img

Using Time Series Analysis to Build a Predictive Model – A Case Study

Businesses from all over the world require predictive models to forecast sales, revenue growth, costs, etc. It is imperative for firms to gain a proper understanding of what the future holds for them, based on their actions or the lack thereof. Because it enables them to craft and implement a viable and robust enterprise-wide strategy. Thus, time series analysis (TSA) serves as a powerful tool to forecast changes in product demand, price fluctuations of raw materials, and other such factors which impact decision-making.  

This is a case study that shows how Xavor built a top-notch predictive pricing model using time series analysis for a company in the polymer chemical industry. 

Problem Statement 

The management team at a client company in Texas wanted to predict monthly product sales and the price of chemical raw materials. The client used these raw materials to develop polymer products. The company was particularly interested in knowing whether the sale of one polymer product model could be used to predict the sale of another polymer product.  

Xavor’s data scientists realized that time is money. Therefore, we offered the client a dynamic decision-making technology as a crucial tool to facilitate the managerial decision-making process. The tool sought to enable managers to make informed decisions regarding a wide range of business activities. Such decisions also involved analyzing the direct relation between time, sales, and money.  

We all make forecasts while making strategic decisions under uncertainty. Sometimes we think we are not forecasting. But the truth is that all our decisions are directed by our anticipation of the results of our actions or inactions.  

Analytical Design 

Our Advanced Analytics and Data Science team designed a state-of-the-art solution for our client. The goal was to determine the optimal price of the chemical raw materials sourced by the client.   

Data  

A time-series analysis data set was created, combining data from different sources. These sources included data from the client as well as the internet. The TSA Dataset was adjusted to be monthly, appropriately handling missing values, holidays, sales seasons, etc.   

A brief description of the features was also added. It enabled us to better understand the business problem and its dependencies. For example, these were Feature Name | Data Type | Brief Description).  

Data Statistics  

The client’s sales department gave us two years of data statistics; it contained the following distinct parameters:  

  • 1,294 variables ingested, 95% external vs. 5% internal. 
  • A number of features – 74. 
  • The number of samples – 19,993. 
  • Economic indicators for the Chinese mainland and its five biggest export partners (Hong Kong, Japan, Korea, US, and Vietnam) – trade, wages, economic conditions, and country-specific industry health indicators.  
  • Market indicators of the four major customer industries of the client’s polymer products – automotive, construction, appliances, and containers.  
  • Schedules for major trade and production spikes – consumer shopping holidays, western and Chinese national holidays, and weather conditions. 
  • Industry data – futures and market prices of upstream raw materials, factory capacity and shutdowns in major producing countries, and MDI imports. 
  • Internal data from SAP and Vendavo – inventory, deal values and volumes, production, costs, and customer information. 

Data Cleaning and Transformation  

Based on the conducted study and an in-depth understanding of the given data set, we removed all the redundant features from the raw data set. We used data-filling techniques, such as mean and mode filling, to tackle the missing data problem. We also used advanced techniques like data imputation.  

Each feature used for training the models had different ranges leading to an extensive feature space. Thus, the training was prone to swinging gradients. Later, as part of the data transformation step, the data set’s attributes were standardized to be used by machine learning algorithms and statistical modeling.  

Feature Engineering  

We built micro & macroeconomic time series variables to serve as potential predictors of demand. Moreover, we also selected transactional data (19,993 data sets). We used 90% of these for model training and the remaining 10% for model prediction.   

Furthermore, we applied variable selection methods to more than 74 macroeconomic variables to identify the most promising linear and nonlinear predictors, lagged predictors, and combinations of predictors.   

Statistical Tests  

In order to use time series analysis (TSA) on statistical models like ARIMA, prior convergence of the time series to global minima is essential, which ensures that the data is stationary. We performed two types of statistical tests to satisfy these initial assumptions.   

  • Augmented Dicky Fuller Test (ADFT)  
  • Granger Causality Test  

Strategy  

The project started by considering the time series as a regression problem based on the client’s demand. We used machine learning algorithms such as random forest to select optimal features. These algorithms were used again with the selected features for predicting the prices of chemicals used in the production of polymer products.  

As the prices were to be predicted for the latest month based on two years’ data, the major issue was that due to the changing market conditions, the price hiked drastically in the last month. 

Therefore, we devised a semi-supervised machine learning technique consisting of k-means clustering and random forest to address the problem. 

 

The natural structure of the TSA data set reflected the characteristics of different industries and applications of businesses, mainly by observing current customers, exports, and collecting potential customers.  

The method first assigned the data to one of the clusters and used the clustering algorithm and statistical techniques, such as Silhouette, to divide the categories into several groups. Each group of chemical products had its own characteristics, needs, and customer domains.   

We found the optimal number of clusters to be eight based on understanding the variables.   

Next, we applied supervised learning algorithms (random forest and SVM) to every cluster to predict the price of each individual chemical product. Dividing the data into clusters was very helpful as it significantly reduced the error rate. Later on, we devised marketing strategies for each category of users based on these features, sales, and needs of the client organization.   

 

 We used the TSA data set to train the ARTXP algorithm of Microsoft Time Series Analysis. We further optimized it for predicting the next likely value in a series. Moreover, we used the results of the standard ARIMA algorithm to improve the accuracy of long-term price predictions of the chemical products manufactured by the client.   

ARIMA model forecast the time series by modeling the predictor variable as a regressor. We used each algorithm separately to train the model. However, the blend of the results from the ARTXP-ARIMA hybrid model yielded better price predictions for a variable number of chemical prices of polymer products.  

Moreover, the inherent complexities of time series forecasting were implemented using the Prophet library to predict future sales of individual polymer products.   

Prophet uses Markov chain Monte Carlo (MCMC) methods to generate forecasts. Thus, MCMC was essential to consider the frequency of our time series. Because we were working with monthly data, we clearly specified the desired frequency of the timestamps (in this case, MS is the start of the month according to the Prophet’s conventions).   

Therefore, the make_future_dataframe function in Facebook’s Prophet library generated 24 monthly timestamps for us based on the historical data provided by the client company. In other words, we were looking to predict the future sales of our polymer products two years into the future.   

We also incorporated the effect of the holiday season in the forecasting model. It was tailored to user requirements, along with additional prior knowledge of demanding industries.  

Also, we used Prophet to return the components of our forecasts. This helped reveal how daily, weekly, and yearly patterns of the time series contributed to the overall sales of the polymer products of the company. 

 

Visualization  

Some distinguishable patterns and trends appeared in the data when we plotted it. We also visualized the TSA data using a method called time series decomposition. It allowed us to decompose our series into three distinct components: trend, seasonality, and noise.   

The plot above clearly shows that the prices of chemical raw materials are unstable. It also shows the seasonal price hike at the beginning of each month.  

Handling  

We suggested the company update the model with recent sales data and update their predictions to model recent trends. This update should be carried out quarterly. Xavor created a general prediction model to make corrections for departments that did not accurately or consistently update sales data. We used it to develop predictions for all the regions.  

Results  

A web-based forecasting and prediction tool was developed that allowed the client’s management team to enter updated values of predictor variables each month and forecast the future demand for the specific chemical product.   

The client company subsequently validated the custom cluster-based model. It did so by comparing forecast vs. actual values for the first several months to predict the price of chemicals. The resulting forecast accuracy was impressive. Therefore, they could use it in their business model to increase revenue and sales of products. It enabled them to:  

  • Use the model forecasts with weekly and monthly visualizations as input for business operations.  
  • Conduct a subsequent study, applying the forecasting and prediction method in another similar polymer product category. 

We record metrics against each characteristic for each of the time series models. We draw conclusions based on these. Based on the MAE and MAPE metrics, the custom semi-supervised model stands out in terms of both forecasting sales and predicting chemical prices.   

However, in terms of execution time, all the models performed roughly on the same scale. But ARIMA marginally ran for a longer duration.  

Conclusion  

The approach and the experiments adopted in this study reasonably answered the client company’s objectives. The approach was to compare the performance of the in-house built time series model and widely used conventional models and libraries in TSA.   

Our team also studied the generalizability and model behavior for predicting long-term forecasts. With the increase in the length of the forecast, the generalizability of the cluster-based semi-supervised model was superior to that of the other time series models employed despite them being state-of-the-art models in the industry.   

This was particularly true for this case, where the market change was correlated with confounding variables. However, this behavior is observed during a standard forecast length. Beyond this forecast length, the volatility in the market outweighs the model behavior.   

The customer was given a choice in the web application to select the best-performing model based on the forecast’s duration and sales prediction.  

Are you looking to build and use a similar price-predicting model for your company? Get in touch with us at [email protected].   

Let's make it happen

We love fixing complex problems with innovative solutions. Get in touch to let us know what you’re looking for and our solution architect will get back to you soon.