Applied Data Mining for Forecasting Using SAS. Tim Rey
interval
Data preprocessing
The common methods for improving the information content of the raw data (which very often are messy) include: imputation of missing data, accumulation, aggregation, outlier detection, transformations, expanding or contracting, and so on. All of these techniques are discussed in separate sections in Chapter 6.
Data preparation deliverables
The key deliverable in this step is a clean data set with combined and aligned targeted variables (Ys) and potential drivers (Xs) based on preprocessed internal and external data.
Of equal importance to the preprocessed data set is a document that describes the details of the data preparation along with the scripts to collect, clean and harmonize the data.
Variable reduction /selection steps
The objective of this block of the work process is to reduce the number of potential economic drivers for the dependent variable by various data mining methods. The data reduction process is done in two key substeps: (1) variable reduction and (2) variable selection in static transactional data. The main difference between the two substeps is the relation of the potential drivers or independent variables (Xs) to the targeted or dependent variables (Ys). In the case of variable reduction, the focus is on the similarity between the independent variables, not on their association with the dependent variable. The idea is that some of the Xs are highly related to one another thus removing redundant variables reduces data dimensionality. In the case of variable selection, the independent variables are chosen based on their statistical significance or similarity with the dependent variables. The details of the methods for variable reduction and selection are presented in Chapter 7 and a short description of the corresponding substeps and deliverables is given below.
Variable reduction via data mining methods
Since there is already a rich literature for the statistical and machine learning disciplines concerning approaches for variable reduction or selection, this book often refers to and contrasts methods used in “non-time series” or transactional data. New methods specifically for time series data are also discussed in more detail in Chapter 7. In the transaction data approach, the association among the independent variables is explored directly. Typical techniques, used in this case, are variable cluster analysis and principal component analysis (PCA). In both methods, the analysis can either be based on correlation or covariance matrices. Once the clusters are found, the variable with the highest correlation to the cluster centroid in each cluster is chosen as a representative of the whole cluster. Another approach, used frequently, is variable reduction via PCA where a transformed set of new variables (based on the correlation structure of the original variables) is used that describes some minimum amount of variation in the data. This reduces the dimensionality of the problem in the independent variables.
In the time series-based variable reduction, the time factor is taken into account. One of the most used methods is similarity analysis where the data is first phase shifted and time warped. Then a distance metric is calculated to obtain the similarity measures between each two time series xi and xj. The variables below some critical distance are assumed as similar and one of them can be selected as representative. In the case of correlated inputs the dimensionality of the original data set could be significantly reduced after removing the similar variables. PCA can also be used in time series data, an example of such is the work done by the Chicago Fed wherein a National Activity Index (CFNAI), based on 85 variables representing different sectors of the US economy, was developed.4
Variable selection via data mining methods
Again, there is quite a rich literature in variable or feature selection for transactional data mining problems. In variable selection the significant inputs are chosen based on their association with the dependent variable. As in the case with variable reduction, there are different methods applied to data with a time series nature as compared to that of transactional data. The first approach uses traditional transactional data mining variable selection methods. Some of the known methods, discussed in Chapter 7, are correlation analysis, stepwise regression, decision trees, partial least squares (PLS), and genetic programming (GP). In order to use these same approaches on time series data, the time series data has to be preprocessed properly. First, both the Ys and Xs are made stationary by taking the first difference. Second, some dynamic in the system is added by introducing lags for each X. As a result, the number of extended X variables to consider as inputs is increased significantly. However, this enables you to capture dynamic dependences between the independent and the dependent variables. This approach is often referred to as the poor man's approach to time series variable selection since much of the extra work is being done to prepare the data and then non-time series approaches are being applied.
The second approach is more specifically geared toward time series. There are four methods in this category. The first one is the correlation coefficient method. The second one is a special version of stepwise regression for time series models. The third method is similarity as discussed earlier in the variable reduction substep but in this case the distance metric is between the Y and the Xs. Thus, the smaller the similarity metric the better the relationship of the corresponding input to the output variable. The fourth approach is called co-integration, which is a specialized test that two time series variables move together in the long run. Much more detail is presented in Chapter 7 concerning these analyses.
One important addition to the variable selection is to be sure to include the SME's favorite drivers, or those discussed as such in market studies (such as CMAI in the chemical industry) or by the market analysts.
Event selection
Specific class variables in forecasting are events. These class variables help describing big discrete shifts and deviations in the time series. Examples of such variables are advertising campaigns before Christmas and Mother's Day, mergers and acquisitions, natural disasters, and so on. It is very important to clarify and define the events and their type in this phase of project development.
Variable reduction and selection deliverables
The key deliverable from the variable reduction and selection step is a reduced set of Xs that are less correlated to one another. It is assumed that it includes only the most relevant drivers or independent variables, selected by consensus based on their statistical significance and expert judgment. However, additional variable reduction is possible during the forecasting phase. Selected events are another important deliverable before beginning the forecasting activities.
As always document the variable reduction/selection actions. The document includes a detailed description of all steps for variable reduction and selection as well as the arguments for the final selection based on statistical significance and subject matter experts approval.
Forecasting model development steps
This block of the work process includes all necessary activities for delivering forecasting models with the best performance based on the available preprocessed data given the reduced number of potential independent variables. Among the numerous options to design forecasting models, the focus in this book is on the most used practical approaches for univariate and multivariate models. The related techniques and development methodologies are described in Chapters 8–11 with minimal theory and sufficient details for practitioners. The basic substeps and deliverables are described below.
Basic forecasting steps: identification, estimation, forecasting
Even the most complex forecasting models are based on three fundamental steps: (1) identification, (2) estimation, and (3) forecasting. The first step is identifying a specific model structure based on the nature of the time series and modeler's hypothesis. Examples