Applied Data Mining for Forecasting Using SAS. Tim Rey
The software basis for handling data in relational database systems is the Structure Query Language (SQL). It includes the necessary operators for searching data pieces as well as different aggregations and joins of tables. The leading relational database systems include Oracle, SAP MaxDB and Sybase, Microsoft SQL Server, and IBM DB2. The good news is that the existing key software programs for data mining, such as SAS Enterprise Miner, IBM SPSS1 and StatSoft STATISTICA Data Miner2 include all necessary software interfaces to collect data from diverse sources.3 For example, SAS offers a specialized tool, SAS/ACCESS, that has almost universal capabilities for access, retrieval, and integration with any available data source.4
3.3.2 Data Preparation Software
It is recommended that the selected software has the following functionality for data preparation:
Data manipulation capabilities that include functions for summary tables generation, data split, concatenation, transposition, stacking, sorting, flexible filtering, joining tables, and so on.
Missing data handling that includes different options to impute missing data.
Data description capabilities that are usually based on basic descriptive statistics, frequency tables, histograms, and so on.
Data visualization capabilities that include a broad spectrum of graphics, such as 3-D scatter plots, contour plots, parallel plots, and so on.
Data pre-processing capabilities that include filtering, outlier detection and removal, data sampling, data partitioning, data transformation, and so on.
Examples of software tools with these capabilities are SAS Enterprise Guide, JMP, IBM SPSS, and StatSoft STATISTICA Data Miner.
3.3.3 Data Mining Software
From the broad range of available data mining methods and functions, the following capabilities for variable reduction and selection are needed for the forecasting applications:
Basic statistical capabilities that include building and analyzing linear regression models with options for variable selection by forward and backward stepwise regression.
Multivariate analysis capabilities that include cross-correlation analysis, PCA, and PLS.
Clustering capabilities that include dividing variables in clusters by linear or nonlinear methods, similarity analysis, and building decision trees.
Variable selection capabilities that include different algorithms for variable selection, such as stepwise regression, decision trees, gradient boosting, singular value decomposition (SVD), and so on.
The three most popular software options for industrial applications that offer most of these capabilities are SAS Enterprise Miner, IBM SPSS, and StatSoft STATISTICA Data Miner.
3.3.4 Forecasting Software
The recommended capabilities for effective development of forecasting models in industrial applications are as follows.
Time series analysis capabilities that include generating time series, different time plots, correlations, seasonality adjustments, decompositions, and so on.
Forecasting model generation capabilities that include the most popular methods, such as exponential smoothing, ARIMA, unobserved components, and so on with a variety of diagnostic statistics and model performance metrics.
Forecasting modeling with events capabilities that enable the introduction of big discrete shifts in the model development.
Hierarchical forecasting capabilities that include developing a model hierarchy at the desired level based on the existing business structure and reconciling this with the final forecast.
Scenario generation capabilities for multivariate-based forecasting models—these different “what if” scenarios can show the impact of the key inputs on the final forecast.
The most powerful software tools that offer these capabilities are SAS Forecast Studio, Automatic Forecasting Systems Autobox and Business Forecast Systems Forecast Pro.
3.3.5 Software Selection Criteria
In addition to the specific technical capabilities of the key software components for a data mining for forecasting system, the following generic selection criteria are recommended:
Cost depends mostly on the ease-of-use of the corresponding packages. Most of the time the tools based on building blocks (such as SAS Enterprise Miner) or the high-performance forecasting tools (such as SAS Forecast Server) cost more. However, the increased productivity they deliver is significant. An additional advantage is the shorter learning and product adaptation time, which lowers the total cost.
Functionality—it is strongly recommended that you carefully check whether the necessary technical functionality is available, as described in the previous sections, and to avoid any compromises. The capability to add new methods is also recommended.
Ease-of-use is enhanced by programming based on building blocks, a high level of automation for data pre-processing and model generation, an interactive graphic interface, and minimal programming necessary to deploy models (all features of SAS Enterprise Guide, for example).
Report generation is a significant step during model development as well as during model deployment and when transferring ownership to clients. During the model building phase many detailed reports with time series analysis, model diagnostics, or variable selection results are needed for successful decision-making. For model deployment, good reporting capabilities for model performance and value tracking are critical in order to keep the client happy.
The learning effort required depends on the software's ease-of-use, users' experience in statistics and forecasting within the organization, and the training courses and materials offered by the vendor. Products with a steep learning curve can significantly delay implementation efforts and reduce the impact of the technology for data mining in forecasting.
Global support 24/7—a fast, professional response to model development and implementation issues that is available globally is critical for the success of data mining for forecasting in industry. This is one of the key factors to consider when selecting the proper software vendor. Very few have the capacity to provide this type of service.
3.4 Data Infrastructure
Developing and maintaining a data infrastructure that can reliably supply the data to the developed and apply forecasting models is a critical step for the final success. The data infrastructure for data mining in forecasting consists of two key parts: internal data from the business and external data from various sources, such as Global Insight, Bloomberg, CMAI, and so on. The essence of both cases is described briefly in the following sections.
3.4.1 Internal Data Infrastructure
Very often creating an internal data infrastructure for data mining in forecasting is the key bottleneck of the whole effort. There are several issues that contribute to this situation. The first issue is the diverse nature of data sources in different parts of the business. This issue is especially difficult to resolve during the transition period after mergers and acquisitions when various types of databases need to be integrated. The second issue is the different time interval and duration with which historical data are kept in the system. Very often the time interval (week, month, or quarter) is different and inconsistent for the historical periods of interest. A similar situation is observed with the duration of historical data. In many cases time history is too short to represent the patterns necessary to build and validate a good forecasting model. The third issue is the structural changes in the business since corresponding models need to be rebuilt with revised history after each significant change.
The internal data infrastructure depends on the corporate data infrastructure. One option to communicate and synchronize the extracts is by using a separate server. (See the example in Figure