Applied Data Mining for Forecasting Using SAS. Tim Rey
href="#ulink_86bd565c-70ac-5e7b-b68f-1488133ee396">3.4.1 Internal Data Infrastructure
3.4.2 External Data Infrastructure
3.5 Organizational Infrastructure
3.5.1 Developers Infrastructure
3.5.3 Work Process Implementation
3.1 Introduction
Applying data mining for forecasting in a business requires serious investments in hardware, software, and training, but a cultural change must also take place. It is very important to estimate the size of the investment based on technical requirements and the products that are available in the market. The four main components of any forecasting infrastructure are hardware, software, data, and organizational. The first three components build the technical basis to support applied data mining for forecasting, and the fourth component is critical to effectively change the culture of the organization. This chapter is focused on an enterprise-wide implementation strategy of data mining for forecasting. The importance of integrating the selected options into the existing corporate infrastructure is discussed at the end of the chapter.
3.2 Hardware Infrastructure
The objective of this section is to give the reader a condensed overview of the potential hardware architectures for implementing data mining for forecasting systems in an industrial setting. The following three options: (1) PC network, (2) client/server, and (3) cloud computing infrastructures are discussed briefly below. However, due to rapid technology changes today's recommendations can easily become obsolete tomorrow.
3.2.1 Personal Computers Network Infrastructure
The least expensive hardware solution for implementing data mining for forecasting systems in an industrial setting is to avoid any additional hardware expenses and use the existing information system infrastructure. Usually, this is based on a PC network. The key advantages of this option are as follows:
low cost
easy integration in the existing information system infrastructure
minimal installation and maintenance efforts
robust performance due to the decentralized architecture
The main limitations of the PC network infrastructure solution for implementing data mining for forecasting systems are as follows:
limitations for large data set processing
slower processing speed relative to servers
limited operating systems options
3.2.2 Client/Server Infrastructure
The client/server model assumes a division of the computing resources between clients or workstations with local processing capabilities and servers with large memory and disk space and more powerful processors. The clients request services such as data, and the servers retrieve resources and deliver the requested information. The number of servers required depends on the number of clients, network speed and capacity, global and local operation, reliability, and so on.
An example of a minimal client/server infrastructure based on SAS is shown in Figure 3.1. The example includes four types of servers and two types of clients—modeler PC and final user PC. One server is allocated to handle metadata. A data mart server, based on Oracle, interacts with the large database cluster containing the corporate data. The third server includes the SAS server and is devoted to intensive computing tasks. Several clients can share the server resources either for developing new models or running developed models as stored processes.
The key advantages of the client/server infrastructure for implementing data mining for forecasting are given below:
very powerful processing capabilities
large memory and high-throughput disk
the use of different operating systems
capacity to process large data sets.
Figure 3.1: An example of client/server infrastructure based on SAS
The disadvantages of this option are as follows:
high cost
more complex maintenance and support
lower reliability if servers are down
The advantages, however, outweigh the disadvantages and the client/server infrastructure is the standard solution for large-scale industrial applications of data mining and forecasting.
3.2.3 Cloud Computing Infrastructure
Another potential solution, called cloud computing, uses powerful external and internal computing resources, and includes grid computing for parallel processing, multi-tiered computer architecture, and the capacity to handle super-large data sets. Such services are currently offered by a number of vendors including well-established industry leaders. Some of the advantages of using this option are as follows:
low implementation and maintenance cost
super-computer power, which is continuously upgraded by the cloud owner
data consolidation in very large data sets
increased reliability
The disadvantages of using a cloud computing infrastructure are summarized as follows:
proprietary data security
initial transfer of very large corporate data to the cloud
limited software
trust issues
information technology (IT) management resistance
This option is still in an exploratory phase and has generated a lot of hype. However, if the technical and economic advantages are proved with more industrial applications, it could become a popular hardware infrastructure in the near future.
3.3 Software Infrastructure
The lion's share of the costs for implementing data mining for forecasting systems, especially for the PC network infrastructure, is not the cost of hardware but the cost of software infrastructure. One of the key decisions to make in advance is the scale of the efforts. In the case of large-scale forecasting on a corporate level that is to be implemented across the globe, an integrated software environment made up of all necessary components with global support is strongly recommended. An example of such infrastructure (based on SAS software) is discussed in this book.
3.3.1 Data Collection Software
This part of the infrastructure strongly depends on the existing corporate information system architecture. Unfortunately, it could be very diverse with different database platforms. In most cases, however, the data are organized in relational databases and stored in separate tables for each entity. The relationship between the tables is defined by two columns—primary key and foreign key columns (Svolba 2006). Data that are accessed from a relational database are usually extracted table by table and are merged according to the primary