Applied Data Mining for Forecasting Using SAS. Tim Rey

Applied Data Mining for Forecasting Using SAS

Скачать книгу

href="#ulink_86bd565c-70ac-5e7b-b68f-1488133ee396">3.4.1 Internal Data Infrastructure

3.4.2 External Data Infrastructure

3.5 Organizational Infrastructure

3.5.1 Developers Infrastructure

3.5.2 Users Infrastructure

3.5.3 Work Process Implementation

3.5.4 Integration with IT

3.1 Introduction

Applying data mining for forecasting in a business requires serious investments in hardware, software, and training, but a cultural change must also take place. It is very important to estimate the size of the investment based on technical requirements and the products that are available in the market. The four main components of any forecasting infrastructure are hardware, software, data, and organizational. The first three components build the technical basis to support applied data mining for forecasting, and the fourth component is critical to effectively change the culture of the organization. This chapter is focused on an enterprise-wide implementation strategy of data mining for forecasting. The importance of integrating the selected options into the existing corporate infrastructure is discussed at the end of the chapter.

3.2 Hardware Infrastructure

The objective of this section is to give the reader a condensed overview of the potential hardware architectures for implementing data mining for forecasting systems in an industrial setting. The following three options: (1) PC network, (2) client/server, and (3) cloud computing infrastructures are discussed briefly below. However, due to rapid technology changes today's recommendations can easily become obsolete tomorrow.

3.2.1 Personal Computers Network Infrastructure

The least expensive hardware solution for implementing data mining for forecasting systems in an industrial setting is to avoid any additional hardware expenses and use the existing information system infrastructure. Usually, this is based on a PC network. The key advantages of this option are as follows:

low cost

easy integration in the existing information system infrastructure

minimal installation and maintenance efforts

robust performance due to the decentralized architecture

The main limitations of the PC network infrastructure solution for implementing data mining for forecasting systems are as follows:

limitations for large data set processing

slower processing speed relative to servers

limited operating systems options

3.2.2 Client/Server Infrastructure

The client/server model assumes a division of the computing resources between clients or workstations with local processing capabilities and servers with large memory and disk space and more powerful processors. The clients request services such as data, and the servers retrieve resources and deliver the requested information. The number of servers required depends on the number of clients, network speed and capacity, global and local operation, reliability, and so on.

An example of a minimal client/server infrastructure based on SAS is shown in Figure 3.1. The example includes four types of servers and two types of clients—modeler PC and final user PC. One server is allocated to handle metadata. A data mart server, based on Oracle, interacts with the large database cluster containing the corporate data. The third server includes the SAS server and is devoted to intensive computing tasks. Several clients can share the server resources either for developing new models or running developed models as stored processes.

The key advantages of the client/server infrastructure for implementing data mining for forecasting are given below:

very powerful processing capabilities

large memory and high-throughput disk

the use of different operating systems

capacity to process large data sets.

Figure 3.1: An example of client/server infrastructure based on SAS

The disadvantages of this option are as follows:

high cost

more complex maintenance and support

lower reliability if servers are down

The advantages, however, outweigh the disadvantages and the client/server infrastructure is the standard solution for large-scale industrial applications of data mining and forecasting.

3.2.3 Cloud Computing Infrastructure

Another potential solution, called cloud computing, uses powerful external and internal computing resources, and includes grid computing for parallel processing, multi-tiered computer architecture, and the capacity to handle super-large data sets. Such services are currently offered by a number of vendors including well-established industry leaders. Some of the advantages of using this option are as follows:

low implementation and maintenance cost

super-computer power, which is continuously upgraded by the cloud owner

data consolidation in very large data sets

increased reliability

The disadvantages of using a cloud computing infrastructure are summarized as follows:

proprietary data security

initial transfer of very large corporate data to the cloud

limited software

trust issues

information technology (IT) management resistance

This option is still in an exploratory phase and has generated a lot of hype. However, if the technical and economic advantages are proved with more industrial applications, it could become a popular hardware infrastructure in the near future.

3.3 Software Infrastructure

The lion's share of the costs for implementing data mining for forecasting systems, especially for the PC network infrastructure, is not the cost of hardware but the cost of software infrastructure. One of the key decisions to make in advance is the scale of the efforts. In the case of large-scale forecasting on a corporate level that is to be implemented across the globe, an integrated software environment made up of all necessary components with global support is strongly recommended. An example of such infrastructure (based on SAS software) is discussed in this book.

3.3.1 Data Collection Software

This part of the infrastructure strongly depends on the existing corporate information system architecture. Unfortunately, it could be very diverse with different database platforms. In most cases, however, the data are organized in relational databases and stored in separate tables for each entity. The relationship between the tables is defined by two columns—primary key and foreign key columns (Svolba 2006). Data that are accessed from a relational database are usually extracted table by table and are merged according to the primary

Скачать книгу