Profit Driven Business Analytics. Baesens Bart
the model output in a user-friendly way, how to integrate it with other applications (e.g., marketing campaign management tools, risk engines), and how to make sure the analytical model can be appropriately monitored and backtested on an ongoing basis.
It is important to note that the process model outlined in Figure 1.1 is iterative in nature in the sense that one may have to return to previous steps during the exercise. For instance, during the analytics step, a need for additional data may be identified that will necessitate additional data selection, cleaning, and transformation. The most time-consuming step typically is the data selection and preprocessing step, which usually takes around 80 % of the total efforts needed to build an analytical model.
ANALYTICAL MODEL EVALUATION
Before adopting an analytical model and making operational decisions based on the obtained clusters, rules, patterns, relations, or predictions, the model needs to be thoroughly evaluated. Depending on the exact type of output, the setting or business environment, and the particular usage characteristics, different aspects may need to be assessed during evaluation in order to ensure the model is acceptable for implementation.
A number of key characteristics of successful analytical models are defined and explained in Table 1.7. These broadly defined evaluation criteria may or may not apply, depending on the exact application setting, and will have to be further specified in practice.
Table 1.7 Key Characteristics of Successful Business Analytics Models
Various challenges may occur when developing and implementing analytical models, possibly leading to difficulties in meeting the objectives as expressed by the key characteristics of successful analytical models discussed in Table 1.7. One such challenge may concern the dynamic nature of the relations or patterns retrieved from the data, impacting the usability and lifetime of the model. For instance, in a fraud detection setting, it is observed that fraudsters constantly try to out-beat detection and prevention systems by developing new strategies and methods (Baesens et al. 2015). Therefore, adaptive analytical models and detection and prevention systems are required in order to detect and resolve fraud as soon as possible. Closely monitoring the performance of the model in such a setting is an absolute must.
Another common challenge in a binary classification setting such as predicting customer churn concerns the imbalanced class distribution, meaning that one class or type of entity is much more prevalent than the other. When developing a customer churn prediction model typically many more nonchurners are present in the historical dataset than there are churners. Furthermore, the costs and benefits related to detecting or missing either class are often strongly imbalanced and may need to be accounted for to optimize decision making in the particular business context. In this book, various approaches are discussed for dealing with these specific challenges. Other issues may arise as well, often requiring ingenuity and creativity to be solved. Hence, both are key characteristics of a good data scientist, as is discussed in the following section.
ANALYTICS TEAM
The analytics process is essentially a multidisciplinary exercise where many different job profiles need to collaborate. First of all, there is the database or data warehouse administrator (DBA). The DBA ideally is aware of all the data available within the firm, the storage details and the data definitions. Hence, the DBA plays a crucial role in feeding the analytical modeling exercise with its key ingredient, which is data. Since analytics is an iterative exercise, the DBA may continue to play an important role as the modeling exercise proceeds.
Another very important profile is the business expert. This could, for instance, be a credit portfolio manager, brand manager, fraud investigator, or e-commerce manager. The business expert has extensive business experience and business common sense, which usually proves very valuable and crucial for success. It is precisely this knowledge that will help to steer the analytical modeling exercise and interpret its key findings. A key challenge here is that much of the expert knowledge is tacit and may be hard to elicit at the start of the modeling exercise.
Legal experts are gaining in importance since not all data can be used in an analytical model because of factors such as privacy and discrimination. For instance, in credit risk modeling, one typically cannot discriminate good and bad customers based on gender, beliefs, ethnic origin, or religion. In Web analytics, information is typically gathered by means of cookies, which are files that are stored on the user's browsing computer. However, when gathering information using cookies, users should be appropriately informed. This is subject to regulation at various levels (regional and national, and supranational, e.g., at the European level). A key challenge here is that privacy and other regulatory issues vary highly depending on the geographical region. Hence, the legal expert should have good knowledge about which data can be used when, and which regulation applies in which location.
The software tool vendors should also be mentioned as an important part of the analytics team. Different types of tool vendors can be distinguished here. Some vendors only provide tools to automate specific steps of the analytical modeling process (e.g., data preprocessing). Others sell software that covers the entire analytical modeling process. Some vendors also provide analytics-based solutions for specific application areas, such as risk management, marketing analytics, or campaign management.
The data scientist, modeler, or analyst is the person responsible for doing the actual analytics. The data scientist should possess a thorough understanding of all big data and analytical techniques involved and know how to implement them in a business setting using the appropriate technology. In the next section, we discuss the ideal profile of a data scientist.
Whereas in a previous section we discussed the characteristics of a good analytical model, in this paragraph we elaborate on the key characteristics of a good data scientist from the perspective of the hiring manager. It is based on our consulting and research experience, having collaborated with many companies worldwide on the topic of big data and analytics.
A Data Scientist Should Have Solid Quantitative Skills
Obviously, a data scientist should have a thorough background in statistics, machine learning and/or data mining. The distinction between these various disciplines is becoming more and more blurred and is actually no longer that relevant. They all provide a set of quantitative techniques to analyze data and find business-relevant patterns within a particular context such as fraud detection or credit risk management. A data scientist should be aware of which technique can be applied, when, and how, and should not focus too much on the underlying mathematical (e.g., optimization) details but, rather, have a good understanding of what analytical problem a technique solves, and how its results should be interpreted. In this context, the education of engineers in computer science and/or business/industrial engineering should aim at an integrated, multidisciplinary view, with graduates formed in both the use of the techniques, and with the business acumen necessary to bring new endeavors to fruition. Also important is to spend enough time validating the analytical results obtained so as to avoid situations often referred to as data massage and/or data torture, whereby data are (intentionally) misrepresented and/or too much time is expended in discussing spurious correlations. When selecting the optimal quantitative technique, the data scientist should consider the specificities of the context and the business problem at hand. Key requirements for business models have been discussed in the previous section, and the data scientist should have a basic understanding of, and intuition for, all of those. Based on a combination of these requirements, the data scientist should be capable of selecting the best analytical technique to solve the particular business problem.
A Data Scientist Should Be a Good Programmer
As per definition, data scientists work with data. This involves plenty of activities such as sampling and preprocessing of data, model estimation, and post-processing (e.g., sensitivity analysis, model deployment, backtesting, model validation). Although many user-friendly software tools are on the market nowadays to automate and support these tasks, every analytical