Smarter Data Science. Cole Stryker
features to include (feature engineering) or which features to exclude (feature selection), this chapter will help you determine which features you'll need for the models that you develop. You'll also learn about the importance of organizing data and the purpose of democratizing data.
Data-Driven Decision-Making
The most advanced algorithms cannot overcome a lack of data. Organizations that seek to prosper from AI by acting upon its revelations must have access to sufficient and relevant data. But even if an organization possesses the data it requires, the organization does not automatically become data-driven. A data-driven organization must be able to place trust in the data that goes into an AI model, as well as trust the concluding data from the AI model. The organization then needs to act on that data rather than on intuition, prior experience, or longstanding business policies.
Practitioners often communicate something like the following sentiment:
[O]rganizations don't have the historical data required for the algorithms to extract patterns for robust predictions. For example, they'll bring us in to build a predictive maintenance solution for them, and then we'll find out that there are very few, if any, recorded failures. They expect AI to predict when there will be a failure, even though there are no examples to learn from.
From “Reshaping Business with Artificial Intelligence: Closing the Gap Between Ambition and Action” by Sam Ransbotham, David Kiron, Philipp Gerbert, and Martin Reeves, September 06, 2017 ( sloanreview.mit.edu/projects/reshaping-business-with-artificial-intelligence )
Even if an organization has a defined problem that could be solved by applying machine learning or deep learning algorithms, an absence of data can result in a negative experience if a model cannot be adequately trained. AI works through hidden neural layers without applying deterministic rules. Special attention needs to be paid as to how to trace the decision-making process in order to provide fairness and transparency with organizational and legal policies.
An issue arises as to how to know when it is appropriate to be data-driven. For many organizations, loose terms such as a system of record are qualitative signals that the data should be safe to use. In the absence of being able to apply a singular rule to grade data, other approaches must be considered. The primary interrogatives constitute a reasonable starting point to help gain insight for controlling all risk-based decisions associated with being a data-driven organization.
Using Interrogatives to Gain Insight
In Rudyard Kipling's 1902 book Just So Stories, the story of “The Elephant's Child” contains a poem that begins like this:
I keep six honest serving-men: (They taught me all I knew)
Their names are What and Where and When and How and Why and Who.
Kipling had codified the six primitive interrogatives of the English language. Collectively, these six words of inquiry—what, where, when, how, why, and who—can be regarded as a means to gain holistic insight into a given topic. It is why Kipling tells us, “They taught me all I knew.”
The interrogatives became a foundational aspect of John Zachman's seminal 1987 and 1992 papers: “A Framework for Information Systems Architecture” and “Extending and Formalizing the Framework for Information Systems Architecture.” Zachman correlated the interrogatives to a series of basic concepts that are of interest to an organization. While the actual sequence in which the interrogatives are presented is inconsequential and no one interrogative is more or less important than any of the others, Zachman typically used the following sequence: what, how, where, who, when, why.
What: The data or information the organization produces
How: A process or a function
Where: A location or communication network
Who: A role played by a person or computational agent
When: A point in time, potentially associated with triggers that are fired or signals that are raised
Why: A goal or subgoal revealing motivation
NOTE
Zachman's article “A Framework for Information Systems Architecture” can be found at ieeexplore.ieee.org/document/5387671. “Extending and Formalizing the Framework for Information Systems Architecture” is available at ieeexplore.ieee.org/document/5387433.
By using Zachman's basic concepts of the six interrogatives, an organization can begin to understand or express how much the organization knows about something in order to infer a degree of trust and to help foster data-driven processes.
If a person or a machine had access to a piece of information or an outcome from an AI model, the person or machine could begin a line of inquiry to determine trust. For example, if the person or machine is given a score (representing the interrogative what), can they then ask, “How was this information produced? Where was this information produced? Who produced this information? When was this information produced? Is this information appropriate to meet my needs (why)?”
The Trust Matrix
To help visually grasp how the holistic nature of the six interrogatives can assist in trust and becoming data-driven, the interrogatives can be mapped to a trust matrix (shown in Figure 2-1) as the x-axis. The y-axis reflects the time horizons: past, present, and future.
Figure 2-1: Trust matrix
The past represents something that has occurred. The past is a history and can inform as to what happened, what was built, what was bought, what was collected (in terms of money), and so on. The present is about the now and can inform us as to things that are underway or in motion. The present addresses what is happening, what is being built, who is buying, etc. The future is about things to be. We can prepare for the future by planning or forecasting. We can budget, and we can predict.
Revealing the past can yield hindsight, present insight, and future foresight. The spectrum across the time horizons provides the viewpoints for what happened, is happening, and could/will happen. While the divisions are straightforward, the concept of the present can actually span the past and the present. Consider, “this year.” This year is part of the present, but the days gone are also part of the past, and the days to come are also part of the future. Normally, the context of inquiry can help to remove any untoward temporal complications.
At each x-y intersection lies what the organization can reasonably know. What is knowable has two dimensions, as shown in Figure 2-2. The two dimensions are breadth and depth. The breadth is a reflection of scope and represents a means to understand how much is known about a given topic. For example, some organizations may have a retention policy that requires information to be expunged after a given number of years—for example, seven years. In this example, the breadth of information an organization has access to is constrained to the most recent seven years.
Figure 2-2: Breadth and depth slivers
Conversely, depth is a reflection of detail. The topic of ethnography is addressed here. For example, a person may purchase a product, and if that product is gifted to