Machine Learning For Dummies. John Paul Mueller
alt="Remember"/> Big data can come from any source, even your email. The article at
https://www.semrush.com/blog/deep-learning-an-upcoming-gmail-feature-that-will-answer-your-emails-for-you/
discusses how Google uses your email to create a list of potential responses for new emails. You can read about the process involved for the user at https://www.lifewire.com/how-to-send-canned-replies-automatically-in-gmail-1172080
. Instead of having to respond to every email individually, you can simply select a canned response at the bottom of the page. This sort of automation isn’t possible without the original email data source. Looking for big data in specific locations will blind you to the big data sitting in common places that most people don’t think about as data sources. Tomorrow’s applications will rely on these alternative data sources, but to create these applications, you must begin seeing the data hidden in plain view today.
Some of these applications already exist, and you’re completely unaware of them. The video at https://research.microsoft.com/apps/video/default.aspx?id=256288
makes the presence of these kinds of applications more apparent. By the time you complete the video, you begin to understand that many uses of machine learning are already in place and users already take them for granted (or have no idea that the application is even present). Many developers see the quest toward an ultimate machine learning experience as the master algorithm, which is the topic of a book entitled The Master Algorithm, by Pedro Domingos (https://www.amazon.com/exec/obidos/ASIN/0465094279/datacservip0f-20/
).
Locating test data sources
As you progress through the book, you discover the need to teach whichever algorithm you’re using (don’t worry about specific algorithms; you see a number of them later in the book) how to recognize various kinds of data and then to do something interesting with it. This training process ensures that the algorithm reacts correctly to the data it receives after the training is over. Of course, you also need to test the algorithm to determine whether the training is a success. In many cases, the book helps you discover ways to break a data source into training and testing data components in order to achieve the desired result. Then, after training and testing, the algorithm can work with new data in real time to perform the tasks that you verified it can perform.
In some cases, you might not have enough data at the outset for both training (the essential initial test) and testing. When this happens, you might need to create a test setup to generate more data, rely on data generated in real time, or create the test data source artificially. You can also use similar data from existing sources, such as a public or private database. The point is that you need both training and testing data that will produce a known result before you unleash your algorithm into the real world of working with uncertain data.
Specifying the Role of Statistics in Machine Learning
Some sites online would have you believe that statistics and machine learning are two completely different technologies. For example, when you read Statistics vs. Machine Learning, fight! (http://brenocon.com/blog/2008/12/statistics-vs-machine-learning-fight/
), you get the idea that the two technologies are not only different, but downright hostile toward each other. (Later updates to the article are important in that they show the learning process that the author, and many of us, go through in trying to make sense of these technologies.) The fact is that statistics and machine learning have a lot in common and that statistics represents one of the five tribes (schools of thought) that make machine learning feasible. The five tribes are
Symbolists: The origin of this tribe is in logic and philosophy. This group relies on inverse deduction to solve problems.
Connectionists: The origin of this tribe is in neuroscience. This group relies on backpropagation to solve problems.
Evolutionaries: The origin of this tribe is in evolutionary biology. This group relies on genetic programming to solve problems.
Bayesians: The origin of this tribe is in statistics. This group relies on probabilistic inference to solve problems.
Analogizers: The origin of this tribe is in psychology. This group relies on kernel machines to solve problems.
The ultimate goal of machine learning is to combine the technologies and strategies embraced by the five tribes to create a single algorithm (the master algorithm) that can learn anything (see Figure 2-1). Of course, achieving that goal is a long way off. Even so, scientists such as Pedro Domingos (http://homes.cs.washington.edu/~pedrod/
) are currently working toward that goal.
FIGURE 2-1: The five tribes will combine their efforts toward the master algorithm.
This book follows the Bayesian tribe strategy, for the most part, in that you solve most problems using some form of statistical analysis. You do see strategies embraced by other tribes described, but the main reason you begin with statistics is that the technology is already well established and understood. In fact, many elements of statistics qualify more as engineering (in which theories are implemented) than science (in which theories are created). The next section of the chapter delves deeper into the five tribes by viewing the kinds of algorithms each tribe uses. Understanding the role of algorithms in machine learning is essential to defining how machine learning works.
Understanding the Role of Algorithms
Everything in machine learning revolves around algorithms. An algorithm is a procedure or formula used to solve a problem. The problem domain affects the kind of algorithm needed, but the basic premise is always the same — to solve some sort of problem, such as driving a car or playing dominoes. In the first case, the problems are complex and many, but the ultimate problem is one of getting a passenger from one place to another without crashing the car. Likewise, the goal of playing dominoes is to win. The following sections discuss algorithms in more detail.
Defining what algorithms do
An algorithm is a kind of container. It provides a box for storing a method to solve a particular kind of a problem. Algorithms process data through a series of well-defined states. The states need not be deterministic, but the states are defined nonetheless. The goal is to create an output that solves a problem. In some cases, the algorithm receives inputs that help define the output, but the focus is always on the output.
Algorithms must express the transitions between states using a well-defined and formal language that the computer can understand. In processing the data and solving the problem, the algorithm defines, refines, and executes a function. The function is always specific to the kind of problem being addressed by the algorithm.
Considering the five main techniques
As described in the previous section, each of the five tribes has a different technique and strategy for solving problems that result in unique algorithms. Combining these algorithms should lead eventually to the master algorithm that will be able to solve any given problem. The following sections provide an overview of the five main algorithmic techniques.