Applied Univariate, Bivariate, and Multivariate Statistics Using Python. Daniel J. Denis
issue may still remain. For instance, how will you convince your committee that what you have measured is actually a good measure of self-esteem?
1.8 Data Analysis, Data Science, Machine Learning, Big Data
In recent years, the “data explosion” has gripped much of science, business, and almost every field of inquiry. Thanks to computers and advanced data warehouse capacities that could have only been dreamt of in years past (and will seem trivial in years to come), the “data deluge” is officially upon us. The facility by which statistical and software analyses can be conducted has increased dramatically. New designations for quantitative analyses often fall under the names of data science and machine learning, and because data is so cheap to store, many corporations, both academic and otherwise, can collect and store massive amounts of data – so much so, that analysis of such data sometimes falls under the title of “big data.” For example, world population data regarding COVID-19 were analyzed in an attempt to spot trends in the virus across age groups, extent of comorbidity with other illnesses, among other things. Such analyses are usually done on very large and evolving databases. The mechanisms for storing and accessing such data are, rightly so, not truly areas of “statistics” per say, and have more to do with data engineering and the like. The field of machine learning, an area primarily in computer science, is an emerging area that emphasizes modern software technology in analyzing data, deciphering trends, and visually depicting results via advanced and sophisticated graphics. As you venture further into data analysis in general, some of the algorithms you may use may come from this field.
Though the fields of data science, machine learning, and other allied fields are relatively new and exciting, it is nonetheless important for the reader to not simply and automatically associate new words with necessarily new “things.” Human beings are creatures of psychological association, and so when we hear of a new term, we often create a new category for that term in our minds, and we assume that since there is a new word, there must be an equivalent new category. However, that does not necessarily imply the new association we have created is one-to-one with the reality of the object. The new vehicle promoted by a car company may be an older design “updated” rather than an entirely new vehicle. Hence, when you hear new terminology in quantitative areas, it is imperative that you never stop with the word alone, but instead delve in deeper to see what is actually “there” in terms of new substance. Why is this approach to understanding important? It is important because otherwise, especially as a newcomer to these areas, you may come to believe that what you are studying is entirely novel. Indeed, it may be “new,” but it may not be as novel or categorically different from the “old” as you may at first think. Likewise, humanistic psychology of the 1950s was not entirely new. The Greeks had very similar ideas. The marketing was new, but the ideas were generally not.
As an example, suppose you are fitting a model to data in machine learning and are concerned about overfitting the model to your data, which, in general, means you are fitting a functional form that too closely matches up to the obtained data you have, potentially allowing for poor replication and generalizability if attempted. You may read about overfitting in a machine learning book and believe the concept applies to machine learning. That is, you may believe overfitting is a property of machine learning models only! How false! While it is a term often used in machine learning, it definitely is not a term specific to the field. Historically, not only has the term and concept of overfitting been used in statistics, but prior to the separation of statistics from mathematics, examples of scientists being concerned about overfitting are scattered throughout history! Hence, “overfitting” is not a concept unique to the field from which you are learning, no more than algorithms are unique to computer science. Historically, algorithms have existed forever, and even the Babylonians were using primitive algorithms (Knuth, 1972). If you are not at least somewhat aware of history, you may come to believe new words and terms necessarily imply new “things.” The concept is usually old news, however. That does not imply the new use of the word is not at least somewhat unique and that it is not being applied to a new algorithm (for example). However, it is likely that the concept has existed well before the word was paired with the thing it is describing in a given field. Likewise, if you believe that support vector machines have anything to do with machines (and I have had students assume there must be a “machine” component within its mathematics!), you need to remember that words are imperfect descriptors for what is actually there. Indeed, much of language in general is nothing more than approximations to what we truly wish to communicate. As any linguist will tell you, language is far from a precise method of communication, but it is often the best we can do. Likewise, with music, a series of notes played on the piano with the goal of communicating a sentiment or emotional quality will necessarily not do so perfectly. It is an approximation. But how awkward it would be for the musician to follow up his or her performance with “What I meant to say was …” or “What is really behind those notes is …” Notions of machine learning, data science, statistics, mathematics, all conjure up associations, but you need to unpack and unravel those associations if you are to understand what is really there. In other words, just as an abstract numerical system may not perfectly coincide with the representation of physical phenomena, so it is true that an abstract linguistic system (of which you might say numerical systems might be a special case) rarely coincides perfectly with the objects it seeks to describe.
This discussion is not meant to start a “turf war” over the priority of human intellectual invention. Far from it. If we were to do that, then we would have to also acknowledge that though Newton and Leibniz put the final touches on the calculus, the idea that they “invented” it, in the truest sense of the word, is a bit of a far cry. Priority disputes in the history of human discovery usually prove futile and virtually impossible to resolve, even among those historians who study the most ancient of roots of intellectual invention on a full-time basis. That is, even assigning priority to ancient discoveries of intellectual concepts is exceedingly difficult (especially without lawyers!), which further provides evidence that “modern” concepts are often not modern at all. As another example, the concept of a computer may not have been a modern invention. Historians have shown that its primitive origin may possibly go back to Charles Babbage and the “Analytical Engine,” and its concept probably goes far beyond that in years as well (Green, 2005). As the saying goes, the only things we do not know is the history we are unaware of or, as Mark Twain once remarked, few if any ideas are original, and can usually be traced back to earlier ones.
1.9 “Training” and “Testing” Models: What “Statistical Learning” Means in the Age of Machine Learning and Data Science
One aspect of the “data revolution” with data science and machine learning leading the way has been the emphasis on the concept of statistical learning. As mentioned, simply because we assign a new word or phrase to something does not necessarily mean that it represents the equivalent of something entirely new. The phrase “statistical learning” is testimony to this. In its simplest and most direct form, statistical learning simply means fitting a model or algorithm to data. The model then “learns” from the data. But what does it learn exactly? It essentially learns what its estimators (in terms of selecting optimal values for them) should be in order to maximize or minimize a function. For example, in the case of a simple linear regression, if we take a model of the type yi = α + βxi + εi and fit it to some data, the model “learns” from the data what are the best values for a and b, which are estimators for α and β, respectively. Given that we are using ordinary least-squares as our method of estimation, the regression model “learns” what a and b should be such that the sum of squared errors is kept to a minimum value (if you don’t know what all this means, no worries, you’ll learn it in Chapter 7). The point here for this discussion is that the “learning” or “training” consists simply of selecting scalars