Applied Univariate, Bivariate, and Multivariate Statistics Using Python. Daniel J. Denis
to have learned or been “trained” from the data. This is, at its most essential and rudimentary level, what statistical learning actually means in many (not all) contexts. If we subject that model to new data after that, thus “sharpening” its scalars, the model “updates” what its estimators should be in order to continue optimizing a function. Note that this more or less parallels the idea of human learning, in that the model (or “you”) is “learning from experience” as a new experience is incorporated into knowledge. For example, a worker learns how to maximize his or her potential in a job through trial and error, otherwise known as “experience.” If one day his or her boss corrects him or her, that new “data” is incorporated into the learning mechanism. If on another day the individual is reinforced for doing something right, that is also incorporated into the learning mechanism. Of course, we cannot see the scalars or estimators (they are largely metaphorical in this case), but you get the idea. Learning “optimizes” some function though exposure to new experience. In classical learning theory in psychology, for instance, the rat in a Skinner box learns that if he presses the lever, he will receive a pellet of food. If he doesn’t press the lever, he doesn’t receive food. The rat is optimizing the function (its in his little brain, and its metaphorical, we can’t see it) that will allow him to distinguish which response gets the food. This is learning! When the rat is “trained” enough, he starts making predictions nearly perfectly with very few errors. So it also is with the statistical model; it does an increasingly good job at “getting it right” as it is trained on increasingly more data (i.e. more “experience”). It also “learns” from what it did wrong, just as the rat learns that if he doesn’t press the lever, he doesn’t eat.
Is any of this “new?” Of course not! In a very real way, pioneers of regression in the 1890s, with the likes of Karl Pearson and George Udny Yule (see Denis and Docherty, 2007), were computing these same regression coefficients on their own data, though not with the use of computers. However, back then it was not referred to as a model learning from data; it was simply seen as a novel statistical method that could help address a social problem of the day. Even earlier than that, Legendre and Gauss in the early nineteenth century (1800s) were developing the method of least-squares that would eventually be used in applying regression methods later that century. Again, they were not called statistical learning methods back then. The idea of calling them learning methods seems to have arisen mostly in statistics, but is now center stage in data science and machine learning. However, a lot of this is due to the zeitgeist of the times, where “zeitgeist” means the “spirit of the times” we are in, which is one of computers, artificial intelligence, and the idea that if we supply a model with enough data, it can eventually “fly itself” so to speak. Hence, the idea here is of “training” as well. This idea is very popular in digit recognition, in that the model is supplied with enough data that it “learns” to discriminate between whether a number is a “2” for instance, or a “4” by learning its edges and most of the rest of what makes these numbers distinct from one another. Of course, the training of every model is not always done via ordinary least-squares regression. Other models are used, and the process can get quite complex and will not always follow this simple regression idea. Sometimes an algorithm is designed to search for patterns in data, which in this case the statistical method is considered to be unsupervised because it has no a priori group structure to guide it as in so-called supervised learning. Principal components, exploratory factor analysis, and cluster analysis are examples of this. However, even in these cases, optimization criteria have been applied. For example, in principal components analysis, we are still maximizing values for scalars, but instead of minimizing the sum of squared (vertical) errors, we are instead maximizing the variance in the original variables subjected to the procedure (this will all become clear how this is done when we survey PCA later in the book).
Now, in the spirit of statistical learning and “training,” validating a model has become equally emphasized, in the sense that after a model is trained on one set of data, it should be applied to a similar set of data to estimate the error rate on that new set. But what does this mean? How can we understand this idea? Easily! Here are some easy examples of where this occurs:
The pilot learns in the simulator or test flights and then his or her knowledge is “validated” on a new flight. The pilot was “trained” in landing in a thunderstorm yesterday and now that knowledge (model) will be evaluated in a new flight on a new storm.
Rafael Nadal, tennis player, learns from his previous match how to not make errors when returning the ball. That learning is evaluated on new data, which is a new tennis match.
A student in a statistics class learns from the first test how to adjust his or her study strategies. That knowledge is validated on test 2 to see how much was learned.
Of course, we can go on and on. The point is that the idea of statistical learning, including concepts of machine learning, are meant to exemplify the zeitgeist we find ourselves in, which is one of increased automation, computers, artificial intelligence, and the idea of machines becoming more and more self-sufficient and learning themselves how to make “optimal” decisions (e.g. self-driving cars). However, what is really going on “behind the scenes” is essential mathematics and usually a problem of optimization of scalars to satisfy particular constraints imposed on the problem.
In this book, while it can be said that we do “train” models by fitting them, we do not cross-validate them on new data. Since it is essentially an introduction and primer, we do not take that additional step. However, you should know that such a step is often a good one to take if you have such data at your disposal to make cross-validation do-able. In many cases, scientists may not have such cross-validation data available to them, at least not yet. Hence, “splitting the sample” into a training and test set may not be do-able due to the size of the data. However, that does not necessarily mean testing cannot be done. It can be, on a new data set that is assumed to be drawn from the same population as the original test set. Techniques for cross-validation do exist that minimize having to collect very large validation samples (e.g. see James et al., 2013). Further, to use one of our previous metaphors, validating the pilot’s skill may be delayed until a new storm is available; it does not necessarily have to be done today. Hence, and in general, when you fit a model, you should always have it in mind to validate that model on new data, data that was not used in the training of the model. Why is this last point important? Quite simply because if the pilot is testing his or her skills on the same storm in which he or she was trained, it’s hardly a test at all, because he or she already knows that particular storm and knows the intricacies and details of that storm, so it is not really a test of new skills; it is more akin to a test of how well he or she remembers how to deal with that specific storm and (returning to our statistical discussion) capitalizes on chance factors. This is why if you are to cross-validate a model, it should be done on new “test” data, never the original training data. If you do not cross-validate the model, you can generally expect your model fit on the training data in most cases to be more optimistic than not, such that it will appear that the model fits “better” than it actually would on new data. This is the primary reason why cross-validation of a model is strongly encouraged. Either way, clear communication of your results is the goal, in that if you fit a model to training data and do not cross-validate it on test data, inform your audience of this so they can know what you have done. If you do cross-validate it, likewise inform them. Hence, in this respect, it is not “essential” that you cross-validate immediately, but what is essential is that you are honest and open about what you have done with your data and clearly communicate to your readers why the current estimates of model fit are likely to be a bit inflated due to not immediately testing out the model on new data. In your next study, if you are able to collect a sample from the same population, evaluate your model on new data to see how well it fits. That will give you a more honest assessment of how good your model really is. For further details on cross-validation, see James et al. (2013), and for a more thorough and deeper theoretical treatment, see Hastie et al. (2009).
1.10 Where We Are Going From Here: How to Use This Book