Applied Univariate, Bivariate, and Multivariate Statistics Using Python. Daniel J. Denis
not so much in terms of data analysis, but rather in examples of how hypothesis-testing works and the like. In this way, it is hoped examples and analogies “hit home” a bit more for readers and students, making the issues “come alive” somewhat rather than featuring abstract examples.
Python code is “unpacked” and explained in many, though not all, places. Many existing books on the market contain explanations of statistical concepts (to varying degrees of precision) and then plop down a bunch of code the reader is expected to simply implement and understand. While we do not avoid this entirely, for the most part we guide the reader step-by-step through both concepts and Python code used. The goal of the book is in understanding how statistical methods work, not arming you with a bunch of code for which you do not understand what is behind it. Principal components code, for instance, is meaningless if you do not first understand and appreciate to some extent what components analysis is about.
Statistical Knowledge vs. Software Knowledge
Having now taught at both the undergraduate and graduate levels for the better part of fifteen years to applied students in the social and sometimes natural sciences, to the delight of my students (sarcasm), I have opened each course with a lecture of sorts on the differences between statistical vs. software knowledge. Very little of the warning is grasped I imagine, though the real-life experience of the warning usually surfaces later in their graduate careers (such as at thesis or dissertation defenses where they may fail to understand their own software output). I will repeat some of that sermon here. While this distinction, historically, has always been important, it is perhaps no more important than in the present day given the influx of computing power available to virtually every student in the sciences and related areas, and the relative ease with which such computing power can be implemented. Allowing a new teen driver to drive a Dodge Hellcat with upward of 700 horsepower would be unwise, yet newcomers to statistics and science, from their first day, have such access to the equivalent in computing power. The statistician is shaking his or her head in disapproval, for good reason. We live in an age where data analysis is available to virtually anybody with a laptop and a few lines of code. The code can often easily be dug up in a matter of seconds online, even with very little software knowledge. And of course, with many software programs coding is not even a requirement, as windows and GUIs (graphical user interfaces) have become very easy to use such that one can obtain an analysis in virtually seconds or even milliseconds. Though this has its advantages, it is not always and necessarily a good thing.
On the one hand, it does allow the student of applied science to “attempt” to conduct his or her data analyses. Yet on the other, as the adage goes, a little knowledge can be a dangerous thing. Being a student of the history of statistics, I can tell you that before computers were widely available, conducting statistical analyses were available only to those who could drudge through computations by hand in generating their “output” (which of course took the form of paper-and-pencil summaries, not the software output we have today). These computations took hours upon hours to perform, and hence, if one were going to do a statistical analysis, one did not embark on such an endeavor lightly. That does not mean the final solution would be valid necessarily, but rather folks may have been more likely to give serious thought to their analyses before conducting them. Today, a student can run a MANOVA in literally 5 minutes using software, but, unfortunately, this does not imply the student will understand what they have done or why they have done it. Random assignment to conditions may have never even been performed, yet in the haste to implement the software routine, the student failed to understand or appreciate how limiting their output would be. Concepts of experimental design get lost in the haste to produce computer output. However, the student of the “modern age” of computing somehow “missed” this step in his or her quickness to, as it were, perform “advanced statistics.” Further, the result is “statistically significant,” yet the student has no idea what Wilks’s lambda is or how it is computed, nor is the difference between statistical significance and effect size understood. The limitations of what the student has produced are not appreciated and faulty substantive (and often philosophically illogical) conclusions follow. I kid you not, I have been told by a new student before that the only problem with the world is a lack of computing power. Once computing power increases, experimental design will be a thing of the past, or so the student believed. Some incoming students enter my class with such perceptions, failing to realize that discovering a cure for COVID-19, for instance, is not a computer issue. It is a scientific one. Computers help, but they do not on their own resolve scientific issues. Instructors faced with these initial misconceptions from their students have a tough road to hoe ahead, especially when forcing on their students fundamental linear algebra in the first two weeks of the course rather than computer code and statistical recipes.
The problem, succinctly put, is that in many sciences, and contrary to the opinion you might expect from someone writing a data analysis text, students learn too much on how to obtain output at the expense of understanding what the output means or the process that is important in drawing proper scientific conclusions from said output. Sadly, in many disciplines, a course in “Statistics” would be more appropriately, and unfortunately, called “How to Obtain Software Output,” because that is pretty much all the course teaches students to do. How did statistics education in applied fields become so watered down? Since when did cultivating the art of analytical or quantitative thinking not matter? Faculty who teach such courses in such a superficial style should know better and instead teach courses with a lot more “statistical thinking” rather than simply generating software output. Among students (who should not necessarily know better – that is what makes them students), there often exists the illusion that simply because one can obtain output for a multiple regression, this somehow implies a multiple regression was performed correctly in line with the researcher’s scientific aims. Do you know how to conduct a multiple regression? “Yes, I know how to do it in software.” This answer is not a correct answer to knowing how to conduct a multiple regression! One need not even understand what multiple regression is to “compute one” in software. As a consultant, I have also had a client or two from very prestigious universities email me a bunch of software output and ask me “Did I do this right?” assuming I could evaluate their code and output without first knowledge of their scientific goals and aims. “Were the statistics done correctly?” Of course, without an understanding of what they intended to do or the goals of their research, such a question is not only figuratively, but also literally impossible to answer aside from ensuring them that the software has a strong reputation for accuracy in number-crunching.
This overemphasis on computation, software or otherwise, is not right, and is a real problem, and is responsible for many misuses and abuses of applied statistics in virtually every field of endeavor. However, it is especially poignant in fields in the social sciences because the objects on which the statistics are computed are often statistical or psychometric entities themselves, which makes understanding how statistical modeling works even more vital to understanding what can vs. what cannot be concluded from a given statistical analysis. Though these problems are also present in fields such as biology and others, they are less poignant, since the reality of the objects in these fields is usually more agreed upon. To be blunt, a t-test on whether a COVID-19 vaccine works or not is not too philosophically challenging. Finding the vaccine is difficult science to be sure, but analyzing the results statistically usually does not require advanced statistics. However, a regression analysis on whether social distancing is a contributing factor to depression rates during the COVID-19 pandemic is not quite as easy on a methodological level. One is so-called “hard science” on real objects, the other might just end up being a statistical artifact. This is why social science students, especially those conducting non-experimental research, need rather deep philosophical and methodological training so they do not read “too much” into a statistical result, things the physical scientist may never have had to confront due to the nature of his or her objects of study. Establishing scientific evidence and supporting a scientific claim in many social (and even natural) sciences is exceedingly