Applied Univariate, Bivariate, and Multivariate Statistics Using Python. Daniel J. Denis

Applied Univariate, Bivariate, and Multivariate Statistics Using Python

scientific claims presumably supported by bloated statistical analyses. Just look at the methodological debates that surrounded COVID-19, which is on an object that is relatively “easy” philosophically! Step away from concrete science, throw in advanced statistical technology and complexity, and you enter a world where establishing evidence is philosophical quicksand. Many students who use statistical methods fall into these pits without even knowing it and it is the instructor’s responsibility to keep them grounded in what the statistical method can vs. cannot do. I have told students countless times, “No, the statistical method cannot tell you that; it can only tell you this.”

Hence, for the student of empirical sciences, they need to be acutely aware and appreciative of the deeper issues of conducting their own science. This implies a heavier emphasis on not how to conduct a billion different statistical analyses, but on understanding the issues with conducting the “basic” analyses they are performing. It is a matter of fact that many students who fill their theses or dissertations with applied statistics may nonetheless fail to appreciate that very little of scientific usefulness has been achieved. What has too often been achieved is a blatant abuse of statistics masquerading as scientific advancement. The student “bootstrapped standard errors” (Wow! Impressive!), but in the midst of a dissertation that is scientifically unsound or at a minimum very weak on a methodological level.

A perfect example to illustrate how statistical analyses can be abused is when performing a so-called “mediation” analysis (you might infer by the quotation marks that I am generally not a fan, and for a very good reason I may add). In lightning speed, a student or researcher can regress Y on X, introduce Z as a mediator, and if statistically significant, draw the conclusion that “Z mediates the relationship between Y and X.” That’s fine, so long as it is clearly understood that what has been established is statistical mediation (Baron and Kenny, 1986), and not necessarily anything more. To say that Z mediates Y and X, in a real substantive sense, requires, of course, much more knowledge of the variables and/or of the research context or design. It first and foremost requires defining what one means by “mediation” in the first place. Simply because one computes statistical mediation does not, in any way whatsoever, justify somehow drawing the conclusion that “X goes through Z on its way to Y,” or anything even remotely similar. Crazy talk! Of course, understanding this limitation should be obvious, right? Not so for many who conduct such analyses. What would such a conclusion even mean? In most cases, with most variables, it simply does not even make sense, regardless of how much statistical mediation is established. Again, this should be blatantly obvious, however many students (and researchers) are unaware of this, failing to realize or appreciate that a statistical model cannot, by itself, impart a “process” onto variables. All a statistical model can typically do, by itself, is partition variability and estimate parameters. Fiedler et al. (2011) recently summarized the rather obvious fact that without the validity of prior assumptions, statistical mediation is simply, and merely, variance partitioning. Fisher, inventor of ANOVA (analysis of variance), already warned us of this when he said of his own novel (at the time) method that ANOVA was merely a way of “arranging the arithmetic.” Whether or not that arrangement is meaningful or not has to come from the scientist and a deep consideration of the objects on which that arrangement is being performed. This idea, that the science matters more than the statistics on which it is applied, is at risk of being lost, especially in the social sciences where statistical models regularly “run the show” (at least in some fields) due to the difficulty in many cases of operationalizing or controlling the objects of study.

Returning to our mediation example, if the context of the research problem lends itself to a physical or substantive definition of mediation or any other physical process, such that there is good reason to believe Z is truly, substantively, “mediating,” then the statistical model can be used as establishing support for this already-presumed relation, in the same way a statistical model can be used to quantify the generational transmission of physical qualities from parent to child in regression. The process itself, however, is not due to the fitting of a statistical model. Never in the history of science or statistics has a statistical model ever generated a process. It merely, and potentially, has only described one. Many students, however, excited to have bootstrapped those standard errors in their model and all the rest of it, are apt to draw substantive conclusions based on a statistical model that simply do not hold water. In such cases, one is better off not running a statistical model at all rather than using it to draw inane philosophically egregious conclusions that can usually be easily corrected in any introduction to a philosophy of science or research methodology course. Abusing and overusing statistics does little to advance science. It simply provides a cloak of complexity.

So, what is the conclusion and recommendation from what might appear to be a very cynical discussion in introducing this book? Understanding the science and statistics must come first. Understanding what can vs. cannot be concluded from a statistical result is the “hard part,” not computing something in Python, at least not at our level of computation (at more advanced levels, of course, computing can be exceptionally difficult, as evidenced by the necessity of advanced computer science degrees). Python code can always be looked up for applied sciences purposes, but “statistical understanding” cannot. At least not so easily. Before embarking on either a statistics course or a computation course, students are strongly encouraged to take a rigorous research design course, as well as a philosophy of science course, so they might better appreciate the limitations of their “claims to evidence” in their projects. Otherwise, statistics, and the computers that compute them, can be just as easily misused and abused as used correctly, and sadly, often are. Instructors and supervisors need to also better educate students on the reckless fitting of statistical models and computing inordinate amounts of statistics without careful guidance on what can vs. cannot be interpreted from such numerical measures. Design first, statistics second.

Statistical knowledge is not equivalent to software knowledge. One can become a proficient expert at Python, for instance, yet still not possess the scientific expertise or experience to successfully interpret output from data analyses. The difficult part is not in generating analyses (that can always be looked up). The most important thing is to interpret analyses correctly in relation to the empirical objects under investigation, and in most cases, this involves recognizing the limitations of what can vs. cannot be concluded from the data analysis.

Mathematical vs. “Conceptual” Understanding

One important aspect of learning and understanding any craft is to know where and why making distinctions is important, and on the opposite end of the spectrum, where divisions simply blur what is really there. One area where this is especially true is in learning, or at least “using,” a technical discipline such as mathematics and statistics to better understand another subject. Many instructors of applied statistics strive to teach statistics at a “conceptual” level, which, to them at least, means making the discipline less “mathematical.” This is done presumably to attract students who may otherwise be fearful of mathematics with all of its formulas and symbolism. However, this distinction, I argue, does more harm than good, and completely misses the point. The truth of the matter is that mathematics are concepts. Statistics are likewise concepts. Attempting to draw a distinction between two things that are the same does little good and only provides more confusion for the student.

A linear function, for example, is a concept, just as a standard error is a concept. That they are symbolized does not take away the fact that there is a softer, more malleable “idea” underneath them, to which the symbolic definition has merely attempted to define. The sooner the student of applied statistics recognizes this, the sooner he or she will stop psychologically associating mathematics with “mathematics,” and instead associate with it what it really is, a form of conceptual development and refinement of intellectual ideas. The mathematics is usually in many

Скачать книгу