Statistical Significance Testing for Natural Language Processing. Rotem Dror
which has made data-driven performance comparison much more complicated. This is because these models are non-deterministic due to their non-convex objective functions, complex hyperparameter tuning process, and training heuristics such as random dropouts that are often applied in their implementation. Chapter 5 hence defines a framework for a statistically valid comparison between two DNNs based on multiple solutions each of them produces for a given dataset. The chapter summarizes previous attempts in the NLP literature to perform this comparison task and evaluates them in light of the proposed framework. Then, it presents a new comparison method that is better fitted to the pre-defined framework. This chapter is based on our ACL 2019 paper [Dror et al., 2019].
The second challenge is crucial for the efforts to extend the reach of NLP technology to multiple domains and languages. These well-justified efforts result in a large number of comparisons between algorithms, across corpora from a large number of languages and domains. The goal of this chapter is to provide the NLP community with a statistical analysis framework, termed Replicability Analysis, which will allow us to draw statistically sound conclusions in evaluation setups that involve multiple comparisons. The classical goal of replicability analysis is to examine the consistency of findings across studies in order to address the basic dogma of science, namely finding is more convincingly true if it is replicated in at least one more study [Heller et al., 2014, Patil et al., 2016]. We adapt this goal to NLP, where we wish to ascertain the superiority of one algorithm over another across multiple datasets, which may come from different languages, domains, and genres. This chapter is based on our TACL paper [Dror et al., 2017].
Finally, while this book aims to provide a basic framework for proper statistical significance testing in NLP research, it is by no means the final word on this topic. Indeed, Chapter 7 presents a list of open questions that are still to be addressed in future research. We hope that this book will contribute to the evaluation practices in our community and eventually to the development of more effective NLP technology.
CHAPTER 2
Statistical Hypothesis Testing
We begin with a definition of the statistical hypothesis testing framework. This fundamental framework will then allow us to discuss statistical significance tests (Chapter 3) and later on their application to experimental research in NLP.
A statistical hypothesis is defined as an hypothesis that is testable by observing and analyzing a process modeled by a set of random variables. In the basic setting, two datasets are compared and a hypothesis is proposed for the statistical relationship between them. This hypothesis is usually suggested as an alternative to an ideal null hypothesis that (often) proposes no relationship between two datasets. If the relationship between the datasets seems unlikely under the null hypothesis according to a threshold probability—the significance level—the null hypothesis will be rejected.
In order to distinguish between the null hypothesis and the alternative hypothesis, we consider two conceptual types of errors. The first type of error occurs when the null hypothesis is wrongly rejected while the second occurs when we wrongfully do not reject the null hypothesis. These two types of errors are known as type I and type II errors, and we will further elaborate on them later on.
In empirical machine learning research in general, and in the NLP community in particular, we would often like to prove the superiority of one algorithm over the other, and present this superiority in terms of a statistically significant improvement according to an evaluation metric, such as accuracy or F-score.1 Therefore, we begin by formulating a general hypothesis testing framework for the comparison between two algorithms. This is the common type of hypothesis testing framework applied in NLP, and its detailed formulation will help us develop our ideas.
2.1 HYPOTHESIS TESTING
We wish to compare two algorithms, A and B. As an example, let us consider a comparison between two machine translation (MT) algorithms: phrase-based MT (such as the Moses MT system [Koehn et al., 2007]) vs. an LSTM Neural Encoder-decoder Network (e.g., the model described in Cho et al. [2014]). In order to compare between the two algorithms, we would experiment with several different parallel corpora. Let X be the set of such corpora, i.e., a collection of datasets X = {X1, X2,…, XN}, where each data set Xi is comprised of sentence pairs, one from the source language and one from the target language. That is, for all i є {1,…, N}, Xi = {xi,1,…, xi,ni}. where xi,j is a source language sentence and its translation.
The difference in performance between the two algorithms is measured with one or more evaluation metrics. In our example, when evaluating the performance of machine translation systems, we may use several evaluation measures to assess the quality of translation from various angles. For example, we would probably like our MT system to provide an accurate translation but we may also want to encourage creativity and linguistic richness, and prefer systems that do not excessively repeat the same words and phrases. Accordingly, we would evaluate it using two vastly used different metrics: BLEU [Papineni et al., 2002] and PINC [Chen et al., 2011]. We denote our set of metrics as M = {M1,…., Mm}.2
So far, we have our two MT algorithms A and B, trained and evaluated on a set of metrics M = {M1,…, Mm}. We denote with Mj(ALG, Xi) the value of the measure Mj when algorithm ALG is applied to the dataset Xi. Without loss of generality, we assume that higher values of the measure are better.
We define the difference in performance between two algorithms, A and B, according to the measure Mj on the dataset Xi as:
Finally, using this notation we formulate the following statistical hypothesis testing problem:
The goal of testing the above hypotheses is to determine if algorithm A is significantly better than algorithm B on the dataset Xi using the evaluation measure Mj. In our example, this translates to the following question: “Is the LSTM-based MT system better than the Phrasebased one on the Wikipedia parallel corpus when considering the BLEU metric?”
If we strive to show that the LSTM is superior to the phrase-based system (in the specific setup of the Wikipedia Corpus and the BLEU metric), we would need to provide statistically valid evidence. Our hypotheses can be described as follows: The (somewhat pessimistic) null hypothesis would state that there is no significant performance difference between the LSTM and the phrase-based system, or that the latter performs even better, while the alternative hypothesis would state that the LSTM performs significantly better.
More generally, in our formulation the null hypothesis, H0, states that there is no difference between the performance of algorithm A and algorithm B, or that B performs better. This hypothesis is tested vs. the alternative statement, H1—that A is superior. If the statistical test results in rejecting the null hypothesis, one concludes that A outperforms B in this setup—i.e., on dataset Xi with respect to the evaluation metric Mj. Otherwise, there is not enough evidence in the data to make the conclusion of rejecting the null hypothesis. In this case, it is uncustomary to claim that we accept the null hypothesis, since the null hypothesis is the starting point, and by posing an alternative hypothesis we try to challenge the idealized state.
Naturally, we could be wrong in our conclusion. Our specific experiments may show that the LSTM outperforms the phrase-based system in a certain setup, but this does not necessarily reflect the true nature of things. Let us now properly define the two types of errors that we may encounter in our hypothesis test.
• Type I error—rejection of the null hypothesis