Statistical Significance Testing for Natural Language Processing. Rotem Dror
6.2 A Multiple Hypothesis Testing Framework for Algorithm Comparison
6.3 Replicability Analysis with Partial Conjunction Testing
6.4 Replicability Analysis: Counting
6.5 Replicability Analysis: Identification
6.7 Real-World Data Applications
6.7.2 Statistical Significance Testing
6.7.4 Results Summary and Overview
7 Open Questions and Challenges
Preface
The field of Natural Language Processing (NLP) has made substantial progress in the last two decades. This progress stems from multiple sources: the data revolution that has made abundant amounts of textual data from a variety of languages and linguistic domains available, the development of increasingly effective predictive statistical models, and the availability of hardware that can apply these models to large datasets. This dramatic improvement in the capabilities of NLP algorithms carries the potential for a great impact.
The extended reach of NLP algorithms has also resulted in NLP papers giving more and more emphasis to the experiment and result sections by showing comparisons between multiple algorithms on various datasets from different languages and domains. It can be safely argued that the ultimate test for the quality of an NLP algorithm is its performance on well-accepted datasets, sometimes referred to as “leader-boards”. This emphasis on empirical results highlights the role of statistical significance testing in NLP research: If we rely on empirical evaluation to validate our hypotheses and reveal the correct language processing mechanisms, we better be sure that our results are not coincidental.
The goal of this book is to discuss the main aspects of statistical significance testing in NLP. Particularly, we aim to briefly summarize the main concepts so that they are readily available to the interested researcher, address the key challenges of hypothesis testing in the context of NLP tasks and data, and discuss open issues and the main directions for future work.
We start with two introductory chapters that present the basic concepts of statistical significance testing: Chapter 2 provides a brief presentation of the hypothesis testing framework, and Chapter 3 introduces common statistical significance tests. Then, Chapter 4 discusses the application of statistical significance testing to NLP. In Chapter 4, we assume that two algorithms are compared on a single dataset, based on a single output that each of them produces, and discuss the relevant significance tests for various NLP tasks and evaluation measures. The chapter puts an emphasis on the aspects in which NLP tasks and data differ from common examples in the statistical literature, e.g., the non–Gaussian distribution of the data and the dependence between the participating examples, e.g., sentences in the same corpus. This chapter, which extends our ACL 2018 paper [Dror et al, 2018], provides our recommended matching between NLP tasks with their evaluation measures and statistical significance tests.
The next two chapters relax two of the basic assumptions of Chapter 4: (a) that each of the compared algorithms produces a single output for each test example (e.g., a single parse tree for a given input sentence), and (b) that the comparison between the two algorithms is performed on a single dataset. Particularly, Chapter 5 addresses the comparison between two algorithms based on multiple solutions where each of them produces for a single dataset, while Chapter 6 addresses the comparison between two algorithms across several datasets.
The first challenge stems from the recent emergence of Deep Neural Networks (DNNs), which has made data-driven performance comparison much more complicated. This is because these models are non-deterministic due to their non-convex objective functions, complex hyperparameter tuning process and training heuristics such as random dropouts, that are often applied in their implementation. Chapter 5, therefore, defines a framework for a statistically valid comparison between two DNNs based on multiple solutions each of them produces for a given dataset. The chapter summarizes previous attempts in the NLP literature to perform this comparison task and evaluates them in light of the proposed framework. Then, it presents a new comparison method that is better fitted to the pre-defined framework. This chapter is based on our ACL 2019 paper [Dror et al., 2019].
The second challenge is crucial for the efforts to extend the reach of NLP technology to multiple domains and languages. These well-justified efforts result in a large number of comparisons between algorithms, across corpora from a large number of languages and domains. The goal of this chapter is to provide the NLP community with a statistical analysis framework, termed Replicability Analysis, which will allow us to draw statistically sound conclusions in evaluation setups that involve multiple comparisons. The classical goal of replicability analysis is to examine the consistency of findings across studies in order to address the basic dogma of science, namely that a finding is more convincingly true if it is replicated in at least one more study [Heller et al., 2014, Patil et al., 2016]. We adapt this goal to NLP, where we wish to ascertain the superiority of one algorithm over another across multiple datasets, which may come from different languages, domains, and genres. This chapter is based on our TACL paper [Dror et al., 2017].
Finally, while this book aims to provide a basic framework for proper statistical significance testing in NLP research, it is by no means the final word on this topic. Indeed, Chapter 7 presents a list of open questions that are still to be addressed in future research. We hope that this book will contribute to the evaluation practices in our community and eventually to the development of more effective NLP technology.
INTENDED READERSHIP
The book is intended for researchers and practitioners in NLP who would like to analyze their experimental results in a statistically sound manner. Hence, we assume technical background in computer science and related areas such as statistics and probability, mostly at the undergraduate level. Moreover, while in Chapter 4 we discuss various NLP tasks and their proposed