Evidence-Based Statistics. Peter M. B. Cahusac
rain than sunshine this afternoon’.
When the LR is transformed into the natural logarithm, it is known as the support, denoted S. The support quantifies the comparative evidence on a scale of −∞ to +∞, with midpoint 0 representing no evidence in favour of either hypothesis. Unlike the use of p, S is a graded measure of evidence without clear cutoffs or thresholds.
If the collected data are not strongly influenced by prior considerations, it is somewhat reassuring that the three approaches usually reach the same conclusion. However, it is not difficult to find examples of where the likelihood evidence points one way and the hypothesis testing decision points the other (see Section 3.7, and de Winter and Cahusac [25], p. 89 and Dienes [6], p. 127)
1.2.2 The Likelihood/Evidential Approach
In advocating the evidential approach, Royall wrote in 2004 ‘Statistics today is in a conceptual and theoretical mess. The discipline is divided into two rival camps, the frequentists and the Bayesians, and neither camp offers the tools that science needs for objectively representing and interpreting statistical data as evidence’ [24], p. 127.
In making sense of data and to make inferences, it is natural to consider different rival hypotheses to explain how such a set of observations arose. Significance testing uses a single hypothesis to test, typically the null hypothesis. The top of Figure 1.1 illustrates the typical situation when testing a sample mean. The sampling distribution for the mean is located over the null value, see vertical dashed line down to the horizontal axis. The sample mean indicated by the continuous vertical line lies in the shaded rejection region. The shaded region represents 5% of the area under the sampling distribution curve, with 2.5% in each tail. Significance testing states a pre-specified significance level α, typically this is 5%. Since the value for the sample mean lies within the shaded area, we can say that p < .05 and we reject the null hypothesis given our α.
Estimation, a key element in statistical analysis, has often been ignored in the face of dichotomous decisions reached from statistical tests. If results are reported as non-significant, it is assumed that there is no effect or difference between population parameters. Alternatively, highly significant results based on large samples are assumed to represent large effects. The increased use of confidence intervals [26] is a great improvement that allows us to see how large or small the magnitude of the effects are, and hence whether they are of practical/clinical importance. These advances have increased the credibility of well-reported studies and facilitated our understanding of research results. The confidence interval is illustrated in the middle portion of Figure 1.1. This is centred on the sample mean (shown by the end-stopped line) and gives a range of plausible values for the population mean [26]. The interval has a frequentist interpretation: 95% of such intervals, calculated from random samples taken from the population of interest, will contain the population statistic. The confidence interval focusses our attention on the obtained sample mean value, and the 95% limits indicate how far this value is from parameter values of interest, especially the null. The interval helps us determine whether the data we have is of practical importance.
Figure 1.1 From sampling distribution to likelihood function. The top curve shows the sampling distribution used for testing statistical significance. It is centred on the null hypothesis value (often 0) and the standard error used to calculate the curve comes from the observed data. Below this in the middle is shown the 95% confidence interval. This uses the sample mean and standard error from the observed data. At the bottom shows the likelihood function, within which is plotted the S-2 likelihood function. Both the likelihood function and the likelihood interval use the observed data like the confidence interval.
At the bottom of Figure 1.1 is shown the likelihood function. This is none other than a rescaled sampling distribution that we saw around the null value. It is calculated from the data, specifically from the sample mean and variance. It contains all the information that we can extract from the data. It is centred on the sample mean which represents the maximum likelihood estimate (MLE) for the population mean. The likelihood function can then be used to compare different hypothesis parameter values. Using simply the height of the curve, the likelihood function allows us to calculate the relative likelihood, in terms of a ratio, for any two parameter values from competing hypotheses. We may compare any value of interest with the null. For example, we may take a value that represents a value that is of practical importance. This might be situated above or below the sample mean value. If this value lies between the null and the sample mean, then the ratio relative to the null will be ≥1. If the value is less than the null, then the ratio will be <1. The same will be true on the other side of the sample mean until the counternull 2 value is reached, after which the ratio will be ≤1. The maximum LR is obtained at the sample mean value. For the illustrated data, this ratio was 13.4, giving an S of 2.6. The evidence represented by the likelihood function is centred on the observed data statistic. The same function centred on the null, as used in significance testing, now seems somewhat artificial.
The likelihood interval shown in Figure 1.1 represents that calculated for a support of 2 (S-2), which closely resembles the 95% confidence interval, although its interpretation is more direct: values within the interval are consistent with the collected data. A value outside the interval has at least one hypothesis value, here the sample mean, that has more than moderate evidence against it.
The precise meaning of p values obtained in statistical tests is difficult to grasp by the average scientist. Even seasoned researchers misunderstand them. In contrast, the likelihood approach is conceptually simple. It uses the likelihood function, derived from the sampling distribution of the collected data, to provide comparative evidence for two specified hypotheses. The likelihood approach uses nothing other than the evidence obtained in the collected sample. For p values, the tail regions of the sampling distribution centred on the null are used. These regions include values beyond the sample statistic which were not observed. What can be the justification for including values that were not observed? Later in his career, Fisher [27], p. 71 admits that ‘This feature is indeed not very defensible save as an approximation’. ‘To what?’ replies Edwards [28]. It is interesting that Fisher then proceeds to compare likelihoods (pp. 71–73) ‘It would, however, have been better to have compared the different possible values of p, in relation to the frequencies with which the actual values observed would have been produced by them, as is done by the Mathematical Likelihood …’ Concluding ‘The likelihood supplies a natural order of preference among the possibilities under consideration’. The use of the LR is computationally simple and intuitively attractive. Tsou and Royall observe ‘Strong theoretical arguments imply that for directly representing and interpreting statistical data as evidence, the proper vehicle is the likelihood function’. Adding pointedly ‘These arguments have had limited impact on statistical practice’ [29]. Perhaps the ritualized [30] and over-rehearsed use of p values have made them so ingrained in the scientific community, that the conceptually simpler LR statistic has now become more difficult to grasp.
1.2.3 Types of Approach Using Likelihoods
A key feature of the evidential approach is the use of LR based upon two values selected by the researcher. The LR then reveals which value is best supported by the observations. Typically, one of these values is the null hypothesis and the other a