Evidence-Based Statistics. Peter M. B. Cahusac
use of verbal labels such as ‘large’ or ‘small’ can sometimes be misleading [33]. What may be considered a large effect in one area (e.g. epidemiology) may be considered small in another (e.g. a drug treatment for hypertension). A popular standardized measure of effect size for a difference in means is d. This is actually Hedges' standardized statistic using the sample standard deviation SD rather than Cohen's using the population parameter σ.4
(1.1)
The relative effect sizes using d can be described as:
d | Description |
0.2 | Small |
0.5 | Medium |
0.8 | Large |
1.3 | Very large |
A more general measure is provided by the correlation coefficient r. However, the transform between r and d is not linear since r is restricted to −1 and 1, while d varies between negative and positive infinity. For example a medium effect r of 0.3 corresponds to a d of 0.63 (on the large side), and a large r of .5 corresponds to a very large effect in d of 1.15. Using d allows us to relate more naturally to the measurements that are made.
Effect size is generally unaffected by sample size, unlike the p value. If the null is not true then the p value obtained will vary according to the sample size: other things being equal, the larger the sample, the smaller the p value. When considering sample size and strength of evidence provided by p values, opposite conclusions are reached by different statisticians [4], p. 71. In Figure 1.2, the 95% confidence intervals around means are plotted for two sets of data. For each interval, the same standard deviation is used and the same p value is obtained for the mean's difference from 0. However, the sample sizes vary, so that with N = 4, there is a 2.6 difference from 0, and for N = 80, there is a 0.6 difference from 0. Hence, the size of the effect is much larger for the interval using few observations, which might indicate that this result is of more practical importance than the result obtained with a larger data set. However, it is also argued that the data with larger N represents stronger evidence, although its effect size is much smaller and the p values identical.
Figure 1.2 Effect size versus sample size: which provides most evidence against H0?
1.4 Calculations
Evidence is measured by the natural logarithm of the LR, known as the support S. The words evidence and support will be used interchangeably.
Giving decimal places during calculations is tricky. The decimal places given for values in the text are usually given to an accuracy that allows one to check formulae and equations, often given in stages. Occasionally, there will be mismatches with the final answer which will be based on the most accurate calculation possible. These can usually be checked from the raw data using Excel or R.
The support S will generally be expressed to only one decimal place. The use of S is merely a guide to the strength of evidence. It is graded rather than thresholded.
The evidential approach does not require any statistical tables. All calculations can be performed from first principles with a hand calculator, R or Excel spreadsheet.
1.5 Summary of the Evidential Approach
1 Choose a parameter value for primary hypothesis H1. Either a value corresponding to practical importance, of minimum importance, or the expected value. Else use a medium effect size, e.g. d = ±0.5. Alternatively, use the MLE.
2 Choose a secondary hypothesis H2 to compare with H1. Often this is the null hypothesis H0.
3 Calculate S12, S10 for H0, or SM for MLE.
4 Assess the relative evidence for the two hypotheses on the graded scale from −∞ to +∞.
5 Always use likelihood intervals, typically for S-2 and S-3. Likelihood intervals are more flexible and may be more informative than examining S for particular hypotheses.
6 If possible and convenient, plot the likelihood function.
Figure 1.3 gives a flow diagram showing the sequence used to calculate and assess the evidence from a data sample.
Figure 1.3 A flow diagram illustrating the general procedure of calculating and assessing evidence. At the top, we start with defining hypotheses of interest. The primary hypothesis H1 is that specified by an effect size or the sample statistic (maximum likelihood estimate (MLE)). The secondary hypothesis H2 specifies another value of interest, often this is the null hypothesis. The support S is calculated from the logarithm of the LR for H1 versus H2. If the MLE is used then the maximum LR is calculated, which becomes SM on taking logs. The value of S indicates the strength of evidence for one of the hypotheses against the other. If the value is negative then this represents evidence in favour of H2. If the value is positive then this represents evidence in favour of the primary hypothesis H1. The magnitude of the negative or positive support values indicates the relative strength of the evidence, from ±1 meaning weak, ±2 moderate, ±3 strong, and ≥±4 extremely strong. An LR of 1 represents an S of 0, which is no evidence in favour of either hypothesis. The likelihood function should be calculated wherever possible and likelihood interval provided when presenting results. Thanks to Alfaisal student, Muhammad Affan Elahi, for the suggestion to use flow charts here and for Figure 2.12.
References
1 1 Taper ML, Lele SR, editors. The Nature of Scientific Evidence: Statistical, Philosophical, and Empirical Considerations. Chicago: University of Chicago Press; 2004.
2 2 Pearson ES. ‘Student’ as statistician. Biometrika. 1939; 30 (3/4):210–50.
3 3 Edwards AWF. Likelihood. Baltimore: John Hopkins University Press; 1992.
4 4 Royall RM. Statistical Evidence: A Likelihood Paradigm. London: Chapman & Hall; 1997.
5 5 Hacking I. Logic of Statistical Inference. Cambridge: Cambridge University Press; 1965.
6 6 Dienes Z. Understanding Psychology as a Science: An Introduction to Scientific and Statistical Inference. Basingstoke: Palgrave MacMillan; 2008.
7 7 Baguley T. Serious