Interventional Cardiology. Группа авторов
width one needs a trial four times the size.
Interpreting p‐values
Use of significance tests is often misleadingly oversimplified by putting too much emphasis on whether p is above or below 0.05. A p < 0.05 means the result is statistically significant at the 5% level, but is an arbitrary guideline. It does not mean one has firm proof of an effect. By definition, even if two treatments are truly identical there is a 1 in 20 chance of reaching p < 0.05. Also, p > 0.05, not statistically significant (or n.s.), does not necessarily imply that a clinically meaningful difference does not exist.
This concept is illustrated graphically in Figure 6.1. In the left hand panel of this figure, similar treatment effects are obtained from two different studies, one of which is significant and one is not. The lack of significance alone should not be the sole metric on which to interpret the findings, particularly as the effect size appears to be large, albeit imprecise. In contrast, in the right‐hand panel, a p‐value of 0.05 is obtained with two very different effect sizes, one which is large and another that is much smaller. Again, focusing on the p‐value alone as the sole discriminator of importance in treatment effect would ignore the very large and perhaps clinically relevant gradient of effect between the treatments.
Figure 6.1 Interpreting p‐values.
Link between p‐values and confidence intervals
In case of categorical outcome, if we have p < 0.05 then the 95% CI for the risk ratio (or odds ratio) will exclude 1; while if p > 0.05 is observed then the 95% CI will include 1. Similarly, in case of continuous outcome, if we have p < 0.05 then the 95% CI for the mean difference will exclude 0, while if p > 0.05 is observed then the 95% CI will include 0. Thus, by looking at the CI alone one can infer whether the treatment difference is significant at the 5% level.
Time to event data
Many major trials study time to a primary event outcome. For instance, the Evaluation of XIENCE versus Coronary Artery Bypass Surgery for Effectiveness of Left Main Revascularization (EXCEL) trial studied a composite ischemic endpoint: death from any cause, myocardial infarction, or stroke over three year follow‐up [3].
A Kaplan–Meier life‐table plot is the main method of displaying such data by treatment group (Figure 6.2). It displays the cumulative percentage of patients experiencing the event over time for each group. This method takes account of patients having different lengths of follow‐up (e.g. any lost to follow‐up before the intended three years).
Figure 6.2 Kaplan–Meier life‐table plot showing pattern of treatment difference over time (e.g. EXCEL 3‐year follow‐up).
Such a plot is a useful descriptive tool, but one needs to use a logrank test to see if there is evidence of a treatment difference in the incidence of events. For instance, the PCI (n = 948) and coronary artery bypass grafting (CABG) alone (n = 957) groups had composite ischemia in 14.5% and 14.1% of patients, respectively. The log‐rank test uses the total data by group displayed to obtain p = 0.98 (i.e. the data are consistent with the null hypothesis of no treatment difference). The log‐rank test can be thought of as an extension, indeed improvement, to the simpler chi‐squared test comparing two percentages because it takes into account the fact that patients have been followed for, or deaths occur at, differing times from randomization.
With time to event data, the hazard ratio is used to estimate any relative treatment differences in risk. It is similar to, but more complicated to calculate, than the simple relative risk already mentioned. It effectively averages the instantaneous relative risk occurring at different follow‐up times, using what is commonly called a Cox proportional hazards model. In this case the hazard ratio comparing PCI with CABG is 1.00 with 95% CI 0.79–1.26. Thus, there is no increase in hazard, but even if the hazard ratio was different than 1, if the 95% CI includes 1 there is no statistical significance in the hazard between the two groups. For instance, the hazard ratio for death from any cause at three year follow‐up in the same trial is 1.34 with 95% CI 0.94–1.91 [3]. Even if there is an observed 34% increase in hazard, the 95% CI includes 1, reflecting lack of statistical significance.
Quantitative data
For a quantitative measure of patient outcome, it is common to compare the mean outcomes in each treatment group. For example, in the Catheter‐based renal denervation in patients with uncontrolled hypertension in the absence of antihypertensive medications (SPYRAL HTN‐OFF MED) study,[4] 80 patients with uncontrolled hypertension were randomized in a blinded fashion to either renal denervation or sham control with a primary efficacy endpoint of change in 24‐hour blood pressure at three months. The mean change of 24‐hour systolic blood pressure from baseline in the renal denervation and sham groups was –9.0 ± 11.0 mmHg and –1.6 ± 10.7 mmHg, respectively. The mean change between groups was –7.0 mmHg (95% CI –12.0 to 2.1; p = 0.006).
The standard deviation (SD) summarizes the extent of individual patient variation around each mean. If the data are normally distributed, then appropriately 95% of individuals will have a value within two standard deviations either side of the mean. This is sometimes called the reference range. However, for a clinical trial outcome measure it is more useful to calculate the standard error of the mean (SEM) which is SD/N. That is, precision in the estimated mean increases proportionately with the square root of the number of patients. The 95% confidence for the mean is mean ±1.96 × SEM.
Trial design: the fundamentals
When planning a clinical trial much energy is devoted to defining exactly what is the new treatment, who are the eligible patients, and what are the primary and secondary outcomes. Then the following statistical design issues need to be considered.
Control group
One essential is that the trial is comparative (i.e. one needs a control group of patients receiving a standard treatment who will be compared with patients receiving the new treatment). Such standard treatment can either be an established active treatment or no treatment (possibly a placebo). Of course, all patients in both groups must have good medical care in all other respects.
Randomization
One needs a fair (unbiased) comparison between new treatment and control, and randomization is the key requirement in this regard. That is, each patient has an equal chance of being randomly assigned to new or standard treatment. Furthermore, an adequate method of handling random assignments is such that no one should be able to predict in advance what each next patient will be assigned to. Hence, randomization based on days of the week, or years of birth, should be definitely avoided. Thus, adequate randomization ensures there is no selection bias in deciding which patients get new or standard treatment. Such selection bias is a serious problem in any observational (non‐randomized) studies comparing treatments, making them notoriously unreliable in their conclusions.
As a consequence, randomization minimizes the possibility that treatment groups will significantly differ in baseline characteristics. However, the possibility for chance variation can never be completely eliminated,