Real World Health Care Data Analysis. Uwe Siebert
(see Chapters 1 and 2), are the data and planned analyses able to produce reliable and valid estimates?
Girman et al. (2013) summarized multiple pre-analyses issues that should be addressed before undertaking any comparative analysis of observational data. One focus of that work was to evaluate the potential for unmeasured confounding relative to the expected effect size (we will address this in Chapter 13). The Duke-Margolis Real-World Evidence Collaborative on the potential use of RWE for regulatory purposes (Berger et al. 2017) comments that “if the bias is too great or confounding cannot be adequately adjusted for then a randomized design may be best suited to generate evidence fit-for regulatory review.” To address this basic concern with confounding, we focus our feasibility analysis in this chapter on two key analytic issues: confirming that the target population of inference is feasible with the current data (common support, positivity assumption, clinical equipoise, and so on) and assessing the ability to address confounders (measured and unmeasured). Both of these are related to core assumptions required for the validity of causal inference based on propensity score analyses.
For instance, while researchers often want to perform analyses that are broadly generalizable, such as performing an analysis on the full population of patients in the database, a lack of overlap in the covariate distributions of the different treatment groups might simply not allow for quality causal inference analysis over the full sample. If there is no common support (no overlap in the covariate space between the treatment groups), this violates a key assumption necessary for unbiased comparative observational analyses. Feasibility analysis can guide researchers into appropriate comparisons and target populations that are possible to conduct with the data in hand.
Secondly, valid analyses require that the data are sufficient to allow for statistical adjustment for bias due to confounding. The primary goal of a propensity score-based analysis is to reduce the bias inherent in comparative observational data analysis that is due to measured confounders. The statistical adjustment must balance the two treatment groups in regards to all key covariates that may be related to both outcome and the treatment selection, such as age, gender, and disease severity measures. The success of the propensity score is judged by the balance in the covariate distributions that it produces between the two treatment groups (D’Agostino 2007). For this reason, assessing the balance produced by the propensity score has become a standard and critical piece of any best practice analysis.
Note that the feasibility and balance assessments are conducted as part of the design stage of the analysis. That is, such assessments can use the baseline data and thus are conducted “outcome free.” If the design phase is completed and documented prior to accessing the outcome data, then consumers of the data can be assured that no manipulation of the models was undertaken in order to produce a better result. Of course, this assessment may be an iterative process in order to find a target population of inference with sufficient overlap and a propensity model that produces good balance in measured confounders. As this feasibility assessment does not depend on outcomes data, the statistical analysis plan can then be finalized and documented after learning from the baseline data but prior to accessing the outcome data.
5.2 Best Practices for Assessing Feasibility: Common Support
Through the process of deriving the study objectives and the estimand, researchers will have determined a target population of inference. By this we mean the population of patients that the results of the analysis should generalize to. However, for valid causal analysis there must be sufficient overlap in baseline patient characteristics between the treatment groups. This overlap is referred to as the “common support.” There is no guarantee that the common support observed in the data is similar to the target population of inference desired by the researchers. The goal of this section is to demonstrate approaches to help assess whether there is sufficient overlap in the patient populations in each treatment group allowing for valid inference to a target population of interest.
Multiple quantitative approaches have been proposed to assess the similarity of baseline characteristics between the patients in one treatment group versus another. Imbens and Rubin (2015) state that differences in the covariate distributions between treatment groups will manifest in some difference of the corresponding propensity score distributions. Thus, comparisons of the propensity score distributions can provide a simple summary of the similarities of patient characteristics between treatments, and such comparisons have become a common part of feasibility assessments.
Thus, as a tool for feasibility assessment, we propose a graphical display comparing the overlap in the two propensity score distributions, supplemented with the following statistics discussed in the next section that provide quantitative guidance on selection of methods and the population of inference:
● Walker’s preference score (clinical equipoise)
● standardized differences of means
● variance ratios
● Tipton’s index
● proportion of near matches
Specific guidance for interpreting each summary statistic is provided in the sections that follow. In addition, guidance on trimming non-overlapping regions of the propensity distributions to obtain a common support is discussed.
5.2.1 Walker’s Preference Score and Clinical Equipoise
Walker et al. (2013) discuss the concept of clinical equipoise as a necessary condition for quality comparative analyses. They define equipoise as “a balance of opinion in the treating community about what really might be the best treatment for a given class of patients.” When there is equipoise, there is better balance between the treatments on measured covariates, less reliance on statistical adjustment, and perhaps more importantly, potentially less likelihood of strong unmeasured confounding. Empirical equipoise is observed similarity in types of patients on each treatment in the baseline patient population. Walker et al. argue that “Empirical equipoise is the condition in which comparative observational studies can be pursued with a diminished concern for confounding by indication …” To quantify empirical equipoise, they proposed the preference score, F, a transformation of the propensity score to standardize for the market share of each treatment,
where F and PS are the preference and propensity scores for Treatment A and P is the proportion of patients receiving Treatment A. Patients with a preference score of 0.5 are likely to receive either Treatment A or B in the same proportion of the market share for Treatments A or B. As a rule of thumb, it is acceptable to pursue a causal analysis if at least half of the patients in each treatment group have a preference score between 0.3 and 0.7 (Walker et al. 2013).
5.2.2 Standardized Differences in Means and Variance Ratios
Imbens and Rubin (2015) show that it is theoretically sufficient to assess imbalance in propensity score distributions as differences in the expectation, dispersion, or shape of the covariate distributions will be represented in the propensity score. Thus, comparing the distributions of the propensity scores for each treatment group has been proposed to help assess the overall feasibility and balance questions. In practice, the standardized difference in mean propensity scores along with the ratio of propensity score variances have been proposed as summary measures to quantify the difference in the distributions (Austin 2009, Stuart et al. 2010). The standardized difference in means (sdm) is defined by Austin (2009) as the absolute difference in the mean propensity score for each treatment divided by a pooled estimate of the variance of the propensity scores:
Austin suggests that standardized differences > 0.1 indicate significant imbalance while Stuart proposes a more conservative value of 0.25. As two very different distributions can still produce a standardized difference in means of zero (Tipton 2014), it is advisable to supplement the sdm with the variance ratio. The variance ratio statistic is simply the variance of the propensity scores for the treated group divided by the variance of the propensity scores for the control group. Acceptable ranges for the ratio of variances of 0.5 to 2.0 have been cited (Austin 2009).