Real World Health Care Data Analysis. Uwe Siebert
data is ignored and one must assume the generalizability of using only a select subset of patients for the analysis. This method could result in biased estimates when the data are not missing completely at random (MCAR). Even when the data are MCAR, the complete case analysis results in reduced power.
The second way to handle missing data is to treat the missing value of each categorical variable as an additional outcome category and impute the missing value of each continuous variable with the marginal mean while adding a dummy variable to indicate it is an imputed value. However, this approach ignores the correlations among original covariate values and thus is not an efficient approach.
The third method also imputes the missing covariates values, but not by simply creating a new “missing category” or using marginal means. The method is called multiple imputation (MI), which Rubin (1978) first proposed. The key step in the MI method is to randomly impute any missing values multiple times with sampling from the posterior predictive distribution of the missing values given the observed values of the same covariate, thereby creating a series of “complete” data sets. One advantage of this method is that each “complete” data set in the imputation can be analyzed separately to estimate the treatment effect, and the pooled (averaged) treatment effect estimate can be considered as the estimate of the causal treatment effect. Another approach is to use the averaged propensity score estimates from each “complete” data set as the propensity score estimates of the subjects in the analysis. There is no consensus on which of these two approaches is more effective, as evidenced in the simulations of Hill (2004). However, a recent study (Mitra and Reiter, 2011) found the second approach would result in less biased treatment effect estimates than the first one. Therefore, we will incorporate the averaged propensity scores approach when implementing the MI method. In addition, MI procedures allow us to include variables that are not included in the estimation of propensity score and therefore might contain useful information about missing values in important covariates.
Another method is to fit separate regressions in estimation of the propensity score for each distinct missingness pattern (MP) (D’Agostino 2001). For illustrative purposes, assume there are only two confounding covariates, denoted by and. Use a binary indicator “Y/N” if the corresponding covariate value is missing/non-missing for a subject, then the possible missing patterns are shown in Table 4.1.
Table 4.1: Possible Missing Patterns
Missing Pattern | X1 | X2 |
1 | N | N |
2 | Y | N |
3 | N | Y |
4 | Y | Y |
According to Table 4.1, if there are two covariates, for one subject, there are 4 possible missing patterns: (1) both covariate values were missing; (2 and 3) either one covariate value was missing; (4) neither of covariate values were missing. Notice these are “possible” missing patterns, which means the patterns may or may not exist in a real data analysis. To generalize, if there are n confounding covariates, then the number of possible missing patterns is 2^n.
The MP approach includes all nonmissing values for those subjects with the same missing pattern. However, as the subjects of each missing pattern is only a subgroup of the original population, the variability of estimated propensity scores increases because the number of subjects included in each propensity score model is smaller. In practice, to reduce the variability induced by the small numbers in some missing patterns, we suggest pooling the missing patterns with less than 100 subjects iteratively until the pooled missing pattern has at least 100 observations.
For reference, a much more complicated and computationally intensive approach is to jointly model the propensity score and the missingness and then use the EM/ECM algorithm (Ibrahim, et al., 1999) or Gibbs sampling (D’Agostino et al., 2000) to estimate parameters and propensity scores. Due to its complexity, we will not implement this approach in SAS.
Qu and Lipkovich (2009) combined the MI and MP methods and developed a new method called multiple imputation missingness pattern (MIMP) to estimate the propensity scores. In this approach, missing data are imputed using a multiple imputation procedure. Then, the propensity scores are estimated from a logistic regression model including the covariates (with missing values imputed) and a factor (a set of indicator variables) indicating the missingness pattern for each observation. A simulation study showed that MIMP performs as well as MI and better than MP when the missingness mechanism is either completely at random or missing at random, and it performs better than MI when data are missing not at random (Qu and Lipkovich, 2009).
In Programs 4.1 through 4.4, we provide SAS code for the MI, MP and MIMP imputation methods. These programs are similar to the code in Chapter 5 of Faries et al. (2010) but use a new SAS procedure, PROC PSMATCH, for the propensity score estimation. The code is based on the simulated REFLECTIONS data. Note that in the REFLECTIONS data, among all confounders identified by the DAG, only duration of disease (DxDur) has missing values.
Programs 4.1a and 4.1b use the MI procedure in SAS to implement multiple imputation. 100 imputed data sets are generated and PROC PSMATCH then estimates the propensity score for each imputed data set. The macro variable VARLIST contains the list of variables to be included in the later propensity score estimations. The BPIPain_LOCF variable is included in Programs 4.1a and 4.1b as an example of a variable that can be in the multiple imputation model but not the propensity model.
Program 4.1a: Multiple Imputation (MI)
**********************************************************************;
* NIMPUTE: number of imputed data, suggested minimum is 100;
* SEED: random seed in multiple imputation
**********************************************************************;
%let VARLIST=
Age BMI_B BPIInterf_B BPIPain_B CPFQ_B FIQ_B GAD7_B ISIX_B PHQ8_B
PhysicalSymp_B SDS_B DXdur;
PROC MI DATA = REFL2 ROUND=.001 NIMPUTE=100 SEED=123456 OUT=DAT_MI NOPRINT;
VAR &VARLIST BPIPain_LOCF;
RUN;
PROC SORT DATA=DAT_MI;
BY _IMPUTATION_;
RUN;
PROC PSMATCH DATA = DAT_MI REGION=ALLOBS;
CLASS COHORT Gender Race Dr_Rheum Dr_PrimCare;
PSMODEL COHORT(TREATED=’OPIOID’) = &VARLIST Gender Race Dr_Rheum Dr_PrimCare;
OUTPUT OUT = DAT_PS PS = _PS_;
BY _IMPUTATION_;
RUN;
In our case, the covariate with missing values is a continuous variable. Therefore, we used the code in Program 4.1a, where one critical assumption is made: the variables in PROC MI are jointly and individually normally distributed. If there exist categorical covariates with missing values, an alternative approach is to use full conditional model in PROC MI. An attractive feature of this method is that it does not require a multivariate normal assumption. We provide the SAS code in Program 4.1b to implement this approach, assuming that the variable Gender has missing values.
Program 4.1b: Optional PROC MI Alternative for Categorical Covariates
PROC MI DATA = REFL2 ROUND=.001 NIMPUTE=100 SEED=123456 OUT=DAT_MI_FCS NOPRINT;
Class Gender;
VAR &VARLIST Gender BPIPain_LOCF;
fcs discrim(Gender /classeffects=include)
nbiter=100;
RUN;
Programs 4.2 and 4.3 provides code to implement missing pattern approach for estimating propensity scores in SAS. Program 4.2 assigns missing value patterns to the analysis data and pool missing patterns that contain less than 100 subjects. After missing patterns are assigned, program 4.3 uses PROC PSMATCH to estimate the propensity score.
Program 4.2: Assign Missing Patterns and Pool Missing Patterns with Small Number of Observations
******************************************************************************