In short, it's a problem most of the time.
Missing data is one of those odd, ubiquitous problems in data analysis. It occurs often, but rarely do researchers actually speak about it. It's often handled in an automatic way without thinking, or even without knowledge of the potential consequences to the final analysis.
For this post, I'd like to highlight some of the problems behind missing data. To start, we have to consider the specific pattern of missingness - that is, what is the underlying process generating the missing values? Little and Rubin (2014) and Allison (2002) categorize missing data as coming from three possible processes: Missing Completely at Random, Missing at Random, Missing Not at Random. Despite these highly undescriptive and opaque names, these three processes are actually quite distinct, and present very different problems to the researcher. Let's briefly talk about each one first:
Missing Completely at Random. When data are missing at completely at random (MCAR) there is no assumed underlying pattern to the missing values. Every variable and observation is assumed to have an equal probability of being missing. This is just like as if someone took a random sample of the data.
- The good news: Under MCAR there is very little bias associated with the missing values. In fact, if the proportion of missing values is quite low you can continue your analysis under casewise deletion (which is the default for most statistical analysis software). Essentially, under MCAR, you are working on a random sample of your full dataset. The biggest drawback here is a loss of statistical power - so under a frequentist standpoint you might be increasing your risks of type I and type II errors.
- The bad news: Situations where MCAR exist are very, very, (very) rare. The assumption that missing values in your data occurred only by chance is quite strong, and difficult to defend in most realistic cases.
Missing at Random. While sounding very similar to MCAR, missing at random (MAR) is quite different. MAR occurs when the pattern of missingness is related to other observed variables within the data. For instance, females might be less likely to fill in questions about their self-reported offending behavior. In this case, the probability that the offending behavior variable is missing is conditional based on the respondent's sex. Both MAR and MCAR have
- The good news: The assumption that observations are MAR allows the researcher to use a number of approaches to deal with missing values. These might include multiple imputation, full-information maximum likelihood, and model-based adjustments (especially in the context of Bayesian analysis).
- The bad news: There does not exist any statistical test to determine whether data are actually MAR. The researcher should provide some substantive reason that indicates
Missing Not at Random. The most problematic of the three patterns. Under MNAR, the probability of a missing value is dependent on levels of the value itself. To extend the previous example about self-reported offending, individuals with a high level of offending might be less likely to fill in a response. In contrast to MCAR and MAR, where the pattern is not assumed to come from the missing data itself (the "ignorability" assumption), MNAR is considered "non-ignorable". This means extra steps are needed to account for the missing data.
- The good & bad news: There isn't really a lot of good news here. MNAR represents a fairly difficult problem to handle because it violates the assumptions used for MAR. Worse yet, there does not exist a single statistical test to definitively determine whether data are MNAR or MAR. There are more complicated ways of handling data under MNAR, such as a selection bias model (Heckman, 1977) or a pattern mixture model (Little, 1995).
So far this discussion has focused on the assumptions behind each of the patterns of missing data. However, visualizing what these patterns actually look like in a model. Here, simulations can be extremely useful - especially if we know what the missing data pattern actually is before we see it. Here, I'll be using the ampute function in mice to generate missing data patterns in a dataset with complete information.
I'll be using data from the 1992 Evaluation of the Focused Offender Disposition Program in Birmingham, Phoenix, and Chicago. This study, among other things, monitored probationers with a history of recent drug use. Part of the study looked at whether ties to conformity and the seriousness of their drug problem predicted probation revocation. Below are a subset of the variables I'll use for this example:
- Gender - Male or Female. Male = 1
- Race - White or Nonwhite. White = 1
- Arrests - Number of prior arrests.
- Conformity Score - Sum score representing a subject's stake in conformity (ie, school, work, family). Ranges from 0 - 14.
- Drug Severity - Scale measuring the subject's drug use severity. Ranges from 0 - 6.
- Urine - Number of urinalysis tests coming up positive for a controlled substance. Ranges from 0 to "4 or more".
Fitting the Model
For our purposes we're going to generate a model predicting the number of positive urinalysis tests, controlling for gender, race, and the number of prior arrests. In addition, we'll see whether the "conformity score" or the "drug severity score" predict higher numbers of positive urine tests.
Because the response variable, Urine, represents discrete counts of an event (a positive test), a Poisson distribution naturally fits here. However, the data are right censored because we only observe counts up to 4 or more. Luckily, since we're using the highly flexible 'brms' package in R, we can specify a censored Poisson regression. Our model looks like this:
urine | censored ~ gender + race + arrests + conformity + drug_severity
where the '|' indicates the response variable should be considered right-censored. Below are the model results.
Model 1. Complete Model (N = 371)
The results from the complete model suggest that the higher values of the drug severity score increase the likelihood of a positive urine test by about 18% per point - such that a person with the highest score would have about 2.7 times more positive urine tests than someone with the lowest score. Interestingly, the stake in conformity score does not appear to predict positive urine tests.
Now, let's add some missing data!
Using ampute we can introduce patterns of missing data, according to our own specification. Because we know what the actual values should be, we can see how our estimates change under different assumptions of missingness. Here, the function generates missing data according to MCAR, MAR, and MNAR. For this exercise, we'll introduce a case-wise proportion of missing data by 50%.
Below, the table shows the parameter estimates for the models under each of the missing data patterns.Consistent with what we'd expect, the estimates under MCAR aren't very far from the actual estimates in the complete dataset. However, the estimates for MAR are noticably different - especially for the variables Arrests and Conformity Score.
Below, we can visualize the bias introduced by looking at estimates for the conformity score for each missing data scenario. Estimates for the data under MCAR are quite close to the true value - although the uncertainty is higher due to the loss in power. Note that both the point estimates for MAR and MNAR are quite far from the true value.
Imputation using MICE
Now that we've identified that missing data is indeed a problem under MAR (and MNAR as well - but we'll set aside this problem for now), we can try and improve our inferences by imputing the missing values. Today, multiple imputation remains one of the most popular methods of handling missing data and is generally recommended over things like hot-deck imputation or even (gasp!) mean replacement. Essentially, multiple imputation sets up a fully conditional model estimating the missing values for each variable conditional on all other variables. The program proceeds to then use those estimates for other missing values. We then repeat this process a number of times until our estimates converge to a single value. By performing the imputations many times, we account for variability in our estimates.
The method I will be using here is called Multiple Imputation using Chained Equations or MICE. The "chained equations" part comes from the way missing values for each variable is estimated. MICE essentially takes the following steps to estimate missing values:
- "Place holder" values are inserted for every missing value (i.e. the mean or median).
- The "place holders" for one variable (Var. 1) are set to missing.
- The observed values for 'Var. 1' are regressed on all the other variables in the imputation model. The predictions are then used as replacements for the missing values.
- Steps 2-3 and repeated for each missing value (i.e. Var. 2, Var. 3, etc...), using the imputed predictions themselves as predictors - hence the 'chained' part of MICE.
- Steps 2-4 are repeated over a specified number of iterations until the estimates for missing values have converged.
For this example I fit an imputation model using 10 imputations across 5 chained equations using all the variables as predictors for the missing values. After specifying this (albeit overly simplistic) imputation model for the missing values in the MAR dataset, we can observe the effect it had on the mean parameter estimates. The table below highlights the imputed data relative to the other missing values. Here it looks like MICE did a good job pulling estimates of conformity score much closer to the true value of -0.05.
Missing data is an ubiquitous problem, yet many researchers are unaware of methods to handle it. Substantial biases can be introduced if patterns of missing data are ignored. In the best-case scenario, a loss of power can increase uncertainty about parameter estimates.
Methods like multiple imputation via free, open-source packages like 'MICE' and 'amelia' can help alleviate some of these problems. However, like all modeling strategies, these are not magic bullets. Imputation models should be developed thoughtfully, including all relevant predictors and preserving patterns in the data.
My goal of this post is to highlight common situations which arise in data analysis, and to show that relatively simple solutions are available.
Allison, P. D. (2002). Missing data: Quantitative applications in the social sciences. British Journal of Mathematical and Statistical Psychology, 55(1), 193-196.
Azur, M. J., Stuart, E. A., Frangakis, C., & Leaf, P. J. (2011). Multiple imputation by chained equations: what is it and how does it work?. International journal of methods in psychiatric research, 20(1), 40-49.
Buuren, S. V., & Groothuis-Oudshoorn, K. (2010). mice: Multivariate imputation by chained equations in R. Journal of statistical software, 1-68.
Heckman, J. J. (1977). Sample selection bias as a specification error (with an application to the estimation of labor supply functions).
Little, R. J. (1995). Modeling the drop-out mechanism in repeated-measures studies. Journal of the American Statistical Association, 90(431), 1112-1121.
Little, R. J., & Rubin, D. B. (2014). Statistical analysis with missing data. John Wiley & Sons.