I have a fair amount of missing data that I don’t want to delete prior to my analysis. What are the best options available for me to retain these partially missing cases?
Missing data are a common problem faced by nearly all data analysts, particularly with the increasing emphasis on the collection of repeated assessments over time. Data values can be missing for a variety of reasons. A common situation is when a subject provides data at one time point but fails to provide data at a later time point; this is sometimes called attrition. However, data can also be missing within a single administration. For example, a subject might find a question objectionable and not want to provide a response; a subject might be fatigued or not invested in the study and skip an entire section; or there might be some mechanical failure where data are not recorded or items are inadvertently not presented. Regardless of source, it is very common for assessments to be missing for a portion of the sample under study. Fortunately, there are several excellent options available that allow us to retain cases that only provide partial data.
Historically, any case that was missing any data was simply dropped from the analysis, an approach called listwise deletion. Listwise deletion was widely used primarily because no alternative methods existed that would allow for the inclusion of partially missing cases. However, listwise deletion results in lower power and often produces biased estimates with limited generalizability. Other traditional approaches included pairwise deletion (where correlations were computed using only cases available on those pair of variables), mean imputation (where a single value was imputed for the missing case and treated as if it had actually been observed), and last-value-carried-forward (where the last observation among a set of repeated measures replaced the subsequent missing values). Like listwise deletion, these other approaches also have significant limitations. Fortunately, there are now several modern approaches to missing data analysis that perform markedly better than these traditional methods. Prior to discussing these modern methods, it helps to consider first the alternative underlying mechanisms that lead to the data being missing in the first place. The terminology and associated acronyms for these mechanisms are a bit labyrinthine, but once understood they bring clarity to the issues at hand.
The first missing data mechanism is called missing completely at random, or MCAR, and reflects a process in which data are missing in a purely random fashion that is unrelated to either the missing value itself or other observed variables in the data set. An example is when values are missing because of a programming error that randomly governs the presentation of items to subjects. The second mechanism is called missing at random, or MAR. This mechanism reflects a process in which data are not missing as a function of the missing value itself, but can be missing in relation to other variables in the data file. For example, men might be twice as likely to be missing as women, but if biological sex was a measured variable in the data file then this could be used to establish MAR. The final mechanism is missing not at random, or MNAR. This kind of missing data (also known as informatively missing or non-ignorably missing) is the most serious mechanism and defines a process in which data are missing due to the missing value itself. For example, an individual may choose not to respond to a question about drug use because that person is elevated on drug use. These three mechanisms are important to delineate because the primary methods for addressing missing data in an analysis depend on the underlying mechanism being MCAR or MAR but not MNAR.
There are two general approaches to fitting models using partially missing cases. The first is multiple imputation, or MI. Under MI, an imputation model is defined that uses variables that were observed in the data file to generate (or impute) numerical values for the observations that were missing. However, to reflect that these values are imputed with uncertainty, this process is repeated multiple times resulting in a different imputed value for each repetition. Thus, 10 or 20 imputed data sets are created, the model of interest is fitted to each data set, and the results from all the estimated models are pooled for subsequent inference. When using MI, the missing data must be MAR given the variables included in the imputation model (e.g., if missingness varies by sex, then sex must be included in the imputation model), but not all of these variables need to be in the analysis model (e.g., sex might be included in the imputation model but not in the fitted model and is thus an “auxiliary” variable).
The second primary approach for accommodating partial missingness is called full information maximum likelihood, or FIML. Here, no raw data are imputed; instead, models are fit to both the complete data and partially missing data and each individual observation contributes whatever data are available to the overall likelihood function. Implicitly, cases with complete data contribute more to the analysis but cases with partial data also contribute what information they have. Because multiple data files are not generated, FIML requires only a single model be estimated from which all inferences are drawn. Since there is no distinction between an imputation model versus the model of interest in FIML, MAR must be satisfied by including all variables predictive of missingness in the fitted model (either as a structural part of the fitted model or innocuously included as auxiliary variables).
There are many situations in which MI and FIML operate almost precisely the same and it is common for each approach to produce comparable results when based on the same model and data. However, there are situations where each has specific advantages and disadvantages with respect to the other. For instance, FIML does not require the estimation and pooling of results in multiple data sets, making it simpler for the user. In contrast, with MI it is often possible to include many more variables in the imputation model (to help satisfy MAR) than are ultimately of interest in the fitted model. Further, different software packages implement these options to differing degrees, so the user must be fully aware of what each program is doing when fitting models to partially missing cases.
The missingness mechanism we have not yet addressed is MNAR. Unfortunately, MNAR data is distinctly harder to handle. Part of the problem is that one can never establish whether the data are MAR versus MNAR because to do so would require actually observing the data that are missing. Thus, approaches for accommodating MNAR, which include selection models and pattern mixture models, are best implemented as sensitivity analyses. These approaches build in assumptions about the non-random missing data process so that we can observe how much the substantive conclusions drawn from the analysis change relative to when assuming MAR. Fortunately, in many applications this is unnecessary as MI and FIML (though they assume MAR) often perform well with MNAR data so long as the informativeness of the missing data process is not strong and the fraction of missing data is not large.
In sum, the existence of MI and FIML makes listwise deletion of missing cases almost indefensible except in a narrow band of specific situations. Indeed, the default estimators in most SEM software packages are now FIML and this issue becomes almost transparent to the user. But be certain that you know precisely how the software is handling missing cases and verify that cases are not being dropped without your knowledge. A non-exhaustive sampling of recommended readings are below.
Enders, C. K. (2010). Applied missing data analysis. Guilford Press.
Graham, J.W. (2003). Adding missing-data-relevant variables to FIML-based Structural Equation Models. Structural Equation Modeling, 10, 80-100.
Graham, J.W. (2009). Missing data analysis: Making it work in the real world. Annual Review of Psychology, 60, 549-576.
Graham, J.W. (2012). Missing Data: Analysis and Design. New York: Springer.
Graham, J. W., Taylor, B. J., Olchowski, A. E., & Cumsille, P. E. (2006). Planned missing data designs in psychological research. Psychological Methods, 11, 323-343.
Harel, O., & Schafer, J. L. (2009). Partial and latent ignorability in missing-data problems. Biometrika, 96, 37-50.
Little, R. J., & Rubin, D. B. (2019). Statistical Analysis with Missing Data (Vol. 793). John Wiley & Sons.
Schafer, J. L., & Graham, J. W. (2002). Missing data: Our view of the state of the art. Psychological Methods, 7, 147-177.
Schafer, J. L., & Olsen, M. K. (1998). Multiple imputation for multivariate missing-data problems: A data analyst’s perspective. Multivariate Behavioral Research, 33, 545-571.