Principled missing data methods for researchers

Yiran Dong, Chao-Ying Joanne Peng, Yiran Dong, Chao-Ying Joanne Peng

Abstract

The impact of missing data on quantitative research can be serious, leading to biased estimates of parameters, loss of information, decreased statistical power, increased standard errors, and weakened generalizability of findings. In this paper, we discussed and demonstrated three principled missing data methods: multiple imputation, full information maximum likelihood, and expectation-maximization algorithm, applied to a real-world data set. Results were contrasted with those obtained from the complete data set and from the listwise deletion method. The relative merits of each method are noted, along with common features they share. The paper concludes with an emphasis on the importance of statistical assumptions, and recommendations for researchers. Quality of research will be enhanced if (a) researchers explicitly acknowledge missing data problems and the conditions under which they occurred, (b) principled methods are employed to handle missing data, and (c) the appropriate treatment of missing data is incorporated into review standards of manuscripts submitted for publication.

Keywords: EM; FIML; Listwise deletion; MAR; MCAR; MI; MNAR; Missing data.

References

    1. Ake CF. Proceedings of the Thirtieth Annual SAS® Users Group International Conference. Cary, NC: SAS Institute Inc; 2005. Rounding after multiple imputation with non-binary categorical covariates; pp. 1–11.
    1. Allison PD. Missing data. Thousand Oaks, CA: Sage Publications, Inc.; 2001.
    1. Allison PD. Missing data techniques for structural equation modeling. J Abnorm Psychol. 2003;112(4):545–557. doi: 10.1037/0021-843X.112.4.545.
    1. Allison PD. Proceedings of the Thirtieth Annual SAS® Users Group International Conference. Cary, NC: SAS Institute Inc; 2005. Imputation of categorical variables with PROC MI; pp. 1–14.
    1. Barnard J, Rubin DB. Small-sample degrees of freedom with multiple imputation. Biometrika. 1999;86(4):948–955. doi: 10.1093/biomet/86.4.948.
    1. Bennett DA. How can I deal with missing data in my study? Aust N Z J Public Health. 2001;25(5):464–469.
    1. Bernaards CA, Belin TR, Schafer JL. Robustness of a multivariate normal approximation for imputation of incomplete binary data. Stat Med. 2007;26(6):1368–1382. doi: 10.1002/sim.2619.
    1. Carpenter J, Goldstein H. Multiple imputation in MLwiN. Multilevel modelling newsletter. 2004;16:9–18.
    1. Carpenter JR, Goldstein H, Kenward MG. REALCOM-IMPUTE software for multilevel multiple imputation with mixed response types. J Stat Softw. 2011;45(5):1–14.
    1. Collins LM, Schafer JL, Kam C-M. A comparison of inclusive and restrictive strategies in modern missing data procedures. Psychol Meth. 2001;6(4):330–351. doi: 10.1037/1082-989X.6.4.330.
    1. Couvreur C. The EM Algorithm: A Guided Tour. Pragues, Czech Republik: In Proc. 2d IEEE European Workshop on Computationaly Intensive Methods in Control and Signal Processing; 1996. pp. 115–120.
    1. Demirtas H, Freels SA, Yucel RM. Plausibility of multivariate normality assumption when multiply imputing non-Gaussian continuous outcomes: a simulation assessment. JSCS. 2008;78(1):69–84.
    1. Dempster AP, Laird NM, Rubin DB. Maximum Likelihood from Incomplete Data via the EM Algorithm. J R Stat Soc Series. 1977;39(1):1–38.
    1. Diggle PJ, Liang KY, Zeger SL. Analysis of longitudinal data. New York: Oxford University Press; 1995.
    1. Enders CK. A Primer on Maximum Likelihood Algorithms Available for Use With Missing Data. Struct Equ Modeling. 2001;8(1):128–141. doi: 10.1207/S15328007SEM0801_7.
    1. Enders CK. Using the Expectation Maximization Algorithm to Estimate Coefficient Alpha for Scales With Item-Level Missing Data. Psychol Meth. 2003;8(3):322–337. doi: 10.1037/1082-989X.8.3.322.
    1. Enders CK. Applied Missing Data Analysis. New York, NY: The Guilford Press; 2010.
    1. Enders CK, Bandalos DL. The Relative Performance of Full Information Maximum Likelihood Estimation for Missing Data in Structural Equation Models. Struct Equ Modeling l. 2001;8(3):430–457. doi: 10.1207/S15328007SEM0803_5.
    1. Graham JW. Adding Missing-Data-Relevant Variables to FIML-Based Structural Equation Models. Struct Equ Modeling. 2003;10(1):80–100. doi: 10.1207/S15328007SEM1001_4.
    1. Graham JW. Missing data analysis: Making it work in the real world. Annu Rev Psychol. 2009;60:549–576. doi: 10.1146/annurev.psych.58.110405.085530.
    1. Graham JW, Olchowski A, Gilreath T. How Many Imputations are Really Needed? Some Practical Clarifications of Multiple Imputation Theory. Prev Sci. 2007;8(3):206–213. doi: 10.1007/s11121-007-0070-9.
    1. Hartley HO, Hocking RR. The Analysis of Incomplete Data. Biometrics. 1971;27(4):783–823. doi: 10.2307/2528820.
    1. Heitjan DF, Little RJ. Multiple imputation for the fatal accident reporting system. Appl Stat. 1991;40:13–29. doi: 10.2307/2347902.
    1. Horton NJ, Kleinman KP. Much Ado About Nothing: A Comparison of Missing Data Methods and Software to Fit Incomplete Data Regression Models. Am Stat. 2007;61(1):79–90. doi: 10.1198/000313007X172556.
    1. Horton NJ, Lipsitz SR. Multiple Imputation in Practice. Am Stat. 2001;55(3):244–254. doi: 10.1198/000313001317098266.
    1. Horton NJ, Lipsitz SR, Parzen M. A Potential for Bias When Rounding in Multiple Imputation. Am Stat. 2003;57(4):229–232. doi: 10.1198/0003130032314.
    1. Ingersoll GM, Orr DP. Behavioral and emotional risk in early adolescents. J Early Adolesc. 1989;9(4):396–408. doi: 10.1177/0272431689094002.
    1. Ingersoll GM, Grizzle K, Beiter M, Orr DP. Frequent somatic complaints and psychosocial risk in adolescents. J Early Adolesc. 1993;13(1):67–78. doi: 10.1177/0272431693013001004.
    1. Kenward MG, Carpenter J. Multiple imputation: current perspectives. Stat Methods in Med Res. 2007;16(3):199–218. doi: 10.1177/0962280206075304.
    1. Little RJA, Rubin DB. Statistical analysis with missing data. 2. New York: Wiley; 2002.
    1. Little RJA, Schenker N. Missing Data. In: Arminger G, Clogg CC, Sobel ME, editors. Handbook of Statistical Modeling for the Social and Behavioral Sciences. New York: Plenum Press; 1995. pp. 39–75.
    1. Nunnally J. Psychometric theory. 2. New York: McGraw-Hill; 1978.
    1. OECD Publishing, Paris. 2009. PISA Data Analysis Manual: SPSS, Second Edition.
    1. Peng CYJ, Nichols RN. Using multinomial logistic models to predict adolescent behavioral risk. J Mod App Stat. 2003;2(1):177–188.
    1. Peng CYJ, Zhu J. Comparison of two approaches for handling missing covariates in logistic regression. Educ Psychol Meas. 2008;68(1):58–77.
    1. Peng CYJ, Harwell M, Liou SM, Ehman LH. Advances in missing data methods and implications for educational research. In: Sawilowsky SS, editor. Real data analysis. Charlotte, North Carolina: Information Age Pub; 2006. pp. 31–78.
    1. Peugh JL, Enders CK. Missing data in educational research: A review of reporting practices and suggestions for improvement. Review of educational research. 2004;74(4):525–556. doi: 10.3102/00346543074004525.
    1. Raghunathan TE, Lepkowski JM, van Hoewyk J, Solenberger P. A multivariate technique for multiply imputing missing values using a sequence of regression models. Survey Methodology. 2001;27(1):85–96.
    1. Resnick MD, Harris LJ, Blum RW. The impact of caring and connectedness on adolescent health and well-being. J Paediatr Child Health. 1993;29(Suppl 1):3–9. doi: 10.1111/j.1440-1754.1993.tb02257.x.
    1. Rosenberg M. Society and the adolescent self-image. rev. Middletown, CT, England: Wesleyan University Press; 1989.
    1. Royston P. Multiple imputation of missing values. SJ. 2004;4(3):227–241.
    1. Royston P. Multiple imputation of missing values: Update of ice. SJ. 2005;5(4):527–536.
    1. Royston P. Multiple imputation of missing values: further update of ice, with an emphasis on interval censoring. SJ. 2007;7(4):445–464.
    1. Royston P, White IR. Multiple Imputation by Chained Equations (MICE): Implementation in Stata. J Stat Softw. 2011;45(4):1–20.
    1. Rubin DB. Inference and missing data. Biometrika. 1976;63(3):581–592. doi: 10.1093/biomet/63.3.581.
    1. Rubin DB. Multiple imputation for nonresponse in surveys. New York: John Wiley & Sons, Inc.; 1987.
    1. Rubin DB. Multiple Imputation after 18+ Years. JASA. 1996;91:473–489. doi: 10.1080/01621459.1996.10476908.
    1. SAS/STAT 9.3 User's Guide. Cary, NC: SAS Institute Inc; 2011.
    1. Schafer JL. Analysis of incomplete multivariate data. London: Chapman & Hall/CRC; 1997.
    1. Schafer JL. Multiple imputation: a primer. Stat Methods in Med. 1999;8(1):3–15. doi: 10.1191/096228099671525676.
    1. Schafer JL. Multiple imputation with PAN. In: Collins LM, Sayer AG, editors. New methods for the analysis of change. Washington, DC: American Psychological Association; 2001. pp. 353–377.
    1. Schafer JL, Graham JW. Missing data: Our view of the state of the art. Psychol Meth. 2002;7(2):147–177. doi: 10.1037/1082-989X.7.2.147.
    1. Schafer JL, Olsen MK. Multiple Imputation for Multivariate Missing-Data Problems: A Data Analyst's Perspective. Multivar Behav Res. 1998;33(4):545–571. doi: 10.1207/s15327906mbr3304_5.
    1. Schenker N, Taylor JMG. Partially parametric techniques for multiple imputation. Comput Stat Data Anal. 1996;22(4):425–446. doi: 10.1016/0167-9473(95)00057-7.
    1. Schlomer GL, Bauman S, Card NA. Best practices for missing data management in counseling psychology. J Couns Psychol. 2010;57(1):1–10. doi: 10.1037/a0018082.
    1. Sinharay S, Stern HS, Russell D. The use of multiple imputation for the analysis of missing data. Psychol Meth. 2001;6(4):317–329. doi: 10.1037/1082-989X.6.4.317.
    1. Tabachnick BG, Fidell LS. Using multivariate statistics. 6. Needham Heights, MA: Allyn & Bacon; 2012.
    1. Tanner MA, Wong WH. The Calculation of Posterior Distributions by Data Augmentation. JASA. 1987;82(398):528–540. doi: 10.1080/01621459.1987.10478458.
    1. Truxillo C. Proceedings of the Thirtieth Annual SAS® Users Group International Conference. Cary, NC: SAS Institute Inc; 2005. Maximum Likelihood Parameter Estimation with Incomplete Data; pp. 1–19.
    1. van Buuren S. Multiple imputation of discrete and continuous data by fully conditional specification. Stat Methods in Med Res. 2007;16(3):219–242. doi: 10.1177/0962280206074463.
    1. van Buuren S, Groothuis-Oudshoorn K. mice: Multivariate Imputation by Chained Equations in R. J Stat Softw. 2011;45(3):1–67.
    1. van Buuren S, Boshuizen HC, Knook DL. Multiple imputation of missing blood pressure covariates in survival analysis. Stat Med. 1999;18(6):681–694. doi: 10.1002/(SICI)1097-0258(19990330)18:6<681::AID-SIM71>;2-R.
    1. van Buuren S, Brand JPL, Groothuis-Oudshoorn CGM, Rubin DB. Fully conditional specification in multivariate imputation. JSCS. 2006;76(12):1049–1064.
    1. White IR, Royston P, Wood AM. Multiple imputation using chained equations: Issues and guidance for practice. Stat Med. 2011;30(4):377–399. doi: 10.1002/sim.4067.
    1. Wilkinson L, the Task Force on Statistical Inference Statistical methods in psychology journals: Guidelines and explanations. Am Psychol. 1999;54(8):594–604. doi: 10.1037/0003-066X.54.8.594.
    1. Wilks SS. The large-sample distribution of the likelihood ratio for testing composite hypotheses. Ann Math Statist. 1938;9(1):60–62. doi: 10.1214/aoms/1177732360.
    1. Williams T, Williams K. Self-efficacy and performance in mathematics: Reciprocal determinism in 33 nations. J Educ Psychol. 2010;102(2):453–466. doi: 10.1037/a0017271.
    1. Yucel R. R mlmmm package: fitting multivariate linear mixed-effects models with missing values. 2007.
    1. Yucel R. Multiple imputation. J Stat Softw. 2011;45:1.
    1. Yung Y, Zhang W. Proceedings of the SAS® Global Forum 2011 Conference. Cary, NC: SAS Institute Inc; 2011. Making use of incomplete observations in the analysis of structural equation models: The CALIS procedure's full information maximum likelihood method in SAS/STAT® 9.3.

Source: PubMed

3
Tilaa