A comparison of multiple imputation methods for missing data in longitudinal studies

Md Hamidul Huque, John B Carlin, Julie A Simpson, Katherine J Lee, Md Hamidul Huque, John B Carlin, Julie A Simpson, Katherine J Lee

Abstract

Background: Multiple imputation (MI) is now widely used to handle missing data in longitudinal studies. Several MI techniques have been proposed to impute incomplete longitudinal covariates, including standard fully conditional specification (FCS-Standard) and joint multivariate normal imputation (JM-MVN), which treat repeated measurements as distinct variables, and various extensions based on generalized linear mixed models. Although these MI approaches have been implemented in various software packages, there has not been a comprehensive evaluation of the relative performance of these methods in the context of longitudinal data.

Method: Using both empirical data and a simulation study based on data from the six waves of the Longitudinal Study of Australian Children (N = 4661), we investigated the performance of a wide range of MI methods available in standard software packages for investigating the association between child body mass index (BMI) and quality of life using both a linear regression and a linear mixed-effects model.

Results: In this paper, we have identified and compared 12 different MI methods for imputing missing data in longitudinal studies. Analysis of simulated data under missing at random (MAR) mechanisms showed that the generally available MI methods provided less biased estimates with better coverage for the linear regression model and around half of these methods performed well for the estimation of regression parameters for a linear mixed model with random intercept. With the observed data, we observed an inverse association between child BMI and quality of life, with available data as well as multiple imputation.

Conclusion: Both FCS-Standard and JM-MVN performed well for the estimation of regression parameters in both analysis models. More complex methods that explicitly reflect the longitudinal structure for these analysis models may only be needed in specific circumstances such as irregularly spaced data.

Keywords: FCS; Joint modelling; Linear mixed model; Longitudinal missing data; MICE; Multilevel multiple imputation; Multiple imputation.

Conflict of interest statement

Ethics approval and consent to participate

Not applicable. De-identified secondary (LSAC) datasets were obtained from the National Centre for Longitudinal Data (NCLD).

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Figures

Fig. 1
Fig. 1
Distribution of the bias in the estimated regression coefficients (i.e., mean changes in the QoL z-score associated with each covariate) for analysis model (1) across the 1000 simulated datasets following complete data, available data and 12 multiple imputation methods. Top and bottom panel show the distribution of the bias in the estimated regression coefficients for covariates with missing data whereas the middle panel shows the distribution of the bias associated with fully observed covariate
Fig. 2
Fig. 2
Estimated coverage of the 95% confidence interval for the regression coefficients in analysis model (1), derived from 1000 simulated datasets. The dotted lines indicate the nominal value of 95%
Fig. 3
Fig. 3
Distribution of the bias in the estimated regression coefficients (i.e., mean changes in the QoL z-score associated with each covariate) for analysis model (2) across the 1000 simulated datasets following complete data, available data and 12 multiple imputation methods. Top, left and bottom right panels show the distribution of the bias in the estimated regression coefficients for covariates with missing data and all other panels show the distribution of the bias associated with fully observed covariate
Fig. 4
Fig. 4
Estimated coverage of the 95% confidence interval for the regression coefficients in analysis model (2), derived from 1000 simulated datasets. The dotted lines indicate the nominal value of 95%
Fig. 5
Fig. 5
Average computational time (in seconds) for single imputation for each of the MI methods when applied to a single simulated dataset
Fig. 6
Fig. 6
Estimated regression coefficients and 95% CI for analysis model (1) applying available data and all the approaches to handle missing data in LSAC
Fig. 7
Fig. 7
Estimated regression coefficients with 95% CI for analysis model (2) applying available data and all the MI approaches to handle missing data in LSAC

References

    1. Diggle P, Heagerty P, Liang KY, Zeger S. Analysis of longitudinal data. Oxford: Oxford University Press; 2013.
    1. Fitzmaurice GM, Laird NM, Ware JH. Applied longitudinal analysis. Hoboken: Wiley; 2012.
    1. Laird NM. Missing data in longitudinal studies. Stat Med. 1988;7(1):305–315. doi: 10.1002/sim.4780070131.
    1. Little RJ, Rubin DB. Statistical analysis with missing data. Hoboken: Wiley; 1987.
    1. Sterne JA, White IR, Carlin JB, Spratt M, Royston P, Kenward MG, et al. Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls. BMJ. 2009;338:b2393. doi: 10.1136/bmj.b2393.
    1. Rezvan PH, Lee KJ, Simpson JA. The rise of multiple imputation: a review of the reporting and implementation of the method in medical research. BMC Med Res Methodol. 2015;15(1):30–43. doi: 10.1186/s12874-015-0022-1.
    1. SAS Institute, Base SAS 9. 4 Procedures Guide: Statistical Procedures. Cary: SAS Institute; 2014.
    1. Stata Corporation, Stata statistical software, Release 13, College Station, Texas, TX, USA. 2013.
    1. R Core Team, R: A language and environment for statistical computing. R Foundation for statistical computing, Vienna, Austria. 2013.
    1. Raghunathan TE, Lepkowski JM, Van Hoewyk J, Solenberger P. A multivariate technique for multiply imputing missing values using a sequence of regression models. Survey methodology. 2001;27(1):85–96.
    1. Van Buuren S. Multiple imputation of discrete and continuous data by fully conditional specification. Stat Methods Med Res. 2007;16(3):219–242. doi: 10.1177/0962280206074463.
    1. Schafer JL. Analysis of incomplete multivariate data. New York: Chapman & Hall; 1997.
    1. Van Buuren S, Brand JP, Groothuis-Oudshoorn C, Rubin DB. Fully conditional specification in multivariate imputation. J Stat Comput Simul. 2006;76(12):1049–1064. doi: 10.1080/10629360600810434.
    1. Schafer JL, Yucel RM. Computational strategies for multivariate linear mixed-effects models with missing values. J Comput Graph Stat. 2002;11(2):437–457. doi: 10.1198/106186002760180608.
    1. Goldstein H, Carpenter J, Kenward MG, Levin KA. Multilevel models with multivariate mixed response types. Stat Model. 2009;9(3):173–197. doi: 10.1177/1471082X0800900301.
    1. Quartagno M, Carpenter J. Multiple imputation for IPD meta-analysis: allowing for heterogeneity and studies with missing covariates. Stat Med. 2015;35(17):2938–2954. doi: 10.1002/sim.6837.
    1. Resche-Rigon M, White IR. Multiple imputation by chained equations for systematically and sporadically missing multilevel data. Stat Methods Med Res. 2016. 10.1177/0962280216666564.
    1. Enders CK, Keller BT, Levy R. A fully conditional specification approach to multilevel imputation of categorical and continuous variables. Psychological methods. 2018;23(2):298-317. 10.1037/met0000148.
    1. Van Buuren S. Multiple imputation of multilevel data. Handbook of advanced multilevel analysis, Taylor & Francis Group, New York, USA 2011;173–96.
    1. Nevalainen J, Kenward MG, Virtanen SM. Missing values in longitudinal dietary data: a multiple imputation approach based on a fully conditional specification. Stat Med. 2009;28(29):3657–3669. doi: 10.1002/sim.3731.
    1. Enders CK, Mistler SA, Keller BT. Multilevel multiple imputation: a review and evaluation of joint modeling and chained equations imputation. Psychol Methods. 2016;21(2):222–240. doi: 10.1037/met0000063.
    1. De Silva AP, Moreno-Betancur M, De Livera AM, Lee KJ, Simpson JA. A comparison of multiple imputation methods for handling missing values in longitudinal data in the presence of a time-varying covariate with a non-linear association with time: a simulation study. BMC Med Res Methodol. 2017;17(1):114–124. doi: 10.1186/s12874-017-0372-y.
    1. Lee KJ, Carlin JB. Multiple imputation for missing data: fully conditional specification versus multivariate normal imputation. Am J Epidemiol. 2010;171:624–632. doi: 10.1093/aje/kwp425.
    1. Lüdtke O, Robitzsch A, Grund S. Multiple imputation of missing data in multilevel designs: a comparison of different strategies. Psychol Methods. 2017;22(1):141–165. doi: 10.1037/met0000096.
    1. Welch C, Petersen I, Bartlett JW, White IR, Marston L, Morris RW, et al. Evaluation of two-fold fully conditional specification multiple imputation for longitudinal electronic health record data. Stat Med. 2014;33(21):3725–3737. doi: 10.1002/sim.6184.
    1. Audigier V, White IR, Jolani S, Debray TP, Quartagno M, Carpenter J, et al. Multiple imputation for multilevel data with continuous and binary variables. Stat Sci. 2018;33(2):160–183. doi: 10.1214/18-STS646.
    1. Jansen P, Mensah F, Clifford S, Nicholson J, Wake M. Bidirectional associations between overweight and health-related quality of life from 4–11 years: longitudinal study of Australian children. Int J Obes. 2013;37(10):1307–1313. doi: 10.1038/ijo.2013.71.
    1. Australian Government Department of Families H, Community Services and Indigenous Affairs (FaHCSIA). Growing Up in Australia: the Longitudinal Study of Australian Children: 2010–11 Annual Report. Canberra, Australia: Australian Government Department of Families, Housing, Community Services and Indigenous Affairs (FaHCSIA) 2012.
    1. Cole TJ, Bellizzi MC, Flegal KM, Dietz WH. Establishing a standard definition for child overweight and obesity worldwide: international survey. BMJ. 2000;320(7244):1240–1245. doi: 10.1136/bmj.320.7244.1240.
    1. Feeney R, Desha L, Khan A, Ziviani J, Nicholson JM. Speech and language difficulties along with other child and family factors associated with health related quality of life of Australian children. Appl Res Qual Life. 2016;11(4):1379–1397. doi: 10.1007/s11482-015-9443-6.
    1. Bernaards CA, Belin TR, Schafer JL. Robustness of a multivariate normal approximation for imputation of incomplete binary data. Stat Med. 2007;26(6):1368–1382. doi: 10.1002/sim.2619.
    1. Zhao E, Yucel RM. Performance of sequential imputation method in multilevel applications. Proceedings in Jonit statistical meetings, Washington DC. 2009.
    1. Audigier V, Resche-Rigon M. Micemd: Multiple imputation by chained equations with multilevel data. R package. 2017.
    1. Horton NJ, Lipsitz SR, Parzen M. A potential for bias when rounding in multiple imputation. Am Stat. 2003;57(4):229–232. doi: 10.1198/0003130032314.
    1. Kalaycioglu O, Copas A, King M, Omar RZ. A comparison of multiple-imputation methods for handling missing data in repeated measurements observational studies. J R Stat Soc A Stat Soc. 2016;179(3):683–706. doi: 10.1111/rssa.12140.
    1. Morris TP, White IR, Royston P. Tuning multiple imputation by predictive mean matching and local residual draws. BMC Med Res Methodol. 2014;14(1):75. doi: 10.1186/1471-2288-14-75.
    1. Hughes RA, White IR, Seaman SR, Carpenter JR, Tilling K, Sterne JA. Joint modelling rationale for chained equations. BMC Med Res Methodol. 2014;14(1):28–37. doi: 10.1186/1471-2288-14-28.
    1. Seaman SR, Hughes RA. Relative efficiency of joint-model and full-conditional-specification multiple imputation when conditional models are compatible: the general location model. Stat Methods Med Res. 2018;27(6):1603–1614. doi: 10.1177/0962280216665872.
    1. Murray JS. Multiple imputation: a review of practical and theoretical findings. Stat Sci. 2018;33(2):142–159. doi: 10.1214/18-STS644.
    1. Zhao Y, Long Q. Multiple imputation in the presence of high-dimensional data. Stat Methods Med Res. 2016;25(5):2021–2035. doi: 10.1177/0962280213511027.
    1. Audigier V, Husson F, Josse J. Multiple imputation for continuous variables using a Bayesian principal component analysis. J Stat Comput Simul. 2016;86(11):2140–2156. doi: 10.1080/00949655.2015.1104683.
    1. Zhao J, Schafer J. Pan: Multiple imputation for multivariate panel or clustered data. R Foundation for statistical computing. 2013.
    1. Carpenter JR, Goldstein H, Kenward MG. REALCOM-IMPUTE software for multilevel multiple imputation with mixed response types. J Stat Softw. 2011;45(5):1–14. doi: 10.18637/jss.v045.i05.
    1. Van Buuren S, Groothuis-Oudshoorn K. Mice: multivariate imputation by chained equations in R. J Stat Softw. 2011;45(3).
    1. Keller BT, Enders CK. Blimp Software Manual (Version Beta 6.7). Los Angeles, Ca. 2017.
    1. Robitzsch A, Grund S, Henke T. Miceadds: some additional multiple imputation functions, especially for ‘mice’. R package version 1. 7–8. 2016.

Source: PubMed

3
Abonnere