Missing data and multiple imputation in clinical epidemiological research

Alma B Pedersen, Ellen M Mikkelsen, Deirdre Cronin-Fenton, Nickolaj R Kristensen, Tra My Pham, Lars Pedersen, Irene Petersen, Alma B Pedersen, Ellen M Mikkelsen, Deirdre Cronin-Fenton, Nickolaj R Kristensen, Tra My Pham, Lars Pedersen, Irene Petersen

Abstract

Missing data are ubiquitous in clinical epidemiological research. Individuals with missing data may differ from those with no missing data in terms of the outcome of interest and prognosis in general. Missing data are often categorized into the following three types: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). In clinical epidemiological research, missing data are seldom MCAR. Missing data can constitute considerable challenges in the analyses and interpretation of results and can potentially weaken the validity of results and conclusions. A number of methods have been developed for dealing with missing data. These include complete-case analyses, missing indicator method, single value imputation, and sensitivity analyses incorporating worst-case and best-case scenarios. If applied under the MCAR assumption, some of these methods can provide unbiased but often less precise estimates. Multiple imputation is an alternative method to deal with missing data, which accounts for the uncertainty associated with missing data. Multiple imputation is implemented in most statistical software under the MAR assumption and provides unbiased and valid estimates of associations based on information from the available data. The method affects not only the coefficient estimates for variables with missing data but also the estimates for other variables with no missing data.

Keywords: MAR; MCAR; MNAR; missing data; multiple imputation; observational study.

Conflict of interest statement

Disclosure The authors report no conflicts of interest in this work.

Figures

Figure 1
Figure 1
Distribution of BMI values by outcome in full dataset (A) and in a dataset with 35% missing values (B) for BMI handled by creating a missing BMI category. Abbreviation: BMI, body mass index.
Figure 2
Figure 2
Normal distribution of observed BMI in a full dataset of 10,000 observations. Abbreviation: BMI, body mass index.
Figure 3
Figure 3
Distribution of BMI in a dataset of 10,000 observations, where 35% of BMI values are missing and replaced by the observed mean BMI value. Abbreviation: BMI, body mass index.
Figure 4
Figure 4
Selection of variables in order to create multiple imputed datasets when looking into the association between body mass index and transfusion risk.
Figure 5
Figure 5
The three main stages of implementing multiple imputation.

References

    1. Marston L, Carpenter JR, Walters KR, Morris RW, Nazareth I, Petersen I. Issues in multiple imputation of missing data for large general practice clinical databases. Pharmacoepidemiol Drug Saf. 2010;19(6):618–626.
    1. Pedersen AB, Baggesen LM, Ehrenstein V, Pedersen L, Lasgaard M, Mikkelsen EM. Perceived stress and risk of any osteoporotic fracture. Osteoporos Int. 2016;27(6):2035–2045.
    1. Buuren Sv. Flexible Imputation of Missing Data Interdisciplinary Statistics Series. Boca Raton, FL: Chapman & Hall/CRC; 2012.
    1. Sterne JA, White IR, Carlin JB, et al. Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls. BMJ. 2009;338:b2393.
    1. Graham JW. Missing data analysis: making it work in the real world. Annu Rev Psychol. 2009;60:549–576.
    1. Rubin DB. Inference and missing data. Biometrika. 1976;63(3):581–592.
    1. Donders AR, van der Heijden GJ, Stijnen T, Moons KG. Review: a gentle introduction to imputation of missing values. J Clin Epidemiol. 2006;59(10):1087–1091.
    1. Little RJA. A test of missing completelly at random for multivariate data with missing values. J Am Stat Assoc. 1988;83(404):1198–1202.
    1. Mohan K, Pearl J, Tian J. Graphical models for inference with missing data. In: Burges CJC, Bottou L, Welling M, Ghahramani Z, Weinberger KQ, editors. Advances in Neural Information Processing System 26 (NIPS-2013) Red Hook, NY: Curran Associates, Inc.; 2013. pp. 1277–1285.
    1. Cappelleri JC, Zou KH, Bushmakin A, Alvir MJM, Symonds T. Patient-Reported Outcomes: Measurement, Implementation and Interpretation. Boca Raton, FL: CRC Press; 2013. p. 2013.
    1. Wisniewski SR, Leon AC, Otto MW, Trivedi MH. Prevention of missing data in clinical research studies. Biol Psychiatry. 2006;59(11):997–1000.
    1. Greenland S, Finkle WD. A critical look at methods for handling missing covariates in epidemiologic regression analyses. Am J Epidemiol. 1995;142(12):1255–1264.
    1. Little RJA. Regression with missing X’s: a review. J Am Stat Assoc. 1992;87(420):1227–1237.
    1. Apold H, Meyer HE, Espehaug B, Nordsletten L, Havelin LI, Flugsrud GB. Weight gain and the risk of total hip replacement a population-based prospective cohort study of 265,725 individuals. Osteoarthritis Cartilage. 2011;19(7):809–815.
    1. Pedersen AB, Sorensen HT, Mehnert F, Overgaard S, Johnsen SP. Risk factors for venous thromboembolism in patients undergoing total hip replacement and receiving routine thromboprophylaxis. J Bone Joint Surg Am. 2010;92-A(12):2156–2164.
    1. Barnes SA, Larsen MD, Schroeder D, Hanson A, Decker PA. Missing data assumptions and methods in a smoking cessation study. Addiction. 2010;105(3):431–437.
    1. Klebanoff MA, Cole SR. Use of multiple imputation in the epidemio-logic literature. Am J Epidemiol. 2008;168(4):355–357.
    1. Rubin DB. Multiple imputation after 18+ years. J Am Stat Assoc. 1996;91(434):473–489.
    1. Carpenter J, Kenward M. Multiple Imputation and Its Application. New York, NY: John Wiley & Sons; 2013.
    1. Schafer JL, Graham JW. Missing data: our view of the state of the art. Psychol Methods. 2002;7(2):147–177.
    1. Collins LM, Schafer JL, Kam CM. A comparison of inclusive and restrictive strategies in modern missing data procedures. Psychol Methods. 2001;6(4):330–351.
    1. Moons KG, Donders RA, Stijnen T, Harrell FE., Jr Using the outcome for imputation of missing predictor values was preferred. J Clin Epidemiol. 2006;59(10):1092–1101.
    1. White IR, Royston P, Wood AM. Multiple imputation using chained equations: issues and guidance for practice. Stat Med. 2011;30(4):377–399.
    1. van Buuren S. Multiple imputation of discrete and continuous data by fully conditional specification. Stat Methods Med Res. 2007;16(3):219–242.
    1. StataCorp . Stata: Release 13 Statistical Software. College Station, TX: StataCorp LP; 2013. [Accessed December 1, 2016]. Available from: .
    1. Rubin DB. Introduction in Multiple Imputation for Nonresponse in Surveys. Hoboken, NJ: John Wiley & Sons, Inc; 1987.
    1. Graham JW, Olchowski AE, Gilreath TD. How many imputations are really needed? Some practical clarifications of multiple imputation theory. Prev Sci. 2007;8(3):206–213.
    1. Bodner TE. What improves with increased missing data imputations? Struct Equ Modeling. 2008;15(4):651–675.
    1. von Elm E, Altman DG, Egger M, Pocock SJ, Gotzsche PC, Vanden-broucke JP, STROBE Initiative The strengthening the reporting of observational studies in epidemiology (STROBE) statement: guidelines for reporting observational studies. Lancet. 2007;370(9596):1453–1457.

Source: PubMed

3
S'abonner