Generative adversarial networks for imputing missing data for big data clinical research

Weinan Dong, Daniel Yee Tak Fong, Jin-Sun Yoon, Eric Yuk Fai Wan, Laura Elizabeth Bedford, Eric Ho Man Tang, Cindy Lo Kuen Lam, Weinan Dong, Daniel Yee Tak Fong, Jin-Sun Yoon, Eric Yuk Fai Wan, Laura Elizabeth Bedford, Eric Ho Man Tang, Cindy Lo Kuen Lam

Abstract

Background: Missing data is a pervasive problem in clinical research. Generative adversarial imputation nets (GAIN), a novel machine learning data imputation approach, has the potential to substitute missing data accurately and efficiently but has not yet been evaluated in empirical big clinical datasets.

Objectives: This study aimed to evaluate the accuracy of GAIN in imputing missing values in large real-world clinical datasets with mixed-type variables. The computation efficiency of GAIN was also evaluated. The performance of GAIN was compared with other commonly used methods, MICE and missForest.

Methods: Two real world clinical datasets were used. The first was that of a cohort study on the long-term outcomes of patients with diabetes (50,000 complete cases), and the second was of a cohort study on the effectiveness of a risk assessment and management programme for patients with hypertension (10,000 complete cases). Missing data (missing at random) to independent variables were simulated at different missingness rates (20, 50%). The normalized root mean square error (NRMSE) between imputed values and real values for continuous variables and the proportion of falsely classified (PFC) for categorical variables were used to measure imputation accuracy. Computation time per imputation for each method was recorded. The differences in accuracy of different imputation methods were compared using ANOVA or non-parametric test.

Results: Both missForest and GAIN were more accurate than MICE. GAIN showed similar accuracy as missForest when the simulated missingness rate was 20%, but was more accurate when the simulated missingness rate was 50%. GAIN was the most accurate for the imputation of skewed continuous and imbalanced categorical variables at both missingness rates. GAIN had a much higher computation speed (32 min on PC) comparing to that of missForest (1300 min) when the sample size is 50,000.

Conclusion: GAIN showed better accuracy as an imputation method for missing data in large real-world clinical datasets compared to MICE and missForest, and was more resistant to high missingness rate (50%). The high computation speed is an added advantage of GAIN in big clinical data research. It holds potential as an accurate and efficient method for missing data imputation in future big data clinical research.

Trial registration: ClinicalTrials.gov ID: NCT03299010 ; Unique Protocol ID: HKUCTR-2232.

Keywords: Big data; Clinical research; Generative adversarial network; Machine learning; Missing data imputation.

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
Density plots displaying the distribution of the absolute difference between imputed values and true values on continuous variables by different methods (missingness rate = 50%). (Note: a and b are representative continuous variables in DM-data, c and d are representative continuous variables in HT-data)
Fig. 2
Fig. 2
Bar plots displaying the distribution of imputed allocation of categorical variables by different methods (missingness rate = 50%). (Note: a and b are representative continuous variables in DM-data, c and d are representative continuous variables in HT-data; Shaded areas indicate the proportion that correctly imputed in each category by each method)
Fig. 3
Fig. 3
Computation time of one imputation process by each method on DM-data. a Computation time on PC; b Computation time on HPC

References

    1. Li P, Stuart EA, Allison DB. Multiple imputation: a flexible tool for handling missing DataMultiple ImputationMultiple imputation. JAMA. 2015;314(18):1966–1967. doi: 10.1001/jama.2015.15281.
    1. Yoon J, Davtyan C, van der Schaar M. Discovery and clinical decision support for personalized healthcare. IEEE J Biomed Health Inform. 2017;21(4):1133–1145. doi: 10.1109/JBHI.2016.2574857.
    1. Altman DG, Bland JM. Missing data. BMJ (Clinical research ed) 2007;334(7590):424. doi: 10.1136/bmj.38977.682025.2C.
    1. Robinson KA, Dennison CR, Wayman DM, Pronovost PJ, Needham DM. Systematic review identifies number of strategies important for retaining study participants. J Clin Epidemiol. 2007;60(8):757.e1–757e19. doi: 10.1016/j.jclinepi.2006.11.023.
    1. Hayati Rezvan P, Lee KJ, Simpson JA. The rise of multiple imputation: a review of the reporting and implementation of the method in medical research. BMC Med Res Methodol. 2015;15(1):30. doi: 10.1186/s12874-015-0022-1.
    1. Little RJA. In: Statistical analysis with missing data [electronic resource] 2. Rubin DB, editor. Hoboken: Wiley; 2002.
    1. Graham JW. Missing data analysis: making it work in the real world. Annu Rev Psychol. 2009;60(1):549–576. doi: 10.1146/annurev.psych.58.110405.085530.
    1. Sv B. Flexible imputation of missing data. Boca Raton: CRC Press; 2012.
    1. Little R, Rubin D. Statistical analysis with missing data. 3. Hoboken: Wiley; 2019.
    1. Sterne JA, White IR, Carlin JB, Spratt M, Royston P, Kenward MG, et al. Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls. BMJ (Clinical research ed) 2009;338:b2393. doi: 10.1136/bmj.b2393.
    1. Bhaskaran K, Smeeth L. What is the difference between missing completely at random and missing at random? Int J Epidemiol. 2014;43(4):1336–1339. doi: 10.1093/ije/dyu080.
    1. Pedersen AB, Mikkelsen EM, Cronin-Fenton D, Kristensen NR, Pham TM, Pedersen L, Petersen I. Missing data and multiple imputation in clinical epidemiological research. Clin Epidemiol. 2017;9:157–166. doi: 10.2147/CLEP.S129785.
    1. Anon A. Multiple imputation by chained equations: what is it and how does it work? Int J Methods Psychiatr Res. 2011;20(1):40–49. doi: 10.1002/mpr.329.
    1. Seaman SR, Bartlett JW, White IR. Multiple imputation of missing covariates with non-linear effects and interactions: an evaluation of statistical methods. BMC Med Res Methodol. 2012;12(1):46. doi: 10.1186/1471-2288-12-46.
    1. Tin KH. The random subspace method for constructing decision forests. IEEE Trans Pattern Anal Mach Intell. 1998;20(8):832–844. doi: 10.1109/34.709601.
    1. Stekhoven DJ, Bühlmann P. MissForest—non-parametric missing value imputation for mixed-type data. Bioinformatics. 2012;28(1):112–118. doi: 10.1093/bioinformatics/btr597.
    1. Shah AD, Bartlett JW, Carpenter J, Nicholas O, Hemingway H. Comparison of random Forest and parametric imputation models for imputing missing data using MICE: a CALIBER study. Am J Epidemiol. 2014;179(6):764–774. doi: 10.1093/aje/kwt312.
    1. Leon S, Benjamin W, Tim L. RenderGAN: generating realistic labeled data. Front Robot and AI. 2018;5.
    1. Bao J, Chen D, Wen F, Li H, Hua G. CVAE-GAN: Fine-Grained Image Generation through Asymmetric Training. 2017.
    1. Yoon J, Jordon J, van der Schaar M. GAIN: Missing Data Imputation using Generative Adversarial Nets. 2018.
    1. Wan EYF, Yu EYT, Chin WY, Fung CSC, Kwok RLP, Chao DVK, et al. Ten-year risk prediction models of complications and mortality of Chinese patients with diabetes mellitus in primary care in Hong Kong: a study protocol. BMJ Open. 2018;8(10):e023070.
    1. Fai Wan EY, Tak Yu EY, Chin WY, Fong DYT, Choi EPH, Tang EHM, et al. Burden of CKD and cardiovascular disease on life expectancy and health service utilization: a cohort study of Hong Kong Chinese hypertensive patients. J Am Soc Nephrol. 2019;30(10):1991–1999. doi: 10.1681/ASN.2018101037.
    1. Burgette LF, Reiter JP. Multiple imputation for missing data via sequential regression trees. Am J Epidemiol. 2010;172(9):1070–1076. doi: 10.1093/aje/kwq260.
    1. van Buuren S, Groothuis-Oudshoorn CGM. mice: Multivariate Imputation by Chained Equations in R. J Stat Softw. 2011;45(3):1548–7660. doi: 10.18637/jss.v045.i03.
    1. Royston P, Altman Douglas G, Marshall A, Holder RL. Comparison of techniques for handling missing covariate data within prognostic modelling studies: a simulation study. BMC Med Res Methodol. 2010;10(1):7. doi: 10.1186/1471-2288-10-7.
    1. Schulz KF, Grimes DA. Sample size slippages in randomised trials: exclusions and the lost and wayward. Lancet. 2002;359(9308):781–785. doi: 10.1016/S0140-6736(02)07882-0.
    1. Yeatts DS, Martin HR. What is missing from my missing data plan? Stroke. 2015;46(6):e130–e1e2. doi: 10.1161/STROKEAHA.115.007984.
    1. Hughes RA, Heron J, Sterne JAC, Tilling K. Accounting for missing data in statistical analyses: multiple imputation is not always the answer. Int J Epidemiol. 2019;48(4):1294–1304. doi: 10.1093/ije/dyz032.

Source: PubMed

3
Se inscrever