Multiple imputation of missing covariates with non-linear effects and interactions: an evaluation of statistical methods

Shaun R Seaman, Jonathan W Bartlett, Ian R White, Shaun R Seaman, Jonathan W Bartlett, Ian R White

Abstract

Background: Multiple imputation is often used for missing data. When a model contains as covariates more than one function of a variable, it is not obvious how best to impute missing values in these covariates. Consider a regression with outcome Y and covariates X and X2. In 'passive imputation' a value X* is imputed for X and then X2 is imputed as (X*)2. A recent proposal is to treat X2 as 'just another variable' (JAV) and impute X and X2 under multivariate normality.

Methods: We use simulation to investigate the performance of three methods that can easily be implemented in standard software: 1) linear regression of X on Y to impute X then passive imputation of X2; 2) the same regression but with predictive mean matching (PMM); and 3) JAV. We also investigate the performance of analogous methods when the analysis involves an interaction, and study the theoretical properties of JAV. The application of the methods when complete or incomplete confounders are also present is illustrated using data from the EPIC Study.

Results: JAV gives consistent estimation when the analysis is linear regression with a quadratic or interaction term and X is missing completely at random. When X is missing at random, JAV may be biased, but this bias is generally less than for passive imputation and PMM. Coverage for JAV was usually good when bias was small. However, in some scenarios with a more pronounced quadratic effect, bias was large and coverage poor. When the analysis was logistic regression, JAV's performance was sometimes very poor. PMM generally improved on passive imputation, in terms of bias and coverage, but did not eliminate the bias.

Conclusions: Given the current state of available software, JAV is the best of a set of imperfect imputation methods for linear regression with a quadratic or interaction effect, but should not be used for logistic regression.

Figures

Figure 1
Figure 1
Typical datasets for normally or log-normally distributed X (each with mean 2 and variance 1), normally distributed Y with mean 2X + X2 or (X - 2)2, and R2 = 0.1, 0.5 or 0.8. Dotted line shows expected value of Y given X.
Figure 2
Figure 2
Log plasma vitamin C and log dietary vitamin C in 15415 individuals for whom both variables are observed.

References

    1. Little RJA, Rubin DB. Statistical Analysis With Missing Data. New Jersey: Wiley; 2002.
    1. Royston J, Sauerbrei W. Multivariate Model-Building. Chichester: Wiley; 2008.
    1. Von Hippel PT. How to impute interactions, squares and other transformed variables. Sociol Methodol. 2009;39:265–291. doi: 10.1111/j.1467-9531.2009.01215.x.
    1. Schenker N, Taylor JMG. Partially parametric techniques for multiple imputation. Computational Statistics and Data Analysis. 1996;22:425–446. doi: 10.1016/0167-9473(95)00057-7.
    1. White IR, Royston P, Wood AM. Multiple imputation for chained equations: issues and guidance for practice. Stat Med. 2011;30:377–399. doi: 10.1002/sim.4067.
    1. Schenker N, Welsh AH. Asymptotic results for multiple imputation. Ann Stat. 1988;16:1550–1566. doi: 10.1214/aos/1176351053.
    1. Day NE, Oakes S, Luben R, Khaw KT, Bingham S, Welch A, Wareham N. EPIC in Norfolk: study design and characteristics of the cohort. Br J Cancer. 1999;80(Suppl 1):95–103.
    1. Bingham SA, Welch AA, McTaggart A, Mulligan AA, Runswick SA, Luben R, Oakes S, K-T K, Wareham N, Day NE. Nutritional methods in the European prospective investigation of cancer in Norfolk. Public Health Nutr. 2001;4:847–858. doi: 10.1079/PHN2000102.
    1. Bates CJ, Thurnham DI. In: Design Concepts in Nutritional Epidemiology. Margetts BM, Nelson N, editor. Oxford University Press; 1991. Biochemical markers of nutrient intake.
    1. Dehghan M, Akhtar-Danesh N, McMillan CR, Thabane L. Is plasma vitamin C an appropriate biomarker of vitamin C intake? A systematic review and meta-analysis. Nutr J. 2007;6 doi:10:1186/1475-2891-6-41.
    1. Brubacher D, Moser U, Jordan P. Vitamin C concentations in plasma as a function of intake: a meta-analysis. International Journal for Vitamin and Nutrient Research. 2000;70:226–237. doi: 10.1024/0300-9831.70.5.226.
    1. Stegmayr B, Johansson I, Huhtasaari F, Moser U, Asplund K. Use of smokeless tobacco and cigarettes--effects on plasma levels of antioxidant vitamins. Int J Vitam Nutr Res. 1993;63:195–200.
    1. Van Buuren S. Multiple imputation of discrete and continuous data by fully conditional specification. Stat Methods Med Res. 2007;16:219–242. doi: 10.1177/0962280206074463.
    1. Ake CF. Rounding After Imputation With Non-binary Categorical Covariates. Paper 112-30, SUGI 30 Proceedings, Philadelphia, Pennsylvania. 2005.
    1. Von Hippel PT. Regression with missing Y's: an improved strategy for analysing multiply imputed data. Sociol Methodol. 2007;37:83–117. doi: 10.1111/j.1467-9531.2007.00180.x.
    1. White H. A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica. 1980;48:817–838. doi: 10.2307/1912934.
    1. Fine JP. Comparing nonnested cox models. Biometrika. 2002;89:635–647. doi: 10.1093/biomet/89.3.635.
    1. Scott AJ, Wild CJ. In: Analysis of Complex Surveys. Skinner CJ, Holt D, Smith TMF, editor. New York: Wiley; 1989. Selection Based on the Response Variable in Logistic Regression.
    1. Rubin DB. Multiple Imputation for Nonresponse in Surveys. New York: Wiley; 1987.
    1. Robins JM, Wang N. Inference for imputation estimators. Biometrika. 2000;87:113–124. doi: 10.1093/biomet/87.1.113.
    1. Nielsen SF. Proper and improper multiple Imputation. Int Stat Rev. 2003;71:593–627.
    1. Prentice RL, Pyke R. Logistic disease incidence model and case-control studies. Biometrika. 1979;66:403–411. doi: 10.1093/biomet/66.3.403.

Source: PubMed

3
Předplatit