Variable selection - A review and recommendations for the practicing statistician

Georg Heinze, Christine Wallisch, Daniela Dunkler, Georg Heinze, Christine Wallisch, Daniela Dunkler

Abstract

Statistical models support medical research by facilitating individualized outcome prognostication conditional on independent variables or by estimating effects of risk factors adjusted for covariates. Theory of statistical models is well-established if the set of independent variables to consider is fixed and small. Hence, we can assume that effect estimates are unbiased and the usual methods for confidence interval estimation are valid. In routine work, however, it is not known a priori which covariates should be included in a model, and often we are confronted with the number of candidate variables in the range 10-30. This number is often too large to be considered in a statistical model. We provide an overview of various available variable selection methods that are based on significance or information criteria, penalized likelihood, the change-in-estimate criterion, background knowledge, or combinations thereof. These methods were usually developed in the context of a linear regression model and then transferred to more generalized linear models or models for censored survival data. Variable selection, in particular if used in explanatory modeling where effect estimates are of central interest, can compromise stability of a final model, unbiasedness of regression coefficients, and validity of p-values or confidence intervals. Therefore, we give pragmatic recommendations for the practicing statistician on application of variable selection methods in general (low-dimensional) modeling problems and on performing stability investigations and inference. We also propose some quantities based on resampling the entire variable selection process to be routinely reported by software packages offering automated variable selection algorithms.

Keywords: change-in-estimate criterion; penalized likelihood; resampling; statistical model; stepwise selection.

© 2017 The Authors. Biometrical Journal Published by WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.

Figures

Figure 1
Figure 1
Simulation study to illustrate possible differential effects of variable selection. Graphs show scatterplots of estimated regression coefficients β^1 and β^2 in 50 simulated datasets of size N=50 with two standard normal IVs with correlation ρ=0.5. Circles and dots indicate simulated datasets where a test of the null hypothesis β2=0 yields p‐values greater or lower than 0.157, respectively. The dashed lines are regression lines of β1 on β2; thus they indicate how β1 would change if β2 is set to 0
Figure 2
Figure 2
A schematic network of dependencies arising from variable selection. β, regression coefficient; IV, independent variable; RMSE, root mean squared error

References

    1. Akaike, H. (1973). Formation theory and an extension of the maximum likelihood principle In Petrov B. N. & Csaki F. (Eds.), Second international symposium on information theory (pp. 267–281). Budapest, HU: Akadémiai Kiado.
    1. Altman, D. , McShane, L. , Sauerbrei, W. , & Taube, S. E. (2012). Reporting recommendations for tumor marker prognostic studies (REMARK): Explanation and elaboration. PLoS Medicine, 9(5), e1001216.
    1. Andersen, P. K. , & Skovgaard, L. T. (2010). Regression with linear predictors. New York, NY: Springer.
    1. . (2017a). Retrieved from [accessed 06 February 2017].
    1. . (2017b). Retrieved from [accessed 11 April 2017].
    1. . (2017c). Retrieved from [accessed 07 February 2017].
    1. . (2017d). Retrieved from [accessed 07 February 2017].
    1. Augustin, N. , Sauerbrei, W. , & Schumacher, M. (2005). The practical utility of incorporating model selection uncertainty into prognostic models for survival data. Statistical Modelling, 5, 95–118.
    1. Binder, H. , & Schumacher, M. (2009). Incorporating pathway information into boosting estimation of high‐dimensional risk prediction models. BMC Bioinformatics, 10, 18.
    1. Blagus, R. (2017). abe: Augmented Backward Elimination. R package version 3.0.1. URL Retrieved from [accessed 13 November 2017]
    1. Box, G. E. P. , & Draper, N. R. (1987). Empirical model‐building and response surfaces. New York, NY: Wiley.
    1. Breiman, L. (2001a). Statistical modeling: The two cultures. Statistical Science, 16, 199–231.
    1. Breiman, L. (2001b). Random forests. Machine Learning, 45(1), 5–32.
    1. Buchholz, A. , Holländer, N. , & Sauerbrei, W. (2008). On properties of predictors derived with a two‐step bootstrap model averaging approach—A simulation study in the linear regression model. Computational Statistics & Data Analysis, 52, 2778–2793.
    1. Buckland, S. T. , Burnham, K. P. , & Augustin, N. H. (1997). Model selection: An integral part of inference. Biometrics, 53, 603–618.
    1. Bühlmann, P. , & Yu, B. (2003). Boosting with the L2 loss: Regression and classification. Journal of the American Statistical Association, 98(462), 324–339.
    1. Burnham, K. P. , & Anderson, D. R. (2002). Model selection and multimodel inference: A practical information‐theoretic approach. New York, NY: Springer.
    1. Bursac, Z. , Gauss, C. H. , Williams, D. K. , & Hosmer, D. W. (2008). Purposeful selection of variables in logistic regression. Source Code for Biology and Medicine, 3, 17.
    1. Courvoisier, D. S. , Combescure, C. , Agoritsas, T. , Gayet‐Ageron, A. , & Perneger, T. V. (2011). Performance of logistic regression modeling: Beyond the number of events per variable, the role of data structure. Journal of Clinical Epidemiology, 64, 993–1000.
    1. Cox, D. R. (1972). Regression models and life‐tables. Journal of the Royal Statistical Society, Series B, 34(2), 187–220.
    1. Cox, D. R. , & Hinkley, D. V. (1979). Theoretical statistics(1st ed.). Boca Raton, FL: Chapman and Hall/CRC.
    1. De Bin, R. , Janitza, S. , Sauerbrei, W. , & Boulesteix, A. L. (2016). Subsampling versus bootstrapping in resampling‐based model selection for multivariable regression. Biometrics, 72, 272–280.
    1. De Luna, X. , Waernbaum, I. , & Richardson, T. S. (2011). Covariate selection for the nonparametric estimation of an average treatment effect. Biometrika, 98, 861–875.
    1. Dunkler, D. , Plischke, M. , Leffondré, K. , & Heinze, G. (2014). Augmented backward elimination: A pragmatic and purposeful way to develop statistical models. PLoS One, 9, .
    1. Dunkler, D. , Sauerbrei, W. , & Heinze, G. (2016). Global, parameterwise and joint shrinkage factor estimation. Journal of Statistical Software, 69, 1–19.
    1. Evans, D. , Chaix, B. , Lobbedez, T. , Verger, C. , & Flahault, A. (2012). Combining directed acyclic graphs and the change‐in‐estimate procedure as a novel approach to adjustment‐variable selection in epidemiology. BMC Medical Research Methodology, 12, 156.
    1. Fan, J. , & Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96, 1348–1360.
    1. Friedman, J. , Hastie, T. , & Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33, 1–22.
    1. Greenland, S. (2000). Principles of multilevel modelling. International Journal of Epidemiology, 29, 158–167.
    1. Greenland, S. (2008). Invited commentary: Variable selection versus shrinkage in the control of multiple confounders. American Journal of Epidemiology, 167, 523–529.
    1. Greenland, S. , & Pearce, N. (2015). Statistical foundations for model‐based adjustments. Annual Review of Public Health, 36, 89–108.
    1. Greenland, S. , Pearl, J. , & Robins, J. M. (1999). Causal diagrams for epidemiologic research. Epidemiology, 10, 37–48.
    1. Harrell, F. E. (2015). Regression modeling strategies. With applications to linear models, logistic regression, and survival analysis. New York, Berlin, Heidelberg: Springer.
    1. Harrell, F. E. , Lee, K. L. , Califf, R. M. , Pryor, D. B. , & Rosati, R. A. (1984). Regression modelling strategies for improved prognostic prediction. Statistics in Medicine, 3, 143–152.
    1. Hastie, T. , Tibshirani, R. , & Friedman, J. H. (2009). The elements of statistical learning: Data mining, inference, and prediction (2nd ed.). New York, NY: Springer.
    1. Heinze, G. , & Dunkler, D. (2017). Five myths about variable selection. Transplant International, 30, 6–10.
    1. Hosmer, D. W. , Lemeshow, S. , & May, S. (2011). Applied survival analysis: Regression modeling of time to event data (2nd ed.). Hoboken, NJ: Wiley.
    1. Hosmer, D. W. , Lemeshow, S. , & Sturdivant, R. X. (2013). Applied logistic regression (3rd ed.). Hoboken, NJ: Wiley.
    1. Johnson, R. W. (1996). Fitting percentage of body fat to simple body measurements. Journal of Statistics Education, 4(1), 265–266.
    1. Lee, P. H. (2014). Is a cutoff of 10% appropriate for the change‐in‐estimate criterion of confounder identification? Journal of Epidemiology, 24, 161–167.
    1. Leeb, H. , & Pötscher, B. M. (2005). Model selection and inference: Facts and fiction. Econometric Theory, 21, 21–59.
    1. Maldonado, G. , & Greenland, S. (1993). Simulation study of confounder‐selection strategies. American Journal of Epidemiology, 138, 923–936.
    1. Mantel, N. (1970). Why stepdown procedures in variable selection. Technometrics, 12, 621–625.
    1. Meinshausen, N. , & Bühlmann, P. (2010). Stability selection. Journal of the Royal Statistical Society Series B Statistical Methodology, 72, 417–473.
    1. Mickey, R. M. , & Greenland, S. (1989). The impact of confounder selection criteria on effect estimation. American Journal of Epidemiology, 129, 125–137.
    1. Moons, K. G. M. , Kengne, A. P. , Grobbee, D. E. , Royston, P. , Vergouwe, Y. , Altman, D. G. , & Woodward, M. (2012). Risk prediction models: II. External validation, model updating, and impact assessment. Heart, 98, 691–698.
    1. Newton, I. , Motte, A. , & Machin, J. (1729). The mathematical principles of natural philosophy. London, UK: B. Motte.
    1. Porzelius, C. , Schumacher, M. , & Binder, H. (2010). Sparse regression techniques in low‐dimensional survival data settings. Statistical Computing, 20, 151–163.
    1. Robinson, L. D. , & Jewell, N. P. (1991). Some surprising results about covariate adjustment in logistic regression models. International Statistical Review/Revue Internationale de Statistique, 59, 227–240.
    1. Royston, P. , & Sauerbrei, W. (2008). Multivariable model‐building. A pragmatic approach to regression analysis based on fractional polynomials for modelling continuous variables. Chichester, UK: John Wiley & Sons, Ltd.
    1. Royston, P. , & Sauerbrei, W. (2003). Stability of multivariable fractional polynomial models with selection of variables and transformations: A bootstrap investigation. Statistics in Medicine, 22, 639–659.
    1. SAS Institute Inc. (2016). SAS/STAT®14.2 User's Guide. Cary, NC: SAS Institute Inc..
    1. Sauerbrei, W. (1999). The use of resampling methods to simplify regression models in medical statistics. Journal of the Royal Statistical Society Series C Applied Statistics, 48, 313–329.
    1. Sauerbrei, W. , Buchholz, A. , Boulesteix, A.‐L. , & Binder, H. (2015). On stability issues in deriving multivariable regression models. Biometrical Journal, 57, 531–555.
    1. Sauerbrei, W. , Abrahamowicz, M. , Altman, D. G. , le Cessie, S. , & on behalf of the, S. i. (2014). Strengthening analytical thinking for observational studies: The STRATOS initiative. Statistics in Medicine, 33, 5413–5432.
    1. Sauerbrei, W. , & Schumacher, M. (1992). A bootstrap resampling procedure for model building: Application to the Cox regression model. Statistics in Medicine, 11, 2093–2109.
    1. Schumacher, M. , Holländer, N. , Schwarzer, G. , Binder, H. , & Sauerbrei, W. (2012). Prognostic factor studies In Rowley C. J. & Hoering A. (Eds.), Handbook of statistics in clinical oncology (3rd ed.). Boca Raton, FL: CRC Press.
    1. Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6, 461–464.
    1. Shmueli, G. (2010). To explain or to predict? Statistical Science, 25, 289–310.
    1. Siri, W. E. (1956). The gross composition of the body. Advances in Biological and Medical Physics, 4, 239–280.
    1. Sokolov, A. , Carlin, D. E. , Paull, E. O. , Baertsch, R. , & Stuart, J. M. (2016). Pathway‐based genomics prediction using generalized elastic net. PLoS Computational Biology, 12(3), e1004790.
    1. Su, T.‐L. , Jaki, T. , Hickey, G. L. , Buchan, I. , & Sperrin, M. (2016). A review of statistical updating methods for clinical prediction models. Statistical Methods in Medical Research, .
    1. Steyerberg, E. (2009). Clinical prediction models: A practical approach to development, validation, and updating. New York, NY: Springer.
    1. Stone, M. (1977). An asymptotic equivalence of choice of model by cross‐validation and Akaike's criterion. Journal of the Royal Statistical Society. Series B (Methodological), 39, 44–47.
    1. Sun, G.‐W. , Shook, T. L. , & Kay, G. L. (1996). Inappropriate use of bivariable analysis to screen risk factors for use in multivariable analysis. Journal of Clinical Epidemiology, 49, 907–916.
    1. Taylor, J. , & Tibshirani, R. J. (2015). Statistical learning and selective inference. Proceedings of the National Academy of Sciences of the United States of America, 112, 7629–7634.
    1. Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society Series B‐Methodological, 58, 267–288.
    1. van Houwelingen, H. C. (2001). Shrinkage and penalized likelihood as methods to improve predictive accuaracy. Statistica Neerlandica, 55, 17–34.
    1. van Houwelingen, H. C. , & Sauerbrei, W. (2013). Cross‐validation, shrinkage and variable selection in linear regression revisited. Open Journal of Statistics, 3, 79–102.
    1. VanderWeele, T. J. , & Shpitser, I. (2011). A new criterion for confounder selection. Biometrics, 67, 1406–1413.
    1. Vansteelandt, S. , Bekaert, M. , & Claeskens, G. (2012). On model selection and model misspecification in causal inference. Statistical Methods in Medical Research, 21, 7–30.
    1. Wikimedia Foundation, Inc. (2017). Statistical model. URL Retrieved from [accessed 06 February 2017].
    1. Wyatt, J. C. , & Altman, D. G. (1995). Commentary: Prognostic models: clinically useful or quickly forgotten? BMJ, 311, .
    1. Zou, H. (2006). The adaptive LASSO and its oracle properties. Journal of the American Statistical Association, 101, 1418–1429.

Source: PubMed

Подписаться