Predictive performance of machine and statistical learning methods: Impact of data-generating processes on external validity in the "large N, small p" setting

Peter C Austin, Frank E Harrell Jr, Ewout W Steyerberg, Peter C Austin, Frank E Harrell Jr, Ewout W Steyerberg

Abstract

Machine learning approaches are increasingly suggested as tools to improve prediction of clinical outcomes. We aimed to identify when machine learning methods perform better than a classical learning method. We hereto examined the impact of the data-generating process on the relative predictive accuracy of six machine and statistical learning methods: bagged classification trees, stochastic gradient boosting machines using trees as the base learners, random forests, the lasso, ridge regression, and unpenalized logistic regression. We performed simulations in two large cardiovascular datasets which each comprised an independent derivation and validation sample collected from temporally distinct periods: patients hospitalized with acute myocardial infarction (AMI, n = 9484 vs. n = 7000) and patients hospitalized with congestive heart failure (CHF, n = 8240 vs. n = 7608). We used six data-generating processes based on each of the six learning methods to simulate outcomes in the derivation and validation samples based on 33 and 28 predictors in the AMI and CHF data sets, respectively. We applied six prediction methods in each of the simulated derivation samples and evaluated performance in the simulated validation samples according to c-statistic, generalized R2, Brier score, and calibration. While no method had uniformly superior performance across all six data-generating process and eight performance metrics, (un)penalized logistic regression and boosted trees tended to have superior performance to the other methods across a range of data-generating processes and performance metrics. This study confirms that classical statistical learning methods perform well in low-dimensional settings with large data sets.

Keywords: Machine learning; Monte Carlo simulations; data-generating process; generalized boosting methods; logistic regression; random forests.

Conflict of interest statement

Declaration of conflicting interests: The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Figures

Figure 1.
Figure 1.
AMI sample: model performance assessed using c-statistic, R-squared, Brier score, and negative log-likelihood. (There are eight panels across each pair of figures, one panel for each of the eight metrics of model performance. Each panel consists of six sets of six box plots. Each box plot describes the variation in the given performance metric across the 1,000 simulation replicates when a particular data-generating process and analytic method were used.)
Figure 2.
Figure 2.
AMI sample: model performance assessed using ICI, E90, calibration intercept, and calibration slope.
Figure 3.
Figure 3.
CHF sample: model performance assessed using c-statistic, R-squared, Brier score, and negative log-likelihood.
Figure 4.
Figure 4.
CHF sample: model performance assessed using ICI, E90, calibration intercept, and calibration slope.

References

    1. Hastie T, Tibshirani R, Friedman J. The elements of statistical learning. vol. 2. New York, NY: Springer, 2009.
    1. Breiman L. Statistical modeling: the two cultures. Stat Sci 2001; 16: 199–231.
    1. Christodoulou E, et al.. A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models. J Clin Epidemiol 2019; 110: 12–22.
    1. Couronne R, Probst P, Boulesteix AL. Random forest versus logistic regression: a large-scale benchmark experiment. BMC Bioinformatics 2018; 19: 270.
    1. Tu JV, et al.. Effectiveness of public report cards for improving the quality of cardiac care: the EFFECT study: a randomized trial. J Am Med Assoc 2009; 302: 2330–2337.
    1. Breiman L. Random forests. Mach Learn 2001; 45: 5–32.
    1. Buhlmann P, Hathorn, T. Boosting algorithms: regularization, prediction and model fitting. Stat Sci 2007; 22: 477–505.
    1. Freund Y, Schapire R. Experiments with a new boosting algorithm. In: Machine Learning: Proceedings of the thirteenth international conference, Morgan Kauffman: San Francisco, 1996, pp.148-156.
    1. Friedman J, Hastie T, Tibshirani R. Additive logistic regression: a statistical view of boosting (with discussion). Ann Stat 2000; 28: 337–407.
    1. McCaffrey DF, Ridgeway G, Morral AR. Propensity score estimation with boosted regression for evaluating causal effects in observational studies. Psychol Meth 2004; 9: 403–425.
    1. Friedman JH. Stochastic gradient boosting. Computat Stat Data Analys 2002; 38: 367–378.
    1. Friedman JH. Greedy function approximation: a gradient boosting machine . Ann Stat 2001; 29: 1189–1232.
    1. Harrell FE., Jr. Regression modeling strategies. 2nd ed. New York, NY: Springer-Verlag, 2015.
    1. van der Ploeg T, Austin PC, Steyerberg EW. Modern modelling techniques are data hungry: a simulation study for predicting dichotomous endpoints. BMC Med Res Methodol 2014; 14: 137.
    1. Steyerberg EW. Clinical prediction models. 2nd ed. New York: Springer-Verlag, 2019.
    1. Austin PC, Steyerberg EW. The Integrated Calibration Index (ICI) and related metrics for quantifying the calibration of logistic regression models. Stat Med 2019; 38: 4051–4065.
    1. Austin PC, Steyerberg EW. Graphical assessment of internal and external calibration of logistic regression models by using loess smoothers. Stat Med 2014; 33: 517–535.
    1. Siontis GC, et al.. External validation of new risk prediction models is infrequent and reveals worse prognostic discrimination. J Clin Epidemiol 2015; 68: 25–34.
    1. Shin S, et al.. Machine learning vs. conventional statistical models for predicting heart failure readmission and mortality. ESC Heart Fail 2021; 8: 106–115.
    1. Hassanipour S, et al.. Comparison of artificial neural network and logistic regression models for prediction of outcomes in trauma patients: a systematic review and meta-analysis. Injury 2019. 50: 244–250.
    1. Kirasich K, Smith T, Sadler B. Random forest vs logistic regression: binary classification for heterogeneous datasets. SMU Data Sci Rev 2018; 1: Article 9.
    1. Vafeiadas T, et al.. A comparison of machine learning techniques for customer churn prediction. Simulat Model Practice Theory 2015; 55: 1–9.

Source: PubMed

3
Suscribir