Clinical risk prediction with random forests for survival, longitudinal, and multivariate (RF-SLAM) data analysis

Shannon Wongvibulsin, Katherine C Wu, Scott L Zeger, Shannon Wongvibulsin, Katherine C Wu, Scott L Zeger

Abstract

Background: Clinical research and medical practice can be advanced through the prediction of an individual's health state, trajectory, and responses to treatments. However, the majority of current clinical risk prediction models are based on regression approaches or machine learning algorithms that are static, rather than dynamic. To benefit from the increasing emergence of large, heterogeneous data sets, such as electronic health records (EHRs), novel tools to support improved clinical decision making through methods for individual-level risk prediction that can handle multiple variables, their interactions, and time-varying values are necessary.

Methods: We introduce a novel dynamic approach to clinical risk prediction for survival, longitudinal, and multivariate (SLAM) outcomes, called random forest for SLAM data analysis (RF-SLAM). RF-SLAM is a continuous-time, random forest method for survival analysis that combines the strengths of existing statistical and machine learning methods to produce individualized Bayes estimates of piecewise-constant hazard rates. We also present a method-agnostic approach for time-varying evaluation of model performance.

Results: We derive and illustrate the method by predicting sudden cardiac arrest (SCA) in the Left Ventricular Structural (LV) Predictors of Sudden Cardiac Death (SCD) Registry. We demonstrate superior performance relative to standard random forest methods for survival data. We illustrate the importance of the number of preceding heart failure hospitalizations as a time-dependent predictor in SCA risk assessment.

Conclusions: RF-SLAM is a novel statistical and machine learning method that improves risk prediction by incorporating time-varying information and accommodating a large number of predictors, their interactions, and missing values. RF-SLAM is designed to easily extend to simultaneous predictions of multiple, possibly competing, events and/or repeated measurements of discrete or continuous variables over time.

Trial registration: LV Structural Predictors of SCD Registry (clinicaltrials.gov, NCT01076660), retrospectively registered 25 February 2010.

Keywords: Clinical risk prediction; Dynamic risk prediction; Random forests; Survival analysis.

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
Random Forests for Survival, Longitudinal, and Multivariate (RF-SLAM) Data Analysis Overview. The Random Forests for Survival, Longitudinal, and Multivariate (RF-SLAM) data analysis approach begins with a pre-processing step to create counting process information units (CPIUs) within which we can model the possibly multivariate outcomes of interest (e.g. SCA, HF) and accommodate time-dependent covariates. For the LV Structural Predictors Registry, the time-varying covariates of interest relate to heart failure hospitalizations (HFs), indicated by the blue diamonds. In this case, CPIUs are created from the Survival, Longitudinal, and Multivariate (SLAM) data by creating a new CPIU every half year, corresponding to the frequency of follow up. The variable int.n represents the interval number indicating time since study enrollment in half-years. The time-varying covariates are int.n and pHF (total number of previous heart failure hospitalizations since study enrollment). Then, these CPIUs (containing the time-varying covariates along with the baseline predictors) are used as inputs in the RF-SLAM algorithm to generate the predicted probability of an SCA. The SCA event indicator is denoted with iSCA (0 if no event within CPIU, 1 if the event occurs within CPIU) and the heart failure hospitalization event indicator is iHF (0 if no event within CPIU, 1 if the event occurs within CPIU)
Fig. 2
Fig. 2
Comparison of Discrimination for Sudden Cardiac Arrest (SCA) Prediction with Different Random Forests Approaches. a, b, c Time-varying AUC curves for the RSF approach which uses only baseline covariates (panel a), RF-SLAM approach with only baseline covariates (panel b), RF-SLAM approach with both baseline and time-varying covariates (panel c). d, e, f Predicted survival curves from RSF (panel d), RF-SLAM approach with only baseline covariates (panel e), and RF-SLAM approach with both baseline and time-varying covariates (panel f). Individuals who experienced an SCA are colored-coded in red and all others are colored-coded in green. Note each column of plots corresponds to the same model (i.e. the left column corresponds to the RSF approach, center column corresponds to the RF-SLAM approach with only baseline covariate, and the right column corresponds to the RF-SLAM approach with both baseline and time-varying covariates)
Fig. 3
Fig. 3
Comparison of Calibration for Sudden Cardiac Arrest (SCA) Prediction with Different Random Forests Approaches. a Calibration curves by decile of predicted risk for the RSF approach which uses only baseline covariates, b RF-SLAM approach with only baseline covariates, c RF-SLAM approach with both baseline and time-varying covariates. For each panel, the difference between the predicted and observed rates are plotted for each decile. The black points indicate the estimates from the original data set. The mean predicted risk (%/year) for each decile are presented at the bottom of the plot. The gray bars indicate the 95% confidence intervals from 500 bootstrapped data sets

References

    1. Goldstein BA, Navar AM, Carter RE. Moving beyond regression techniques in cardiovascular risk prediction: applying machine learning to address analytic challenges. Eur Heart J. 2016;38(23):1805–14.
    1. Kruppa J, Ziegler A, König IR. Risk estimation and risk prediction using machine-learning methods. Hum Genet. 2012;131(10):1639–54. doi: 10.1007/s00439-012-1194-y.
    1. Malley JD, Kruppa J, Dasgupta A, Malley KG, Ziegler A. Probability machines. Methods Inf Med. 2012;51(01):74–81. doi: 10.3414/ME00-01-0052.
    1. Deo RC. Machine learning in medicine. Circulation. 2015;132(20):1920–30. doi: 10.1161/CIRCULATIONAHA.115.001593.
    1. Boulesteix A-L, Janitza S, Kruppa J, König IR. Overview of random forest methodology and practical guidance with emphasis on computational biology and bioinformatics. Wiley Interdiscip Rev Data Min Knowl Disc. 2012;2(6):493–507. doi: 10.1002/widm.1072.
    1. Wager S, Athey S. Estimation and inference of heterogeneous treatment effects using random forests. J Am Stat Assoc. 2018;113(523):1228–42. doi: 10.1080/01621459.2017.1319839.
    1. Hill JL. Bayesian nonparametric modeling for causal inference. J Comput Graph Stat. 2011;20(1):217–40. doi: 10.1198/jcgs.2010.08162.
    1. Sparapani RA, Logan BR, McCulloch RE, Laud PW. Nonparametric survival analysis using bayesian additive regression trees (bart) Stat Med. 2016;35(16):2741–53. doi: 10.1002/sim.6893.
    1. Foster JC, Taylor JM, Ruberg SJ. Subgroup identification from randomized clinical trial data. Stat Med. 2011;30(24):2867–80. doi: 10.1002/sim.4322.
    1. Su X, Tsai C-L, Wang H, Nickerson DM, Li B. Subgroup analysis via recursive partitioning. J Mach Learn Res. 2009;10(Feb):141–58.
    1. Lu M, Sadiq S, Feaster DJ, Ishwaran H. Estimating individual treatment effect in observational data using random forest methods. J Comput Graph Stat. 2018;27(1):209–19. doi: 10.1080/10618600.2017.1356325.
    1. Wilson PW, D’Agostino RB, Levy D, Belanger AM, Silbershatz H, Kannel WB. Prediction of coronary heart disease using risk factor categories. Circulation. 1998;97(18):1837–47. doi: 10.1161/01.CIR.97.18.1837.
    1. Morrow DA, Antman EM, Charlesworth A, Cairns R, Murphy SA, de Lemos JA, Giugliano RP, McCabe CH, Braunwald E. Timi risk score for st-elevation myocardial infarction: a convenient, bedside, clinical score for risk assessment at presentation: an intravenous npa for treatment of infarcting myocardium early ii trial substudy. Circulation. 2000;102(17):2031–7. doi: 10.1161/01.CIR.102.17.2031.
    1. Fishman GI, Chugh SS, DiMarco JP, Albert CM, Anderson ME, Bonow RO, Buxton AE, Chen P-S, Estes M, Jouven X, et al. Sudden cardiac death prediction and prevention: report from a national heart, lung, and blood institute and heart rhythm society workshop. Circulation. 2010;122(22):2335–48. doi: 10.1161/CIRCULATIONAHA.110.976092.
    1. Hayashi M, Shimizu W, Albert CM. The spectrum of epidemiology underlying sudden cardiac death. Circ Res. 2015;116(12):1887–906. doi: 10.1161/CIRCRESAHA.116.304521.
    1. Wellens HJ, Schwartz PJ, Lindemans FW, Buxton AE, Goldberger JJ, Hohnloser SH, Huikuri HV, Kääb S, La Rovere MT, Malik M, et al. Risk stratification for sudden cardiac death: current status and challenges for the future. Eur Heart J. 2014;35(25):1642–51. doi: 10.1093/eurheartj/ehu176.
    1. Kandala Jagdesh, Oommen Clint, Kern Karl B. Sudden cardiac death. British Medical Bulletin. 2017;122(1):5–15. doi: 10.1093/bmb/ldx011.
    1. Myerburg RJ, Goldberger JJ. Sudden cardiac arrest risk assessment: population science and the individual risk mandate. JAMA Cardiol. 2017;2(6):689–94. doi: 10.1001/jamacardio.2017.0266.
    1. Zaman S, Goldberger JJ, Kovoor P. Sudden death risk-stratification in 2018–2019: The old and the new. Heart Lung Cir. 2019;28(1):57–64. doi: 10.1016/j.hlc.2018.08.027.
    1. Haqqani HM, Chan KH, Kumar S, Denniss AR, Gregory AT. The contemporary era of sudden cardiac death and ventricular arrhythmias: basic concepts, recent developments and future directions. Heart Lung Circ. 2019;28(1):1–5. doi: 10.1016/S1443-9506(18)31972-3.
    1. Chieng D, Paul V, Denman R. Current device therapies for sudden cardiac death prevention–the icd, subcutaneous icd and wearable icd. Heart Lung Circ. 2019;28(1):65–75. doi: 10.1016/j.hlc.2018.09.011.
    1. Moss AJ, Zareba W, Hall WJ, Klein H, Wilber DJ, Cannom DS, Daubert JP, Higgins SL, Brown MW, Andrews ML. Prophylactic implantation of a defibrillator in patients with myocardial infarction and reduced ejection fraction. N Engl J Med. 2002;346(12):877–83. doi: 10.1056/NEJMoa013474.
    1. Bardy GH, Lee KL, Mark DB, Poole JE, Packer DL, Boineau R, Domanski M, Troutman C, Anderson J, Johnson G, et al. Amiodarone or an implantable cardioverter–defibrillator for congestive heart failure. N Engl J Med. 2005;352(3):225–37. doi: 10.1056/NEJMoa043399.
    1. Wu KC, Gerstenblith G, Guallar E, Marine JE, Dalal D, Cheng A, Marbán E, Lima JA, Tomaselli GF, Weiss RG. Combined cardiac magnetic resonance imaging and c-reactive protein levels identify a cohort at low risk for defibrillator firings and death. Circ Cardiovasc Imaging. 2012;5(2):178–86. doi: 10.1161/CIRCIMAGING.111.968024.
    1. Kent DM, Hayward RA. Limitations of applying summary results of clinical trials to individual patients: the need for risk stratification. Jama. 2007;298(10):1209–12. doi: 10.1001/jama.298.10.1209.
    1. Sabbag A, Suleiman M, Laish-Farkash A, Samania N, Kazatsker M, Goldenberg I, Glikson M, Beinart R, et al. Contemporary rates of appropriate shock therapy in patients who receive implantable device therapy in a real-world setting: From the israeli icd registry. Heart Rhythm. 2015;12(12):2426–33. doi: 10.1016/j.hrthm.2015.08.020.
    1. Kramer DB, Kennedy KF, Noseworthy PA, Buxton AE, Josephson ME, Normand S-L, Spertus JA, Zimetbaum PJ, Reynolds MR, Mitchell SL. Characteristics and outcomes of patients receiving new and replacement implantable cardioverter-defibrillators: results from the ncdr. Circ Cardiovasc Qual Outcomes. 2013;6(4):488–97. doi: 10.1161/CIRCOUTCOMES.111.000054.
    1. Deo R, Norby FL, Katz R, Sotoodehnia N, Adabag S, DeFilippi CR, Kestenbaum B, Chen LY, Heckbert SR, Folsom AR, et al. Development and validation of a sudden cardiac death prediction model for the general population. Circulation. 2016;134(11):806–16. doi: 10.1161/CIRCULATIONAHA.116.023042.
    1. Kaltman JR, Thompson PD, Lantos J, Berul CI, Botkin J, Cohen JT, Cook NR, Corrado D, Drezner J, Frick KD, et al. Screening for sudden cardiac death in the young: report from a national heart, lung, and blood institute working group. Circulation. 2011;123(17):1911–8. doi: 10.1161/CIRCULATIONAHA.110.017228.
    1. Wu KC. Sudden cardiac death substrate imaged by magnetic resonance imaging: from investigational tool to clinical applications. Circ Cardiovasc Imaging. 2017;10(7):005461. doi: 10.1161/CIRCIMAGING.116.005461.
    1. Bou-Hamad I, Larocque D, Ben-Ameur H, et al. A review of survival trees. Stat Surv. 2011;5:44–71. doi: 10.1214/09-SS047.
    1. Breiman L. Random forests. Mach Learn. 2001;45(1):5–32. doi: 10.1023/A:1010933404324.
    1. Ishwaran H, Kogalur UB, Blackstone EH, Lauer MS, et al. Random survival forests. Ann Appl Stat. 2008;2(3):841–60. doi: 10.1214/08-AOAS169.
    1. Hastie T, Tibshirani R, Friedman J. The elements of statistical learning. Springer Ser Stat. 2001.
    1. Fernández-Delgado M, Cernadas E, Barro S, Amorim D. Do we need hundreds of classifiers to solve real world classification problems? J Mach Learn Res. 2014;15(1):3133–81.
    1. Schmidt André, Azevedo Clerio F., Cheng Alan, Gupta Sandeep N., Bluemke David A., Foo Thomas K., Gerstenblith Gary, Weiss Robert G., Marbán Eduardo, Tomaselli Gordon F., Lima João A.C., Wu Katherine C. Infarct Tissue Heterogeneity by Magnetic Resonance Imaging Identifies Enhanced Cardiac Arrhythmia Susceptibility in Patients With Left Ventricular Dysfunction. Circulation. 2007;115(15):2006–2014. doi: 10.1161/CIRCULATIONAHA.106.653568.
    1. Tao S, Ashikaga H, Ciuffo LA, Yoneyama K, Lima JA, Frank TF, Weiss RG, Tomaselli GF, Wu KC. Impaired left atrial function predicts inappropriate shocks in primary prevention implantable cardioverter-defibrillator candidates. J Cardiovasc Electrophysiol. 2017;28(7):796–805. doi: 10.1111/jce.13234.
    1. Zhang Y, Guallar E, Weiss RG, Stillabower M, Gerstenblith G, Tomaselli GF, Wu KC. Associations between scar characteristics by cardiac magnetic resonance and changes in left ventricular ejection fraction in primary prevention defibrillator recipients. Heart Rhythm. 2016;13(8):1661–6. doi: 10.1016/j.hrthm.2016.04.013.
    1. Cheng A, Dalal D, Butcher B, Norgard S, Zhang Y, Dickfeld T, Eldadah ZA, Ellenbogen KA, Guallar E, Tomaselli GF. Prospective observational study of implantable cardioverter-defibrillators in primary prevention of sudden cardiac death: study design and cohort description. J Am Heart Assoc. 2013;2(1):000083. doi: 10.1161/JAHA.112.000083.
    1. Cheng A, Zhang Y, Blasco-Colmenares E, Dalal D, Butcher B, Norgard S, Eldadah Z, Ellenbogen KA, Dickfeld T, Spragg DD, et al. Protein biomarkers identify patients unlikely to benefit from primary prevention implantable cardioverter defibrillators: findings from the prospective observational study of implantable cardioverter defibrillators (prose-icd) Circ Arrhythmia Electrophysiol. 2014;7(6):1084–91. doi: 10.1161/CIRCEP.113.001705.
    1. Zhang Y, Guallar E, Blasco-Colmenares E, Dalal D, Butcher B, Norgard S, Tjong FV, Eldadah Z, Dickfeld T, Ellenbogen KA, et al. Clinical and serum-based markers are associated with death within 1 year of de novo implant in primary prevention icd recipients. Heart Rhythm. 2015;12(2):360–6. doi: 10.1016/j.hrthm.2014.10.034.
    1. Ishwaran H, Kogalur UB, Kogalur MUB. Package ’randomforestsrc’. 2019. .
    1. Moradian H, Larocque D, Bellavance F. L1 splitting rules in survival forests. Lifetime Data Anal. 2017;23(4):671–91. doi: 10.1007/s10985-016-9372-1.
    1. Nasejje JB, Mwambi H, Dheda K, Lesosky M. A comparison of the conditional inference survival forest model to random survival forests based on a simulation study as well as on two applications with time-to-event data. BMC Med Res Methodol. 2017;17(1):115. doi: 10.1186/s12874-017-0383-8.
    1. Singer JD, Willett JB. It’s about time: Using discrete-time survival analysis to study duration and the timing of events. J Educ Stat. 1993;18(2):155–95.
    1. Fleming TR, Harrington DP. Counting Processes and Survival Analysis, vol. 169. Hoboken: Wiley; 2011. .
    1. Therneau TM, Atkinson EJ, et introduction to recursive partitioning using the rpart routines. 1997. .
    1. Quigley J, Bedford T, Walls L. Estimating rate of occurrence of rare events with empirical bayes: A railway application. Reliab Eng Syst Saf. 2007;92(5):619–27. doi: 10.1016/j.ress.2006.02.007.
    1. Howlader HA, Balasooriya U. Bayesian estimation of the distribution function of the poisson model. Biom J J Math Methods Biosci. 2003;45(7):901–12.
    1. Breiman L. Classification and regression trees: Chapman & Hall; 1984. .
    1. Breiman L, Cutler A. Setting up, using, and understanding random forests v4. 0: University of California, Department of Statistics; 2003. .
    1. Liaw A, Wiener M, et al. Classification and regression by randomforest. R news. 2002;2(3):18–22.
    1. Dankowski T, Ziegler A. Calibrating random forests for probability estimation. Stat Med. 2016;35(22):3949–60. doi: 10.1002/sim.6959.
    1. Kruppa J, Schwarz A, Arminger G, Ziegler A. Consumer credit risk: Individual probability estimates using machine learning. Expert Syst Appl. 2013;40(13):5125–31. doi: 10.1016/j.eswa.2013.03.019.
    1. Steyerberg EW, Vergouwe Y. Towards better clinical prediction models: seven steps for development and an abcd for validation. Eur Heart J. 2014;35(29):1925–31. doi: 10.1093/eurheartj/ehu207.
    1. Lee Y-h, Bang H, Kim DJ. How to establish clinical prediction models. Endocrinol Metab. 2016;31(1):38–44. doi: 10.3803/EnM.2016.31.1.38.
    1. Moons KG, Royston P, Vergouwe Y, Grobbee DE, Altman DG. Prognosis and prognostic research: what, why, and how? Bmj. 2009;338:375. doi: 10.1136/bmj.b375.
    1. Kattan MW, Hess KR, Amin MB, Lu Y, Moons KG, Gershenwald JE, Gimotty PA, Guinney JH, Halabi S, Lazar AJ, et al. American joint committee on cancer acceptance criteria for inclusion of risk models for individualized prognosis in the practice of precision medicine. CA: A Cancer J Clin. 2016;66(5):370–4.
    1. Steyerberg EW, Uno H, Ioannidis JP, Van Calster B, Ukaegbu C, Dhingra T, Syngal S, Kastrinos F. Poor performance of clinical prediction models: the harm of commonly applied methods. J Clin Epidemiol. 2018;98:133–43. doi: 10.1016/j.jclinepi.2017.11.013.
    1. Bansal A, Heagerty PJ. A tutorial on evaluating the time-varying discrimination accuracy of survival models used in dynamic decision making. Med Decis Making. 2018;38(8):904–16. doi: 10.1177/0272989X18801312.
    1. Cortes C, Mohri M. Confidence intervals for the area under the roc curve. In: Advances in Neural Information Processing Systems: 2005. p. 305–12. .
    1. Efron B, Tibshirani R. An introduction to the bootstrap. New York: Chapman & Hall; 1994.
    1. Spiegelhalter DJ. Probabilistic prediction in patient management and clinical trials. Stat Med. 1986;5(5):421–33. doi: 10.1002/sim.4780050506.
    1. Rufibach K. Use of brier score to assess binary predictions. J Clin Epidemiol. 2010;63(8):938–9. doi: 10.1016/j.jclinepi.2009.11.009.
    1. Yang S, Prentice R. Improved logrank-type tests for survival data using adaptive weights. Biometrics. 2010;66(1):30–8. doi: 10.1111/j.1541-0420.2009.01243.x.
    1. Mantel N. Evaluation of survival data and two new rank order statistics arising in its consideration. Cancer Chemother Rep. 1966;50:163–70.
    1. Peto R, Peto J. Asymptotically efficient rank invariant test procedures. J R Stat Soc Ser A (Gen) 1972;135(2):185–98. doi: 10.2307/2344317.
    1. Prentice RL, Pettinger M, Anderson GL. Statistical issues arising in the women’s health initiative. Biometrics. 2005;61(4):899–911. doi: 10.1111/j.0006-341X.2005.454_1.x.
    1. Cook NR. Use and misuse of the receiver operating characteristic curve in risk prediction. Circulation. 2007;115(7):928–35. doi: 10.1161/CIRCULATIONAHA.106.672402.
    1. Wager S, Hastie T, Efron B. Confidence intervals for random forests: The jackknife and the infinitesimal jackknife. J Mach Learn Res. 2014;15(1):1625–51.
    1. Papageorgiou G, Mauff K, Tomer A, Rizopoulos D. An overview of joint modeling of time-to-event and longitudinal outcomes. Ann Rev Stat Appl. 2019. .
    1. Rizopoulos D, Molenberghs G, Lesaffre EM. Dynamic predictions with time-dependent covariates in survival analysis using joint modeling and landmarking. Biom J. 2017;59(6):1261–76. doi: 10.1002/bimj.201600238.
    1. Chi Y-Y, Ibrahim JG. Joint models for multivariate longitudinal and multivariate survival data. Biometrics. 2006;62(2):432–45. doi: 10.1111/j.1541-0420.2005.00448.x.
    1. Guler I, Faes C, Cadarso-Suárez C, Teixeira L, Rodrigues A, Mendonca D. Two-stage model for multivariate longitudinal and survival data with application to nephrology research. Biom J. 2017;59(6):1204–20. doi: 10.1002/bimj.201600244.

Source: PubMed

3
Abonnieren