Machine learning algorithms for outcome prediction in (chemo)radiotherapy: An empirical comparison of classifiers

Timo M Deist, Frank J W M Dankers, Gilmer Valdes, Robin Wijsman, I-Chow Hsu, Cary Oberije, Tim Lustberg, Johan van Soest, Frank Hoebers, Arthur Jochems, Issam El Naqa, Leonard Wee, Olivier Morin, David R Raleigh, Wouter Bots, Johannes H Kaanders, José Belderbos, Margriet Kwint, Timothy Solberg, René Monshouwer, Johan Bussink, Andre Dekker, Philippe Lambin, Timo M Deist, Frank J W M Dankers, Gilmer Valdes, Robin Wijsman, I-Chow Hsu, Cary Oberije, Tim Lustberg, Johan van Soest, Frank Hoebers, Arthur Jochems, Issam El Naqa, Leonard Wee, Olivier Morin, David R Raleigh, Wouter Bots, Johannes H Kaanders, José Belderbos, Margriet Kwint, Timothy Solberg, René Monshouwer, Johan Bussink, Andre Dekker, Philippe Lambin

Abstract

Purpose: Machine learning classification algorithms (classifiers) for prediction of treatment response are becoming more popular in radiotherapy literature. General Machine learning literature provides evidence in favor of some classifier families (random forest, support vector machine, gradient boosting) in terms of classification performance. The purpose of this study is to compare such classifiers specifically for (chemo)radiotherapy datasets and to estimate their average discriminative performance for radiation treatment outcome prediction.

Methods: We collected 12 datasets (3496 patients) from prior studies on post-(chemo)radiotherapy toxicity, survival, or tumor control with clinical, dosimetric, or blood biomarker features from multiple institutions and for different tumor sites, that is, (non-)small-cell lung cancer, head and neck cancer, and meningioma. Six common classification algorithms with built-in feature selection (decision tree, random forest, neural network, support vector machine, elastic net logistic regression, LogitBoost) were applied on each dataset using the popular open-source R package caret. The R code and documentation for the analysis are available online (https://github.com/timodeist/classifier_selection_code). All classifiers were run on each dataset in a 100-repeated nested fivefold cross-validation with hyperparameter tuning. Performance metrics (AUC, calibration slope and intercept, accuracy, Cohen's kappa, and Brier score) were computed. We ranked classifiers by AUC to determine which classifier is likely to also perform well in future studies. We simulated the benefit for potential investigators to select a certain classifier for a new dataset based on our study (pre-selection based on other datasets) or estimating the best classifier for a dataset (set-specific selection based on information from the new dataset) compared with uninformed classifier selection (random selection).

Results: Random forest (best in 6/12 datasets) and elastic net logistic regression (best in 4/12 datasets) showed the overall best discrimination, but there was no single best classifier across datasets. Both classifiers had a median AUC rank of 2. Preselection and set-specific selection yielded a significant average AUC improvement of 0.02 and 0.02 over random selection with an average AUC rank improvement of 0.42 and 0.66, respectively.

Conclusion: Random forest and elastic net logistic regression yield higher discriminative performance in (chemo)radiotherapy outcome and toxicity prediction than other studied classifiers. Thus, one of these two classifiers should be the first choice for investigators when building classification models or to benchmark one's own modeling results against. Our results also show that an informed preselection of classifiers based on existing datasets can improve discrimination over random selection.

Keywords: classification; machine learning; outcome prediction; predictive modeling; radiotherapy.

Conflict of interest statement

Andre Dekker, Johan van Soest, and Tim Lustberg are founders and shareholders of Medical Data Works B.V., which provides consulting on medical data collection and analysis projects. Cary Oberije is CEO of ptTheragnostic B.V. Philippe Lambin is member of the advisory board of ptTheragnostic B.V.

© 2018 The Authors. Medical Physics published by Wiley Periodicals, Inc. on behalf of American Association of Physicists in Medicine.

Figures

Figure 1
Figure 1
Experimental design: each dataset is split into five stratified outer folds (step 1). For each of the folds, the data are preprocessed (imputation, dummy coding, deleting zero variance features, rescaling) (step 2). The hyperparameters are tuned in the training set via a fivefold inner CV (steps 3–5). Based on the selected hyperparameters, a model is learned on the training set (step 6) and applied on the test set (step 7). Performance metrics are calculated on the test set (step 8) and stored for all outer folds. This process is repeated 100 times for each classifier. Randomization seeds are stable across classifiers within a repetition to allow pairwise comparison. [Color figure can be viewed at wileyonlinelibrary.com]
Figure 2
Figure 2
Box and scatterplot of the AUC rank (lower being better) per outer fivefold CV aggregated over all datasets and repetitions (12 datasets × 100 repetitions = 1200 data points per classifier). [Color figure can be viewed at wileyonlinelibrary.com]
Figure 3
Figure 3
Pairwise comparisons of each classifier pair (12 datasets × 100 repetitions = 1200 comparisons per pair). The numbers in the plot indicate how often classifier A (y‐axis) achieved an AUC greater than classifier B (x‐axis). The color indicates whether the increased AUCs by classifier A are statistically significant (violet), insignificant (light violet), or have not been tested (gray). The significance cutoff was set to the 0.05‐level (one‐sided Wilcoxon signed‐rank test, Holm–Bonferroni correction for 15 tests). [Color figure can be viewed at wileyonlinelibrary.com]
Figure 4
Figure 4
The mean AUC for each pair of classifier and dataset (100 repetitions = 100 data points per pair). [Color figure can be viewed at wileyonlinelibrary.com]
Figure 5
Figure 5
The mean rank derived from the AUC (100 repetitions = 100 data points per pair). [Color figure can be viewed at wileyonlinelibrary.com]

References

    1. Lambin P, van Stiphout RGPM, Starmans MHW, et al. Predicting outcomes in radiation oncology–multifactorial decision support systems. Nat Rev Clin Oncol. 2013;10:27–40.
    1. Lambin P, Roelofs E, Reymen B, et al. Rapid learning health care in oncology’ – An approach towards decision support systems enabling customised radiotherapy. Radiother Oncol. 2013;109:159–164.
    1. Kuhn M, Wing J, Weston S, et al. Caret: Classification and Regression Training; 2016. .
    1. Fernández‐Delgado M, Cernadas E, Barro S, Amorim D. Do we need hundreds of classifiers to solve real world classification problems? J Mach Learn Res. 2014;15:3133–3181.
    1. Wainer J. Comparison of 14 different families of classification algorithms on 115 binary datasets. ArXiv160600930 Cs. June 2016. . Accessed April 8, 2017.
    1. Olson RS, Cava WL, Mustahsan Z, Varik A, Moore JH. Data‐driven advice for applying machine learning to bioinformatics problems. In: Biocomputing 2018. WORLD SCIENTIFIC; 2017:192‐203. 10.1142/9789813235533_0018
    1. Parmar C, Grossmann P, Bussink J, Lambin P, Aerts HJWL. Machine learning methods for quantitative radiomic biomarkers. Sci Rep. 2015;5:13087.
    1. Belderbos J, Heemsbergen W, Hoogeman M, Pengel K, Rossi M, Lebesque J. Acute esophageal toxicity in non‐small cell lung cancer patients after high dose conformal radiotherapy. Radiother Oncol. 2005;75:157–164.
    1. Bots WTC, van den Bosch S, Zwijnenburg EM, et al. Reirradiation of head and neck cancer: long‐term disease control and toxicity. Head Neck. 2017;39:1122–1130.
    1. Carvalho S, Troost EGC, Bons J, Menheere P, Lambin P, Oberije C. Prognostic value of blood‐biomarkers related to hypoxia, inflammation, immune response and tumour load in non‐small cell lung cancer – a survival model with external validation. Radiother Oncol. 2016;119:487–494.
    1. Carvalho S, Troost E, Bons J, Menheere P, Lambin P, Oberije C. Data from: Prognostic value of blood‐biomarkers related to hypoxia, inflammation, immune response and tumour load in non‐small cell lung cancer – a survival model with external validation. 10.17195/candat.2016.04.1. Published 2016.
    1. Janssens GO, Rademakers SE, Terhaard CH, et al. Accelerated radiotherapy with carbogen and nicotinamide for laryngeal cancer: results of a phase III randomized trial. J Clin Oncol. 2012;30:1777–1783.
    1. Jochems A, Deist TM, El Naqa I, et al. Developing and validating a survival prediction model for NSCLC patients through distributed learning across 3 countries. Int J Radiat Oncol. 2017;99:344–352.
    1. Kwint M, Uyterlinde W, Nijkamp J, et al. Acute esophagus toxicity in lung cancer patients after intensity modulated radiation therapy and concurrent chemotherapy. Int J Radiat Oncol Biol Phys. 2012;84:e223–e228.
    1. Egelmeer AGTM, Velazquez ER, de Jong JMA, et al. Development and validation of a nomogram for prediction of survival and local control in laryngeal carcinoma patients treated with radiotherapy alone: a cohort study based on 994 patients. Radiother Oncol. 2011;100:108–115.
    1. Lustberg T, Bailey M, Thwaites DI, et al. Implementation of a rapid learning platform: predicting 2‐year survival in laryngeal carcinoma patients in a clinical setting. Oncotarget. 2016;7:37288–37296.
    1. Oberije C, De Ruysscher D, Houben R, et al. A validated prediction model for overall survival from stage III non‐small cell lung cancer: toward survival prediction for individual patients. Int J Radiat Oncol Biol Phys. 2015;92:935–944.
    1. Oberije C, De Ruysscher D, Houben R, et al. Data from: A validated prediction model for overall survival from Stage III Non Small Cell Lung Cancer: towards survival prediction for individual patients; 2015. .
    1. Olling K, Nyeng DW, Wee L. Predicting acute odynophagia during lung cancer radiotherapy using observations derived from patient‐centred nursing care. Tech Innov Patient Support Radiat Oncol. 2018;5:16–20.
    1. Wijsman R, Dankers F, Troost EGC, et al. Multivariable normal‐tissue complication modeling of acute esophageal toxicity in advanced stage non‐small cell lung cancer patients treated with intensity‐modulated (chemo‐)radiotherapy. Radiother Oncol. 2015;117:49–54.
    1. Wijsman R, Dankers F, Troost EGC, et al. Inclusion of incidental radiation dose to the cardiac atria and ventricles does not improve the prediction of radiation pneumonitis in advanced stage non‐small cell lung cancer patients treated with intensity‐modulated radiation therapy. Int J Radiat Oncol. 2017;99:434–441.
    1. James G, Witten D, Hastie T, Tibshirani R. An Introduction to Statistical Learning: With Applications in R. New York: Springer‐Verlag; 2013. //. Accessed March 4, 2018.
    1. Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed. New York: Springer‐Verlag; 2009. //. Accessed March 4, 2018.
    1. Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. J Stat Softw. 2010;33:1–22.
    1. Liaw A, Wiener M. Classification and regression by randomForest. R News. 2002;2:18–22.
    1. Venables WN, Ripley BD. Modern Applied Statistics with S, 4th ed. New York: Springer; 2002. .
    1. Karatzoglou A, Smola A, Hornik K, Zeileis A. kernlab – an S4 package for Kernel methods in R. J Stat Softw. 2004;11:1–20.
    1. Tuszynski J. CaTools: Tools: Moving Window Statistics, GIF, Base64, ROC AUC, Etc; 2014. .
    1. Therneau T, Atkinson B, Ripley B. Rpart: Recursive Partitioning and Regression Trees; 2017. .
    1. Steyerberg EW, Vickers AJ, Cook NR, et al. Assessing the performance of prediction models: a framework for traditional and novel measures. Epidemiol Camb Mass. 2010;21:128–138.
    1. Deist TM, Dankers FJWM, Valdes G, et al. Code for: Machine learning algorithms for outcome prediction in (chemo)radiotherapy: an empirical comparison of classifiers. .
    1. Lavesson N, Davidsson P. Quantifying the impact of learning algorithm parameter tuning. In: Proceedings of the 21st National Conference on Artificial Intelligence ‐ Volume 1. Boston, MA: AAAI Press; 2006:395–400.
    1. Valdes G, Luna JM, Eaton E, Ii CBS, Ungar LH, Solberg TD. MediBoost: a patient stratification tool for interpretable decision making in the era of precision medicine. Sci Rep. 2016;6:37854.
    1. Caruana R, Lou Y, Gehrke J, Koch P, Sturm M, Elhadad N. Intelligible models for healthcare: predicting pneumonia risk and hospital 30‐day readmission. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD ‘15. New York, NY, USA: ACM; 2015:1721–1730.

Source: PubMed

3
S'abonner