Causative Classification of Ischemic Stroke by the Machine Learning Algorithm Random Forests

Jianan Wang, Xiaoxian Gong, Hongfang Chen, Wansi Zhong, Yi Chen, Ying Zhou, Wenhua Zhang, Yaode He, Min Lou, Jianan Wang, Xiaoxian Gong, Hongfang Chen, Wansi Zhong, Yi Chen, Ying Zhou, Wenhua Zhang, Yaode He, Min Lou

Abstract

Background: Prognosis, recurrence rate, and secondary prevention strategies differ by different etiologies in acute ischemic stroke. However, identifying its cause is challenging.

Objective: This study aimed to develop a model to identify the cause of stroke using machine learning (ML) methods and test its accuracy.

Methods: We retrospectively reviewed the data of patients who had determined etiology defined by the Trial of ORG 10172 in Acute Stroke Treatment (TOAST) from CASE-II (NCT04487340) to train and evaluate six ML models, namely, Random Forests (RF), Logistic Regression (LR), Extreme Gradient Boosting (XGBoost), K-Nearest Neighbor (KNN), Ada Boosting, Gradient Boosting Machine (GBM), for the detection of cardioembolism (CE), large-artery atherosclerosis (LAA), and small-artery occlusion (SAO). Between October 2016 and April 2020, patients were enrolled consecutively for algorithm development (phase one). Between June 2020 and December 2020, patients were enrolled consecutively in a test set for algorithm test (phase two). Area under the curve (AUC), precision, recall, accuracy, and F1 score were calculated for the prediction model.

Results: Finally, a total of 18,209 patients were enrolled in phase one, including 13,590 patients (i.e., 6,089 CE, 4,539 LAA, and 2,962 SAO) in the model, and a total of 3,688 patients were enrolled in phase two, including 3,070 patients (i.e., 1,103 CE, 1,269 LAA, and 698 SAO) in the model. Among the six models, the best models were RF, XGBoost, and GBM, and we chose the RF model as our final model. Based on the test set, the AUC values of the RF model to predict CE, LAA, and SAO were 0.981 (95%CI, 0.978-0.986), 0.919 (95%CI, 0.911-0.928), and 0.918 (95%CI, 0.908-0.927), respectively. The most important items to identify CE, LAA, and SAO were atrial fibrillation and degree of stenosis of intracranial arteries.

Conclusion: The proposed RF model could be a useful diagnostic tool to help neurologists categorize etiologies of stroke.

Clinical trial registration: [www.ClinicalTrials.gov], identifier [NCT01274117].

Keywords: cardioembolism; large-artery atherosclerosis; machine learning; small-artery occlusion; stroke.

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2022 Wang, Gong, Chen, Zhong, Chen, Zhou, Zhang, He and Lou.

Figures

FIGURE 1
FIGURE 1
Illustration of features contributing to the identification of CE by Gini importance values. CE, cardioembolism; NIHSS, National Institutes of Health Stroke Scale. Gini importance is a measurement of the feature importance in the model; the higher the value of Gini importance is, the more important the feature is.
FIGURE 2
FIGURE 2
Illustration of features contributing to the identification of LAA by Gini importance values. LAA, large-artery atherosclerosis; LVO, large vessel occlusion; NIHSS, National Institutes of Health Stroke Scale. Gini importance is a measurement of the feature importance in the model; the higher the value of Gini importance is, the more important the feature is.
FIGURE 3
FIGURE 3
Illustration of features contributing to the identification of SAO by Gini importance values. LVO, large vessel occlusion; NIHSS, National Institutes of Health Stroke Scale; SAO, small-artery occlusion. Gini importance is a measurement of the feature importance in the model; the higher the value of Gini importance is, the more important the feature is.
FIGURE 4
FIGURE 4
Confusion matrix of the model in identifying CE, LAA, and SAO on the test set. CE, cardioembolism; LAA, large-artery atherosclerosis; SAO, small-artery occlusion. Confusion matrices are calculated by comparing the position and classification of each measured sample with the actual corresponding position and classification. Each column represents the predicted category of the data, and each row represents the true attribution category.

References

    1. Abraham A., Pedregosa F., Eickenberg M., Gervais P., Mueller A., Kossaifi J., et al. (2014). Machine learning for neuroimaging with scikit-learn. Front. Neuroinform. 8:14. 10.3389/fninf.2014.00014
    1. Adams H. P., Jr., Bendixen B. H., Kappelle L. J., Biller J., Love B. B., Gordon D. L., et al. (1993). Classification of subtype of acute ischemic stroke. definitions for use in a multicenter clinical trial. TOAST. Trial of Org 10172 in acute stroke treatment. Stroke 24 35–41. 10.1161/01.str.24.1.35
    1. Ay H., Furie K. L., Singhal A., Smith W. S., Sorensen A. G., Koroshetz W. J. (2005). An evidence-based causative classification system for acute ischemic stroke. Ann. Neurol. 58 688–697. 10.1002/ana.20617
    1. Boodt N., Compagne K. C. J., Dutra B. G., Samuels N., Tolhuisen M. L., Alves H., et al. (2020). Stroke etiology and thrombus computed tomography characteristics in patients with acute ischemic stroke: a MR clean registry substudy. Stroke 51 1727–1735. 10.1161/STROKEAHA.119.027749
    1. Campbell B. C. V., Khatri P. (2020). Stroke. Lancet 396 129–142. 10.1016/S0140-6736(20)31179-X
    1. Chen P. H., Gao S., Wang Y. J., Xu A. D., Li Y. S., Wang D. (2012). Classifying ischemic stroke, from TOAST to CISS. CNS Neurosci. Ther. 18 452–456. 10.1111/j.1755-5949.2011.00292.x
    1. Denisko D., Hoffman M. M. (2018). Classification and interaction in random forests. Proc. Natl. Acad. Sci. U.S.A. 115 1690–1692. 10.1073/pnas.1800256115
    1. Garcia-Cazares R., Merlos-Benitez M., Marquez-Romero J. M. (2020). Role of the physical examination in the determination of etiology of ischemic stroke. Neurol. India 68 282–287. 10.4103/0028-3886.284386
    1. Goldstein L. B., Jones M. R., Matchar D. B., Edwards L. J., Hoff J., Chilukuri V., et al. (2001). Improving the reliability of stroke subgroup classification using the Trial of ORG 10172 in Acute Stroke Treatment (TOAST) criteria. Stroke 32 1091–1098. 10.1161/01.str.32.5.1091
    1. Hankey G. J. (2014). Secondary stroke prevention. Lancet Neurol. 13 178–194. 10.1016/S1474-4422(13)70255-2
    1. Jauch E. C., Barreto A. D., Broderick J. P., Char D. M., Cucchiara B. L., Devlin T. G., et al. (2017). Biomarkers of acute stroke etiology (BASE) study methodology. Transl. Stroke Res. 8 424–428. 10.1007/s12975-017-0537-3
    1. Jurmeister P., Bockmayr M., Seegerer P., Bockmayr T., Treue D., Montavon G., et al. (2019). Machine learning analysis of DNA methylation profiles distinguishes primary lung squamous cell carcinomas from head and neck metastases. Sci. Transl. Med. 11:eaaw8513. 10.1126/scitranslmed.aaw8513
    1. Ko Y., Lee S., Chung J. W., Han M. K., Park J. M., Kang K., et al. (2014). MRI-based algorithm for acute ischemic stroke subtype classification. J. Stroke 16 161–172. 10.5853/jos.2014.16.3.161
    1. Koo C. L., Liew M. J., Mohamad M. S., Salleh A. H. (2013). A review for detecting gene-gene interactions using machine learning methods in genetic epidemiology. Biomed Res. Int. 2013:432375. 10.1155/2013/432375
    1. Lee H., Lee E. J., Ham S., Lee H. B., Lee J. S., Kwon S. U., et al. (2020). Machine learning approach to identify stroke within 4.5 hours. Stroke 51 860–866. 10.1161/STROKEAHA.119.027611
    1. Lovett J. K., Coull A. J., Rothwell P. M. (2004). Early risk of recurrence by subtype of ischemic stroke in population-based incidence studies. Neurology 62 569–573. 10.1212/01.wnl.0000110311.09970.83
    1. Meschia J. F., Barrett K. M., Chukwudelunzu F., Brown W. M., Case L. D., Kissela B. M., et al. (2006). Interobserver agreement in the trial of org 10172 in acute stroke treatment classification of stroke based on retrospective medical record review. J. Stroke Cerebrovasc. Dis. 15 266–272. 10.1016/j.jstrokecerebrovasdis.2006.07.001
    1. Pandian J. D., Gall S. L., Kate M. P., Silva G. S., Akinyemi R. O., Ovbiagele B. I., et al. (2018). Prevention of stroke: a global perspective. Lancet 392 1269–1278. 10.1016/S0140-6736(18)31269-8
    1. Pandian J. D., Kalkonde Y., Sebastian A. I., Felix C., Urimubenshi G., Bosch J. (2020). Stroke systems of care in low-income and middle-income countries: challenges and opportunities. Lancet 396 1443–1451. 10.1016/S0140-6736(20)31374-X
    1. Pedregosa F., Varoquaux G., Gramfort A., Michel V., Thirion B., Grisel O., et al. (2011). Scikit-learn: machine learning in python. J. Machine Learn. Res. 12 2825–2830. 10.1080/13696998.2019.1666854
    1. Petty G. W., Brown R. D., Jr., Whisnant J. P., Sicks J. D., O’Fallon W. M., Wiebers D. O. (2000). Ischemic stroke subtypes : a population-based study of functional outcome, survival, and recurrence. Stroke 31 1062–1068. 10.1161/01.str.31.5.1062
    1. Sacco R. L., Foulkes M. A., Mohr J. P., Wolf P. A., Hier D. B., Price T. R. (1989b). Determinants of early recurrence of cerebral infarction. The stroke data bank. Stroke 20 983–989. 10.1161/01.str.20.8.983
    1. Sacco R. L., Ellenberg J. H., Mohr J. P., Tatemichi T. K., Hier D. B., Price T. R., et al. (1989a). Infarcts of undetermined cause: the NINCDS stroke data bank. Ann. Neurol. 25 382–390. 10.1002/ana.410250410
    1. Selvarajah J. R., Glaves M., Wainwright J., Jha A., Vail A., Tyrrell P. J. (2009). Classification of minor stroke: intra- and inter-observer reliability. Cerebrovasc. Dis. 27 209–214. 10.1159/000196817
    1. Suo Y., Jing J., Meng X., Li Z., Pan Y., Jiang Y., et al. (2020). Inconsistent centralised versus non-centralised ischaemic stroke aetiology. Stroke Vasc. Neurol. 5 337–347. 10.1136/svn-2020-000576
    1. Venthur B., Dahne S., Hohne J., Heller H., Blankertz B. (2015). Wyrm: a brain-computer interface toolbox in python. Neuroinformatics 13 471–486. 10.1007/s12021-015-9271-8
    1. Wang S., McCormick T. H., Leek J. T. (2020). Methods for correcting inference based on outcomes predicted by machine learning. Proc. Natl. Acad. Sci. U.S.A. 117 30266–30275. 10.1073/pnas.2001238117
    1. White H., Boden-Albala B., Wang C., Elkind M. S., Rundek T., Wright C. B., et al. (2005). Ischemic stroke subtype incidence among whites, blacks, and hispanics: the northern manhattan study. Circulation 111 1327–1331. 10.1161/01.CIR.0000157736.19739.D0
    1. Yan R., Li W., Yin L., Wang Y., Bo J. Pure-China Investigators (2017). Cardiovascular diseases and risk-factor burden in urban and rural communities in high-, middle-, and low-income regions of china: a large community-based epidemiological study. J. Am. Heart Assoc. 6:e004445. 10.1161/JAHA.116.004445
    1. Yang X. L., Zhu D. S., Lv H. H., Huang X. X., Han Y., Wu S., et al. (2019). Etiological classification of cerebral ischemic stroke by the TOAST, SSS-TOAST, and ASCOD systems: the impact of observer’s experience on reliability. Neurologist 24 111–114. 10.1097/NRL.0000000000000236
    1. Zhang R., Zhou Y., Liu C., Zhang M., Yan S., Liebeskind D. S., et al. (2017). Overestimation of susceptibility vessel sign a predictive marker of stroke cause. Stroke 48 1993–1996. 10.1161/STROKEAHA.117.016727
    1. Zhou Y., Xu C., Zhang R., Shi F., Liu C., Yan S., et al. (2019). Longer length of delayed-contrast filling of clot on 4-dimensional computed tomographic angiography predicts cardiogenic embolism. Stroke 50 2568–2570. 10.1161/STROKEAHA.118.024411

Source: PubMed

3
Abonnere