Using Transfer Learning for Improved Mortality Prediction in a Data-Scarce Hospital Setting

Thomas Desautels, Jacob Calvert, Jana Hoffman, Qingqing Mao, Melissa Jay, Grant Fletcher, Chris Barton, Uli Chettipally, Yaniv Kerem, Ritankar Das, Thomas Desautels, Jacob Calvert, Jana Hoffman, Qingqing Mao, Melissa Jay, Grant Fletcher, Chris Barton, Uli Chettipally, Yaniv Kerem, Ritankar Das

Abstract

Algorithm-based clinical decision support (CDS) systems associate patient-derived health data with outcomes of interest, such as in-hospital mortality. However, the quality of such associations often depends on the availability of site-specific training data. Without sufficient quantities of data, the underlying statistical apparatus cannot differentiate useful patterns from noise and, as a result, may underperform. This initial training data burden limits the widespread, out-of-the-box, use of machine learning-based risk scoring systems. In this study, we implement a statistical transfer learning technique, which uses a large "source" data set to drastically reduce the amount of data needed to perform well on a "target" site for which training data are scarce. We test this transfer technique with AutoTriage, a mortality prediction algorithm, on patient charts from the Beth Israel Deaconess Medical Center (the source) and a population of 48 249 adult inpatients from University of California San Francisco Medical Center (the target institution). We find that the amount of training data required to surpass 0.80 area under the receiver operating characteristic (AUROC) on the target set decreases from more than 4000 patients to fewer than 220. This performance is superior to the Modified Early Warning Score (AUROC: 0.76) and corresponds to a decrease in clinical data collection time from approximately 6 months to less than 10 days. Our results highlight the usefulness of transfer learning in the specialization of CDS systems to new hospital sites, without requiring expensive and time-consuming data collection efforts.

Keywords: AUROC; Machine learning; clinical decision support; mortality prediction; transfer learning.

Conflict of interest statement

DECLARATION OF CONFLICTING INTERESTS: The author(s) declared the following potential conflicts of interest with respect to the research, authorship, and/or publication of this article: All authors affiliated with Dascena report being employees of Dascena. Dr Chris Barton reports receiving consulting fees from Dascena. Dr Chris Barton and Dr Grant Fletcher report receiving grant funding from Dascena.

Figures

Figure 1
Figure 1
The time intervals between prediction time and the end-of-record event at University of California San Francisco (n = 48 249 with 1636 in-hospital mortality cases: 3.39% prevalence). As the record is discretized to 1000 hours at most, the end-of-record event in those cases beyond this limit was marked at 1000 hours. In such cases, the clinical outcome, death or discharge, still determined the gold standard label.
Figure 2
Figure 2
Inclusion flowcharts for UCSF target data (left) and MIMIC-III source data (right). These flowcharts illustrate the process used to obtain the final target and source data sets. The number of encounters remaining after each step is underneath the corresponding block. MIMIC-III indicates Medical Information Mart for Intensive Care-III; UCSF, University of California San Francisco.
Figure 3
Figure 3
The construction of training and test sets for each cross-validation fold. Source data (green) and target data (blue) were both used in this procedure. For each of the 10 test folds, the corresponding training set was constructed using the whole source set and a variable portion of the remaining target data. During training, the examples from source and target sets received different weights. The performance of the resulting classifier was assessed on the test set.
Figure 4
Figure 4
Learning curves (mean AUROC) with increasing number of target training examples. Error bars are 1 SE. When data availability is low, target-only training exhibits lower AUROC values and high variability. AUROC indicates area under the receiver operating characteristic; CV, cross-validation. The maximum mean AUROC achieved by the nested cross validation method is 0.8498.
Figure 5
Figure 5
Calibration curves giving mean regression coefficients α (offset) and β (slope) between predicted and empirical log odds for mortality, as a function of increasing number of target training examples. Error bars are 1 SE; the mean and SE are calculated using CV folds. Perfect calibration corresponds to (α, β) = (0, 1). CV indicates cross-validation.

References

    1. Calvert JS, Price DA, Barton CW, Chettipally UK, Das R. Discharge recommendation based on a novel technique of homeostatic analysis. J Am Med Inform Assoc. 2016;24:24–29.
    1. Calvert J, Desautels T, Chettipally U, et al. High-performance detection and early prediction of septic shock for alcohol-use disorder patients. Ann Med Surg (Lond) 2016;8:50–55.
    1. Calvert JS, Price DA, Chettipally UK, et al. A computational approach to early sepsis detection. Comput Biol Med. 2016;74:69–73.
    1. Calvert J, Mao Q, Rogers AJ, et al. A computational approach to mortality prediction of alcohol use disorder inpatients. Comput Biol Med. 2016;75:74–79.
    1. Desautels T, Calvert J, Hoffman J, et al. Prediction of sepsis in the intensive care unit with minimal electronic health record data: a machine learning approach. JMIR Med Inform. 2016;4:e28.
    1. Calvert J, Mao Q, Hoffman JL, et al. Using electronic health record collected clinical variables to predict medical intensive care unit mortality. Ann Med Surg (Lond) 2016;11:52–57.
    1. Nachimuthu SK, Haug PJ. Early detection of sepsis in the emergency department using dynamic Bayesian networks. AMIA Annu Symp Proc. 2012;2012:653–662.
    1. Henry KE, Hager DN, Pronovost PJ, Saria S. A targeted real-time early warning score (TREWScore) for septic shock. Sci Transl Med. 2015;7:299ra122.
    1. Celi LA, Galvin S, Davidzon G, Lee J, Scott D, Mark R. A database-driven decision support system: customized mortality prediction. J Pers Med. 2012;2:138–148.
    1. Subbe C, Slater A, Menon D, Gemmell L. Validation of physiological scoring systems in the accident and emergency department. Emerg Med J. 2006;23:841–845.
    1. Vincent J, Moreno R, Takala J, et al. The SOFA (Sepsis-related Organ Failure Assessment) score to describe organ dysfunction/failure. On behalf of the working group on sepsis-related problems of the European Society of Intensive Care Medicine. Intensive Care Med. 1996;22:707–710.
    1. Knaus WA, Draper EA, Wagner DP, Zimmerman JE. APACHE II: a severity of disease classification system. Crit Care Med. 1985;13:818–829.
    1. Lee A, Bishop G, Hillman K, et al. The medical emergency team. Anaesth Intensive Care. 1995;7:183–186.
    1. Bone RC. Toward an epidemiology and natural history of SIRS (systemic inflammatory response syndrome) JAMA. 1992;268:3452–3455.
    1. Pan SJ, Yang Q. A survey on transfer learning. IEEE T Knowl Data En. 2010;22:1345–1359.
    1. van Opbroek A, Vernooij MW, de Bruinje M. Transfer learning improves supervised image segmentation across imaging protocols. IEEE Trans Med Imaging. 2015;34:1018–1030.
    1. Wiens J, Guttag J, Horvitz E. A study in transfer learning: leveraging data from multiple hospitals to enhance hospital-specific predictions. J Am Med Inform Assoc. 2014;21:699–706.
    1. Wiens J, Guttag J, Horvitz E. Patient risk stratification with time-varying parameters: a multitask learning approach. J Mach Learn Res. 2016;17:1–23.
    1. Gong JJ, Sundt TM, Rawn JD, Guttag JV. Instance weighting for patient-specific risk stratification models; Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; Sydney, NSW. August 10-13, 2015; pp. 369–378.
    1. Le Gall J-R, Lemeshow S, Salunier F. A new Simplified Acute Physiology Score (SAPS II) based on a European/North American multicenter study. JAMA. 1993;270:2957–2963.
    1. Johnson AEW, Pollard TJ, Shen L. MIMIC-III a freely accessible critical care database [published online ahead of print May 24, 2016] Sci Data. doi: 10.1038/sdata.2016.35.
    1. Freund Y, Schapire RE. A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci. 1997;55:119–139.
    1. Luo Z, Schölkopf B, Vovk V. Chapter 5. Explaining AdaBoost. In: Schapire RE, editor. Empirical Inference: Festschrift in Honor of Vladimir N. Vapnik. Berlin, Germany: Springer; 2016. pp. 37–52.
    1. Narsky I, Porter FC. Statistical Analysis Techniques in Particle Physics (Chapter 14) Hoboken, NJ: Wiley; 2014.
    1. Ben-David S, Blitzer J, Crammer K, Kulesza A, Pereira F, Vaughan J. A theory of learning from different domains. Mach Learn. 2010;79:151–175.
    1. Cox DR. Two further applications of a model for binary regression. Biometrica. 1958;45:562–565.
    1. Miller ME, Hui SL, Tierney WM. Validation techniques for logistic regression models. Stat Med. 1991;10:1213–1226.
    1. Subbe CP, Slater A, Menon D, Gemmell L. Validation of physiological scoring systems in the accident and emergency department. Emerg Med J. 2006;23:841–845.
    1. Borgwardt KM, Gretton A, Rasch JM, Kriegel HP, Schölkopf B, Smola AJ. Integrating structured biological data by Kernel Maximum Mean Discrepancy. Bioinformatics. 2006;22:e49–e57.
    1. Gong M, Zhang K, Liu T, Tao D, Glymour C, Schölkopf B. Domain adaptation with conditional transferable components. JMLR Workshop Conf Proc. 2016;48:2839–2848.

Source: PubMed

Подписаться