Deep Patient: An Unsupervised Representation to Predict the Future of Patients from the Electronic Health Records

Riccardo Miotto, Li Li, Brian A Kidd, Joel T Dudley, Riccardo Miotto, Li Li, Brian A Kidd, Joel T Dudley

Abstract

Secondary use of electronic health records (EHRs) promises to advance clinical research and better inform clinical decision making. Challenges in summarizing and representing patient data prevent widespread practice of predictive modeling using EHRs. Here we present a novel unsupervised deep feature learning method to derive a general-purpose patient representation from EHR data that facilitates clinical predictive modeling. In particular, a three-layer stack of denoising autoencoders was used to capture hierarchical regularities and dependencies in the aggregated EHRs of about 700,000 patients from the Mount Sinai data warehouse. The result is a representation we name "deep patient". We evaluated this representation as broadly predictive of health states by assessing the probability of patients to develop various diseases. We performed evaluation using 76,214 test patients comprising 78 diseases from diverse clinical domains and temporal windows. Our results significantly outperformed those achieved using representations based on raw EHR data and alternative feature learning strategies. Prediction performance for severe diabetes, schizophrenia, and various cancers were among the top performing. These findings indicate that deep learning applied to EHRs can derive patient representations that offer improved clinical predictions, and could provide a machine learning framework for augmenting clinical decision systems.

Figures

Figure 1. Conceptual framework used to derive…
Figure 1. Conceptual framework used to derive the deep patient representation through unsupervised deep learning of a large EHR data warehouse.
(A) Pre-processing stage to obtain raw patient representations from the EHRs. (B) The raw representations are modeled by the unsupervised deep architecture leading to a set of general and robust features. (C) The deep features are applied to the entire hospital database to derive patient representations that can be applied to a number of clinical tasks.
Figure 2. Diagram of the unsupervised deep…
Figure 2. Diagram of the unsupervised deep feature learning pipeline to transform a raw dataset into the deep patient representation through multiple layers of neural networks.
Each layer of the neural network is trained to produce a higher-level representation from the result of the previous layer.
Figure 3. R-precision obtained in the disease…
Figure 3. R-precision obtained in the disease tagging experiment by the different patient representations over several prediction time intervals (expressed as number of days).
We reports results for patients represented with original descriptors (RawFeat) and pre-processed by principal component analysis (PCA), independent component analysis (ICA), Gaussian mixture model (GMM), k-means clustering (K-Means), and three-layer stacked denoising autoencoders (DeepPatient).

References

    1. Hersh W. R. Adding value to the electronic health record through secondary use of data for quality assurance, research, and surveillance. Am. J. Manag. Care 13, 277–278 (2007).
    1. Tatonetti N. P., Ye P. P., Daneshjou R. & Altman R. B. Data-driven prediction of drug effects and interactions. Sci. Transl. Med. 4, 125ra131 (2012).
    1. Li L. et al.. Identification of type 2 diabetes subgroups through topological analysis of patient similarity. Sci. Transl. Med. 7, 311ra174 (2015).
    1. Doshi-Velez F., Ge Y. & Kohane I. Comorbidity clusters in autism spectrum disorders: an electronic health record time-series analysis. Pediatrics 133, e54–63 (2014).
    1. Miotto R. & Weng C. Case-based reasoning using electronic health records efficiently identifies eligible patients for clinical trials. J. Am. Med. Inform. Assoc. 22, E141–E150 (2015).
    1. Bellazzi R. & Zupan B. Predictive data mining in clinical medicine: current issues and guidelines. Int. J. Med. Inform. 77, 81–97 (2008).
    1. Jensen P. B., Jensen L. J. & Brunak S. Mining electronic health records: towards better research applications and clinical care. Nat. Rev. Genet. 13, 395–405 (2012).
    1. Dahlem D., Maniloff D. & Ratti C. Predictability bounds of electronic health records. Sci. Rep. 5, 11865 (2015).
    1. Wu J. L., Roy J. & Stewart W. F. Prediction modeling using EHR data: challenges, strategies, and a comparison of machine learning approaches. Med. Care 48, S106–S113 (2010).
    1. Weiskopf N. G., Hripcsak G., Swaminathan S. & Weng C. Defining and measuring completeness of electronic health records for secondary use. J. Biomed. Inform. 46, 830–836 (2013).
    1. Weiskopf N. G. & Weng C. Methods and dimensions of electronic health record data quality assessment: enabling reuse for clinical research. J. Am. Med. Inform. Assoc. 20, 144–151 (2013).
    1. Bengio Y., Courville A. & Vincent P. Representation learning: a review and new perspectives. IEEE T. Pattern Anal. Mach. Intell. 35, 1798–1828 (2013).
    1. Jordan M. I. & Mitchell T. M. Machine learning: trends, perspectives, and prospects. Science 349, 255–260 (2015).
    1. Huang S. H. et al.. Toward personalizing treatment for depression: predicting diagnosis and severity. J. Am. Med. Inform. Assoc. 21, 1069–1075 (2014).
    1. Lyalina S. et al.. Identifying phenotypic signatures of neuropsychiatric disorders from electronic medical records. J. Am. Med. Inform. Assoc. 20, e297–305 (2013).
    1. Wang X., Sontag D. & Wang F. Unsupervised learning of disease progression models. ACM SIGKDD, 85–94 (2014).
    1. LeCun Y., Bengio Y. & Hinton G. Deep learning. Nature 521, 436–444 (2015).
    1. Vincent P., Larochelle H., Lajoie I., Bengio Y. & Manzagol P. A. Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res. 11, 3371–3408 (2010).
    1. Shah N. H. et al.. Comparison of concept recognizers for building the Open Biomedical Annotator. BMC Bioinformatics 10, S14 (2009).
    1. Musen M. A. et al.. The National Center for Biomedical Ontology. J. Am. Med. Inform. Assoc. 19, 190–195 (2012).
    1. Jonquet C., Shah N. H. & Musen M. A. The Open Biomedical Annotator. Summit on Translat. Bioinforma. 2009, 56–60 (2009).
    1. Lependu P., Iyer S. V., Fairon C. & Shah N. H. Annotation analysis for testing drug safety signals using unstructured clinical notes. J. Biomed. Semantics 3, S5 (2012).
    1. Chapman W. W., Bridewell W., Hanbury P., Cooper G. F. & Buchanan B. G. A simple algorithm for identifying negated findings and diseases in discharge summaries. J. Biomed. Inform. 34, 301–310 (2001).
    1. Cohen R., Elhadad M. & Elhadad N. Redundancy in electronic health record corpora: analysis, impact on text mining performance and mitigation strategies. BMC Bioinformatics 14, 10 (2013).
    1. Blei D. M. Probabilistic topic models. Commun. ACM 55, 77–84 (2012).
    1. Arnold C. W., El-Saden S. M., Bui A. A. & Taira R. Clinical case-based retrieval using latent topic analysis. AMIA Annu. Symp. Proc., 26–30 (2010).
    1. Perotte A., Bartlett N., Elhadad N. & Wood F. Hierarchically supervised latent dirichlet allocation. NIPS, 2609–2617 (2011).
    1. Bisgin H., Liu Z., Fang H., Xu X. & Tong W. Mining FDA drug labels using an unsupervised learning technique - topic modeling. BMC Bioinformatics 12, S11 (2011).
    1. Blei D. M., Ng A. Y. & Jordan M. I. Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003).
    1. Cowen M. E. et al.. Casemix adjustment of managed care claims data using the clinical classification for health policy research method. Med. Care 36, 1108–1113 (1998).
    1. Larochelle H., Bengio Y., Louradour J. & Lamblin P. Exploring strategies for training deep neural networks. J. Mach. Learn. Res. 10, 1–40 (2009).
    1. Breiman L. Random forests. Mach. Learn. 45, 5–32 (2001).
    1. Fernandez-Delgado M., Cernadas E., Barro S. & Amorim D. Do we need hundreds of classifiers to solve real world classification problems? J. Mach. Learn. Res. 15, 3133–3181 (2014).
    1. Manning C. D., Raghavan P. & Schütze H. Introduction to Information Retrieval. (Cambridge University Press, 2008).
    1. Helmstaedter M. et al.. Connectomic reconstruction of the inner plexiform layer in the mouse retina. Nature 500, 168–174 (2013).
    1. Ma J. S., Sheridan R. P., Liaw A., Dahl G. E. & Svetnik V. Deep neural nets as a method for quantitative structure-activity relationships. J. Chem. Inf. Model 55, 263–274 (2015).
    1. Leung M. K. K., Xiong H. Y., Lee L. J. & Frey B. J. Deep learning of the tissue-regulated splicing code. Bioinformatics 30, 121–129 (2014).
    1. Xiong H. Y. et al.. The human splicing code reveals new insights into the genetic determinants of disease. Science 347, 144–151 (2015).
    1. Alipanahi B., Delong A., Weirauch M. T. & Frey B. J. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nature Biotech. 33, 831–838 (2015).
    1. Liang Z., Zhang G., Huang J. X. & Hu Q. V. Deep learning for healthcare decision making with EMRs. IEEE BIBM, 556–559 (2014).
    1. Hinton G. E. & Salakhutdinov R. R. Reducing the dimensionality of data with neural networks. Science 313, 504–507 (2006).
    1. Lasko T. A., Denny J. C. & Levy M. A. Computational phenotype discovery using unsupervised feature learning over noisy, sparse, and irregular clinical data. PLoS One 8, e66341 (2013).
    1. Kennedy E. H., Wiitala W. L., Hayward R. A. & Sussman J. B. Improved cardiovascular risk prediction using non-parametric regression and electronic health record data. Med. Care 51, 251–258 (2013).
    1. Hui L., Xiaoyi L., Ramanathan M. & Aidong Z. Prediction and informative risk factor selection of bone diseases. IEEE/ACM T. Comput. Biol. Bioinform. 12, 79–91 (2015).
    1. Perotte A., Ranganath R., Hirsch J. S., Blei D. & Elhadad N. Risk prediction for chronic kidney disease progression using heterogeneous electronic health record data and time series analysis. J. Am. Med. Inform. Assoc. 22, 872–880 (2015).
    1. Perotte A. et al.. Diagnosis code assignment: Models and evaluation metrics. J. Am. Med. Inform. Assoc. 21, 231–237 (2014).
    1. Gottlieb A., Stein G. Y., Ruppin E., Altman R. B. & Sharan R. A method for inferring medical diagnoses from patient similarities. BMC Med. 11, 194–203 (2013).
    1. Yao L. X., Zhang Y. Y., Li Y., Sanseau P. & Agarwal P. Electronic health records: Implications for drug discovery. Drug Discov. Today 16, 594–599 (2011).

Source: PubMed

3
Sottoscrivi