Adapting machine learning techniques to censored time-to-event health record data: A general-purpose approach using inverse probability of censoring weighting

David M Vock, Julian Wolfson, Sunayan Bandyopadhyay, Gediminas Adomavicius, Paul E Johnson, Gabriela Vazquez-Benitez, Patrick J O'Connor, David M Vock, Julian Wolfson, Sunayan Bandyopadhyay, Gediminas Adomavicius, Paul E Johnson, Gabriela Vazquez-Benitez, Patrick J O'Connor

Abstract

Models for predicting the probability of experiencing various health outcomes or adverse events over a certain time frame (e.g., having a heart attack in the next 5years) based on individual patient characteristics are important tools for managing patient care. Electronic health data (EHD) are appealing sources of training data because they provide access to large amounts of rich individual-level data from present-day patient populations. However, because EHD are derived by extracting information from administrative and clinical databases, some fraction of subjects will not be under observation for the entire time frame over which one wants to make predictions; this loss to follow-up is often due to disenrollment from the health system. For subjects without complete follow-up, whether or not they experienced the adverse event is unknown, and in statistical terms the event time is said to be right-censored. Most machine learning approaches to the problem have been relatively ad hoc; for example, common approaches for handling observations in which the event status is unknown include (1) discarding those observations, (2) treating them as non-events, (3) splitting those observations into two observations: one where the event occurs and one where the event does not. In this paper, we present a general-purpose approach to account for right-censored outcomes using inverse probability of censoring weighting (IPCW). We illustrate how IPCW can easily be incorporated into a number of existing machine learning algorithms used to mine big health care data including Bayesian networks, k-nearest neighbors, decision trees, and generalized additive models. We then show that our approach leads to better calibrated predictions than the three ad hoc approaches when applied to predicting the 5-year risk of experiencing a cardiovascular adverse event, using EHD from a large U.S. Midwestern healthcare system.

Keywords: Censored data; Electronic health data; Inverse probability weighting; Machine learning; Risk prediction; Survival analysis.

Copyright © 2016 Elsevier Inc. All rights reserved.

Figures

Figure 1
Figure 1
Distribution of follow-up times, i.e., time from the end of the baseline period until the patient experiences a CV event, the patient disenrolls from the insurance system, or the study ends, in the cohort after applying inclusion/ exclusion criteria. The number of subjects whose follow-up ends in a CV event are shown on the right while the number whose follow-up is censored is given on the left. The large number of subjects with between 7-9 years of follow-up are subjects who were part of the health system from the inception of the electronic medical record at their primary care clinic (typically occurring between 2001 and 2002) and remained part of the system until 2011.
Figure 2
Figure 2
Proportion of subjects with unknown τ -year event status as a function of τ, the time from index date in years.
Figure 3
Figure 3
The graphical model for our Bayesian network for CV risk prediction. Nodes represent input variables and edges represent conditional dependencies between the variables. The edge between subgraphs indicates an edge from every node in the source subgraph to every node in the destination subgraph or node. That is, the outcome variable (Event) is connected to every node in the graph. Features in the same nodes indicate those features are modeled jointly. The full description of each of the features appears in Section 5.1.
Figure 4
Figure 4
Predicted CV risk minus empirical or observed CV risk across bins defined by the predicted risk. The predicted risk bins were based on clinically relevant cutoffs for the risk of experiencing a cardiovascular event within 5 years: 0-5%, 5-10%, 10-15%, 15-20% and > 20%.

Source: PubMed

3
구독하다