Integrated modeling of clinical and gene expression information for personalized prediction of disease outcomes

Jennifer Pittman, Erich Huang, Holly Dressman, Cheng-Fang Horng, Skye H Cheng, Mei-Hua Tsou, Chii-Ming Chen, Andrea Bild, Edwin S Iversen, Andrew T Huang, Joseph R Nevins, Mike West, Jennifer Pittman, Erich Huang, Holly Dressman, Cheng-Fang Horng, Skye H Cheng, Mei-Hua Tsou, Chii-Ming Chen, Andrea Bild, Edwin S Iversen, Andrew T Huang, Joseph R Nevins, Mike West

Abstract

We describe a comprehensive modeling approach to combining genomic and clinical data for personalized prediction in disease outcome studies. This integrated clinicogenomic modeling framework is based on statistical classification tree models that evaluate the contributions of multiple forms of data, both clinical and genomic, to define interactions of multiple risk factors that associate with the clinical outcome and derive predictions customized to the individual patient level. Gene expression data from DNA microarrays is represented by multiple, summary measures that we term metagenes; each metagene characterizes the dominant common expression pattern within a cluster of genes. A case study of primary breast cancer recurrence demonstrates that models using multiple metagenes combined with traditional clinical risk factors improve prediction accuracy at the individual patient level, delivering predictions more accurate than those made by using a single genomic predictor or clinical data alone. The analysis also highlights issues of communicating uncertainty in prediction and identifies combinations of clinical and genomic risk factors playing predictive roles. Implicated metagenes identify gene subsets with the potential to aid biological interpretation. This framework will extend to incorporate any form of data, including emerging forms of genomic data, and provides a platform for development of models for personalized prognosis.

Figures

Fig. 1.
Fig. 1.
Kaplan–Meier survival curves for recurrence based on high-risk/lowrisk categorization of breast cancer patients. (A) Empirical survival estimates based on lymph node involvement (low risk, 0–3 positive nodes; high risk, 4 or more positive nodes). (B) Empirical survival estimates based on partition into two groups defined by a threshold in the gene expression pattern of Mg307 and, separately, Mg440. (C) Empirical survival estimates showing evidence of interaction between clinical factors (lymph node status) and genomic factors (in this example, Mg307). (D) Refined empirical survival estimates for two subgroups of the low Mg307 group, defined by a partition on Mg365. (E) Refined empirical survival estimates for two subgroups of the high Mg307 group, defined by a partition on an ER-related metagene, Mg351.
Fig. 2.
Fig. 2.
Use of successive metagene analyses to improve predictions of breast cancer recurrence. (Upper) The expression pattern of the genes in Mg307 (ordered vertically by their weighted value in the metagene) on the entire group of 158 patients. Samples are ordered (horizontally) by the value of Mg307, and the vertical black line indicates the split of the patients into two subgroups underlying the empirical survival curves in Fig. 1B. The two subgroups of patients defined by this split were then further split with two additional metagenes. The low Mg307 subgroup is split based on Mg365, and the high Mg307 group is split based on Mg351. (Lower) The subsequent images show the patterns of genes within Mg365 (Left) and Mg351 (Right) for the corresponding two subgroups of patients, arranged similarly within each group and also indicating the second-level splits. These splits underlie the refined survival curve estimates in Fig. 1 D and E.
Fig. 3.
Fig. 3.
Predictive genomic and clinicogenomic tree models. (A) Metagene tree model. The left box at each node of the tree identifies the number of patients, and the right box gives (as a percentage) the corresponding model-based point estimate of the 4-year recurrence-free probability based on the tree model predictions for that group. (B) Clinicogenomic tree model in a format as described in A. Note the appearance of interactions between lymph node status and Mg307 and Mg365, for example, in relation to the empirical survival curves and metagene expression images in Figs. 1 and 2.
Fig. 4.
Fig. 4.
Predictor variables in top clinicogenomic tree models. Summary of the level of the tree in which each variable appears and defines a node split. The numbers on the y axis simply index trees, with probabilities (in parentheses) indicating the relative weights of trees based on fit to the data. On the x axis, probabilities (in parentheses) associated with clinical or metagene predictor variables are sums of the probabilities of trees in which each occurs and so define overall weights, indicating the relative importance of each variable to the overall model fit and consequent recurrence predictions.
Fig. 5.
Fig. 5.
Predictions from a clinicogenomic tree model. (Upper) Estimates and approximate 95% confidence intervals for 4-year survival probabilities for each patient. The survival probability of each patient is predicted in an out-of-sample cross-validation based on a model completely regenerated from the data of the remaining patients. Each patient is located on the x axis at the recorded recurrence or censoring time for that patient. Patients indicated in blue are the 4-year recurrence-free cases, and those in red are patients with symptoms that recurred within 4 years. The interval estimates for a few cases that stand out are wide, representing uncertainty due to disparities among predictions from individual tree models that are combined in the overall prediction. (Lower) Summary of predictive survival curves and uncertainty estimates for four patients whose clinical and genomic parameters match four actual cases in the data set (cases indexed as 48, 158, 98, and 135).

Source: PubMed

3
Subskrybuj