Predicting the clinical status of human breast cancer by using gene expression profiles

M West, C Blanchette, H Dressman, E Huang, S Ishida, R Spang, H Zuzan, J A Olson Jr, J R Marks, J R Nevins, M West, C Blanchette, H Dressman, E Huang, S Ishida, R Spang, H Zuzan, J A Olson Jr, J R Marks, J R Nevins

Abstract

Prognostic and predictive factors are indispensable tools in the treatment of patients with neoplastic disease. For the most part, such factors rely on a few specific cell surface, histological, or gross pathologic features. Gene expression assays have the potential to supplement what were previously a few distinct features with many thousands of features. We have developed Bayesian regression models that provide predictive capability based on gene expression data derived from DNA microarray analysis of a series of primary breast cancer samples. These patterns have the capacity to discriminate breast tumors on the basis of estrogen receptor status and also on the categorized lymph node status. Importantly, we assess the utility and validity of such models in predicting the status of tumors in crossvalidation determinations. The practical value of such approaches relies on the ability not only to assess relative probabilities of clinical outcomes for future samples but also to provide an honest assessment of the uncertainties associated with such predictive classifications on the basis of the selection of gene subsets for each validation analysis. This latter point is of critical importance in the ability to apply these methodologies to clinical assessment of tumor phenotype.

Figures

Figure 1
Figure 1
Factor analysis for ER+/ER− comparison. (A) Pairwise factor analysis. Breast tumors depicted in a scatter plot on two dominant factors underlying 100 genes selected in pure discrimination of the training cases. Each tumor is indicated by a simple index number (see Table 2) and is color coded, with red indicating ER+ cases and blue indicating ER− cases. Only the tumors in the training set are plotted. Factor 1 is clearly discriminatory (Factor 4 is chosen purely for display purposes). (B) Fitted classification probabilities for training cases from the factor regression analysis. The values on the horizontal axis are estimates of the overall factor score in the regression. The corresponding values on the vertical axis are fitted/estimated classification probabilities, with corresponding 90% probability intervals marked as dashed lines to indicate uncertainty about these estimated values. Color coding is as described in A. (C) Predictive probabilities for ER status of each tumor in the validation sample. The analysis was based on the selected subset of 100 genes in the full training sample analysis. Color coding is as described in A.
Figure 2
Figure 2
Expression levels of top 100 genes providing pure discrimination of ER status. Expression levels are depicted by color coding, with black representing the lowest level, followed by red, orange, yellow, and then white as the highest level of expression. Each column in the figure represents all 100 genes from an individual tumor sample, which are grouped according to determined ER status. Each row represents an individual gene, ordered from top to bottom according to regression coefficients (see Table 3).
Figure 3
Figure 3
Out-of-sample crossvalidation predictions of ER status. (A) One-at-a-time crossvalidation predictions of classification probabilities for training cases from the factor regression analysis. The values on the horizontal axis are estimates of the overall factor score in the regression. The corresponding values on the vertical axis are estimated classification probabilities with corresponding 90% probability intervals marked as dashed lines to indicate uncertainty about these estimated values. The analysis and predictions for each tumor are based on the screened subset of 100 most discriminatory genes to parallel current practice in expression studies by other groups. (B) One-at-a-time crossvalidation predictions of classification probabilities for training cases in the ER study, in a format similar to that of A. In this instance, each case is predicted only on the basis of the ER status of the remaining training tumors, with the subset of 100 genes reselected in each case. The figure presents the resulting honest uncertainties about the extent of true predictive accuracy in a practical setting, reflecting inherent variability due to heterogeneity of expression profiles.
Figure 4
Figure 4
Analysis for nodal comparisons. (A) Pairwise factor analysis. Breast tumors depicted in a scatter plot on two dominant factors underlying 100 genes selected in pure discrimination according to nodal status. Each tumor is indicated by a simple index number (see Table 2) and is color coded, with red indicating node positive cases with at least three identified positive nodes and blue indicating lymph node negative cases. Factor 1 is clearly discriminatory (Factor 3 is chosen purely for display purposes). (B) One-at-a-time crossvalidation predictions of classification probabilities in nodal analysis. The values on the horizontal axis are estimates of the overall factor score in the regression. The corresponding values on the vertical axis are estimated classification probabilities, with corresponding 90% probability intervals marked as dashed lines to indicate uncertainty about these estimated values. The analysis and predictions for each tumor are based on the screened subset of 100 most discriminatory genes. (C) One-at-a-time crossvalidation predictions in the nodal study, in a format similar to that of A. Each case is predicted only on the basis of the nodal status of the remaining training tumors, with the subset of 100 genes reselected in each case. As such, the analysis exhibits the resulting uncertainties about the extent of true predictive accuracy in a practical setting, reflecting inherent variability due to heterogeneity of expression profiles.

Source: PubMed

3
Prenumerera