Transcriptomics and machine learning predict diagnosis and severity of growth hormone deficiency

Philip G Murray, Adam Stevens, Chiara De Leonibus, Ekaterina Koledova, Pierre Chatelain, Peter E Clayton, Philip G Murray, Adam Stevens, Chiara De Leonibus, Ekaterina Koledova, Pierre Chatelain, Peter E Clayton

Abstract

Background: The effect of gene expression data on diagnosis remains limited. Here, we show how diagnosis and classification of growth hormone deficiency (GHD) can be achieved from a single blood sample using a combination of transcriptomics and random forest analysis.

Methods: Prepubertal treatment-naive children with GHD (n = 98) were enrolled from the PREDICT study, and controls (n = 26) were acquired from online data sets. Whole blood gene expression was correlated with peak growth hormone (GH) using rank regression and a random forest algorithm tested for prediction of the presence of GHD and in classification of GHD as severe (peak GH <4 μg/l) and nonsevere (peak ≥4 μg/l). Performance was assessed using area under the receiver operating characteristic curve (AUC-ROC).

Results: Rank regression identified 347 probe sets in which gene expression correlated with peak GH concentrations (r = ± 0.28, P < 0.01). These 347 probe sets yielded an AUC-ROC of 0.95 for prediction of GHD status versus controls and an AUC-ROC of 0.93 for prediction of GHD severity.

Conclusion: This study demonstrates highly accurate diagnosis and disease classification for GHD using a combination of transcriptomics and random forest analysis.

Trial registration: NCT00256126 and NCT00699855.

Funding: Merck and the National Institute for Health Research (CL-2012-06-005).

Keywords: Endocrinology; growth factors.

Conflict of interest statement

Conflict of interest: AS, PGM, CDL, PC, and PEC have received honoraria from Merck. EK is an employee of Merck.

Figures

Figure 1. Heatmap of gene expression for…
Figure 1. Heatmap of gene expression for those probe sets whose expression correlated with peak GH levels.
(A) Normal children (n = 26) were combined with GHD patients (n = 98), rank regression analysis was adjusted for sex and age as covariates, and clusters of similar gene expression were identified using the Euclidean metric and marked using a dendrogram and white boxes (347 probe sets, 271 unique genes). The distinction between normal subjects is marked by the break in the heatmap; GHD is defined by a cutoff level of 10 μ/l growth hormone, as measured by provocation testing. The vertical white line demarcates the point of inflexion for gene expression at a peak GH level of 4.75 μ/l, while the horizontal white line demarcates those probe sets positively and negatively associated with peak GH levels (< or >4.75 μ/l). (B) Two-way cluster analysis of gene expression in GHD and control subjects. Four distinct clusters of GHD subgroups can be seen from the dendrogram on the horizontal axis derived via a Euclidian metric. There is, however, a large number of subjects it was not possible to classify (right of white line). This group contained all but 1 of the normal control subjects and 20 GHD subjects.
Figure 2. Identification of clusters of variation…
Figure 2. Identification of clusters of variation of gene expression related to GHD severity.
(A) Heatmap for the probe sets identified by correlation with peak GH (347 probe sets, 271 unique genes). Five distinct clusters of gene expression are identified via the dendrogram — two positively correlated (red) with peak GH and three negatively correlated (green). Pink, yellow, and blue squares indicate the principal component analysis group for each patient (see Figure 1B). (B) Isomap supervised principal component analysis using only those probe sets whose expression correlated to peak GH identified 3 distinct groups of GHD subjects (n = 98; pink n = 49, yellow n = 37, and blue n = 12).
Figure 3. Network modeling of the overlap…
Figure 3. Network modeling of the overlap of gene expression between clinical markers.
(A) Network models generated using BioGRID (version 3.2.117) were analyzed to define modules of functionally related genes. The “community structure” of these modules was assessed and ranked by their “centrality” score to form a hierarchy related to the biological action of the network. (B) Community structure of modules within the network was assessed using the ModuLand algorithm in Cytoscape 2.8.3. Hierarchy of the first 15 network modules in each of the network models of gene expression overlap between clinical markers. Modules are shown as octagons labeled with the most central gene in the cluster and ranked by network centrality (1st through 15th).
Figure 4. Summary of predicted activity and…
Figure 4. Summary of predicted activity and regulators derived via causal network analysis for the network modules.
The hierarchy of clusters of gene expression shown in Figure 2 were mapped onto identified causal networks. Activity of pathways and master regulators (colored red) show a positive correlation with the GHD severity or (colored green) show negatively correlated activity. Pathway ontology of all modules in the hierarchy is shown in Supplemental Figure 2.

Source: PubMed

3
Tilaa