Gaussian and Mixed Graphical Models as (multi-)omics data analysis tools

Michael Altenbuchinger, Antoine Weihs, John Quackenbush, Hans Jörgen Grabe, Helena U Zacharias, Michael Altenbuchinger, Antoine Weihs, John Quackenbush, Hans Jörgen Grabe, Helena U Zacharias

Abstract

Gaussian Graphical Models (GGMs) are tools to infer dependencies between biological variables. Popular applications are the reconstruction of gene, protein, and metabolite association networks. GGMs are an exploratory research tool that can be useful to discover interesting relations between genes (functional clusters) or to identify therapeutically interesting genes, but do not necessarily infer a network in the mechanistic sense. Although GGMs are well investigated from a theoretical and applied perspective, important extensions are not well known within the biological community. GGMs assume, for instance, multivariate normal distributed data. If this assumption is violated Mixed Graphical Models (MGMs) can be the better choice. In this review, we provide the theoretical foundations of GGMs, present extensions such as MGMs or multi-class GGMs, and illustrate how those methods can provide insight in biological mechanisms. We summarize several applications and present user-friendly estimation software. This article is part of a Special Issue entitled: Transcriptional Profiles and Regulatory Gene Networks edited by Dr. Dr. Federico Manuel Giorgi and Dr. Shaun Mahony.

Keywords: (Multi-)omics; Gaussian Graphical Model; Gene regulatory network; Mixed Graphical Model.

Copyright © 2019 Elsevier B.V. All rights reserved.

Figures

Figure 1:. Scatterplots before and after variable…
Figure 1:. Scatterplots before and after variable adjustment.
Figure (a) shows the scatterplot of 1000 measurements between two multivariate normal random variables X and Y. Figure (b) takes into account the effect of a third random variable Z, which is associated with both X and Y. Here, we calculated the residues eX and eY after a linear regression of X with Z and of Y with Z. We observe that the correlation between X and Y in (a) can be entirely explained by variable Z as shown in Figure (b). The corresponding Pearson correlation coefficients are given in the lower right corners. Data were simulated from a three-dimensional multivariate normal distribution, (X, Y, Z)T ~ N(0, Ω−1), where the precision matrix Ω is defined by ω11 = ω22 = ω33 = 1, ω31 = ω32 = ω13 = ω23 = −0.7 and 0 elsewhere, as outlined in the Supplementary File 1.
Figure 2:. Graphical representation of conditional independence.
Figure 2:. Graphical representation of conditional independence.
Figure (a) illustrates the concept of conditional independence. Variables X and Y are conditionally independent given Z. Consequently, no edge is drawn between X and Y, while there is an edge between X and Z, and Y and Z. Figure (b) shows an exemplary precision matrix Ω. Figure (c) shows the corresponding network visualization, and (d) illustrates the first order neighborhood of the variable v1, which includes the node itself and the two adjacent nodes v2 and v4.
Figure 3:. Distribution of gene-gene Pearson correlations…
Figure 3:. Distribution of gene-gene Pearson correlations and full order partial correlations.
Figure (a) shows the distribution of gene-gene Pearson’s correlation coefficients estimated for single-cell RNA sequencing data of melanoma metastases from Tirosh et al. (2016). Figure (b) shows the corresponding distribution of full order partial correlations estimated using the R package GeneNet (Schaefer et al., 2015). The black dashed lines in (a) mark the highest and lowest percentile (99% and 1%) of (anti-)correlations. In (b), the corresponding lines are shown for partial correlations. Notice that for both (a) and (b) the y-axis is on log-scale.
Figure 4:. Gaussian Graphical Model for single-cell…
Figure 4:. Gaussian Graphical Model for single-cell RNA sequencing data of melanoma metastases (Tirosh et al., 2016).
Figure (a) displays the complete GGM with nodes representing the 1,000 most abundant genes in the data set and edges representing significant (q-value < 0.05) full order partial correlations. The strength of an association is reflected by the edge intensity from strong positive (dark blue) to strong negative association (dark red). Figure (b) displays the first order neighborhood of CD3D, which encodes a protein of the T-cell receptor/CD3 complex. The corresponding R code to reproduce the results is given in the Supplementary File 1.
Figure 5:. Partial correlation estimation accuracy.
Figure 5:. Partial correlation estimation accuracy.
We simulated data for p = 100 variables and 248 true edges (5% of all possible edges) for different sample sizes. The y-axis gives the deviation between partial correlation estimates and the ground truth, calculated as ‖ρestimate − ρtrue‖F2, where ρestimate is the estimate, ρtrue the ground truth, and ||.||F the Frobenius norm. Here, the red curve is the estimate obtained from covariance matrix inversion, which is only possible for sample sizes N > p. N = p is indicated by the vertical black dotted line. The blue line shows the corresponding result using the covariance shrinkage approach of Schaefer et al. (2015). We observe that covariance shrinkage provides estimates for sample sizes N < p and that estimates improve considerably for moderate sample sizes N > p. Note that both axes are on a logarithmic scale.
Figure 6:. First order neighborhood of the…
Figure 6:. First order neighborhood of the node “cell type”.
The right figure shows the neighborhood of the categorical variable “cell type”. Edge intensity reflects the strength of an association from strong positive (dark blue) to strong negative association (dark red). The node color indicates if the selected gene is specific for B cells (red squares), macrophages (green circles), and T cells (blue circles). T-cell genes are, e.g., CD3D, CD3E, CD3G, which encode proteins of the T-cell receptor-CD3 complex, CD2, that encodes a surface antigen present on all peripheral blood T cells, and Interleukin 32 (IL32), which encodes a cytokine increased in the activation of T cells. B-cell related genes (red) are, e.g., CD37, which encodes a cell-surface protein whose expression is restricted to cells of the immune system, with highest expression in mature B cells, and HLA-DRA, which is one of the HLA class II alpha chain paralogues that is expressed in antigen presenting cells. The only selected macrophage gene was Lysozyme (LYZ). Lysozymes are associated with the monozyte-macrophage system and enhance the activity of immunoagents. The corresponding classification performance in differentiating T and B cells, and macrophages from the remaining cells is shown in the upper left corner.
Figure 7:. Partial correlation estimates GGM versus…
Figure 7:. Partial correlation estimates GGM versus MGM.
Figure (a) compares the partial correlations estimated using a GGM (y-axis) with those estimated using a MGM that additionally contains the cell type as a discrete node (x-axis). For better comparability, we estimated both the GGM and MGM as described in Altenbuchinger et al. (2019). Figure (b) shows the orange area indicated in (a). Red circles correspond to genes that are directly connected to the cell-type node in the MGM approach.
Figure 8:. Differential networks.
Figure 8:. Differential networks.
Figure (a) shows an example network ΩA, corresponding to phenotype A, (b) shows the corresponding network of phenotype B. Both networks share similarities, but differ in selected edges, yielding the differential network ΩA − ΩB in (c). Blue edges encode positive associations and red edges negative associations.

Source: PubMed

3
Suscribir