The Molecular Signatures Database (MSigDB) hallmark gene set collection

Arthur Liberzon, Chet Birger, Helga Thorvaldsdóttir, Mahmoud Ghandi, Jill P Mesirov, Pablo Tamayo, Arthur Liberzon, Chet Birger, Helga Thorvaldsdóttir, Mahmoud Ghandi, Jill P Mesirov, Pablo Tamayo

Abstract

The Molecular Signatures Database (MSigDB) is one of the most widely used and comprehensive databases of gene sets for performing gene set enrichment analysis. Since its creation, MSigDB has grown beyond its roots in metabolic disease and cancer to include >10,000 gene sets. These better represent a wider range of biological processes and diseases, but the utility of the database is reduced by increased redundancy across, and heterogeneity within, gene sets. To address this challenge, here we use a combination of automated approaches and expert curation to develop a collection of "hallmark" gene sets as part of MSigDB. Each hallmark in this collection consists of a "refined" gene set, derived from multiple "founder" sets, that conveys a specific biological state or process and displays coherent expression. The hallmarks effectively summarize most of the relevant information of the original founder sets and, by reducing both variation and redundancy, provide more refined and concise inputs for gene set enrichment analysis.

Keywords: gene expression; gene set enrichment analysis; gene sets.

Figures

Figure 1
Figure 1
Analysis of Hedgehog signaling in medulloblastoma. The figure shows ssGSEA scores ranked by their degree of association (IC) between the Hedgehog and photoreceptor phenotype for: A) the 50 hallmarks and, B) the Hedgehog hallmark and 9 of its top scoring founder gene sets. The IC scores, p-values and FDR’s appear on the right side of the heat maps. Black and grey colors denote medulloblastoma subtypes (Hedgehog and photoreceptor subtypes respectively).
Figure 2
Figure 2
Ranks of gene sets grouped by biological themes. The horizontal axis denotes rankings of gene sets enriched in the GBM data with respect to necrosis. The biological themes are on the right side of the graph. The vertical bars indicate ranks of gene sets. Black bars denote the 245 significantly enriched sets. Gray bars stand for the gene sets that were not enriched significantly. The uncategorized gene sets are not shown. The rows indicate 11 biological themes. The red box shows gene sets that are pushed down the list by high scoring gene sets representing hypoxia/glycolysis, EMT, and NFkB signaling.
Figure 3
Figure 3
Matching hallmark enrichment scores to phenotypes defined by protein levels. The top row of the heat maps shows Reverse Phase Protein Array (RPPA) profiles of selected proteins sorted in descending order from left to right. The chosen protein expression profiles are from top to bottom: A) MYC (c-Myc-R-C), B) ESR1 (ER-alpha-R-V), C) AR (AR-R-V), D) BCL2 (Bcl-2-M-V), E) CDH2 (N-cadherin-R-V), F) SMAD3 (Smad3-R-V), G) STAT3 pY705 (STAT3_pY705-R-V), H) STAT5A (STAT5-alpha-R-V) and I) KDR scores.

Source: PubMed

3
Suscribir