Opportunities and challenges for transcriptome-wide association studies

Michael Wainberg, Nasa Sinnott-Armstrong, Nicholas Mancuso, Alvaro N Barbeira, David A Knowles, David Golan, Raili Ermel, Arno Ruusalepp, Thomas Quertermous, Ke Hao, Johan L M Björkegren, Hae Kyung Im, Bogdan Pasaniuc, Manuel A Rivas, Anshul Kundaje, Michael Wainberg, Nasa Sinnott-Armstrong, Nicholas Mancuso, Alvaro N Barbeira, David A Knowles, David Golan, Raili Ermel, Arno Ruusalepp, Thomas Quertermous, Ke Hao, Johan L M Björkegren, Hae Kyung Im, Bogdan Pasaniuc, Manuel A Rivas, Anshul Kundaje

Abstract

Transcriptome-wide association studies (TWAS) integrate genome-wide association studies (GWAS) and gene expression datasets to identify gene-trait associations. In this Perspective, we explore properties of TWAS as a potential approach to prioritize causal genes at GWAS loci, by using simulations and case studies of literature-curated candidate causal genes for schizophrenia, low-density-lipoprotein cholesterol and Crohn's disease. We explore risk loci where TWAS accurately prioritizes the likely causal gene as well as loci where TWAS prioritizes multiple genes, some likely to be non-causal, owing to sharing of expression quantitative trait loci (eQTL). TWAS is especially prone to spurious prioritization with expression data from non-trait-related tissues or cell types, owing to substantial cross-cell-type variation in expression levels and eQTL strengths. Nonetheless, TWAS prioritizes candidate causal genes more accurately than simple baselines. We suggest best practices for causal-gene prioritization with TWAS and discuss future opportunities for improvement. Our results showcase the strengths and limitations of using eQTL datasets to determine causal genes at GWAS loci.

Conflict of interest statement

Competing interests

The authors declare no competing interests.

Figures

Fig. 1 |. TWAS, like GWAS, frequently…
Fig. 1 |. TWAS, like GWAS, frequently has multiple significant associations per locus.
a, An overview of TWAS. Briefly, TWAS involves: (i) training a predictive model of expression from genotype on a reference panel such as GTEx; (ii) using this model to predict expression for individuals in the GWAS cohort; and (iii) associating this predicted expression with the trait. b,c, Manhattan plots of GWAS (b) and Fusion TWAS (c) for LDL cholesterol, using GWAS summary statistics from the Global Lipids Genetics Consortium and liver expression from the STARNET cohort (Supplementary Note). GWAS has multiple hits per locus, owing to LD, and TWAS has multiple hits per locus, owing to co-regulation (which can also be driven in part by LD; described below), as explored in the main text. Clusters of multiple adjacent TWAS hit genes are highlighted in red. d, Three scenarios in which co-regulation can lead to multiple hits per locus, and the estimated percentage of non-causal hit genes subject to each scenario; each scenario is presented in a case study later in the text. To estimate the percentages, we grouped hits into 2.5-megabase clumps and made the approximation that genes that were not the top hit in multi-hit clumps were non-causal; we then calculated the percentage of these genes with total or predicted expression r2 ≥ 0.2 or ≥ 1 shared variant with the top hit in their block, aggregating genes across the LDL/liver and Crohn’s disease/whole-blood TWAS. The full distributions of the total and predicted expression correlations and number of shared variants are shown in Supplementary Fig. 1, separated by study.
Fig. 2 |. Co-regulation strongly predicts TWAS…
Fig. 2 |. Co-regulation strongly predicts TWAS hit strength at the SORT1 locus.
a, Fusion Manhattan plot of the SORT1 locus. b, Expression correlation (corr.) with SORT1 versus TWAS P value, for each gene in the SORT1 locus. Chr, chromosome.
Fig. 3 |. Correlated predicted expression can…
Fig. 3 |. Correlated predicted expression can cause non-causal hits even in the absence of correlated total expression.
a, For nearby genes, Fusion-predicted expression correlations tend to be higher than total expression correlations, for example, at the SORT1 locus. b, Fusion Manhattan plot of the IRF2BP2 locus, where RP4–781K5.7 is a likely non-causal hit due to predicted expression correlation with IRF2BP2. c, Details of the two genes’ Fusion expression models: a line between a variant’s rs number and a gene indicates that the variant is included in the gene’s expression model with either a positive weight (blue) or a negative weight (orange); the thickness of the line increases with the magnitude of the weight; red arcs indicate LD. Pink rs numbers are GWAS hits (genome-wide significant or sub-significant), whereas gray rs numbers are not. For clarity, four variants with weights less than 0.05 in magnitude for IRF2BP2 (rs2175594, P = 0.02, weight +0.03; rs2439500, P = 0.2, weight = +0.01; rs11588636, P = 0.3, weight = −0.03; and rs780256, P = 0.9, weight = −0.0 3) and five variants for RP4–781K5.7 (rs478425, P = 0.01, weight = + 0.02; rs633269, P = 0.02, weight = +0.01; rs881070, P = 0.06, weight = −0.02; rs673283, P = 0.1, weight = + 0.004; and rs9659229, P = 0.1, weight = −0.04) are not shown. d, Estimated causal probability for each significant gene from Fusion at the SORT1 and IRF2BP2 loci, according to TWAS gene-based fine-mapping with the FOCUS method.
Fig. 4 |. Sharing of GWAS variants…
Fig. 4 |. Sharing of GWAS variants between expression models can contribute to non-causal hits even without correlated predicted expression.
a, Fusion Manhattan plot of the NOD2 locus. b, Details of the expression models of NOD2 and BRD7; as in Fig. 2, a line between a variant’s rs number and a gene indicates that the variant is included in the gene’s expression model with either a positive weight (blue) or a negative weight (orange), with the thickness of the line increasing with the magnitude of the weight. Red arcs indicate LD.
Fig. 5 |. co-regulation scenarios in TWAS…
Fig. 5 |. co-regulation scenarios in TWAS that may lead to non-causal hits, from least to most general.
a, Correlated expression across individuals: the causal gene has correlated total expression with another gene, which may become a non-causal TWAS hit. Co-reg, co-regulation. b, Correlated predicted expression across individuals: even if total expression correlation is low, predicted expression correlation may be high if the same variants (or variants in LD) regulate both genes and are included in both models. c, Sharing of GWAS hits: even if the two genes’ models include largely distinct variants, and predicted expression correlation is low, only a single shared GWAS hit variant (or variant in LD) is necessary for both genes to be TWAS hits. d, Both models include distinct GWAS hits: in the most general case, the GWAS hits driving the signal at the two genes may not be in LD with each other, for instance if the non-causal gene’s GWAS hit happens to regulate the causal gene as well, but this connection is missed by the expression modeling (a false negative), or if the causal gene’s GWAS hit acts via a coding mechanism (not shown).
Fig. 6 |. Most candidate causal genes…
Fig. 6 |. Most candidate causal genes drop out after switching to a tissue with a less clear mechanistic relationship to the trait, owing to a lack of sufficient expression or sufficiently heritable expression.
Fusion TWAS P values at nine LDL/liver and four Crohn’s disease/whole-blood multi-hit loci, using expression from tissues with a clear (top row) or less clear or absent (bottom row) mechanistic relationship to the trait. Candidate causal genes are labeled and colored red.

Source: PubMed

3
Suscribir