Systematic comparison of phenome-wide association study of electronic medical record data and genome-wide association study data

Joshua C Denny, Lisa Bastarache, Marylyn D Ritchie, Robert J Carroll, Raquel Zink, Jonathan D Mosley, Julie R Field, Jill M Pulley, Andrea H Ramirez, Erica Bowton, Melissa A Basford, David S Carrell, Peggy L Peissig, Abel N Kho, Jennifer A Pacheco, Luke V Rasmussen, David R Crosslin, Paul K Crane, Jyotishman Pathak, Suzette J Bielinski, Sarah A Pendergrass, Hua Xu, Lucia A Hindorff, Rongling Li, Teri A Manolio, Christopher G Chute, Rex L Chisholm, Eric B Larson, Gail P Jarvik, Murray H Brilliant, Catherine A McCarty, Iftikhar J Kullo, Jonathan L Haines, Dana C Crawford, Daniel R Masys, Dan M Roden, Joshua C Denny, Lisa Bastarache, Marylyn D Ritchie, Robert J Carroll, Raquel Zink, Jonathan D Mosley, Julie R Field, Jill M Pulley, Andrea H Ramirez, Erica Bowton, Melissa A Basford, David S Carrell, Peggy L Peissig, Abel N Kho, Jennifer A Pacheco, Luke V Rasmussen, David R Crosslin, Paul K Crane, Jyotishman Pathak, Suzette J Bielinski, Sarah A Pendergrass, Hua Xu, Lucia A Hindorff, Rongling Li, Teri A Manolio, Christopher G Chute, Rex L Chisholm, Eric B Larson, Gail P Jarvik, Murray H Brilliant, Catherine A McCarty, Iftikhar J Kullo, Jonathan L Haines, Dana C Crawford, Daniel R Masys, Dan M Roden

Abstract

Candidate gene and genome-wide association studies (GWAS) have identified genetic variants that modulate risk for human disease; many of these associations require further study to replicate the results. Here we report the first large-scale application of the phenome-wide association study (PheWAS) paradigm within electronic medical records (EMRs), an unbiased approach to replication and discovery that interrogates relationships between targeted genotypes and multiple phenotypes. We scanned for associations between 3,144 single-nucleotide polymorphisms (previously implicated by GWAS as mediators of human traits) and 1,358 EMR-derived phenotypes in 13,835 individuals of European ancestry. This PheWAS replicated 66% (51/77) of sufficiently powered prior GWAS associations and revealed 63 potentially pleiotropic associations with P < 4.6 × 10⁻⁶ (false discovery rate < 0.1); the strongest of these novel associations were replicated in an independent cohort (n = 7,406). These findings validate PheWAS as a tool to allow unbiased interrogation across multiple phenotypes in EMR-based cohorts and to enhance analysis of the genomic basis of human disease.

Conflict of interest statement

COMPETING FINANCIAL INTERESTS

The authors declare no competing financial interests.

Figures

Figure 1
Figure 1
PheWAS replication of NHGRI Catalog SNP-phenotype associations. (a) Each point represents the −log10(P) of a single SNP-phenotype association tested with PheWAS. This study is restricted to SNP-phenotype associations that achieved genome-wide significance (P ≤ 5 × 10−8) in at least one prior GWAS study that included individuals of European ancestry. Numbers in parentheses beside each phenotype represent the sample size within the PheWAS data set. The vertical blue line represents P = 0.05. Binary traits refer to all adequately powered, binary traits in the NHGRI Catalog with exact matches to a PheWAS phenotype. For example, 5/5 catalog SNPs associated with rheumatoid arthritis were replicated at P < 0.05 in PheWAS, and 9/15 SNPs associated with type 2 diabetes were replicated. Continuous traits are those numerically defined traits in the NHGRI Catalog that are related to PheWAS diseases (e.g., “iron deficiency anemia” was the PheWAS trait paired with the “serum iron level” catalog trait). (b) Replication rates of SNP-phenotype associations at different bins of statistical power. Association count refers to the number of SNP-phenotype associations replicated or not replicated at each bin of statistical power (e.g., all tested associations with power <0.1, power 0.1–0.2). The black line represents a linear regression weighted using the number of associations in each bin (y = 0.64×, r2 = 0.96). (c) Replication rate of NHGRI Catalog associations by number of unique publications citing the original SNP-phenotype association. Association count refers to the number of SNP-phenotype associations (among either adequately powered binary or continuous traits) with the corresponding number of publications. (d) Replication rate of NHGRI Catalog associations by discovery P-value. The dashed line indicates P = 5 × 10−8.
Figure 2
Figure 2
GWAS and PheWAS associations in the genome. Each diamond represents a unique phenotype association at each SNP. Red diamonds represent associations in the NHGRI Catalog only (including phenotypes not present in the PheWAS catalog), green diamonds represent NHGRI Catalog associations replicated by PheWAS (P < 0.05), and blue diamonds represent new phenotype associations identified by PheWAS (P < 4.6 × 10−6, or a FDR < 0.1). Numbers to the right and left indicate chromosomes.
Figure 3
Figure 3
PheWAS plots for four SNPs. Each panel represents 1,358 phenotypes tested for association with a particular SNP, using logistic regression assuming an additive genetic model adjusted for age, sex, study site and the first three principal components. Phenotypes are grouped along the x axis by categorization within the PheWAS code hierarchy. The upper red lines indicate P = 4.6 × 10−6 (FDR = 0.1 for entire PheWAS); lower blue lines indicate P = 0.05; dashed lines are a single-SNP Bonferroni correction (P = 0.05/1,358). Diamonds encircling phenotype circles represent known NHGRI Catalog associations. (a) PheWAS associations for rs12203592 in IRF4, previously associated with hair and eye color, freckling and progressive supranuclear palsy. (b) PheWAS associations for rs2853676 in TERT, previously associated with glioma. (c) PheWAS associations for rs4977574 near CDKN2BAS at chr9p21, previously associated with myocardial infarction, and in LD with carotid stenosis. (d) PheWAS associations for rs660895 near HLA-DRB1, previously associated with rheumatoid arthritis. Results and plots for all SNPs included in the present study are available at http://phewascatalog.org/.
Figure 4
Figure 4
Risk variants for skin phenotypes have different pleiotropy patterns. Association odds ratios are graphed on the x axis and P-values (numbers next to the bars) are from the PheWAS analysis for that SNP. All SNPs use the minor allele as the coded allele, except rs2853676 (TERT). Darker colored bars represent significant associations, calculated as P = 0.05 divided by the number of associations displayed, or 0.05/(6 phenotypes*6 SNPs) = 1.4 × 10−3. Tests for heterogeneity revealed significant heterogeneity among the six phenotypes (I2 = 59–94%, all P < 0.05) and among the six SNPs (I2 = 23–83%, all P < 0.05). Bars oriented leftward toward “protect” represent SNPs in which the coded allele favors decreased prevalence of disease, and bars oriented rightward toward “risk” represent coded alleles favoring increased prevalence of disease.

Source: PubMed

3
구독하다