GAWMerge expands GWAS sample size and diversity by combining array-based genotyping and whole-genome sequencing

Ravi Mathur, Fang Fang, Nathan Gaddis, Dana B Hancock, Michael H Cho, John E Hokanson, Laura J Bierut, Sharon M Lutz, Kendra Young, Albert V Smith, NHLBI Trans-Omics for Precision Medicine (TOPMed) Consortium, Edwin K Silverman, Grier P Page, Eric O Johnson, Ravi Mathur, Fang Fang, Nathan Gaddis, Dana B Hancock, Michael H Cho, John E Hokanson, Laura J Bierut, Sharon M Lutz, Kendra Young, Albert V Smith, NHLBI Trans-Omics for Precision Medicine (TOPMed) Consortium, Edwin K Silverman, Grier P Page, Eric O Johnson

Abstract

Genome-wide association studies (GWAS) have made impactful discoveries for complex diseases, often by amassing very large sample sizes. Yet, GWAS of many diseases remain underpowered, especially for non-European ancestries. One cost-effective approach to increase sample size is to combine existing cohorts, which may have limited sample size or be case-only, with public controls, but this approach is limited by the need for a large overlap in variants across genotyping arrays and the scarcity of non-European controls. We developed and validated a protocol, Genotyping Array-WGS Merge (GAWMerge), for combining genotypes from arrays and whole-genome sequencing, ensuring complete variant overlap, and allowing for diverse samples like Trans-Omics for Precision Medicine to be used. Our protocol involves phasing, imputation, and filtering. We illustrated its ability to control technology driven artifacts and type-I error, as well as recover known disease-associated signals across technologies, independent datasets, and ancestries in smoking-related cohorts. GAWMerge enables genetic studies to leverage existing cohorts to validly increase sample size and enhance discovery for understudied traits and ancestries.

Trial registration: ClinicalTrials.gov NCT00292552.

Conflict of interest statement

E.K.S. has received institutional grant support from GlaxoSmithKline and Bayer. M.H.C. has received grant support from GSK and Bayer, and consulting or speaking fees from Illumina, Genentech, and AstraZeneca. All other authors have no competing interests.

© 2022. The Author(s).

Figures

Fig. 1. Overview of the protocol to…
Fig. 1. Overview of the protocol to use whole-genome sequencing (WGS) data as public control in GWAS.
*The quality control (QC) of the case and public control data is conducted independently according to the steps outlined in the methods.
Fig. 2. Evaluation design for GAWMerge.
Fig. 2. Evaluation design for GAWMerge.
Evaluation design for a technical comparison, b type-I error assessment, and c known GWAS hits. *The samples with European ancestry in COPDGene were evenly divided into two subsets of samples. EA1 includes all COPD cases and some COPD controls to match the COPD prevalence in ECLIPSE. EA2 has all the rest COPD free samples.
Fig. 3. Meta-analysis results from evaluation for…
Fig. 3. Meta-analysis results from evaluation for type-I error.
The Manhattan plot (a) shows the expected no signal, while the QQ-plot (b) shows no inflation.
Fig. 4. Meta-analysis results for replication of…
Fig. 4. Meta-analysis results for replication of GWAS hits for COPD.
The Manhattan plot (a) shows the replicated signals, while the QQ-plot (b) shows inflation due to the true signal.

References

    1. Luca D, et al. On the use of general control samples for genome-wide association studies: genetic matching highlights causal variants. Am. J. Hum. Genet. 2008;82:453–463. doi: 10.1016/j.ajhg.2007.11.003.
    1. Cooper JD, et al. Meta-analysis of genome-wide association study data identifies additional type 1 diabetes risk loci. Nat. Genet. 2008;40:1399–1401. doi: 10.1038/ng.249.
    1. Rao DC. An overview of the genetic dissection of complex traits. Adv. Genet. 2008;60:3–34. doi: 10.1016/S0065-2660(07)00401-4.
    1. Todd JA, et al. Robust associations of four new chromosome regions from genome-wide analyses of type 1 diabetes. Nat. Genet. 2007;39:857–864. doi: 10.1038/ng2068.
    1. Johnson EO, et al. KAT2B polymorphism identified for drug abuse in African Americans with regulatory links to drug abuse pathways in human prefrontal cortex. Addict. Biol. 2016;21:1217–1232. doi: 10.1111/adb.12286.
    1. van Manen D, et al. Genome-wide association scan in HIV-1-infected individuals identifying variants influencing disease course. PLoS One. 2011;6:e22208. doi: 10.1371/journal.pone.0022208.
    1. Xie W, et al. Genome-wide analyses reveal gene influence on HIV disease progression and HIV-1C acquisition in Southern Africa. AIDS Res. Hum. Retrovir. 2017;33:597–609. doi: 10.1089/aid.2016.0017.
    1. Lake, S. et al. The cannabis-dependent relationship between methadone treatment dose and Illicit opioid use in a community-based cohort of people who use drugs. Cannabis Cannabinoid Res.10.1089/can.2021.0080 (2021).
    1. Lo A, et al. Factors associated with methadone maintenance therapy discontinuation among people who inject drugs. J. Subst. Abuse Treat. 2018;94:41–46. doi: 10.1016/j.jsat.2018.08.009.
    1. Ho LA, Lange EM. Using public control genotype data to increase power and decrease cost of case–control genetic association studies. Hum. Genet. 2010;128:597–608. doi: 10.1007/s00439-010-0880-x.
    1. Mukherjee S, et al. Including additional controls from public databases improves the power of a genome-wide association study. Hum. Hered. 2011;72:21–34. doi: 10.1159/000330149.
    1. Zhuang JJ, et al. Optimizing the power of genome-wide association studies by using publicly available reference samples to expand the control group. Genet. Epidemiol. 2010;34:319–326. doi: 10.1002/gepi.20482.
    1. Johnson EO, et al. Imputation across genotyping arrays for genome-wide association studies: assessment of bias and a correction strategy. Hum. Genet. 2013;132:509–522. doi: 10.1007/s00439-013-1266-7.
    1. Lindstrom S, et al. A comprehensive survey of genetic variation in 20,691 subjects from four large cohorts. PLoS One. 2017;12:e0173997. doi: 10.1371/journal.pone.0173997.
    1. Kowalski MH, et al. Use of >100,000 NHLBI Trans-Omics for Precision Medicine (TOPMed) consortium whole genome sequences improves imputation quality and detection of rare variant associations in admixed African and Hispanic/Latino populations. PLoS Genet. 2019;15:e1008500. doi: 10.1371/journal.pgen.1008500.
    1. Bycroft C, et al. The UK Biobank resource with deep phenotyping and genomic data. Nature. 2018;562:203–209. doi: 10.1038/s41586-018-0579-z.
    1. Wall JD, et al. The GenomeAsia 100K project enables genetic discoveries across Asia. Nature. 2019;576:106–111. doi: 10.1038/s41586-019-1793-z.
    1. Danilov KA, Nikogosov DA, Musienko SV, Baranova AV. A comparison of BeadChip and WGS genotyping outputs using partial validation by sanger sequencing. BMC Genom. 2020;21:528. doi: 10.1186/s12864-020-06919-x.
    1. Das S, et al. Next-generation genotype imputation service and methods. Nat. Genet. 2016;48:1284–1287. doi: 10.1038/ng.3656.
    1. Bierut LJ, et al. Novel genes identified in a high-density genome wide association study for nicotine dependence. Hum. Mol. Genet. 2007;16:24–35. doi: 10.1093/hmg/ddl441.
    1. Saccone SF, et al. Cholinergic nicotinic receptor genes implicated in a nicotine dependence association study targeting 348 candidate genes with 3713 SNPs. Hum. Mol. Genet. 2007;16:36–49. doi: 10.1093/hmg/ddl438.
    1. Regan EA, et al. Genetic epidemiology of COPD (COPDGene) study design. COPD. 2010;7:32–43. doi: 10.3109/15412550903499522.
    1. Vestbo J, et al. Evaluation of COPD longitudinally to identify predictive surrogate end-points (ECLIPSE) Eur. Respi.r J. 2008;31:869–873. doi: 10.1183/09031936.00111707.
    1. Cho MH, et al. Risk loci for chronic obstructive pulmonary disease: a genome-wide association study and meta-analysis. Lancet Respir. Med. 2014;2:214–225. doi: 10.1016/S2213-2600(14)70002-5.
    1. Verlouw, J. A. M. et al. A comparison of genotyping arrays. Eur. J. Hum. Genet.29, 1611–1624 (2021).
    1. Taliun D, et al. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. Nature. 2021;590:290–299. doi: 10.1038/s41586-021-03205-y.
    1. Cho MH, et al. A genome-wide association study of COPD identifies a susceptibility locus on chromosome 19q13. Hum. Mol. Genet. 2012;21:947–957. doi: 10.1093/hmg/ddr524.
    1. Hobbs BD, et al. Genetic loci associated with chronic obstructive pulmonary disease overlap with loci for lung function and pulmonary fibrosis. Nat. Genet. 2017;49:426–432. doi: 10.1038/ng.3752.
    1. Wojcik GL, et al. Genetic analyses of diverse populations improves discovery for complex traits. Nature. 2019;570:514–518. doi: 10.1038/s41586-019-1310-4.
    1. Abel HJ, Duncavage EJ. Detection of structural DNA variation from next generation sequencing data: a review of informatic approaches. Cancer Genet. 2013;206:432–440. doi: 10.1016/j.cancergen.2013.11.002.
    1. Gudbjartsson DF, et al. Large-scale whole-genome sequencing of the Icelandic population. Nat. Genet. 2015;47:435–444. doi: 10.1038/ng.3247.
    1. Purcell S, et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 2007;81:559–575. doi: 10.1086/519795.
    1. Danecek, P. et al. Twelve years of SAMtools and BCFtools. Gigascience10, giab008 (2021).
    1. Delaneau O, Marchini J, Zagury JF. A linear complexity phasing method for thousands of genomes. Nat. Methods. 2011;9:179–181. doi: 10.1038/nmeth.1785.
    1. Delaneau O, Marchini J. Integrating sequence and array data to create an improved 1000 Genomes Project haplotype reference panel. Nat. Commun. 2014;5:3934. doi: 10.1038/ncomms4934.
    1. Zhan X, Hu Y, Li B, Abecasis GR, Liu DJ. RVTESTS: an efficient and comprehensive tool for rare variant association analysis using sequence data. Bioinformatics. 2016;32:1423–1426. doi: 10.1093/bioinformatics/btw079.
    1. National Heart, Lung, and Blood Institute, National Institutes of Health & U.S. Department of Health and Human Services. The NHLBI BioData Catalyst.Zenodo.YuAlZIRBzcs (2020).

Source: PubMed

3
Subscribe