Accounting for population structure in genetic studies of cystic fibrosis

Hanley Kingston, Adrienne M Stilp, William Gordon, Jai Broome, Stephanie M Gogarten, Hua Ling, John Barnard, Shannon Dugan-Perez, Patrick T Ellinor, Stacey Gabriel, Soren Germer, Richard A Gibbs, Namrata Gupta, Kenneth Rice, Albert V Smith, Michael C Zody, Cystic Fibrosis Genome Project, NHLBI Trans-Omics for Precision Medicine (TOPMed) Consortium, Scott M Blackman, Garry Cutting, Michael R Knowles, Yi-Hui Zhou, Margaret Rosenfeld, Ronald L Gibson, Michael Bamshad, Alison Fohner, Elizabeth E Blue, Hanley Kingston, Adrienne M Stilp, William Gordon, Jai Broome, Stephanie M Gogarten, Hua Ling, John Barnard, Shannon Dugan-Perez, Patrick T Ellinor, Stacey Gabriel, Soren Germer, Richard A Gibbs, Namrata Gupta, Kenneth Rice, Albert V Smith, Michael C Zody, Cystic Fibrosis Genome Project, NHLBI Trans-Omics for Precision Medicine (TOPMed) Consortium, Scott M Blackman, Garry Cutting, Michael R Knowles, Yi-Hui Zhou, Margaret Rosenfeld, Ronald L Gibson, Michael Bamshad, Alison Fohner, Elizabeth E Blue

Abstract

CFTR F508del (c.1521_1523delCTT, p.Phe508delPhe) is the most common pathogenic allele underlying cystic fibrosis (CF), and its frequency varies in a geographic cline across Europe. We hypothesized that genetic variation associated with this cline is overrepresented in a large cohort (N > 5,000) of persons with CF who underwent whole-genome sequencing and that this pattern could result in spurious associations between variants correlated with both the F508del genotype and CF-related outcomes. Using principal-component (PC) analyses, we showed that variation in the CFTR region disproportionately contributes to a PC explaining a relatively high proportion of genetic variance. Variation near CFTR was correlated with population structure among persons with CF, and this correlation was driven by a subset of the sample inferred to have European ancestry. We performed genome-wide association studies comparing persons with CF with one versus two copies of the F508del allele; this allowed us to identify genetic variation associated with the F508del allele and to determine that standard PC-adjustment strategies eliminated the significant association signals. Our results suggest that PC adjustment can adequately prevent spurious associations between genetic variants and CF-related traits and are therefore effective tools to control for population structure even when population structure is confounded with disease severity and a common pathogenic variant.

Keywords: CFTR F508del; genome-wide association study; population structure.

Conflict of interest statement

M.B. is the editor-in-chief and J.X.C. (member of the Cystic Fibrosis Genome Project) is the deputy editor of HGG Advances. The authors declare no other competing interests.

© 2022 The Author(s).

Figures

Figure 1
Figure 1
Population structure within the entire CFGP (n = 4,939) Pairwise principal-component (PC) plots are shown for PCs 1–4 with frequency distributions and percentage of variance explained by each PC on the diagonal. Ancestry estimates indicate the ancestry with the highest estimated proportion using Somalier. Abbreviations: AFR, sub-Saharan African; AMR, Native American; EAS, East Asian; EUR, European; SAS, South Asian.
Figure 2
Figure 2
Correlation between PCs and genomic position (A and B) The correlation between PCs (Y axis) and genomic position (X axis) are shown for the (A) CFGP (n = 4,939) and (B) CFGP participants with estimated European ancestry >80% (n = 4,567). The number of PCs shown is the number used to calculate the genetic relatedness matrix and, for the total CFGP dataset, used in the PC-adjusted GWAS analysis. Color-coded regions include 7q21.31 (CFTR, pink) and three regions that have previously shown evidence of long-range LD: 2q21.1-2q22.1 (LCT, teal), 6p22.3-6p21.2 (the major histocompatibility complex, orange), and the 8p23 inversion polymorphism (purple).
Figure 3
Figure 3
GWASs for CFTR F508del heterozygosity versus homozygosity (Top) The baseline model adjusted for site and relatedness. (Bottom) The PC-adjusted model. Association signals are measured as -log10(p values). Plot is truncated at p = 1 × 10-10, as the peak at CFTR on chr7 reaches p < 1 × 10-300 under both models. The genome-wide significance level, p < 5 × 10-8, is indicated by the horizontal line.

References

    1. Bell S.C., Mall M.A., Gutierrez H., Macek M., Madge S., Davies J.C., Burgel P.R., Tullis E., Castanos C., Castellani C., et al. The future of cystic fibrosis care: a global perspective. Lancet Respir. Med. 2020;8:65–124. doi: 10.1016/S2213-2600(19)30337-6.
    1. Lopes-Pacheco M. CFTR modulators: the changing face of cystic fibrosis in the era of precision medicine. Front. Pharmacol. 2020;10:1662. doi: 10.3389/fphar.2019.01662.
    1. Drumm M.L., Konstan M.W., Schluchter M.D., Handler A., Pace R., Zou F., Zariwala M., Fargo D., Xu A., Dunn J.M., et al. Genetic modifiers of lung disease in cystic fibrosis. N. Engl. J. Med. 2005;353:1443–1453. doi: 10.1056/NEJMoa051469.
    1. Karczewski K.J., Francioli L.C., MacArthur D.G., Cummings B.B., Alföldi J., Wang Q., Collins R.L., Laricchia K.M., Ganna A., Birnbaum D.P., et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature. 2020;581:434–443. doi: 10.1530/ey.17.14.3.
    1. Mateu E., Calafell F., Ramos M.D., Casals T., Bertranpetit J. Can a place of origin of the main cystic fibrosis mutations be identified? Am. J. Hum. Genet. 2002;70:257–264. doi: 10.1086/338243.
    1. Cutting G.R. Cystic fibrosis genetics: from molecular understanding to clinical application. Nat. Rev. Genet. 2015;16:45–56. doi: 10.1038/nrg3849.
    1. Price A.L., Patterson N.J., Plenge R.M., Weinblatt M.E., Shadick N.A., Reich D. Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 2006;38:904–909. doi: 10.1038/ng1847.
    1. Cook J.P., Mahajan A., Morris A.P. Fine-scale population structure in the UK Biobank: implications for genome-wide association studies. Hum. Mol. Genet. 2020;29:2803–2811. doi: 10.1093/hmg/ddaa157.
    1. Conomos M.P., Miller M.B., Thornton T.A. Robust inference of population structure for ancestry prediction and correction of stratification in the presence of relatedness. Genet. Epidemiol. 2015;39:276–293. doi: 10.1002/gepi.21896.
    1. Collaco J.M., Blackman S.M., McGready J., Naughton K.M., Cutting G.R. Quantification of the relative contribution of environmental and genetic factors to variation in cystic fibrosis lung function. J. Pediatr. 2010;157:802–807.e3. doi: 10.1016/j.jpeds.2010.05.018.
    1. Blackman S.M., Commander C.W., Watson C., Arcara K.M., Strug L.J., Stonebraker J.R., Wright F.A., Rommens J.M., Sun L., Pace R.G., et al. Genetic modifiers of cystic fibrosis-related diabetes. Diabetes. 2013;62:3627–3635. doi: 10.2337/db13-0510.
    1. Bartlett J.R., Friedman K.J., Ling S.C., Pace R.G., Bell S.C., Bourke B., Castaldo G., Castellani C., Cipolli M., Colombo C., et al. Genetic modifiers of liver disease in cystic fibrosis. JAMA. 2009;302:1076–1083. doi: 10.1001/jama.2009.1295.
    1. Treggiari M.M., Rosenfeld M., Mayer-Hamblett N., Retsch-Bogart G., Gibson R.L., Williams J., Emerson J., Kronmal R.A., Ramsey B.W., Group E.S. Early anti-pseudomonal acquisition in young patients with cystic fibrosis: rationale and design of the EPIC clinical trial and observational study. Contemp. Clin. Trials. 2009;30:256–268. doi: 10.1016/j.cct.2009.01.003.
    1. Li H., Handsaker B., Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R., Genome Project Data Processing S. The sequence alignment/map format and SAMtools. Bioinformatics. 2009;25:2078–2079. doi: 10.1093/bioinformatics/btp352.
    1. Danecek P., Auton A., Abecasis G., Albers C.A., Banks E., DePristo M.A., Handsaker R.E., Lunter G., Marth G.T., Sherry S.T., et al. 1000 Genomes Project Analysis Group The variant call format and VCFtools. Bioinformatics. 2011;27:2156–2158. doi: 10.1093/bioinformatics/btr330.
    1. Schneider V.A., Graves-Lindsay T., Howe K., Bouk N., Chen H.C., Kitts P.A., Murphy T.D., Pruitt K.D., Thibaud-Nissen F., Albracht D., et al. Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly. Genome Res. 2017;27:849–864. doi: 10.1101/gr.213611.116.
    1. Van der Auwera G.A., O'Connor B.D. first edition. O'Reilly Media; 2020. Genomics in the Cloud: Using Docker, GATK, and WDL in Terra.
    1. Jun G., Flickinger M., Hetrick K.N., Romm J.M., Doheny K.F., Abecasis G.R., Boehnke M., Kang H.M. Detecting and estimating contamination of human DNA samples in sequencing and array-based genotype data. Am. J. Hum. Genet. 2012;91:839–848. doi: 10.1016/j.ajhg.2012.09.004.
    1. Conomos M.P., Reiner A.P., Weir B.S., Thornton T.A. Model-free estimation of recent genetic relatedness. Am. J. Hum. Genet. 2016;98:127–148. doi: 10.1016/j.ajhg.2015.11.022.
    1. Manichaikul A., Mychaleckyj J.C., Rich S.S., Daly K., Sale M., Chen W.-M. Robust relationship inference in genome-wide association studies. Bioinformatics. 2010;26:2867–2873. doi: 10.1093/bioinformatics/btq559.
    1. McCague A.F., Raraigh K.S., Pellicore M.J., Davis-Marcisak E.F., Evans T.A., Han S.T., Lu Z., Joynt A.T., Sharma N., Castellani C., et al. Correlating cystic fibrosis transmembrane conductance regulator function with clinical features to inform precision treatment of cystic fibrosis. Am. J. Respir. Crit. Care Med. 2019;199:1116–1126. doi: 10.1164/rccm.201901-0145OC.
    1. Gogarten S.M., Sofer T., Chen H., Yu C., Brody J.A., Thornton T.A., Rice K.M., Conomos M.P. Genetic association testing using the GENESIS R/Bioconductor package. Bioinformatics. 2019;35:5346–5348. doi: 10.1093/bioinformatics/btz567.
    1. R Core Team . R Foundation for Statistical Computating; 2017. R: A Language and Environment for Statistical Computing.
    1. Devlin B., Roeder K. Genomic control for association studies. Biometrics. 1999;55:997–1004. doi: 10.1111/j.0006-341x.1999.00997.x.
    1. Pedersen B.S., Bhetariya P.J., Brown J., Kravitz S.N., Marth G., Jensen R.L., Bronner M.P., Underhill H.R., Quinlan A.R. Somalier: rapid relatedness estimation for cancer and germline studies using efficient genome sketches. Genome Med. 2020;12:62. doi: 10.1186/s13073-020-00761-2.
    1. Taliun D., Harris D.N., Kessler M.D., Carlson J., Szpiech Z.A., Torres R., Taliun S.A.G., Corvelo A., Gogarten S.M., Kang H.M., et al. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. Nature. 2021;590:290–299. doi: 10.1038/s41586-021-03205-y.
    1. Price A.L., Weale M.E., Patterson N., Myers S.R., Need A.C., Shianna K.V., Ge D., Rotter J.I., Torres E., Taylor K.D., et al. Long-range LD can confound genome scans in admixed populations. Am. J. Hum. Genet. 2008;83:132–135. doi: 10.1016/j.ajhg.2008.06.005.
    1. Grinde K. University of Washington; 2019. Statistical Inference in Admixed Populations. PhD Thesis.
    1. Novembre J., Johnson T., Bryc K., Kutalik Z., Boyko A.R., Auton A., Indap A., King K.S., Bergmann S., Nelson M.R., et al. Genes mirror geography within Europe. Nature. 2008;456:98–101. doi: 10.1038/nature07331.
    1. Gogarten S.M., Bhangale T., Conomos M.P., Laurie C.A., McHugh C.P., Painter I., Zheng X., Crosslin D.R., Levine D., Lumley T., et al. GWASTools: an R/Bioconductor package for quality control and analysis of genome-wide association studies. Bioinformatics. 2012;28:3329–3331. doi: 10.1093/bioinformatics/bts610.
    1. Itan Y., Jones B.L., Ingram C.J., Swallow D.M., Thomas M.G. A worldwide correlation of lactase persistence phenotype and genotypes. BMC Evol. Biol. 2010;10:36. doi: 10.1186/1471-2148-10-36.
    1. Doytchinova I.A., Guan P., Flower D.R. Identifiying human MHC supertypes using bioinformatic methods. J. Immunol. 2004;172:4314–4323. doi: 10.4049/jimmunol.172.7.4314.
    1. Ma J., Amos C.I. Investigation of inversion polymorphisms in the human genome using principal components analysis. PLoS One. 2012;7:e40224. doi: 10.1371/journal.pone.0040224.
    1. Wright F.A., Strug L.J., Doshi V.K., Commander C.W., Blackman S.M., Sun L., Berthiaume Y., Cutler D., Cojocaru A., Collaco J.M., et al. Genome-wide association and linkage identify modifier loci of lung disease severity in cystic fibrosis at 11p13 and 20q13.2. Nat. Genet. 2011;43:539–546. doi: 10.1038/ng.838.
    1. Corvol H., Blackman S.M., Boelle P.Y., Gallins P.J., Pace R.G., Stonebraker J.R., Accurso F.J., Clement A., Collaco J.M., Dang H., et al. Genome-wide association meta-analysis identifies five modifier loci of lung disease severity in cystic fibrosis. Nat. Comm. 2015;6:8382. doi: 10.1038/ncomms9382.
    1. Gong J., Wang F., Xiao B., Panjwani N., Lin F., Keenan K., Avolio J., Esmaeili M., Zhang L., He G., et al. Genetic association and transcriptome integration identify contributing genes and tissues at cystic fibrosis modifier loci. PLoS Genet. 2019;15:e1008007. doi: 10.1371/journal.pgen.1008007.
    1. Bobadilla J.L., Macek M., Jr., Fine J.P., Farrell P.M. Cystic fibrosis: a worldwide analysis of CFTR mutations--correlation with incidence data and application to screening. Hum. Mutat. 2002;19:575–606. doi: 10.1002/humu.10041.
    1. Zaidi A.A., Mathieson I. Demographic history mediates the effect of stratification on polygenic scores. Elife. 2020;9:e61548. doi: 10.7554/elife.61548.
    1. Rees D.C., Williams T.N., Gladwin M.T. Sickle-cell disease. Lancet. 2010;376:2018–2031. doi: 10.1016/s0140-6736(10)61029-x.
    1. Corbo R.M., Scacchi R. Apolipoprotein E (APOE) allele distribution in the world. Is APOE∗4 a 'thrifty' allele? Ann. Hum. Genet. 1999;63:301–310. doi: 10.1046/j.1469-1809.1999.6340301.x.

Source: PubMed

3
Subscribe