A global reference for human genetic variation

1000 Genomes Project Consortium, Adam Auton, Lisa D Brooks, Richard M Durbin, Erik P Garrison, Hyun Min Kang, Jan O Korbel, Jonathan L Marchini, Shane McCarthy, Gil A McVean, Gonçalo R Abecasis

Abstract

The 1000 Genomes Project set out to provide a comprehensive description of common human genetic variation by applying whole-genome sequencing to a diverse set of individuals from multiple populations. Here we report completion of the project, having reconstructed the genomes of 2,504 individuals from 26 populations using a combination of low-coverage whole-genome sequencing, deep exome sequencing, and dense microarray genotyping. We characterized a broad spectrum of genetic variation, in total over 88 million variants (84.7 million single nucleotide polymorphisms (SNPs), 3.6 million short insertions/deletions (indels), and 60,000 structural variants), all phased onto high-quality haplotypes. This resource includes >99% of SNP variants with a frequency of >1% for a variety of ancestries. We describe the distribution of genetic variation across the global sample, and discuss the implications for common disease studies.

Conflict of interest statement

D.M.A. is affiliated with Vertex Pharmaceuticals, E.A. is on the speaker’s bureau for Illumina, P.A. is an advisor to Illumina and Ancestry.com, D.R.B., B.B., M.B., R.K.C., A.C., M.E., S.H., S.K., L.M., J.P. and R.S. are affiliated with Illumina, J.K.B. is affiliated with Ancestry.com, A.C. is on the Science Advisory Board of Biogen Idec. and the scientific advisory board of Affymetrix, A.W.C. is affiliated with DNAnexus, D.C. is affiliated with Personalis, C.J.D., J.G., J.P.S., T.W., B.W., and Y.Z. are affiliated with Affymetrix, E.T.D. is an advisor for DNAnexus, F.M.D.L.V. is employed by Real Time Genomics, M.A.D. is affiliated with SynapDx, P.D. is a co-founder and director of Genomics, and a partner in Peptide Groove, R.D. is a founder of Congenica and a consultant for Dovetail, E.E.E. is on the scientific advisory board of DNAnexus, and is a consultant for Kunming University of Science and Technology as part of the 1000 China Talent Program, P.F. is a member of the scientific advisory board of Omicia, M.G. is an advisor to Bina and DNAnexus, F.C.L.H. is affiliated with ThermoFisher Scientific, N.H. is affiliated with Life Technologies, C.L. is a scientific advisor for BioNano Genomics, H.Y.K.L. is affiliated with Bina Technologies which is part of Roche Sequencing, E.R.M. holds shares in Life Technologies, and G.M. is a co-founder of Genomics and a partner in Peptide Groove.

Figures

Figure 1. Population sampling.
Figure 1. Population sampling.
a, Polymorphic variants within sampled populations. The area of each pie is proportional to the number of polymorphisms within a population. Pies are divided into four slices, representing variants private to a population (darker colour unique to population), private to a continental area (lighter colour shared across continental group), shared across continental areas (light grey), and shared across all continents (dark grey). Dashed lines indicate populations sampled outside of their ancestral continental region. b, The number of variant sites per genome. c, The average number of singletons per genome. PowerPoint slide
Figure 2. Population structure and demography.
Figure 2. Population structure and demography.
a, Population structure inferred using a maximum likelihood approach with 8 clusters. b, Changes to effective population sizes over time, inferred using PSMC. Lines represent the within-population median PSMC estimate, smoothed by fitting a cubic spline passing through bin midpoints. PowerPoint slide
Figure 3. Population differentiation.
Figure 3. Population differentiation.
a, Variants found to be rare (<0.5%) within the global sample, but common (>5%) within a population. b, Genes showing strong differentiation between pairs of closely related populations. The vertical axis gives the maximum obtained value of the FST-based population branch statistic (PBS), with selected genes coloured to indicate the population in which the maximum value was achieved. PowerPoint slide
Figure 4. Imputation and eQTL discovery.
Figure 4. Imputation and eQTL discovery.
a, Imputation accuracy as a function of allele frequency for six populations. The insert compares imputation accuracy between phase 3 and phase 1, using all samples (solid lines) and intersecting samples (dashed lines). b, The average number of tagging variants (r2 > 0.8) as a function of physical distance for common (top), low frequency (middle), and rare (bottom) variants. c, The proportion of top eQTL variants that are SNPs and indels, as discovered in 69 samples from each population. d, The percentage of eQTLs in TFBS, having performed discovery in the first population, and fine mapped by including an additional 69 samples from a second population (*P < 0.01, **P < 0.001, ***P < 0.0001, McNemar’s test). The diagonal represents the percentage of eQTLs in TFBS using the original discovery sample. PowerPoint slide
Extended Data Figure 1. Summary of the…
Extended Data Figure 1. Summary of the callset generation pipeline.
Boxes indicate steps in the process and numbers indicate the corresponding section(s) within the Supplementary Information.
Extended Data Figure 2. Power of discovery…
Extended Data Figure 2. Power of discovery and heterozygote genotype discordance.
a, The power of discovery within the main data set for SNPs and indels identified within an overlapping sample of 284 genomes sequenced to high coverage by Complete Genomics (CG), and against a panel of >60,000 haplotypes constructed by the Haplotype Reference Consortium (HRC). To provide a measure of uncertainty, one curve is plotted for each chromosome. b, Improved power of discovery in phase 3 compared to phase 1, as assessed in a sample of 170 Complete Genomics genomes that are included in both phase 1 and phase 3. c, Heterozygote discordance in phase 3 for SNPs, indels, and SVs compared to 284 Complete Genomics genomes. d, Heterozygote discordance for phase 3 compared to phase 1 within the intersecting sample. e, Sensitivity to detect Complete Genomics SNPs as a function of sequencing depth. f, Heterozygote genotype discordance as a function of sequencing depth, as compared to Complete Genomics data.
Extended Data Figure 3. Variant counts.
Extended Data Figure 3. Variant counts.
a, The number of variants within the phase 3 sample as a function of alternative allele frequency. b, The average number of detected variants per genome with whole-sample allele frequencies <0.5% (grey bars), with the average number of singletons indicated by colours.
Extended Data Figure 4. The standardized number…
Extended Data Figure 4. The standardized number of variant sites per genome, partitioned by population and variant category.
For each category, z-scores were calculated by subtracting the mean number of sites per genome (calculated across the whole sample), and dividing by the standard deviation. From left: sites with a derived allele, synonymous sites with a derived allele, nonsynonymous sites with a derived allele, sites with a loss-of-function allele, sites with a HGMD disease mutation allele, sites with a ClinVar pathogenic variant, and sites carrying a GWAS risk allele.
Extended Data Figure 5. Population structure as…
Extended Data Figure 5. Population structure as inferred using the admixture program for K = 5 to 12.
Extended Data Figure 6. Allelic sharing.
Extended Data Figure 6. Allelic sharing.
a, Genotype covariance (above diagonal) and sharing of f2 variants (below diagonal) between pairs of individuals. b, Quantification of average f2 sharing between populations. Each row represents the distribution of f2 variants shared between individuals from the population indicated on the left to individuals from each of the sampled populations. c, The average number of f2 variants per haploid genome. d, The inferred age of f2 variants, as estimated from shared haplotype lengths, with black dots indicating the median value.
Extended Data Figure 7. Unsmoothed PSMC curves.
Extended Data Figure 7. Unsmoothed PSMC curves.
a, The median PSMC curve for each population. b, PSMC curves estimated separately for all individuals within the 1000 Genomes sample. c, Unsmoothed PSMC curves comparing estimates from the low coverage data (dashed lines) to those obtained from high coverage PCR-free data (solid lines). Notable differences are confined to very recent time intervals, where the additional rare variants identified by deep sequencing suggest larger population sizes.
Extended Data Figure 8. Genes showing very…
Extended Data Figure 8. Genes showing very strong patterns of differentiation between pairs of closely related populations within each continental group.
Within each continental group, the maximum PBS statistic was selected from all pairwise population comparisons within the continental group against all possible out-of-continent populations. Note the x axis shows the number of polymorphic sites within the maximal comparison.
Extended Data Figure 9. Performance of imputation.
Extended Data Figure 9. Performance of imputation.
a, Performance of imputation in 6 populations using a subset of phase 3 as a reference panel (n = 2,445), phase 1 (n = 1,065), and the corresponding data within intersecting samples from both phases (n = 1,006). b, Performance of imputation from phase 3 by variant class.
Extended Data Figure 10. Decay of linkage…
Extended Data Figure 10. Decay of linkage disequilibrium as a function of physical distance.
Linkage disequilibrium was calculated around 10,000 randomly selected polymorphic sites in each population, having first thinned each population down to the same sample size (61 individuals). The plotted line represents a 5 kb moving average.

References

    1. The 1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature467, 1061–1073 (2010)
    1. The 1000 Genomes Project Consortium. An integrated map of genetic variation from 1,092 human genomes. Nature491, 56–65 (2012)
    1. Voight BF, et al. The metabochip, a custom genotyping array for genetic studies of metabolic, cardiovascular, and anthropometric traits. PLoS Genet. 2012;8:e1002793. doi: 10.1371/journal.pgen.1002793.
    1. Trynka G, et al. Dense genotyping identifies and localizes multiple common and rare variant association signals in celiac disease. Nature Genet. 2011;43:1193–1201. doi: 10.1038/ng.998.
    1. Howie B, Fuchsberger C, Stephens M, Marchini J, Abecasis GR. Fast and accurate genotype imputation in genome-wide association studies through pre-phasing. Nature Genet. 2012;44:955–959. doi: 10.1038/ng.2354.
    1. Xue Y, et al. Deleterious- and disease-allele prevalence in healthy individuals: insights from current predictions, mutation databases, and population-scale resequencing. Am. J. Hum. Genet. 2012;91:1022–1032. doi: 10.1016/j.ajhg.2012.10.015.
    1. Jung H, Bleazard T, Lee J, Hong D. Systematic investigation of cancer-associated somatic point mutations in SNP databases. Nature Biotechnol. 2013;31:787–789. doi: 10.1038/nbt.2681.
    1. Sudmant, P. H. et al. An integrated map of structural variation in 2,504 human genomes. Nature10.1038/nature15394 (this issue)
    1. The Haplotype Reference Consortium ()
    1. Simons YB, Turchin MC, Pritchard JK, Sella G. The deleterious mutation load is insensitive to recent population history. Nature Genet. 2014;46:220–224. doi: 10.1038/ng.2896.
    1. Do R, et al. No evidence that selection has been less effective at removing deleterious mutations in Europeans than in Africans. Nature Genet. 2015;47:126–131. doi: 10.1038/ng.3186.
    1. Alexander DH, Novembre J, Lange K. Fast model-based estimation of ancestry in unrelated individuals. Genome Res. 2009;19:1655–1664. doi: 10.1101/gr.094052.109.
    1. Mathieson I, McVean G. Demography and the age of rare variants. PLoS Genet. 2014;10:e1004528. doi: 10.1371/journal.pgen.1004528.
    1. Li H, Durbin R. Inference of human population history from individual whole-genome sequences. Nature. 2011;475:493–496. doi: 10.1038/nature10231.
    1. Moltke I, et al. A common Greenlandic TBC1D4 variant confers muscle insulin resistance and type 2 diabetes. Nature. 2014;512:190–193. doi: 10.1038/nature13425.
    1. Yi X, et al. Sequencing of 50 human exomes reveals adaptation to high altitude. Science. 2010;329:75–78. doi: 10.1126/science.1190371.
    1. Lamason RL, et al. SLC24A5, a putative cation exchanger, affects pigmentation in zebrafish and humans. Science. 2005;310:1782–1786. doi: 10.1126/science.1116238.
    1. Eiberg H, et al. Blue eye color in humans may be caused by a perfectly associated founder mutation in a regulatory element located within the HERC2 gene inhibiting OCA2 expression. Hum. Genet. 2008;123:177–187. doi: 10.1007/s00439-007-0460-x.
    1. Mathias RA, et al. Adaptive evolution of the FADS gene cluster within Africa. PLoS ONE. 2012;7:e44926. doi: 10.1371/journal.pone.0044926.
    1. Hernandez RD, et al. Classic selective sweeps were rare in recent human evolution. Science. 2011;331:920–924. doi: 10.1126/science.1198878.
    1. Chen W, et al. Genetic variants near TIMP3 and high-density lipoprotein-associated loci influence susceptibility to age-related macular degeneration. Proc. Natl Acad. Sci. USA. 2010;107:7401–7406. doi: 10.1073/pnas.0912702107.
    1. Wakefield J. Bayes factors for genome-wide association studies: comparison with P-values. Genet. Epidemiol. 2009;33:79–86. doi: 10.1002/gepi.20359.
    1. Wakefield J. Commentary: genome-wide significance thresholds via Bayes factors. Int. J. Epidemiol. 2012;41:286–291. doi: 10.1093/ije/dyr241.
    1. Sham PC, Purcell SM. Statistical power and significance testing in large-scale genetic studies. Nature Rev. Genet. 2014;15:335–346. doi: 10.1038/nrg3706.
    1. Gold B, et al. Variation in factor B (BF) and complement component 2 (C2) genes is associated with age-related macular degeneration. Nature Genet. 2006;38:458–462. doi: 10.1038/ng1750.
    1. Klein RJ, et al. Complement factor H polymorphism in age-related macular degeneration. Science. 2005;308:385–389. doi: 10.1126/science.1109557.
    1. Rivera A, et al. Hypothetical LOC387715 is a second major susceptibility gene for age-related macular degeneration, contributing independently of complement factor H to disease risk. Hum. Mol. Genet. 2005;14:3227–3236. doi: 10.1093/hmg/ddi353.
    1. Yates JR, et al. Complement C3 variant and the risk of age-related macular degeneration. N. Engl. J. Med. 2007;357:553–561. doi: 10.1056/NEJMoa072618.
    1. Maller JB, et al. Bayesian refinement of association signals for 14 loci in 3 common diseases. Nature Genet. 2012;44:1294–1301. doi: 10.1038/ng.2435.
    1. Fritsche LG, et al. Age-related macular degeneration is associated with an unstable ARMS2 (LOC387715) mRNA. Nature Genet. 2008;40:892–896. doi: 10.1038/ng.170.
    1. The ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature489, 57–74 (2012)
    1. Stranger BE, et al. Patterns of cis regulatory variation in diverse human populations. PLoS Genet. 2012;8:e1002639. doi: 10.1371/journal.pgen.1002639.
    1. Chaisson MJ, et al. Resolving the complexity of the human genome using single-molecule sequencing. Nature. 2015;517:608–611. doi: 10.1038/nature13907.
    1. Gudbjartsson DF, et al. Large-scale whole-genome sequencing of the Icelandic population. Nature Genet. 2015;47:435–444. doi: 10.1038/ng.3247.
    1. The UK10K Consortium. The UK10K project identifies rare variants in health and disease. Nature10.1038/nature14962 (2015)
    1. Sidore, C. et al. Genome sequencing elucidates Sardinian genetic architecture and augments association analyses for lipid and blood inflammatory markers. Nature Genet.10.1038/ng.3368 (2015)
    1. Delaneau O, Marchini J. The 1000 Genomes Project Consortium. Integrating sequence and array data to create an improved 1000 Genomes Project haplotype reference panel. Nature Commun. 2014;5:3934. doi: 10.1038/ncomms4934.
    1. O’Connell J, et al. A general approach for haplotype phasing across the full spectrum of relatedness. PLoS Genet. 2014;10:e1004234. doi: 10.1371/journal.pgen.1004234.
    1. Menelaou A, Marchini J. Genotype calling and phasing using next-generation sequencing reads and a haplotype scaffold. Bioinformatics. 2013;29:84–91. doi: 10.1093/bioinformatics/bts632.

Source: PubMed

3
Subscribe