A map of human genome variation from population-scale sequencing

1000 Genomes Project Consortium, Gonçalo R Abecasis, David Altshuler, Adam Auton, Lisa D Brooks, Richard M Durbin, Richard A Gibbs, Matt E Hurles, Gil A McVean

Abstract

The 1000 Genomes Project aims to provide a deep characterization of human genome sequence variation as a foundation for investigating the relationship between genotype and phenotype. Here we present results of the pilot phase of the project, designed to develop and compare different strategies for genome-wide sequencing with high-throughput platforms. We undertook three projects: low-coverage whole-genome sequencing of 179 individuals from four populations; high-coverage sequencing of two mother-father-child trios; and exon-targeted sequencing of 697 individuals from seven populations. We describe the location, allele frequency and local haplotype structure of approximately 15 million single nucleotide polymorphisms, 1 million short insertions and deletions, and 20,000 structural variants, most of which were previously undescribed. We show that, because we have catalogued the vast majority of common variation, over 95% of the currently accessible variants found in any individual are present in this data set. On average, each person is found to carry approximately 250 to 300 loss-of-function variants in annotated genes and 50 to 100 variants previously implicated in inherited disorders. We demonstrate how these results can be used to inform association and functional studies. From the two trios, we directly estimate the rate of de novo germline base substitution mutations to be approximately 10(-8) per base pair per generation. We explore the data with regard to signatures of natural selection, and identify a marked reduction of genetic variation in the neighbourhood of genes, due to selection at linked sites. These methods and public data will support the next phase of human genetic research.

Figures

Figure
Figure
Figure
Figure
Figure 1. Properties of the variants found
Figure 1. Properties of the variants found
a, Venn diagrams showing the numbers of SNPs identified in each pilot project in each population or analysis panel, subdivided according to whether the SNP was present in dbSNP release 129 (“Known”) or not (“Novel”). Exon analysis panel AFR is YRI+LWK, ASN is CHB+CHD+JPT, and EUR is CEU+TSI. Note that the scale for the exon project column is much larger than for the other pilots. b, The number of variants per Mb at different allele frequencies divided by the expectation under the neutral coalescent (1/i, where i is the variant allele count), thus giving an estimate of theta per megabase. Blue: low coverage SNPs, red: low coverage indels, black: low coverage genotyped large deletions, green: exon SNPs. The spikes at the right ends of the lines correspond to excess variants for which all samples differed from the reference (approximately 1 per 30 kb), consistent with errors in the reference sequence. c, Fraction of variants in each allele frequency class that were novel. Novelty was determined by comparison to dbSNP release 129 for SNPs and small indels, dbVar (June 2010) for deletions, and two published genomes, for larger indels. d, Size distribution and novelty of variants discovered in the low coverage project. SNPs are shown in blue, deletions with respect to the reference sequence in red, and insertions or duplications with respect to the reference in green. The fraction of variants in each size bin that were novel is shown by the purple line, and is defined relative to dbSNP (SNPs and indels), dbVar (deletions, duplications, mobile element insertions), dbRIP and other studies (mobile element insertions), Venter and Watson genomes, (indels and deletions), and indels from split capillary reads (indels and deletions). To account for ambiguous placement of many indels, discovered indels were deemed to match known indels if they were within 25 bp of a known indel of the same size. To account for imprecise knowledge of the location of most deletions and duplications, discovered variants were deemed to match known variants if they had > 50% reciprocal overlap.
Figure 2. Variant discovery rates and genotype…
Figure 2. Variant discovery rates and genotype accuracy in the low coverage project
a, Rates of low coverage variant detection by allele frequency in CEU. Lines show the fraction of variants seen in overlapping samples in independent studies, that were also found to be polymorphic in the low coverage project (in the same overlapping samples), as a function of allele count in the 60 low coverage samples. Note that we plot power against expected allele count in 60 samples, e.g. a variant present in, say, 2 copies in an overlap of 30 samples is expected to be present 4 times in 60 samples. The crosses on the right represent the average discovery fraction for all variants having more than 10 copies in the sample. Colours correspond to: (red) HapMap II sites, excluding sites also in HapMap 3 (43 overlapping samples); (blue) exon project sites (57 overlapping samples); (green) deletions from Conrad et al. (60 overlapping samples; deletions were classified as “found” if there was any overlap). b, Estimated rates of discovery of variants at different frequencies in the CEU (blue), a population related to the CEU with Fst = 1% (green) and across Europe as a whole (light blue). The insert shows a cartoon of the statistical model for population history and thus allele frequencies in related populations where an ancestral population gave rise to many equally related populations, one of which (blue circle) has samples sequenced. c, SNP genotype accuracy by allele frequency in the CEU low coverage project, measured by comparison to HapMap II genotypes at sites present in both call sets, excluding sites that were also in HapMap 3. Lines represent the average accuracy of homozygote reference (red), heterozygote (green) and homozygote alternative calls (blue) as a function of the alternative allele count in the overlapping set of 43 samples, and the overall genotype error rate (grey, at bottom of plot). The inset shows the number of each genotype class as a function of alternative allele count. d, Coverage and accuracy for the low coverage and exon projects as a function of depth threshold. For 41 CEU samples sequenced in both the exon and low coverage projects, on the x axis is shown the number of non-reference SNP genotype calls at HapMap II sites not in HapMap 3 that were called in the exon project target region, and on the y axis is shown the number of these calls that were not variant (i.e., are reference homozygote and thus incorrectly were called as variant) according to HapMap II. Each point plotted corresponds to a minimum depth threshold for called sites. Grey lines show constant error rates. The exon project calls (red) were made independently per sample, whereas the low coverage calls (blue), which were only slightly less accurate, were made using LD information that combined partial information across samples and sites in an imputation-based algorithm. The additional data added from point “1” to point “0” (upper right in the figure) for the low coverage project were completely imputed.
Figure 3. The value of additional samples…
Figure 3. The value of additional samples for variant discovery
The fraction of variants present in an individual that would not have been found in a sequenced reference panel, as a function of reference panel size and the sequencing strategy. The lines represent predictions for Synonymous (Syn), Nonsynonymous (NonSyn), and Loss of function (LOF) variant classes, broken down by sequencing category: full sequencing as for exons (Full) and low coverage sequencing (LowCov). The values were calculated from observed distributions of variants of each class in 321 East Asian samples (CHB, CHD and JPT populations) in the exon data, and power to detect variants at low allele counts in the reference panel from Figure 2a.
Figure 4. Imputation from the low coverage…
Figure 4. Imputation from the low coverage data
a, Accuracy of imputing variant genotypes using HapMap 3 sites to impute sites from the low coverage (LC) project into the trio fathers as a function of allele frequency. Accuracy of imputing genotypes from the HapMap II reference panels is also shown. Imputation accuracy for common variants was generally a few percent worse from the low coverage project than from HapMap, although error rates increase for less common variants. b, An example of imputation in a cis-eQTL for TIMM22, for which the original Ilumina 300K genotype data gave a weak signal. Imputation using HapMap data made a small improvement, and imputation using low coverage haplotypes provided a much stronger signal.
Figure 5. Variation around genes
Figure 5. Variation around genes
a, Diversity in genes calculated from the CEU low coverage genotype calls (upper) and diversity divided by divergence between humans and rhesus macaque (lower). Within each element averaged diversity is shown for the first and last 25 base pairs, with the remaining 150 positions sampled at fixed distances across the element (elements shorter than 150 base pairs were not considered). Note that estimates of diversity will be reduced compared to the true population value due to the reduced power for rare variants, but relative values should be little affected. b, Average autosomal diversity divided by divergence, as a function of genetic distance from coding transcripts, calculated at putatively neutral sites, i.e., excluding phastcons conserved noncoding sequences and all sites in coding exons but four-fold degenerate sites. c, Numbers of SNPs showing increasingly high levels of differentiation in allele frequency between the CEU and CHB+JPT (red), CEU and YRI (green) and CHB+JPT and YRI (blue). Lines indicate synonymous variants (dashed), nonsynonymous variants (dotted) and other variants (solid). The most highly differentiated genic SNPs were enriched for nonsynonymous variants, indicating local adaptation. d, The decay of population differentiation around genic SNPs showing extreme allele frequency differences between populations (difference in frequency of at least 0.8 between populations, thinned so there is no more than one per gene considered). For all such SNPs the highest allele frequency difference in bins of 0.01 cM away from the variant was recorded and averaged.
Figure 6. Recombination
Figure 6. Recombination
a, Improved resolution of hotspot boundaries. The average recombination rate estimated from low coverage project data around recombination hotspots detected in HapMap II. Recombination hotspots were narrower, and in CEU (orange) and CHB+JPT (purple) more intense than previously estimated. b, The concentration of recombination in a small fraction of the genome, one line per chromosome. If recombination were uniformly distributed throughout the genome, then the lines on this figure would appear along the diagonal. Instead, most recombination occurs in a small fraction of the genome. Recombination rates in YRI (green) appeared to be less concentrated in recombination hotspots than CEU (orange) or CHB+JPT (purple). HapMap II estimates are shown in black. c, The relationship between genetic variation and recombination rates in the YRI population. The top plot shows average levels of diversity, measured as mean number of segregating sites per base, surrounding occurrences of the previously described hotspot motif (CCTCCCTNNCCAC, red line) and a closely related, but not recombinogenic DNA sequence (CTTCCCTNNCCAC, green line). The lighter red and green shaded areas give 95% confidence intervals on diversity levels. The bottom plot shows estimated mean recombination rates surrounding motif occurrences, with colours defined as in the top plot.

Source: PubMed

3
購読する