Sequence and structural variation in a human genome uncovered by short-read, massively parallel ligation sequencing using two-base encoding

Kevin Judd McKernan, Heather E Peckham, Gina L Costa, Stephen F McLaughlin, Yutao Fu, Eric F Tsung, Christopher R Clouser, Cisyla Duncan, Jeffrey K Ichikawa, Clarence C Lee, Zheng Zhang, Swati S Ranade, Eileen T Dimalanta, Fiona C Hyland, Tanya D Sokolsky, Lei Zhang, Andrew Sheridan, Haoning Fu, Cynthia L Hendrickson, Bin Li, Lev Kotler, Jeremy R Stuart, Joel A Malek, Jonathan M Manning, Alena A Antipova, Damon S Perez, Michael P Moore, Kathleen C Hayashibara, Michael R Lyons, Robert E Beaudoin, Brittany E Coleman, Michael W Laptewicz, Adam E Sannicandro, Michael D Rhodes, Rajesh K Gottimukkala, Shan Yang, Vineet Bafna, Ali Bashir, Andrew MacBride, Can Alkan, Jeffrey M Kidd, Evan E Eichler, Martin G Reese, Francisco M De La Vega, Alan P Blanchard, Kevin Judd McKernan, Heather E Peckham, Gina L Costa, Stephen F McLaughlin, Yutao Fu, Eric F Tsung, Christopher R Clouser, Cisyla Duncan, Jeffrey K Ichikawa, Clarence C Lee, Zheng Zhang, Swati S Ranade, Eileen T Dimalanta, Fiona C Hyland, Tanya D Sokolsky, Lei Zhang, Andrew Sheridan, Haoning Fu, Cynthia L Hendrickson, Bin Li, Lev Kotler, Jeremy R Stuart, Joel A Malek, Jonathan M Manning, Alena A Antipova, Damon S Perez, Michael P Moore, Kathleen C Hayashibara, Michael R Lyons, Robert E Beaudoin, Brittany E Coleman, Michael W Laptewicz, Adam E Sannicandro, Michael D Rhodes, Rajesh K Gottimukkala, Shan Yang, Vineet Bafna, Ali Bashir, Andrew MacBride, Can Alkan, Jeffrey M Kidd, Evan E Eichler, Martin G Reese, Francisco M De La Vega, Alan P Blanchard

Abstract

We describe the genome sequencing of an anonymous individual of African origin using a novel ligation-based sequencing assay that enables a unique form of error correction that improves the raw accuracy of the aligned reads to >99.9%, allowing us to accurately call SNPs with as few as two reads per allele. We collected several billion mate-paired reads yielding approximately 18x haploid coverage of aligned sequence and close to 300x clone coverage. Over 98% of the reference genome is covered with at least one uniquely placed read, and 99.65% is spanned by at least one uniquely placed mate-paired clone. We identify over 3.8 million SNPs, 19% of which are novel. Mate-paired data are used to physically resolve haplotype phases of nearly two-thirds of the genotypes obtained and produce phased segments of up to 215 kb. We detect 226,529 intra-read indels, 5590 indels between mate-paired reads, 91 inversions, and four gene fusions. We use a novel approach for detecting indels between mate-paired reads that are smaller than the standard deviation of the insert size of the library and discover deletions in common with those detected with our intra-read approach. Dozens of mutations previously described in OMIM and hundreds of nonsynonymous single-nucleotide and structural variants in genes previously implicated in disease are identified in this individual. There is more genetic variation in the human genome still to be uncovered, and we provide guidance for future surveys in populations and cancer biopsies.

Figures

Figure 1.
Figure 1.
Cumulative plot of sequence and clone coverage from uniquely placed fragments and uniquely placed mate pairs. The sequence coverage is derived from the fragment, 2 × 25 mate-paired, and 2 × 50 mate-paired libraries while the clone coverage is from only the mate-paired libraries (2 × 25 and 2 × 50).
Figure 2.
Figure 2.
Uniquely placed mate pairs provide a more comprehensive sampling of the human genome than the unique placement of each of the tags independently. The coverage is separated by mate-paired data treated as single tags before pairing (mate pairs, unpaired; blue) and mate-paired data treated as mate pairs (mate pairs, paired; pink).
Figure 3.
Figure 3.
Dependence of genotype calling on depth of sequence coverage. The NA18507 genotypes called by SOLiD at all HapMap loci are compared with the HapMap genotypes by SOLiD coverage per genome position (average 18× coverage). Coverage includes alleles representing the reference or a valid base change; i.e., alleles with single or invalid adjacent mismatches are not included. No prior information about SNP presence or SNP alleles was used in making SOLiD gentoype calls. The number of HapMap loci with a given level of SOLiD coverage (“Count”) are shown and the percentage of these loci for which SOLiD gives the same genotype as HapMap for homozygotes and heterozygotes is represented by the colored lines (graphed using the left-hand y-axis and referred to as “% Concordance”) using two genotyping algorithms: Consensus Caller and diBayes. diBayes is more sensitive at heterozygous SNP detection and yields a lower false-negative rate than Consensus Caller, but we did not attempt to estimate the false-positive rate of diBayes with validation data. SOLiD genotypes that differ from HapMap gentoypes are nearly always heterozygous undercalls (i.e., the position is called homozogyous for one of the two alleles) or called as N (insufficient evidence to make a confident genotype call).
Figure 4.
Figure 4.
Length distributions of small and medium insertions and deletions under sequencing reads with respective concordances. Deletions are detected up to 500 bp and insertions up to 20 bp. A high prevalence of small indels, even-sized indels, and Alu-sized deletions (300–350 bp) are found in this genome. Larger indels (deletions 12 bp and higher and insertions 4 bp and higher) are called with more restrictive settings (see Methods) than smaller ones.
Figure 5.
Figure 5.
Length distributions of large insertions and deletions identified between mate-paired tags. There is an abundance of insertions and deletions in the size range of Alus as well as a spike in the number of deletions in the size range of LINEs (6000 bp).
Figure 6.
Figure 6.
The distribution of the 193 deletions identified in NA18507 with SOLiD by both the intra-read and inter-read approaches. (Inset) A 328-bp deletion detected using both the inter- and intra-read approaches. Four nonredundant molecules identify the deletion with the intra-read approach while 81 clones identify the deletion with the inter-read approach. This deletion has also been found in the Venter, Watson, and YH genomes.
Figure 7.
Figure 7.
Copy number variations detected with SOLiD mate-paired reads in NA18507. (A) The size distribution of CNVs detected with SOLiD mate-paired reads. (B) Overlap of copy numbers computed from normalized SOLiD coverage and from Affymetrix array CGH (aCGH) (McCarroll and Altshuler 2007). Colors indicate CNV calls from aCGH. On the top of the figure are the numbers of SOLiD CNV calls that overlap with aCGH data at each copy number.
Figure 8.
Figure 8.
Theoretical and actual detection of SNPs and indels at various levels of average sequence coverage. (A) The upper bound on the number of SNPs and intra-read indels that can be detected at various levels of coverage. This is calculated by assessing how much of the genome meets the coverage requirements for each type of variant, 2× coverage for homozygous SNPs, 4× coverage for heterozygous SNPs, and 6× coverage without considering the 3 bp on each end of the reads for intra-read indels. For small indels, two split reads are required to make a call, but due to the more restrictive manner of these calls, only about one in three reads (as found in simulations) can be used for this. (B) The actual number of SNPs and intra-read indels detected at various levels of average sequence coverage. (C) The number of insertions and deletions ≥200 bp detected between mate-paired reads at various average levels of sequence coverage.

Source: PubMed

3
Sottoscrivi