Multi-platform discovery of haplotype-resolved structural variation in human genomes

Mark J P Chaisson, Ashley D Sanders, Xuefang Zhao, Ankit Malhotra, David Porubsky, Tobias Rausch, Eugene J Gardner, Oscar L Rodriguez, Li Guo, Ryan L Collins, Xian Fan, Jia Wen, Robert E Handsaker, Susan Fairley, Zev N Kronenberg, Xiangmeng Kong, Fereydoun Hormozdiari, Dillon Lee, Aaron M Wenger, Alex R Hastie, Danny Antaki, Thomas Anantharaman, Peter A Audano, Harrison Brand, Stuart Cantsilieris, Han Cao, Eliza Cerveira, Chong Chen, Xintong Chen, Chen-Shan Chin, Zechen Chong, Nelson T Chuang, Christine C Lambert, Deanna M Church, Laura Clarke, Andrew Farrell, Joey Flores, Timur Galeev, David U Gorkin, Madhusudan Gujral, Victor Guryev, William Haynes Heaton, Jonas Korlach, Sushant Kumar, Jee Young Kwon, Ernest T Lam, Jong Eun Lee, Joyce Lee, Wan-Ping Lee, Sau Peng Lee, Shantao Li, Patrick Marks, Karine Viaud-Martinez, Sascha Meiers, Katherine M Munson, Fabio C P Navarro, Bradley J Nelson, Conor Nodzak, Amina Noor, Sofia Kyriazopoulou-Panagiotopoulou, Andy W C Pang, Yunjiang Qiu, Gabriel Rosanio, Mallory Ryan, Adrian Stütz, Diana C J Spierings, Alistair Ward, AnneMarie E Welch, Ming Xiao, Wei Xu, Chengsheng Zhang, Qihui Zhu, Xiangqun Zheng-Bradley, Ernesto Lowy, Sergei Yakneen, Steven McCarroll, Goo Jun, Li Ding, Chong Lek Koh, Bing Ren, Paul Flicek, Ken Chen, Mark B Gerstein, Pui-Yan Kwok, Peter M Lansdorp, Gabor T Marth, Jonathan Sebat, Xinghua Shi, Ali Bashir, Kai Ye, Scott E Devine, Michael E Talkowski, Ryan E Mills, Tobias Marschall, Jan O Korbel, Evan E Eichler, Charles Lee, Mark J P Chaisson, Ashley D Sanders, Xuefang Zhao, Ankit Malhotra, David Porubsky, Tobias Rausch, Eugene J Gardner, Oscar L Rodriguez, Li Guo, Ryan L Collins, Xian Fan, Jia Wen, Robert E Handsaker, Susan Fairley, Zev N Kronenberg, Xiangmeng Kong, Fereydoun Hormozdiari, Dillon Lee, Aaron M Wenger, Alex R Hastie, Danny Antaki, Thomas Anantharaman, Peter A Audano, Harrison Brand, Stuart Cantsilieris, Han Cao, Eliza Cerveira, Chong Chen, Xintong Chen, Chen-Shan Chin, Zechen Chong, Nelson T Chuang, Christine C Lambert, Deanna M Church, Laura Clarke, Andrew Farrell, Joey Flores, Timur Galeev, David U Gorkin, Madhusudan Gujral, Victor Guryev, William Haynes Heaton, Jonas Korlach, Sushant Kumar, Jee Young Kwon, Ernest T Lam, Jong Eun Lee, Joyce Lee, Wan-Ping Lee, Sau Peng Lee, Shantao Li, Patrick Marks, Karine Viaud-Martinez, Sascha Meiers, Katherine M Munson, Fabio C P Navarro, Bradley J Nelson, Conor Nodzak, Amina Noor, Sofia Kyriazopoulou-Panagiotopoulou, Andy W C Pang, Yunjiang Qiu, Gabriel Rosanio, Mallory Ryan, Adrian Stütz, Diana C J Spierings, Alistair Ward, AnneMarie E Welch, Ming Xiao, Wei Xu, Chengsheng Zhang, Qihui Zhu, Xiangqun Zheng-Bradley, Ernesto Lowy, Sergei Yakneen, Steven McCarroll, Goo Jun, Li Ding, Chong Lek Koh, Bing Ren, Paul Flicek, Ken Chen, Mark B Gerstein, Pui-Yan Kwok, Peter M Lansdorp, Gabor T Marth, Jonathan Sebat, Xinghua Shi, Ali Bashir, Kai Ye, Scott E Devine, Michael E Talkowski, Ryan E Mills, Tobias Marschall, Jan O Korbel, Evan E Eichler, Charles Lee

Abstract

The incomplete identification of structural variants (SVs) from whole-genome sequencing data limits studies of human genetic diversity and disease association. Here, we apply a suite of long-read, short-read, strand-specific sequencing technologies, optical mapping, and variant discovery algorithms to comprehensively analyze three trios to define the full spectrum of human genetic variation in a haplotype-resolved manner. We identify 818,054 indel variants (<50 bp) and 27,622 SVs (≥50 bp) per genome. We also discover 156 inversions per genome and 58 of the inversions intersect with the critical regions of recurrent microdeletion and microduplication syndromes. Taken together, our SV callsets represent a three to sevenfold increase in SV detection compared to most standard high-throughput sequencing studies, including those from the 1000 Genomes Project. The methods and the dataset presented serve as a gold standard for the scientific community allowing us to make recommendations for maximizing structural variation sensitivity for future genome sequencing studies.

Conflict of interest statement

J.K., C.-S.C., C.C.L., and A.M.W. are employees and shareholders of Pacific Biosciences (aka PacBio); A.R.H., T.A., H.C., E.T.L., J.L., and A.W.C.P. are employees and shareholders of Bionano Genomics; D.M.C., W.H.H., P.M., S.K.-P., and W.X. are employees and shareholders of 10X Genomics; J.F. is an employee of Illumina; J.E.L. is an employee of DNALink; S.P.L. is an employee of TreeCode Sdn Bhd. P.F. is a member of the scientific advisory board (SAB) of Fabric Genomics, Inc., and Eagle Genomics, Ltd. E.E.E. is on the SAB of DNAnexus, Inc. and was a consultant for Kunming University of Science and Technology (KUST) as part of the 1000 China Talent Program (2014–2016). C.L. was on the SAB of Bionano Genomics. All other authors declare no competing interests.

Figures

Fig. 1
Fig. 1
Characteristics of SNV-based haplotypes obtained from different data sources. a Distribution of phased block lengths for the YRI child NA19240. Note that Strand-seq haplotypes span whole chromosomes and therefore one block per chromosome is shown. Vertical bars highlight N50 haplotype length: the minimum length haplotype block at which at least half of the phased bases are contained. For Illumina (IL) paired-end data, phased blocks cover <50% of the genome and hence the N50 cannot be computed. b Fraction of phase connection, i.e., pairs of consecutive heterozygous variants provided by each technology (averaged over all proband samples). c Pairwise comparisons of different phasings; colors encode switch error rates (averaged over all proband samples). For each row, a green box indicates the phasing of an independent technology with best agreement, with corresponding switch error rates given in green. d Each phased block is shown in a different color. The largest block is shown in cyan, i.e., all cyan regions belong to one block, even though interspaced by white areas (genomic regions where no variants are phased) or disconnected small blocks (different colors). e Fraction of heterozygous SNVs in the largest block shown in d. f Mismatch error rate of largest block compared to trio-based phasing, averaged over all chromosomes of all proband genomes (i.e., the empirical probability that any two heterozygous variants on a chromosome are phased correctly with respect to each other, in contrast to the switch error rate, which relays the probability that any two adjacent heterozygous variants are phased correctly). (*) Not available because trio phasing is used as reference for comparisons. (**) Not shown as population-based phasing does not output block boundaries; refer to Supplementary Material for an illustration of errors in population-based phasing
Fig. 2
Fig. 2
Comparison and integration of indel and SV callsets on HG00733, HG00514, and NA12940. a Length distribution of deletions and insertions identified by PB (blue), IL (red) and BNG (brown), respectively, together with averaged length distribution of SVs discovered in the maternal genomes by the 1KG-P3 report (silver). b Number of SVs discovered by one or multiple genome platforms in the YRI child NA19240. c Overlap of IL indel discovery algorithms, with total number of indels found by each combination of IL algorithms (gray) and those that overlapped with a PB indel (blue) in the CHS child HG00514
Fig. 3
Fig. 3
Characterization of simple and complex inversions. a Integration of inversions across platforms based on reciprocal overlap. Shown is an example of five orthogonal platforms intersecting at a homozygous inversion, with breakpoint ranges and supporting Strand-seq signature illustrated in bottom panels. b Size distribution of inversions included in the unified inversion list, subdivided by technology, with the total inversions (N) contributed by each listed. c Classification of Strand-seq inversions based on orthogonal phase support. Illustrative examples of simple (homozygous and heterozygous) and complex (inverted duplication) events are shown. Strand-seq inversions were identified based on read directionality (read count; reference reads in gray, inverted reads in purple), the relative ratio of reference to inverted reads within the locus (read ratio), and the haplotype structure of the inversion, with phased read data considered in terms of directionality (Ph; H1 alleles in red, H2 alleles in blue; alleles from reference reads are displayed above the ideogram and alleles from inverted reads are displayed below). ILL Illumina. liWGS long-insert whole-genome sequencing libraries. PB Pacific Biosciences. StS Strand-seq. BNG Bionano Genomics. SD segmental duplication. Ph phase data
Fig. 4
Fig. 4
Concordance of IL methods compared against total IL callset and PB callset using orthogonal technologies. Results by algorithm shown for a the deletion concordance for individual methods, b the union of all pairs of methods, and c the requirement that more than one caller agree on any call. Individual callers are shown as red points for comparison. Pairs and triples of combinations are in black points. The solid and dashed lines represent the 5% and 10% non-concordance rates (NCR), respectively. The top five combinations of methods in each plot below the 10% NCR, along with the individual plots, are labeled. d Overlap of IL-SV discovery algorithms, with total number of SVs found by each combination of IL algorithms (gray) and those that overlapped with the PB-SV calls (blue) in the YRI child NA19240. e PCA of the genotypes of concordant calls of each method: PC 1 versus 2 (left), PC 2 versus 3 (right). VH VariationHunter

References

    1. Conrad DF, et al. Origins and functional impact of copy number variation in the human genome. Nature. 2010;464:704–712. doi: 10.1038/nature08516.
    1. Kidd JM, et al. A human genome structural variation sequencing resource reveals insights into mutational mechanisms. Cell. 2010;143:837–847. doi: 10.1016/j.cell.2010.10.027.
    1. Korbel JO, et al. Paired-end mapping reveals extensive structural variation in the human genome. Science. 2007;318:420–426. doi: 10.1126/science.1149504.
    1. Sudmant PH, et al. An integrated map of structural variation in 2,504 human genomes. Nature. 2015;526:75–81. doi: 10.1038/nature15394.
    1. Kronenberg ZN, et al. Wham: identifying structural variants of biological consequence. PLoS Comput. Biol. 2015;11:e1004572. doi: 10.1371/journal.pcbi.1004572.
    1. Ye K, Schulz MH, Long Q, Apweiler R, Ning Z. Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads. Bioinformatics. 2009;25:2865–2871. doi: 10.1093/bioinformatics/btp394.
    1. Chaisson MJP, et al. Resolving the complexity of the human genome using single-molecule sequencing. Nature. 2015;517:608–611. doi: 10.1038/nature13907.
    1. Sharp AJ, et al. Segmental duplications and copy-number variation in the human genome. Am. J. Hum. Genet. 2005;77:78–88. doi: 10.1086/431652.
    1. Chiang C, et al. The impact of structural variation on human gene expression. Nat. Genet. 2017;49:692–699. doi: 10.1038/ng.3834.
    1. Manolio TA, et al. Finding the missing heritability of complex diseases. Nature. 2009;461:747–753. doi: 10.1038/nature08494.
    1. 1000 Genomes Project Consortium. et al. A global reference for human genetic variation. Nature. 2015;526:68–74. doi: 10.1038/nature15393.
    1. Lieberman-Aiden E, et al. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science. 2009;326:289–293. doi: 10.1126/science.1181369.
    1. Falconer E, et al. DNA template strand sequencing of single-cells maps genomic rearrangements at high resolution. Nat. Methods. 2012;9:1107–1112. doi: 10.1038/nmeth.2206.
    1. Huddleston J, et al. Discovery and genotyping of structural variation from long-read haploid genome sequence data. Genome Res. 2017;27:677–685. doi: 10.1101/gr.214007.116.
    1. Martin, M. et al. WhatsHap: fast and accurate read-based phasing. bioRxiv 085050 (2016).
    1. Patterson M, et al. WhatsHap: weighted haplotype assembly for future-generation sequencing reads. J. Comput. Biol. 2015;22:498–509. doi: 10.1089/cmb.2014.0157.
    1. Porubsky, D. et al. Dense and accurate whole-chromosome haplotyping of individual genomes. Nat Commun. (2017).
    1. Porubský D, et al. Direct chromosome-length haplotyping by single-cell sequencing. Genome Res. 2016;26:1565–1574. doi: 10.1101/gr.209841.116.
    1. Zheng GXY, et al. Haplotyping germline and cancer genomes with high-throughput linked-read sequencing. Nat. Biotechnol. 2016;34:303–311. doi: 10.1038/nbt.3432.
    1. Loh PR, et al. Reference-based phasing using the Haplotype Reference Consortium panel. Nat. Genet. 2016;48:1443–1448. doi: 10.1038/ng.3679.
    1. Edge P, Bafna V, Bansal V. HapCUT2: robust and accurate haplotype assembly for diverse sequencing technologies. Genome Res. 2017;27:801–812. doi: 10.1101/gr.213462.116.
    1. Ben-Elazar S, Chor B, Yakhini Z. Extending partial haplotypes to full genome haplotypes using chromosome conformation capture data. Bioinformatics. 2016;32:i559–i566. doi: 10.1093/bioinformatics/btw453.
    1. Pendleton M, et al. Assembly and diploid architecture of an individual human genome via single-molecule technologies. Nat. Methods. 2015;12:780–786. doi: 10.1038/nmeth.3454.
    1. Chin CS, et al. Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nat. Methods. 2013;10:563–569. doi: 10.1038/nmeth.2474.
    1. Garg, S., Martin, M. & Marschall, T. Read-based phasing of related individuals. Bioinformatics 32, i234–i242 (2016).
    1. Broman KW, Murray JC, Sheffield VC, White RL, Weber JL. Comprehensive human genetic maps: individual and sex-specific variation in recombination. Am. J. Hum. Genet. 1998;63:861–869. doi: 10.1086/302011.
    1. Hou Y, et al. Genome analyses of single human oocytes. Cell. 2013;155:1492–1506. doi: 10.1016/j.cell.2013.11.040.
    1. Kirkness EF, et al. Sequencing of isolated sperm cells for direct haplotyping of a human genome. Genome Res. 2013;23:826–832. doi: 10.1101/gr.144600.112.
    1. Lu S, et al. Probing meiotic recombination and aneuploidy of single sperm cells by whole-genome sequencing. Science. 2012;338:1627–1630. doi: 10.1126/science.1229112.
    1. Myers S, Freeman C, Auton A, Donnelly P, McVean G. A common sequence motif associated with recombination hot spots and genome instability in humans. Nat. Genet. 2008;40:1124–1129. doi: 10.1038/ng.213.
    1. DePristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 2011;43:491–498. doi: 10.1038/ng.806.
    1. Garrison, E. & Marth, G. Haplotype-based variant detection from short-read sequencing. arXiv [] (2012).
    1. Mills RE, et al. An initial map of insertion and deletion (INDEL) variation in the human genome. Genome Res. 2006;16:1182–1190. doi: 10.1101/gr.4565806.
    1. Sedlazeck FJ, et al. Accurate detection of complex structural variations using single-molecule sequencing. Nat. Methods. 2018 doi: 10.1038/s41592-018-0001-7.
    1. Bailey JA, Eichler EE. Primate segmental duplications: crucibles of evolution, diversity and disease. Nat. Rev. Genet. 2006;7:552–564. doi: 10.1038/nrg1895.
    1. Chaisson MJP, Wilson RK, Eichler EE. Genetic variation and the de novo assembly of human genomes. Nat. Rev. Genet. 2015;16:627–640. doi: 10.1038/nrg3933.
    1. Redin C, et al. The genomic landscape of balanced cytogenetic abnormalities associated with human congenital anomalies. Nat. Genet. 2017;49:36–45. doi: 10.1038/ng.3720.
    1. Talkowski ME, et al. Sequencing chromosomal abnormalities reveals neurodevelopmental loci that confer risk across diagnostic boundaries. Cell. 2012;149:525–537. doi: 10.1016/j.cell.2012.03.028.
    1. Sanders AD, et al. Characterizing polymorphic inversions in human genomes by single-cell sequencing. Genome Res. 2016;26:1575–1587. doi: 10.1101/gr.201160.115.
    1. Genovese G, et al. Using population admixture to help complete maps of the human genome. Nat. Genet. 2013;45:406–414. doi: 10.1038/ng.2565.
    1. Weise A, et al. Microdeletion and microduplication syndromes. J. Histochem. Cytochem. 2012;60:346–358. doi: 10.1369/0022155412440001.
    1. Antonacci F, et al. Characterization of six human disease-associated inversion polymorphisms. Hum. Mol. Genet. 2009;18:2555–2566. doi: 10.1093/hmg/ddp187.
    1. Tubio JMC, et al. Mobile DNA in cancer. Extensive transduction of nonrepetitive DNA mediated by L1 retrotransposition in cancer genomes. Science. 2014;345:1251343. doi: 10.1126/science.1251343.
    1. Brouha B, et al. Hot L1s account for the bulk of retrotransposition in the human population. Proc. Natl Acad. Sci. USA. 2003;100:5280–5285. doi: 10.1073/pnas.0831042100.
    1. Brouha B, et al. Evidence consistent with human L1 retrotransposition in maternal meiosis I. Am. J. Hum. Genet. 2002;71:327–336. doi: 10.1086/341722.
    1. Scott EC, et al. A hot L1 retrotransposon evades somatic repression and initiates human colorectal cancer. Genome Res. 2016;26:745–755. doi: 10.1101/gr.201814.115.
    1. Lek M, et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature. 2016;536:285–291. doi: 10.1038/nature19057.
    1. Mohiyuddin M, et al. MetaSV: an accurate and integrative structural-variant caller for next generation sequencing. Bioinformatics. 2015;31:2741–2744. doi: 10.1093/bioinformatics/btv204.
    1. Mills RE, et al. Mapping copy number variation by population-scale genome sequencing. Nature. 2011;470:59–65. doi: 10.1038/nature09708.
    1. Hehir-Kwa, J. et al. A high-quality reference panel reveals the complexity and distribution of structural genome changes in a human population. (2016).
    1. Koolen DA, et al. Clinical and molecular delineation of the 17q21.31 microdeletion syndrome. J. Med. Genet. 2008;45:710–720. doi: 10.1136/jmg.2008.058701.
    1. Sharp AJ, et al. A recurrent 15q13.3 microdeletion syndrome associated with mental retardation and seizures. Nat. Genet. 2008;40:322–328. doi: 10.1038/ng.93.
    1. Chaisson MJ, Mukherjee S, Kannan S, Eichler EE. Resolving multicopy duplications de novo using polyploid phasing. Res. Comput. Mol. Biol. 2017;10229:117–133. doi: 10.1007/978-3-319-56970-3_8.
    1. Berlin K, et al. Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nat. Biotechnol. 2015;33:623–630. doi: 10.1038/nbt.3238.
    1. Steinberg, K. M. et al. High-quality assembly of an individual of yoruban descent. bioRxiv 067447 (2016). 10.1101/067447.
    1. Chaisson MJ, Tesler G. Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory. BMC Bioinform. 2012;13:238. doi: 10.1186/1471-2105-13-238.
    1. Koren S, et al. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res. 2017;27:722–736. doi: 10.1101/gr.215087.116.
    1. Li H. Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics. 2016;32:2103–2110. doi: 10.1093/bioinformatics/btw152.
    1. Martin M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet. J. 2011;17:10–12. doi: 10.14806/ej.17.1.200.
    1. Zhao X, Weber AM, Mills RE. A recurrence-based approach for validating structural variation using long-read sequencing technology. Gigascience. 2017;6:1–9.
    1. Sudmant PH, et al. Global diversity, population stratification, and selection of human copy-number variation. Science. 2015;349:aab3761. doi: 10.1126/science.aab3761.
    1. Mallick S, et al. The Simons Genome Diversity Project: 300 genomes from 142 diverse populations. Nature. 2016;538:201–206. doi: 10.1038/nature18964.
    1. Sanders AD, Falconer E, Hills M, Spierings DCJ, Lansdorp PM. Single-cell template strand sequencing by Strand-seq enables the characterization of individual homologs. Nat. Protoc. 2017;12:1151–1176. doi: 10.1038/nprot.2017.029.
    1. Clarke L, et al. The international Genome sample resource (IGSR): a worldwide collection of genome variation incorporating the 1000 Genomes Project data. Nucleic Acids Res. 2017;45:D854–D859. doi: 10.1093/nar/gkw829.

Source: PubMed

3
Sottoscrivi