Tandem-genotypes: robust detection of tandem repeat expansions from long DNA reads

Satomi Mitsuhashi, Martin C Frith, Takeshi Mizuguchi, Satoko Miyatake, Tomoko Toyota, Hiroaki Adachi, Yoko Oma, Yoshihiro Kino, Hiroaki Mitsuhashi, Naomichi Matsumoto, Satomi Mitsuhashi, Martin C Frith, Takeshi Mizuguchi, Satoko Miyatake, Tomoko Toyota, Hiroaki Adachi, Yoko Oma, Yoshihiro Kino, Hiroaki Mitsuhashi, Naomichi Matsumoto

Abstract

Tandemly repeated DNA is highly mutable and causes at least 31 diseases, but it is hard to detect pathogenic repeat expansions genome-wide. Here, we report robust detection of human repeat expansions from careful alignments of long but error-prone (PacBio and nanopore) reads to a reference genome. Our method is robust to systematic sequencing errors, inexact repeats with fuzzy boundaries, and low sequencing coverage. By comparing to healthy controls, we prioritize pathogenic expansions within the top 10 out of 700,000 tandem repeats in whole genome sequencing data. This may help to elucidate the many genetic diseases whose causes remain unknown.

Keywords: Long-read sequencing; Nanopore; PacBio; Repeat diseases; Tandem repeat.

Conflict of interest statement

Ethics approval and consent to participate

The Institutional Review Board of Yokohama City University of Medicine approved the experimental protocols (IRB approval number: A180800011). Written informed consent was obtained from the patient, in accordance with Japanese regulatory requirements. Experimental methods comply with the Helsinki Declaration.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Figures

Fig. 1
Fig. 1
Flow chart to predict and prioritize tandem repeat changes, using the tandem-genotypes and tandem-genotypes-join programs. Long reads are aligned to a reference genome using LAST. Dotted square: program developed in this study
Fig. 2
Fig. 2
aj Distribution of predicted change in repeat copy number, for nanopore reads from each of ten plasmids. Red arrows: expected copy number change. Forward (red) and reverse strand reads (blue) are shown separately. y-axis: read count, x-axis: change in copy number relative to the reference plasmid. Reference copy numbers in each plasmid are in Additional file 1: Table S7. Black arrows: reads in these peaks may actually have shortened repeats
Fig. 3
Fig. 3
aj Distribution of predicted change in repeat copy number, for nanopore reads of human DNA with inserted repeats. Reads covering each of ten disease-associated repeat loci were selected, and the repeat region in each read was replaced by the repeat region of a plasmid nanopore read. y-axis: read count, x-axis: change in copy number relative to the reference human genome. Forward (red) and reverse strand reads (blue) are shown separately. Red arrows: projected repeat copy changes
Fig. 4
Fig. 4
Distribution of predicted change in repeat copy number, for PacBio (RSII) reads of cloned SCA10 loci from 3 patients. a tandem-genotypes. Forward (red) and reverse strand reads (blue) are shown separately. b RepeatHMM with straightforward parameters. c RepeatHMM with non-obvious parameters suggested by its authors. y-axis: read count, x-axis: change in copy number relative to the reference human genome. Red arrow: expected repeat copy change, from McFarland et al. [9]
Fig. 5
Fig. 5
PCR results and Sanger sequencing of two tandem-repeat loci in human genome NA12878, with expansions relative to the reference human genome (hg38). Reads were aligned by MUSCLE [33]. The histograms show read counts (y-axis) for predicted copy number changes (x-axis), with nanopore (rel3) in blue and PacBio (SRR3197748) in red. a Expansion of an inexact TATAT repeat in an intron of PCDH15: actually insertion of an AluYb8 SINE. b Expansion of an intergenic GT repeat: actually deletion in the reference human genome. The bands marked by asterisks were sequenced and proved to be non-target amplification
Fig. 6
Fig. 6
Alignments of DNA reads (vertical) to the reference human genome (horizontal). Diagonal lines indicate alignments, of the same strands (red) and opposite strands (blue). The vertical stripes indicate repeat annotations in the reference genome: tandem repeats (purple) and transposable elements (pink). a Six reads from a BAFME patient that cover the disease-causing SAMD12 AAAAT repeat locus. b Close-ups of three reads with ~ 5 k expansions. c Two examples of chimeric human reads (rel3) with expanded CAG repeats at the ATXN7 disease locus
Fig. 7
Fig. 7
a Prioritization of predicted repeat copy number changes in a BAFME patient. The BAFME expansion (AAATA: SAMD12 intron) is ranked 4th out of 0.7 million tandem repeats annotated in rmsk.txt. Forward (red) and reverse strand reads (blue) are shown separately. Histograms are raw output of tandem-genotypes-plot. b Prioritization of predicted repeat copy number changes in whole genome nanopore reads (NA12878 rel3) plus chimeric human/plasmid reads with pathological triplet-repeat expansions (AR, ATN1, ATXN2, ATXN3, ATXN7, CACNA1A, and HTT). These pathological expansions are prioritized within the top 10 out of 0.7 million tandem repeats in the genome, when compared to three control datasets using tandem-genotypes-join. c Comparison to control datasets with tandem-genotypes-join (joined) effectively de-prioritized other repeats, versus only using a single sample (single). y-axis: prioritization ranking
Fig. 8
Fig. 8
a Genome-wide distribution of predicted change in repeat copy number, for nanopore MinION (rel3) and PacBio (SRR3197748) reads from the same human (NA12878). Nanopore tends to have negative and PacBio positive predicted changes, especially for short repeat units. Read number = 10,000 (randomly sampled). Nanopore reads are shown in blue and PacBio in red. bd Genome-wide distribution of predicted change in repeat copy number for nanopore MinION (rel3) and PacBio (SRR3197748) reads from the same human (NA12878), and nanopore PromethION (ERR2585112-5) from a different individual (NA19240). b Distributions for AG di-nucleotide repeats. c Distributions for GAT. d Distributions for CTT. CTT shows the most prominent strand bias in nanopore rel3 reads among all types of triplet repeat (all types are in Additional file 1: Figures S9, S10, S13–S16). PromethION shows less strand bias compared to rel3. y-axis: read count, x-axis: change in copy number relative to the reference human genome

References

    1. Tang H, Kirkness EF, Lippert C, Biggs WH, Fabani M, Guzman E, Ramakrishnan S, Lavrenko V, Kakaradov B, Hou C, et al. Profiling of short-tandem-repeat disease alleles in 12,632 human whole genomes. Am J Hum Genet. 2017;101:700–715. doi: 10.1016/j.ajhg.2017.09.013.
    1. La Spada AR, Roling DB, Harding AE, Warner CL, Spiegel R, Hausmanowa-Petrusewicz I, Yee WC, Fischbeck KH. Meiotic stability and genotype-phenotype correlation of the trinucleotide repeat in X-linked spinal and bulbar muscular atrophy. Nat Genet. 1992;2:301–304. doi: 10.1038/ng1292-301.
    1. MacDonald ME, Ambrose CM, Duyao MP, Myers RH, Lin C, Srinidhi L, Barnes G, Taylor SA, James M, Groot N et al. A novel gene containing a trinucleotide repeat that is expanded and unstable on Huntington’s disease chromosomes. Cell. 1993;72:971–83.
    1. Brook JD, McCurrach ME, Harley HG, Buckler AJ, Church D, Aburatani H, Hunter K, Stanton VP, Thirion JP, Hudson T, et al. Molecular basis of myotonic dystrophy: expansion of a trinucleotide (CTG) repeat at the 3′ end of a transcript encoding a protein kinase family member. Cell. 1992;68:799–808. doi: 10.1016/0092-8674(92)90154-5.
    1. Kremer EJ, Pritchard M, Lynch M, Yu S, Holman K, Baker E, Warren ST, Schlessinger D, Sutherland GR, Richards RI. Mapping of DNA instability at the fragile X to a trinucleotide repeat sequence p (CCG)n. Science. 1991;252:1711–1714. doi: 10.1126/science.1675488.
    1. Lemmers RJ, van der Vliet PJ, Klooster R, Sacconi S, Camano P, Dauwerse JG, Snider L, Straasheijm KR, van Ommen GJ, Padberg GW, et al. A unifying genetic model for facioscapulohumeral muscular dystrophy. Science. 2010;329:1650–1653. doi: 10.1126/science.1189044.
    1. Brais B, Bouchard JP, Xie YG, Rochefort DL, Chretien N, Tome FM, Lafreniere RG, Rommens JM, Uyama E, Nohira O, et al. Short GCG expansions in the PABP2 gene cause oculopharyngeal muscular dystrophy. Nat Genet. 1998;18:164–167. doi: 10.1038/ng0298-164.
    1. Musova Z, Mazanec R, Krepelova A, Ehler E, Vales J, Jaklova R, Prochazka T, Koukal P, Marikova T, Kraus J, et al. Highly unstable sequence interruptions of the CTG repeat in the myotonic dystrophy gene. Am J Med Genet A. 2009;149A:1365–1374. doi: 10.1002/ajmg.a.32987.
    1. McFarland KN, Liu J, Landrian I, Godiska R, Shanker S, Yu F, Farmerie WG, Ashizawa T. SMRT sequencing of long tandem nucleotide repeats in SCA10 reveals unique insight of repeat expansion structure. PLoS One. 2015;10:e0135906. doi: 10.1371/journal.pone.0135906.
    1. Ishiura H, Doi K, Mitsui J, Yoshimura J, Matsukawa MK, Fujiyama A, Toyoshima Y, Kakita A, Takahashi H, Suzuki Y, et al. Expansions of intronic TTTCA and TTTTA repeats in benign adult familial myoclonic epilepsy. Nat Genet. 2018;50:581–590. doi: 10.1038/s41588-018-0067-2.
    1. Nishikawa A, Mitsuhashi S, Miyata N, Nishino I. Targeted massively parallel sequencing and histological assessment of skeletal muscles for the molecular diagnosis of inherited muscle disorders. J Med Genet. 2017;54:104–110. doi: 10.1136/jmedgenet-2016-104073.
    1. Cummings BB, Marshall JL, Tukiainen T, Lek M, Donkervoort S, Foley AR, Bolduc V, Waddell LB, Sandaradura SA, O'Grady GL, et al. Improving genetic diagnosis in Mendelian disease with transcriptome sequencing. Sci Transl Med. 2017;9:eaal5209.
    1. Ameur A, Kloosterman WP, Hestand MS. Single-molecule sequencing: towards clinical applications. Trends Biotechnol. 2018. 10.1016/j.tibtech.2018.07.013.
    1. Ummat A, Bashir A. Resolving complex tandem repeats with long reads. Bioinformatics. 2014;30:3491–3498. doi: 10.1093/bioinformatics/btu437.
    1. Liu Q, Zhang P, Wang D, Gu W, Wang K. Interrogating the “unsequenceable” genomic trinucleotide repeat disorders by long-read sequencing. Genome Med. 2017;9:65. doi: 10.1186/s13073-017-0456-7.
    1. Frith MC, Khan S. A survey of localized sequence rearrangements in human DNA. Nucleic Acids Res. 2018;46:1661–1673. doi: 10.1093/nar/gkx1266.
    1. Hamada M, Ono Y, Asai K, Frith MC. Training alignment parameters for arbitrary sequencers with LAST-TRAIN. Bioinformatics. 2017;33:926–928.
    1. Jain M, Koren S, Miga KH, Quick J, Rand AC, Sasani TA, Tyson JR, Beggs AD, Dilthey AT, Fiddes IT, et al. Nanopore sequencing and assembly of a human genome with ultra-long reads. Nat Biotechnol. 2018;36:338–345. doi: 10.1038/nbt.4060.
    1. Sedlazeck FJ, Rescheneder P, Smolka M, Fang H, Nattestad M, von Haeseler A, Schatz MC. Accurate detection of complex structural variations using single-molecule sequencing. Nat Methods. 2018;15:461–468. doi: 10.1038/s41592-018-0001-7.
    1. Cretu Stancu M, van Roosmalen MJ, Renkens I, Nieboer MM, Middelkamp S, de Ligt J, Pregno G, Giachino D, Mandrile G, Espejo Valle-Inclan J, et al. Mapping and phasing of structural variation in patient genomes using nanopore sequencing. Nat Commun. 2017;8:1326. doi: 10.1038/s41467-017-01343-4.
    1. Mizuguchi T, Toyota T, Adachi H, Miyake N, Matsumoto N, Miyatake S. Detecting a long insertion variant in SAMD12 by SMRT sequencing: implications of long-read whole-genome sequencing for repeat expansion diseases. J Hum Genet. 2018. 10.1038/s10038-018-0551-7.
    1. De Coster W, De Roeck A, De Pooter T, D’Hert S, De Rijk P, Strazisar M, Sleegers K. Structural variants identified by Oxford Nanopore PromethION sequencing of the human genome. BioRxiv. 2018. 10.1101/434118.
    1. Höijer I, Tsai YC, Clark TA, Kotturi P, Dahl N, Stattin EL, Bondeson ML, Feuk L, Gyllensten U, Ameur A. Detailed analysis of HTT repeat elements in human blood using targeted amplification-free long-read sequencing. Hum Mutat. 2018;39:1262–1272. doi: 10.1002/humu.23580.
    1. Benson G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 1999;27:573–580. doi: 10.1093/nar/27.2.573.
    1. Sone J, Mitsuhashi S, Fujita A, Mizuguchi T, Mori K, Koike H, Hashiguchi A, Takashima H, Sugiyama H, Kohno Y, et al. Long-read sequencing identifies GGC repeat expansion in human-specific NOTCH2NLC associated with neuronal intranuclear inclusion disease. bioRxiv:515635. 10.1101/515635.
    1. Frith MC. A new repeat-masking method enables specific detection of homologous sequences. Nucleic Acids Res. 2011;39:e23. doi: 10.1093/nar/gkq1212.
    1. Frith MC. Gentle masking of low-complexity sequences improves homology search. PLoS One. 2011;6:e28819. doi: 10.1371/journal.pone.0028819.
    1. Oma Y, Kino Y, Sasagawa N, Ishiura S. Intracellular localization of homopolymeric amino acid-containing proteins expressed in mammalian cells. J Biol Chem. 2004;279:21217–21222. doi: 10.1074/jbc.M309887200.
    1. Kino Y, Washizu C, Kurosawa M, Oma Y, Hattori N, Ishiura S, Nukina N. Nuclear localization of MBNL1: splicing-mediated autoregulation and repression of repeat-derived aberrant proteins. Hum Mol Genet. 2015;24:740–756. doi: 10.1093/hmg/ddu492.
    1. Oma Y, Kino Y, Toriumi K, Sasagawa N, Ishiura S. Interactions between homopolymeric amino acids (HPAAs) Protein Sci. 2007;16:2195–2204. doi: 10.1110/ps.072955307.
    1. O’Leary NA, Wright MW, Brister JR, Ciufo S, Haddad D, McVeigh R, Rajput B, Robbertse B, Smith-White B, Ako-Adjei D, et al. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 2016;44:D733–D745. doi: 10.1093/nar/gkv1189.
    1. Morgulis A, Gertz EM, Schaffer AA, Agarwala R. WindowMasker: window-based masker for sequenced genomes. Bioinformatics. 2006;22:134–141. doi: 10.1093/bioinformatics/bti774.
    1. Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004;32:1792–1797. doi: 10.1093/nar/gkh340.

Source: PubMed

3
Abonner