Performance of neural network basecalling tools for Oxford Nanopore sequencing

Ryan R Wick, Louise M Judd, Kathryn E Holt, Ryan R Wick, Louise M Judd, Kathryn E Holt

Abstract

Background: Basecalling, the computational process of translating raw electrical signal to nucleotide sequence, is of critical importance to the sequencing platforms produced by Oxford Nanopore Technologies (ONT). Here, we examine the performance of different basecalling tools, looking at accuracy at the level of bases within individual reads and at majority-rule consensus basecalls in an assembly. We also investigate some additional aspects of basecalling: training using a taxon-specific dataset, using a larger neural network model and improving consensus basecalls in an assembly by additional signal-level analysis with Nanopolish.

Results: Training basecallers on taxon-specific data results in a significant boost in consensus accuracy, mostly due to the reduction of errors in methylation motifs. A larger neural network is able to improve both read and consensus accuracy, but at a cost to speed. Improving consensus sequences ('polishing') with Nanopolish somewhat negates the accuracy differences in basecallers, but pre-polish accuracy does have an effect on post-polish accuracy.

Conclusions: Basecalling accuracy has seen significant improvements over the last 2 years. The current version of ONT's Guppy basecaller performs well overall, with good accuracy and fast performance. If higher accuracy is required, users should consider producing a custom model using a larger neural network and/or training data from the same species.

Keywords: Basecalling; Long-read sequencing; Oxford Nanopore.

Conflict of interest statement

In July 2018, Ryan Wick attended a hackathon in Bermuda at ONT’s expense. ONT also paid his travel, accommodation and registration to attend the London Calling (2017) and Nanopore Community Meeting (2017) events as an invited speaker.

Figures

Fig. 1
Fig. 1
Read accuracy, consensus accuracy and speed performance for each basecaller version, plotted against the release date (version numbers specified in Additional file 2: Table S3). Accuracies are expressed as qscores (also known as Phred quality scores) on a logarithmic scale where Q10 = 90%, Q20 = 99%, Q30 = 99.9%, etc. Each basecaller was run using its default model, except for Guppy v2.2.3 which was also run with its included flip-flop model and our two custom-trained models
Fig. 2
Fig. 2
Read and consensus accuracy from Guppy v2.2.3 for a variety of genomes using different models: the default RGRGR model, the included flip-flop model and the two custom models we trained for this study. Both custom models used the same training set which focused primarily on K. pneumoniae, secondarily on the Enterobacteriaceae family and lastly on the Proteobacteria phylum
Fig. 3
Fig. 3
Consensus errors per basecaller for the K. pneumoniae benchmarking set, broken down by type. Dcm refers to errors occurring in the CCAGG/CCTGG Dcm motif. Homopolymer errors are changes in the length of a homopolymer three or more bases in length (in the reference). This plot is limited to basecallers/versions with less than 1.2% consensus error and excludes redundant results from similar versions
Fig. 4
Fig. 4
Consensus accuracy before (red) and after Nanopolish (blue) for the assemblies of K. pneumoniae benchmarking set

References

    1. Charalampous T, Richardson H, Kay GL, Baldan R, Jeanes C, Rae D, Grundy S, Turner DJ, Wain J, Leggett RM, Livermore DM, O’Grady J. Rapid diagnosis of lower respiratory infection using Nanopore-based clinical metagenomics. bioRxiv. 2018:387548. 10.1101/387548.
    1. Graves A, Fernández S, Gomez F, Schmidhuber J. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In: ICML ’06 Proceedings of the 23rd International Conference on Machine Learning: 2006. p. 369–76. 10.1145/1143844.1143891. .
    1. Teng H, Cao MD, Hall MB, Duarte T, Wang S, Coin LJM. Chiron: Translating nanopore raw signal directly into nucleotide sequence using deep learning. GigaScience. 2018;7(5):1–9. doi: 10.1093/gigascience/giy037.
    1. Boža V, Brejová B, Vinař T. DeepNano: Deep recurrent neural networks for base calling in MinION Nanopore reads. PLoS ONE. 2017;12(6):1–13. doi: 10.1371/journal.pone.0178751.
    1. Stoiber M, Brown J. BasecRAWller: Streaming nanopore basecalling directly from raw signal. bioRxiv. 2017:1–15. 10.1101/133058.
    1. Jain M, Koren S, Quick J, Rand AC, Sasani TA, Tyson JR, Beggs AD, Dilthey AT, Fiddes IT, Malla S, Marriott H, Miga KH, Nieto T, O’Grady J, Olsen HE, Pedersen BS, Rhie A, Richardson H, Quinlan A, Snutch TP, Tee L, Paten B, Phillippy AM, Simpson JT, Loman NJ, Loose M. Nanopore sequencing and assembly of a human genome with ultra-long reads. Nat Biotechnol. 2018;36:338–45. doi: 10.1038/nbt.4060.
    1. Smith JW, Gomez-Eichelmann MC, Levy-Mustri A, Ramirez-Santos J. Presence of 5-methylcytosine in CC(A/T)GG sequences (Dcm methylation) in DNAs from different bacteria. J Bacteriol. 1991;173(23):7692–4. doi: 10.1128/jb.173.23.7692-7694.1991.
    1. Pightling AW, Pettengill JB, Luo Y, Baugher JD, Rand H, Strain E. Interpreting whole-genome sequence analyses of foodborne bacteria for regulatory applications and outbreak investigations. Front Microbiol. 2018;9:1–13. doi: 10.3389/fmicb.2018.01482.
    1. Yoshida CE, Kruczkiewicz P, Laing CR, Lingohr EJ, Victor P. The Salmonella in silico typing resource (SISTR): An open web-accessible tool for rapidly typing and subtyping draft Salmonella genome assemblies. PLoS ONE. 2016;11(1):0147101.
    1. Schjørring S, Gillesberg Lassen S, Jensen T, Moura A, Kjeldgaard JS, Müller L, Thielke S, Leclercq A, Maury MM, Tourdjman M, Donguy MP, Lecuit M, Ethelberg S, Nielsen EM. Cross-border outbreak of listeriosis caused by cold-smoked salmon, revealed by integrated surveillance and whole genome sequencing (WGS), Denmark and France, 2015 to 2017. Eurosurveillance. 2017;22(50):1–5. doi: 10.2807/1560-7917.ES.2017.22.50.17-00762.
    1. Chinwalla AT, Cook LL, Delehaunty KD, Fewell GA, Fulton LA, et al. Initial sequencing and comparative analysis of the mouse genome. Nature. 2002;420(6915):520–62. doi: 10.1038/nature01262.
    1. Garalde DR, Snell EA, Jachimowicz D, Sipos B, Lloyd JH, Bruce M, Pantic N, Admassu T, James P, Warland A, Jordan M, Ciccone J, Serra S, Keenan J, Martin S, McNeill L, Wallace EJ, Jayasinghe L, Wright C, Blasco J, Young S, Brocklebank D, Juul S, Clarke J, Heron AJ, Turner DJ. Highly parallel direct RNA sequencing on an array of nanopores. Nat Methods. 2018;15(3):201. doi: 10.1038/nmeth.4577.
    1. Gilbert WV, Bell TA, Schaening C. Messenger RNA modifications: Form, distribution, and function. Science. 2016;352(6292):1408–12. doi: 10.1126/science.aad8711.
    1. Souvorov A, Agarwala R, Lipman DJ. SKESA: Strategic k-mer extension for scrupulous assemblies. Genome Biol. 2018;19(1):153. doi: 10.1186/s13059-018-1540-z.
    1. Gorrie CL, Mirceta M, Wick RR, Judd LM, Wyres KL, Thomson NR, Strugnell RA, Pratt NF, Garlick JS, Watson KM, Hunter PC, McGloughlin SA, Spelman DW, Jenney AWJ, Holt KE. Antimicrobial-resistant Klebsiella pneumoniae carriage and infection in specialized geriatric care wards linked to acquisition in the referring hospital. Clin Infect Dis. 2018;67(2):161–70. doi: 10.1093/cid/ciy027.
    1. Wick RR, Judd LM, Gorrie CL, Holt KE. Completing bacterial genome assemblies with multiplex MinION sequencing. Microb Genom. 2017;3(10):1–7.
    1. Wick RR, Judd LM, Holt KE. Deepbinner: Demultiplexing barcoded Oxford Nanopore reads with deep convolutional neural networks. PLoS Comput Biol. 2018;14(11):1006583. doi: 10.1371/journal.pcbi.1006583.
    1. Li H. Minimap2: Pairwise alignment for nucleotide sequences. Bioinformatics. 2018;34(18):3094–100. doi: 10.1093/bioinformatics/bty191.
    1. Vaser R, Sović I, Nagarajan N, Šikić M. Fast and accurate de novo genome assembly from long uncorrected reads. Genome Res. 2017;27(5):737–46. doi: 10.1101/gr.214270.116.
    1. Kurtz S, Phillippy A, Delcher AL, Smoot M, Shumway M, Antonescu C, Salzberg SL. Versatile and open software for comparing large genomes. Genome Biol. 2004;5(2):12. doi: 10.1186/gb-2004-5-2-r12.
    1. Loman NJ, Quick J, Simpson JT. A complete bacterial genome assembled de novo using only nanopore sequencing data. Nat Methods. 2015;12(8):733–5. doi: 10.1038/nmeth.3444.
    1. Wick RR, Judd LM, Holt KE. Training data. Figshare. 2019. 10.26180/5c5a5f5ff20ed.
    1. Wick RR, Judd LM, Holt KE. Trained models. Figshare. 2019. 10.26180/5c5a5fc61e7fa.
    1. Wick RR, Judd LM, Holt KE. Raw fast5s. Figshare. 2019. 10.26180/5c5a5fa08bbee.
    1. Wick RR, Judd LM, Holt KE. Basecalled reads. Figshare. 2019. 10.26180/5c5a7292227de.
    1. Wick RR, Judd LM, Holt KE. Assemblies. Figshare. 2019. 10.26180/5c5a5fb6ac10f.
    1. Wick RR, Judd LM, Holt KE. Reference genomes. Figshare. 2019. 10.26180/5c5a5fcf72e40.
    1. Wick RR, Judd LM, Holt KE. Analysis scripts. GitHub. 2019. 10.5281/zenodo.1188469.

Source: PubMed

3
Subscribe