Improved metagenomic analysis with Kraken 2

Derrick E Wood, Jennifer Lu, Ben Langmead, Derrick E Wood, Jennifer Lu, Ben Langmead

Abstract

Although Kraken's k-mer-based approach provides a fast taxonomic classification of metagenomic sequence data, its large memory requirements can be limiting for some applications. Kraken 2 improves upon Kraken 1 by reducing memory usage by 85%, allowing greater amounts of reference genomic data to be used, while maintaining high accuracy and increasing speed fivefold. Kraken 2 also introduces a translated search mode, providing increased sensitivity in viral metagenomics analysis.

Keywords: Alignment-free methods; Metagenomics; Metagenomics classification; Microbiome; Minimizers; Probabilistic data structures.

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
Differences in operation between the two versions of Kraken. a Both versions of Kraken begin classifying a k-mer by computing its bp minimizer (highlighted in magenta). The default values of k and for each version are shown in the figure. b Kraken 2 applies a spaced seed mask of s spaces to the minimizer and calculates a compact hash code, which is then used as a search query in its compact hash table; the lowest common ancestor (LCA) taxon associated with the compact hash code is then assigned to the k-mer (see the “Methods” section for full details). In Kraken 1, the minimizer is used to accelerate the search for the k-mer, through the use of an offset index and a limited-range binary search; the association between k-mer and LCA is directly stored in the sorted list. c Kraken 2 also achieves lower memory usage than Kraken 1 by using fewer bits to store the LCA and storing a compact hash code of the minimizer rather than the full k-mer. d Impact on speed, memory usage, and prokaryotic genus F1-measure in Kraken 2 when changing k with respect to ( = 31, s = 7 for all three graphs). e Impact on prokaryotic genus sensitivity and positive predictive value (PPV) when changing the number of minimizer spaces s (k = 35,  = 31 for both graphs). In d and e, the data are from our parameter sweep results in Additional file 1: Table S2, and the default values of the independent variables for Kraken 2 are marked with a circle.
Fig. 2
Fig. 2
Comparison between Kraken 2 and other sequence classification tools. a Processing speed (in millions of reads per minute) and memory usage (measured by maximum resident set size, in gigabytes) are shown for each classifier, as evaluated on 50 million paired-end simulated reads with 16 threads. Accuracy results are shown for b 40 prokaryotic genomes and c 10 viral genomes. The results here are shown for sensitivity, positive predictive value (PPV), and F1-measure as evaluated on a per-fragment basis at the genus rank, with 1000 reads simulated from each genome. The strains from which reads were simulated were excluded from the reference libraries for each classification tool. “Kraken 2X” is Kraken 2 using translated search against a protein database. Full results for these strain-exclusion experiments are available in Additional file 1: Table S1

References

    1. Kim D, Song L, Breitwieser FP, Salzberg SL. Centrifuge: rapid and sensitive classification of metagenomic sequences. Genome Res. 2016;26:1721–1729. doi: 10.1101/gr.210641.116.
    1. Ounit R, Wanamaker S, Close TJ, Lonardi S. CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers. BMC Genomics. 2015;16:236. doi: 10.1186/s12864-015-1419-2.
    1. Menzel P, Ng KL, Krogh A. Fast and sensitive taxonomic classification for metagenomics with Kaiju. Nat Commun. 2016;7:11257. doi: 10.1038/ncomms11257.
    1. Wood DE, Salzberg SL. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 2014;15:R46. doi: 10.1186/gb-2014-15-3-r46.
    1. Breitwieser FP, Baker DN, Salzberg SL. KrakenUniq: confident and fast metagenomics classification using unique k-mer counts. Genome Biol. 2018;19:198. doi: 10.1186/s13059-018-1568-0.
    1. Lindgreen S, Adair KL, Gardner PP. An evaluation of the accuracy and speed of metagenome analysis tools. Sci Rep. 2016;6:19233. doi: 10.1038/srep19233.
    1. Ye SH, Siddle KJ, Park DJ, Sabeti PC. Benchmarking metagenomics tools for taxonomic classification. Cell. 2019;178:779–794. doi: 10.1016/j.cell.2019.07.010.
    1. Eyice Ö, et al. SIP metagenomics identifies uncultivated Methylophilaceae as dimethylsulphide degrading bacteria in soil and lake sediment. ISME J. 2015;9:2336. doi: 10.1038/ismej.2015.37.
    1. Merelli I, et al. Low-power portable devices for metagenomics analysis: fog computing makes bioinformatics ready for the Internet of Things. Futur Gener Comput Syst. 2018;88:467–478. doi: 10.1016/j.future.2018.05.010.
    1. Lu J, Salzberg SL. Removing contaminants from databases of draft genomes. PLoS Comput Biol. 2018;14:e1006277. doi: 10.1371/journal.pcbi.1006277.
    1. Donovan PD, Gonzalez G, Higgins DG, Butler G, Ito K. Identification of fungi in shotgun metagenomics datasets. PLoS One. 2018;13:e0192898. doi: 10.1371/journal.pone.0192898.
    1. Meiser A, Otte J, Schmitt I, Grande FD. Sequencing genomes from mixed DNA samples - evaluating the metagenome skimming approach in lichenized fungi. Sci Rep. 2017;7:14881. doi: 10.1038/s41598-017-14576-6.
    1. Knutson TP, Velayudhan BT, Marthaler DG. A porcine enterovirus G associated with enteric disease contains a novel papain-like cysteine protease. J Gen Virol. 2017;98:1305–1310. doi: 10.1099/jgv.0.000799.
    1. Lu J, Breitwieser FP, Thielen P, Salzberg SL. Bracken: estimating species abundance in metagenomics data. PeerJ Comput Sci. 2017;3:e104. doi: 10.7717/peerj-cs.104.
    1. Roberts M, Hayes W, Hunt B, Mount S, Yorke J. Reducing storage requirements for biological sequence comparison. Bioinformatics. 2004;20:3363–3369. doi: 10.1093/bioinformatics/bth408.
    1. Li H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics. 2018;34:3094–3100. doi: 10.1093/bioinformatics/bty191.
    1. Langmead B, Wilks C, Antonescu V, Charles R. Scaling read aligners to hundreds of threads on general-purpose processors. Bioinformatics. 2018;35(3):421–32.
    1. Pettengill EA, Pettengill JB, Binet R. Phylogenetic analyses of Shigella and enteroinvasive Escherichia coli for the identification of molecular epidemiological markers: whole-genome comparative analysis does not support distinct genera designation. Front Microbiol. 2016;6:1573. doi: 10.3389/fmicb.2015.01573.
    1. Helgason E, et al. Bacillus anthracis, Bacillus cereus, and Bacillus thuringiensis—one species on the basis of genetic evidence. Appl Environ Microbiol. 2000;66:2627 LP–2622630. doi: 10.1128/AEM.66.6.2627-2630.2000.
    1. Gomila M, Peña A, Mulet M, Lalucat J, García-Valdés E. Phylogenomics and systematics in Pseudomonas. Front Microbiol. 2015;6:214. doi: 10.3389/fmicb.2015.00214.
    1. Parks DH, et al. A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life. Nat Biotechnol. 2018;36:996. doi: 10.1038/nbt.4229.
    1. Sichtig H, et al. FDA-ARGOS: a public quality-controlled genome database resource for infectious disease sequencing diagnostics and regulatory science research. bioRxiv. 2018;482059. 10.1101/482059.
    1. Stewart RD, et al. Assembly of 913 microbial genomes from metagenomic sequencing of the cow rumen. Nat Commun. 2018;9:870. doi: 10.1038/s41467-018-03317-6.
    1. Pandey, P., Bender, M. A., Johnson, R. & Patro, R. A general-purpose counting filter: making every bit count. in Proc 2017 ACM Int Conf Manag Data 775–787 (2017). doi:10.1145/3035918.3035963
    1. Flajolet P, Fusy É, Gandouet O, Meunier F. Hyperloglog: the analysis of a near-optimal cardinality estimation algorithm. Discret Math Theor Comput Sci Proc. 2007;AH:127–46.
    1. Appleby, A. SMHasher GitHub repository. at <
    1. Federhen S. The NCBI Taxonomy database. Nucleic Acids Res. 2011;40:D136–D143. doi: 10.1093/nar/gkr1178.
    1. Břinda K, Sykulski M, Kucherov G. Spaced seeds improve k-mer-based metagenomic classification. Bioinformatics. 2015;31:3584–3592. doi: 10.1093/bioinformatics/btv419.
    1. Church DM, et al. Extending reference assembly models. Genome Biol. 2015;16:13. doi: 10.1186/s13059-015-0587-3.
    1. The UniVec Database. at <
    1. Morgulis A, Gertz EM, Schäffer AA, Agarwala R. A fast and symmetric DUST implementation to mask low-complexity DNA sequences. J Comput Biol. 2006;13:1028–1040. doi: 10.1089/cmb.2006.13.1028.
    1. Wootton JC, Federhen S. Analysis of compositionally biased regions in sequence databases. Methods Enzymol. 1996;266:554–571. doi: 10.1016/S0076-6879(96)66035-2.
    1. Flajolet P, Martin GN. Probabilistic counting algorithms for data base applications. J Comput Syst Sci. 1985;31:182–209. doi: 10.1016/0022-0000(85)90041-8.
    1. Solis AD. Amino acid alphabet reduction preserves fold information contained in contact interactions in proteins. Proteins Struct Funct Bioinforma. 2015;83:2198–2216. doi: 10.1002/prot.24936.
    1. Holtgrewe M. Mason - a read simulator for second generation sequencing data. 2010.
    1. Segata N, et al. Metagenomic microbial community profiling using unique clade-specific marker genes. Nat Methods. 2012;9:811–814. doi: 10.1038/nmeth.2066.
    1. Kodama Y, et al. The sequence read archive: explosive growth of sequencing data. Nucleic Acids Res. 2011;40:D54–D56. doi: 10.1093/nar/gkr854.
    1. Lawrence JG, Hatfull GF, Hendrix RW. Imbroglios of viral taxonomy: genetic exchange and failings of phenetic approaches. J Bacteriol. 2002;184:4891 LP–4894905. doi: 10.1128/JB.184.17.4891-4905.2002.
    1. Wood, D. E. Kraken 2 Manuscript Data. doi:10.5281/zenodo.3365797
    1. Wood, D. E. Kraken 2 Experiment GitHub repository. at <
    1. Wood, D. E. Kraken 2 GitHub repository. at <

Source: PubMed

3
Iratkozz fel