Improved metagenomic analysis with Kraken 2
Derrick E Wood, Jennifer Lu, Ben Langmead, Derrick E Wood, Jennifer Lu, Ben Langmead
Abstract
Although Kraken's k-mer-based approach provides a fast taxonomic classification of metagenomic sequence data, its large memory requirements can be limiting for some applications. Kraken 2 improves upon Kraken 1 by reducing memory usage by 85%, allowing greater amounts of reference genomic data to be used, while maintaining high accuracy and increasing speed fivefold. Kraken 2 also introduces a translated search mode, providing increased sensitivity in viral metagenomics analysis.
Keywords: Alignment-free methods; Metagenomics; Metagenomics classification; Microbiome; Minimizers; Probabilistic data structures.
Conflict of interest statement
The authors declare that they have no competing interests.
Figures
References
- Kim D, Song L, Breitwieser FP, Salzberg SL. Centrifuge: rapid and sensitive classification of metagenomic sequences. Genome Res. 2016;26:1721–1729. doi: 10.1101/gr.210641.116.
- Ounit R, Wanamaker S, Close TJ, Lonardi S. CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers. BMC Genomics. 2015;16:236. doi: 10.1186/s12864-015-1419-2.
- Menzel P, Ng KL, Krogh A. Fast and sensitive taxonomic classification for metagenomics with Kaiju. Nat Commun. 2016;7:11257. doi: 10.1038/ncomms11257.
- Wood DE, Salzberg SL. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 2014;15:R46. doi: 10.1186/gb-2014-15-3-r46.
- Breitwieser FP, Baker DN, Salzberg SL. KrakenUniq: confident and fast metagenomics classification using unique k-mer counts. Genome Biol. 2018;19:198. doi: 10.1186/s13059-018-1568-0.
- Lindgreen S, Adair KL, Gardner PP. An evaluation of the accuracy and speed of metagenome analysis tools. Sci Rep. 2016;6:19233. doi: 10.1038/srep19233.
- Ye SH, Siddle KJ, Park DJ, Sabeti PC. Benchmarking metagenomics tools for taxonomic classification. Cell. 2019;178:779–794. doi: 10.1016/j.cell.2019.07.010.
- Eyice Ö, et al. SIP metagenomics identifies uncultivated Methylophilaceae as dimethylsulphide degrading bacteria in soil and lake sediment. ISME J. 2015;9:2336. doi: 10.1038/ismej.2015.37.
- Merelli I, et al. Low-power portable devices for metagenomics analysis: fog computing makes bioinformatics ready for the Internet of Things. Futur Gener Comput Syst. 2018;88:467–478. doi: 10.1016/j.future.2018.05.010.
- Lu J, Salzberg SL. Removing contaminants from databases of draft genomes. PLoS Comput Biol. 2018;14:e1006277. doi: 10.1371/journal.pcbi.1006277.
- Donovan PD, Gonzalez G, Higgins DG, Butler G, Ito K. Identification of fungi in shotgun metagenomics datasets. PLoS One. 2018;13:e0192898. doi: 10.1371/journal.pone.0192898.
- Meiser A, Otte J, Schmitt I, Grande FD. Sequencing genomes from mixed DNA samples - evaluating the metagenome skimming approach in lichenized fungi. Sci Rep. 2017;7:14881. doi: 10.1038/s41598-017-14576-6.
- Knutson TP, Velayudhan BT, Marthaler DG. A porcine enterovirus G associated with enteric disease contains a novel papain-like cysteine protease. J Gen Virol. 2017;98:1305–1310. doi: 10.1099/jgv.0.000799.
- Lu J, Breitwieser FP, Thielen P, Salzberg SL. Bracken: estimating species abundance in metagenomics data. PeerJ Comput Sci. 2017;3:e104. doi: 10.7717/peerj-cs.104.
- Roberts M, Hayes W, Hunt B, Mount S, Yorke J. Reducing storage requirements for biological sequence comparison. Bioinformatics. 2004;20:3363–3369. doi: 10.1093/bioinformatics/bth408.
- Li H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics. 2018;34:3094–3100. doi: 10.1093/bioinformatics/bty191.
- Langmead B, Wilks C, Antonescu V, Charles R. Scaling read aligners to hundreds of threads on general-purpose processors. Bioinformatics. 2018;35(3):421–32.
- Pettengill EA, Pettengill JB, Binet R. Phylogenetic analyses of Shigella and enteroinvasive Escherichia coli for the identification of molecular epidemiological markers: whole-genome comparative analysis does not support distinct genera designation. Front Microbiol. 2016;6:1573. doi: 10.3389/fmicb.2015.01573.
- Helgason E, et al. Bacillus anthracis, Bacillus cereus, and Bacillus thuringiensis—one species on the basis of genetic evidence. Appl Environ Microbiol. 2000;66:2627 LP–2622630. doi: 10.1128/AEM.66.6.2627-2630.2000.
- Gomila M, Peña A, Mulet M, Lalucat J, García-Valdés E. Phylogenomics and systematics in Pseudomonas. Front Microbiol. 2015;6:214. doi: 10.3389/fmicb.2015.00214.
- Parks DH, et al. A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life. Nat Biotechnol. 2018;36:996. doi: 10.1038/nbt.4229.
- Sichtig H, et al. FDA-ARGOS: a public quality-controlled genome database resource for infectious disease sequencing diagnostics and regulatory science research. bioRxiv. 2018;482059. 10.1101/482059.
- Stewart RD, et al. Assembly of 913 microbial genomes from metagenomic sequencing of the cow rumen. Nat Commun. 2018;9:870. doi: 10.1038/s41467-018-03317-6.
- Pandey, P., Bender, M. A., Johnson, R. & Patro, R. A general-purpose counting filter: making every bit count. in Proc 2017 ACM Int Conf Manag Data 775–787 (2017). doi:10.1145/3035918.3035963
- Flajolet P, Fusy É, Gandouet O, Meunier F. Hyperloglog: the analysis of a near-optimal cardinality estimation algorithm. Discret Math Theor Comput Sci Proc. 2007;AH:127–46.
- Appleby, A. SMHasher GitHub repository. at <
- Federhen S. The NCBI Taxonomy database. Nucleic Acids Res. 2011;40:D136–D143. doi: 10.1093/nar/gkr1178.
- Břinda K, Sykulski M, Kucherov G. Spaced seeds improve k-mer-based metagenomic classification. Bioinformatics. 2015;31:3584–3592. doi: 10.1093/bioinformatics/btv419.
- Church DM, et al. Extending reference assembly models. Genome Biol. 2015;16:13. doi: 10.1186/s13059-015-0587-3.
- The UniVec Database. at <
- Morgulis A, Gertz EM, Schäffer AA, Agarwala R. A fast and symmetric DUST implementation to mask low-complexity DNA sequences. J Comput Biol. 2006;13:1028–1040. doi: 10.1089/cmb.2006.13.1028.
- Wootton JC, Federhen S. Analysis of compositionally biased regions in sequence databases. Methods Enzymol. 1996;266:554–571. doi: 10.1016/S0076-6879(96)66035-2.
- Flajolet P, Martin GN. Probabilistic counting algorithms for data base applications. J Comput Syst Sci. 1985;31:182–209. doi: 10.1016/0022-0000(85)90041-8.
- Solis AD. Amino acid alphabet reduction preserves fold information contained in contact interactions in proteins. Proteins Struct Funct Bioinforma. 2015;83:2198–2216. doi: 10.1002/prot.24936.
- Holtgrewe M. Mason - a read simulator for second generation sequencing data. 2010.
- Segata N, et al. Metagenomic microbial community profiling using unique clade-specific marker genes. Nat Methods. 2012;9:811–814. doi: 10.1038/nmeth.2066.
- Kodama Y, et al. The sequence read archive: explosive growth of sequencing data. Nucleic Acids Res. 2011;40:D54–D56. doi: 10.1093/nar/gkr854.
- Lawrence JG, Hatfull GF, Hendrix RW. Imbroglios of viral taxonomy: genetic exchange and failings of phenetic approaches. J Bacteriol. 2002;184:4891 LP–4894905. doi: 10.1128/JB.184.17.4891-4905.2002.
- Wood, D. E. Kraken 2 Manuscript Data. doi:10.5281/zenodo.3365797
- Wood, D. E. Kraken 2 Experiment GitHub repository. at <
- Wood, D. E. Kraken 2 GitHub repository. at <
Source: PubMed