Kraken: ultrafast metagenomic sequence classification using exact alignments

Derrick E Wood, Steven L Salzberg, Derrick E Wood, Steven L Salzberg

Abstract

Kraken is an ultrafast and highly accurate program for assigning taxonomic labels to metagenomic DNA sequences. Previous programs designed for this task have been relatively slow and computationally expensive, forcing researchers to use faster abundance estimation programs, which only classify small subsets of metagenomic data. Using exact alignment of k-mers, Kraken achieves classification accuracy comparable to the fastest BLAST program. In its fastest mode, Kraken classifies 100 base pair reads at a rate of over 4.1 million reads per minute, 909 times faster than Megablast and 11 times faster than the abundance estimation program MetaPhlAn. Kraken is available at http://ccb.jhu.edu/software/kraken/.

Figures

Figure 1
Figure 1
The Kraken sequence classification algorithm. To classify a sequence, each k-mer in the sequence is mapped to the lowest common ancestor (LCA) of the genomes that contain that k-mer in a database. The taxa associated with the sequence’s k-mers, as well as the taxa’s ancestors, form a pruned subtree of the general taxonomy tree, which is used for classification. In the classification tree, each node has a weight equal to the number of k-mers in the sequence associated with the node’s taxon. Each root-to-leaf (RTL) path in the classification tree is scored by adding all weights in the path, and the maximal RTL path in the classification tree is the classification path (nodes highlighted in yellow). The leaf of this classification path (the orange, leftmost leaf in the classification tree) is the classification used for the query sequence.
Figure 2
Figure 2
Classification accuracy and speed comparison of classification programs for three simulated metagenomes. For each metagenome, genus precision and sensitivity are shown for five classifiers, and speed is shown for five programs (PhymmBL65 is simply a confidence-filtered version of PhymmBL’s results, and MetaPhlAn only classifies a subset of reads that map to one of its marker genes, as it is an abundance estimation program). Results shown are for: (a) the HiSeq metagenome, consisting of HiSeq reads (mean length μ = 92 bp) in equal proportion from ten bacterial sequencing projects; (b) the MiSeq metagenome, consisting of MiSeq reads (μ = 156 bp) in equal proportion from ten bacterial projects; and (c) the simBA-5 metagenome, consisting of simulated 100-bp reads with a high error rate from 1,967 bacterial and archaeal taxa. Note that the horizontal axes in all speed graphs have a logarithmic scale.
Figure 3
Figure 3
Classification accuracy and speed comparison of variants of Kraken for three simulated metagenomes. For each metagenome, genus precision and sensitivity are shown for five classifiers, and speed is shown for Kraken, along with a reduced memory version of Kraken (MiniKraken), quick execution versions of both (Kraken-Q and MiniKraken-Q), and Kraken run with a database containing draft and completed microbial genomes from GenBank (Kraken-GB). Results shown are for the same metagenomes used in Figure 2. Note that the scales of the axes differ from Figure 2, as the precision and speed of Kraken (and its variants) exceed that of the other classifiers used. (a) HiSeq metagenome. (b) MiSeq metagenome. (c) simBA-5 metagenome.
Figure 4
Figure 4
Taxonomic distribution of saliva microbiome reads classified by Kraken. Sequences from saliva samples collected from three individuals were classified by Kraken. The distribution of those reads that were classified by Kraken is shown.
Figure 5
Figure 5
Kraken database structure. Each k-mer to be queried against the database has a specific substring that is its minimizer. To search for a k-mer in the database, the positions in the database that contain k-mers with the same minimizer are examined. These positions are quickly found by examining the minimizer offset array for the start positions of records with the k-mer’s minimizer (orange) and the next possible minimizer (blue). Within a range of records associated with a given minimizer, records are sorted by lexicographical ordering of their k-mers, allowing a query to be completed by using a binary search over this range.

References

    1. Venter C, Remington K, Heidelberg J, Halpern A, Rusch D, Eisen J, Wu D, Paulsen I, Nelson K, Nelson W, Fouts D, Levy S, Knap A, Lomas M, Nealson K, White O, Peterson J, Hoffman J, Parsons R, Baden-Tillson H, Pfannkoch C, Rogers Y-H, Smith H. Environmental genome shotgun sequencing of the Sargasso Sea. Science. 2004;304:66–74. doi: 10.1126/science.1093857.
    1. Tyson G, Chapman J, Hugenholtz P, Allen E, Ram R, Richardson P, Solovyev V, Rubin E, Rokhsar D, Banfield J. Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature. 2004;428:37–43. doi: 10.1038/nature02340.
    1. Huttenhower C, Gevers D, Knight R, Abubucker S, Badger JH, Chinwalla AT, Creasy HH, Earl AM, FitzGerald MG, Fulton RS, Giglio MG, Hallsworth-Pepin K, Lobos EA, Madupu R, Magrini V, Martin JC, Mitreva M, Muzny DM, Sodergren EJ, Versalovic J, Wollam AM, Worley KC, Wortman JR, Young SK, Zeng Q, Aagaard KM, Abolude OO, Allen-Vercoe E, Alm EJ, Alvarado L. et al.Structure, function and diversity of the healthy human microbiome. Nature. 2012;486:207–214. doi: 10.1038/nature11234.
    1. Altschul S, Gish W, Miller W, Myers E, Lipman D. Basic local alignment search tool. J Mol Biol. 1990;215:403–410.
    1. Brady A, Salzberg SL. Phymm and PhymmBL: metagenomic phylogenetic classification with interpolated Markov models. Nat Methods. 2009;6:673–676. doi: 10.1038/nmeth.1358.
    1. Huson D, Auch A, Qi J, Schuster S. MEGAN analysis of metagenomic data. Genome Res. 2007;17:377–386. doi: 10.1101/gr.5969107.
    1. Brady A, Salzberg S. PhymmBL expanded: confidence scores, custom databases, parallelization and more. Nat Methods. 2011;8:367. doi: 10.1038/nmeth0511-367.
    1. Rosen G, Garbarine E, Caseiro D, Polikar R, Sokhansanj B. Metagenome fragment classification using N-mer frequency profiles. Adv Bioinformatics. 2008;2008:1–12.
    1. Liu B, Gibbons T, Ghodsi M, Treangen T, Pop M. Accurate and fast estimation of taxonomic profiles from metagenomic shotgun sequences. BMC Genomics. 2011;12:S4.
    1. Segata N, Waldron L, Ballarini A, Narasimhan V, Jousson O, Huttenhower C. Metagenomic microbial community profiling using unique clade-specific marker genes. Nat Methods. 2012;9:811–814. doi: 10.1038/nmeth.2066.
    1. Treangen T, Koren S, Sommer D, Liu B, Astrovskaya I, Ondov B, Darling A, Phillippy A, Pop M. MetAMOS: a modular and open source metagenomic assembly and analysis pipeline. Genome Biol. 2013;14:R2. doi: 10.1186/gb-2013-14-1-r2.
    1. Ames SK, Hysom DA, Gardner SN, Lloyd GS, Gokhale MB, Allen JE. Scalable metagenomic taxonomy classification using a reference genome database. Bioinformatics. 2013;29:2253–2260. doi: 10.1093/bioinformatics/btt389.
    1. Kindblom C, Davies JR, Herzberg MC, Svensäter G, Wickström C. Salivary proteins promote proteolytic activity in Streptococcus mitis biovar 2 and Streptococcus mutans. Mol Oral Microbiol. 2012;27:362–372. doi: 10.1111/j.2041-1014.2012.00650.x.
    1. Foweraker JE, Cooke NJ, Hawkey PM. Ecology of Haemophilus influenzae and Haemophilus parainfluenzae in sputum and saliva and effects of antibiotics on their distribution in patients with lower respiratory tract infections. Antimicrob Agents Chemother. 1993;37:804–809. doi: 10.1128/AAC.37.4.804.
    1. Könönen E, Saarela M, Karjalainen J, Jousimies-Somer H, Alaluusua S, Asikainen S. Transmission of oral Prevotella melaninogenica between a mother and her young child. Oral Microbiol Immunol. 1994;9:310–314. doi: 10.1111/j.1399-302X.1994.tb00077.x.
    1. Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden T. BLAST+: architecture and applications. BMC Bioinformatics. 2009;10:421. doi: 10.1186/1471-2105-10-421.
    1. Zimin AV, Marçais G, Puiu D, Roberts M, Salzberg SL, Yorke JA. The MaSuRCA genome assembler. Bioinformatics. 2013;29:2669–2677. doi: 10.1093/bioinformatics/btt476.
    1. Pruitt KD, Tatusova T, Brown GR, Maglott DR. NCBI Reference Sequences (RefSeq): current status, new features and genome annotation policy. Nucleic Acids Res. 2012;40:D130–D135. doi: 10.1093/nar/gkr1079.
    1. Marçais G, Kingsford C. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics. 2011;27:764–770. doi: 10.1093/bioinformatics/btr011.
    1. Roberts M, Hayes W, Hunt B, Mount S, Yorke J. Reducing storage requirements for biological sequence comparison. Bioinformatics. 2004;20:3363–3369. doi: 10.1093/bioinformatics/bth408.
    1. Magoc T, Pabinger S, Canzar S, Liu X, Su Q, Puiu D, Tallon LJ, Salzberg SL. GAGE-B: an evaluation of genome assemblers for bacterial organisms. Bioinformatics. 2013;29:1718–1725. doi: 10.1093/bioinformatics/btt273.
    1. Holtgrewe M. Mason.
    1. Mavromatis K, Ivanova N, Barry K, Shapiro H, Goltsman E, McHardy A, Rigoutsos I, Salamov A, Korzeniewski F, Land M, Lapidus A, Grigoriev I, Richardson P, Hugenholtz P, Kyrpides N. Use of simulated data sets to evaluate the fidelity of metagenomic processing methods. Nat Methods. 2007;4:495–500. doi: 10.1038/nmeth1043.
    1. Ondov B, Bergman N, Phillippy A. Interactive metagenomic visualization in a web browser. BMC Bioinformatics. 2011;12:385. doi: 10.1186/1471-2105-12-385.
    1. Kraken homepage.
    1. Kraken GitHub repository. .

Source: PubMed

3
Se inscrever