Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy

Qiong Wang, George M Garrity, James M Tiedje, James R Cole, Qiong Wang, George M Garrity, James M Tiedje, James R Cole

Abstract

The Ribosomal Database Project (RDP) Classifier, a naïve Bayesian classifier, can rapidly and accurately classify bacterial 16S rRNA sequences into the new higher-order taxonomy proposed in Bergey's Taxonomic Outline of the Prokaryotes (2nd ed., release 5.0, Springer-Verlag, New York, NY, 2004). It provides taxonomic assignments from domain to genus, with confidence estimates for each assignment. The majority of classifications (98%) were of high estimated confidence (> or = 95%) and high accuracy (98%). In addition to being tested with the corpus of 5,014 type strain sequences from Bergey's outline, the RDP Classifier was tested with a corpus of 23,095 rRNA sequences as assigned by the NCBI into their alternative higher-order taxonomy. The results from leave-one-out testing on both corpora show that the overall accuracies at all levels of confidence for near-full-length and 400-base segments were 89% or above down to the genus level, and the majority of the classification errors appear to be due to anomalies in the current taxonomies. For shorter rRNA segments, such as those that might be generated by pyrosequencing, the error rate varied greatly over the length of the 16S rRNA gene, with segments around the V2 and V4 variable regions giving the lowest error rates. The RDP Classifier is suitable both for the analysis of single rRNA sequences and for the analysis of libraries of thousands of sequences. Another related tool, RDP Library Compare, was developed to facilitate microbial-community comparison based on 16S rRNA gene sequence libraries. It combines the RDP Classifier with a statistical test to flag taxa differentially represented between samples. The RDP Classifier and RDP Library Compare are available online at http://rdp.cme.msu.edu/.

Figures

FIG. 1.
FIG. 1.
Overall classification accuracy by query size (exhaustive leave-one-out testing using the Bergey corpus). Numbers are percentages of tests correctly classified.
FIG. 2.
FIG. 2.
(A) Classification accuracy rate for the Bergey corpus with sequence segments of 100 bases, moving 25 bases a time. The gray bars on the x axis define the hypervariable regions. The average classification accuracy rate at the genus level was 70% over all 100-base regions. (B) Average bootstrap confidence estimate for each segment.
FIG. 3.
FIG. 3.
Phylogenetic analysis of the Alicyclobacillaceae, including the genera Sulfobacillus and Alicyclobacillus. Sequences for each of the 11 species type strains of Alicyclobacillaceae available in release 5.0 of Bergey's taxonomic outline, along with additional sequences for four Alicyclobacillaceae species type strains that became available after the release of the outline (marked with an asterisk), and two Bacillus species type strains were analyzed using the weighted neighbor-joining method (5). The tree is rooted using Escherichia coli sequence J01695 as the outgroup. Bootstrap confidence estimates above 85% are shown. Three misclassifications made by the RDP Classifier are highlighted, with the original (release 5.0) description appended with a corrected description. S. disulfidooxidansT became A. disulfidooxidansT. In 2005, S. disulfidooxidans was formally reclassified as a new combination, A. disulfidooxidans (14). S. thermosulfidooxidans VKM 1269T became “A. tolerans K1.” In release 5, sequence Z21979 was listed as coming from the type strain (VKM 1269) of S. thermosulfidooxidans. This agrees with the original publication for the sequence (25). The same group later reported that the sequence was probably from S. thermosulfidooxidans strain K1, not VKM 1269 (15). Two independent sequences (X91080 and AB089844) for the type strain of S. thermosulfidooxidans are available. They are nearly identical to each other (0.2% difference) and 19% different from Z21979. In 2005, S. thermosulfidooxidans strain K1 was reclassified as the type strain of a new species, A. tolerans (14). In our analysis, however, the sequence for K1 given in the naming paper (accession number AF137502) is 8% different from that for Z21979. Although still listed in the GenBank record as from A. tolerans strain K1, Z21979 is most probably not from strain K1 and not even from a member of the species A. tolerans. A. acidoterrestrisT became Bacillus sp. Sequence X60602 was published in 1991 as from the type strain (DSM 3922) for A. acidoterrestris (1). The GenBank record also lists this as from strain DSM 3922, but a strain mix-up in the culture supplied by the DSMZ was reported in 1992 (27). The type strain (ATCC 49025, listed by the DSMZ as equivalent to DSM 3922) was resequenced in 2005 (AY573797 [9]). These two sequences are 14% different, and X60602 is now described by GenBank as from a Bacillus sp.

Source: PubMed

3
Iratkozz fel