Ribosomal Database Project: data and tools for high throughput rRNA analysis

James R Cole, Qiong Wang, Jordan A Fish, Benli Chai, Donna M McGarrell, Yanni Sun, C Titus Brown, Andrea Porras-Alfaro, Cheryl R Kuske, James M Tiedje, James R Cole, Qiong Wang, Jordan A Fish, Benli Chai, Donna M McGarrell, Yanni Sun, C Titus Brown, Andrea Porras-Alfaro, Cheryl R Kuske, James M Tiedje

Abstract

Ribosomal Database Project (RDP; http://rdp.cme.msu.edu/) provides the research community with aligned and annotated rRNA gene sequence data, along with tools to allow researchers to analyze their own rRNA gene sequences in the RDP framework. RDP data and tools are utilized in fields as diverse as human health, microbial ecology, environmental microbiology, nucleic acid chemistry, taxonomy and phylogenetics. In addition to aligned and annotated collections of bacterial and archaeal small subunit rRNA genes, RDP now includes a collection of fungal large subunit rRNA genes. RDP tools, including Classifier and Aligner, have been updated to work with this new fungal collection. The use of high-throughput sequencing to characterize environmental microbial populations has exploded in the past several years, and as sequence technologies have improved, the sizes of environmental datasets have increased. With release 11, RDP is providing an expanded set of tools to facilitate analysis of high-throughput data, including both single-stranded and paired-end reads. In addition, most tools are now available as open source packages for download and local use by researchers with high-volume needs or who would like to develop custom analysis pipelines.

Figures

Figure 1.
Figure 1.
Gene coverage: number of sequences from RDP release 11.1 covering the indicated positions on the reference sequence. (A) Bacterial SSU rRNA gene. Positions relative to Escherichia coli sequence GenBank accession J01695.1. Gray bars indicate variable regions (1). (B) Archaeal SSU rRNA gene. Positions relative to E. coli sequence GenBank accession J01695.1. (C) Fungal LSU rRNA gene. Positions relative to S. cerevisiae GenBank accession NC_001144.5 LSU gene. D1 and D2 indicate hypervariable regions initially used for discrimination among Fusarium spp. (2). The D2 region is among the most highly variable eukaryotic LSU regions in terms of both length and structure (3). Such high diversity may improve the performance of the RDP Classifier when discriminating between closely related genera. Gene coverage charts are available online and updated with each incremental RDP release.
Figure 2.
Figure 2.
Multiple sequence alignment of partial bacterial 16S rRNA sequences corresponding to the region between common V6 variable region amplification primers (15). Uppercase columns correspond to modeled positions. Lowercase columns correspond to regions where hypervariability in size and structure preclude assignment of homologous residues. These columns are normally ‘masked out’ before phylogenetic analysis. (A) Using the new RDP 11 alignment model. This matches the alignment for this region obtained with full-length sequences. (B) Using the RDP 10 alignment model. The alignment of the full-length sequences is almost identical in this V6 region between the two models, except one G-U pair in RDP 11 appears as inserts in the RDP 10 alignment. Bases highlighted in green color are canonical base pairs matching the conserved secondary structure. From top to bottom, the GenBank accessions are AB006164, AB006178, AB021164, AB015577, AB003932 and AB004715.
Figure 3.
Figure 3.
Accumulation curves showing (A) taxon size and (B) intra-taxon distance. All aligned sequences in RDP release 11.1 in each of the three RDP collections were clustered as described. The average distance between pairs of sequences in a taxon is shown in (B). The shape of the phylum curves, and to a lesser extent class curves, for archaea and fungi, are likely influenced by the small number of taxa and the skewed representation of sequences in these taxa.
Figure 4.
Figure 4.
Comparing per base error rates for three paired-end read assembly tools. The error rates were calculated using assembled reads filtered by either read Q score (Assembler and original PANDAseq; 38) or delta Q score (mothur; 39). Recommended read Q score of 27 for Assembler and base Q score (deltaq) of 6 for mothur are marked. (A) Sample M_20130714 and (B) Sample M_20130819.

References

    1. Neefs JM, Van de Peer Y, De Rijk P, Chapelle S, De Wachter R. Compilation of small ribosomal subunit RNA structures. Nucleic Acids Res. 1993;21:3025–3049.
    1. Guadet J, Julien J, Lafay JF, Brygoo Y. Phylogeny of some Fusarium species, as determined by large-subunit rRNA sequence comparison. Mol. Biol. Evol. 1989;6:227–242.
    1. Schnare MN, Damberger SH, Gray MW, Gutell RR. Comprehensive comparison of structural characteristics in Eukaryotic cytoplasmic large subunit (23S-like) ribosomal RNA. J. Mol. Biol. 1996;256:701–719.
    1. Liu K-L, Porras-Alfaro A, Kuske CR, Eichorst S, Xie G. Accurate, rapid taxonomic classification of fungal large subunit rRNA genes. Appl. Environ. Microbiol. 2012;78:1523–1533.
    1. Nakamura Y, Cochrane G, Karsch-Mizrachi I International Nucleotide Sequence Database Collaboration. The International Nucleotide Sequence Database Collaboration. Nucleic Acids Res. 2013;41:D21–D24.
    1. Cochrane G, Alako B, Amid C, Bower L, Cerdeño-Tárraga A, Cleland I, Gibson R, Goodgame N, Jang M, Kay S, et al. Facing growth in the European Nucleotide Archive. Nucleic Acids Res. 2013;41:D30–D35.
    1. Yilmaz P, Kottmann R, Field D, Knight R, Cole JR, Amaral-Zettler L, Gilbert JA, Karsch-Mizrachi I, Johnston A, Cochrane G, et al. Minimum information about a marker gene sequence (MIMARKS) and minimum information about any (x) sequence (MIxS) specifications. Nat. Biotechnol. 2011;29:415–420.
    1. NCBI Resource Coordinators. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2013;41:D8–D20.
    1. Edgar RC, Haas BJ, Clemente JC, Quince C, Knight R. UCHIME improves sensitivity and speed of chimera detection. Bioinformatics. 2011;27:2194–2200.
    1. Federhen S. The NCBI Taxonomy database. Nucleic Acids Res. 2012;40:D136–D143.
    1. Barrett T, Clark K, Gevorgyan R, Gorelenkov V, Gribov E, Karsch-Mizrachi I, Kimelman M, Pruitt KD, Resenchuk S, Tatusova T, et al. BioProject and BioSample databases at NCBI: facilitating capture and organization of metadata. Nucleic Acids Res. 2012;40:D57–D63.
    1. Nawrocki EP, Eddy SR. Infernal 1.1: 100-fold faster RNA homology searches. Bioinformatics. 2013;29:2933–3935.
    1. Cannone JJ, Subramanian S, Schnare MN, Collett JR, D'Souza LM, Du Y, Feng B, Lin N, Madabusi LV, Müller KM, et al. The comparative RNA web (CRW) site: an online database of comparative sequence and structure information for ribosomal, intron, and other RNAs. BMC Bioinformatics. 2002;3:2.
    1. Huse SM, Welch DM, Morrison HG, Sogin ML. Ironing out the wrinkles in the rare biosphere through improved OTU clustering. Environ. Microbiol. 2010;12:1889–1898.
    1. Sogin ML, Morrison HG, Huber JA, Mark Welch D, Huse SM, Neal PR, Arrieta JM, Herndl GJ. Microbial diversity in the deep sea and the underexplored “rare biosphere”. Proc. Natl Acad. Sci. U.S.A. 2006;103:12115–12120.
    1. Parte A. LPSN—List of Prokaryotic Names with Standing in Nomenclature. Nucleic Acids Res. 2014;42:D613–D616.
    1. Munoz R, Yarza P, Ludwig W, Euzéby J, Amann R, Schleifer KH, Glöckner FO, Rosselló-Móra R. Release LTPs104 of the All-Species Living Tree. Syst. Appl. Microbiol. 2011;34:169–170.
    1. Stackebrandt E, Ebers J. Taxonomic parameters revisited: tarnished gold standards. Microbiol. Today. 2006;33:152–155.
    1. Larsen N, Olsen GJ, Maidak BL, McCaughey MJ, Overbeek R, Macke TJ, Marsh TL, Woese CR. The ribosomal database project. Nucleic Acids Res. 1993;21:3021–3023.
    1. Cole JR, Chai B, Farris RJ, Wang Q, Kulam SA, McGarrell DM, Garrity GM, Tiedje JM. The Ribosomal Database Project (RDP-II): sequences and tools for high-throughput rRNA analysis. Nucleic Acids Res. 2005;33:D294–D296.
    1. Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402.
    1. Wang Q, Garrity GM, Tiedje JM, Cole JR. Naïve Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Appl. Environ. Microbiol. 2007;73:5261–5267.
    1. Claesson MJ, O'Sullivan O, Wang Q, Nikkilä J, Marchesi JR, Smidt H, de Vos WM, Ross RP, O'Toole PW. Comparative analysis of pyrosequencing and a phylogenetic microarray for exploring microbial community structures in the human distal intestine. PLoS One. 2009;4:e6669.
    1. Sul WJ, Cole JR, Jesus EC, Wang Q, Farris RJ, Fish JA, Tiedje JM. Bacterial community comparisons by taxonomy-supervised analysis independent of sequence alignment and clustering. Proc. Natl Acad. Sci. U.S.A. 2011;108:14637–14642.
    1. Werner JJ, Koren O, Hugenholtz P, DeSantis TZ, Walters WA, Caporaso JG, Angenent LT, Knight R, Ley RE. Impact of training sets on classification of high-throughput bacterial 16s rRNA gene surveys. ISME J. 2012;6:94–103.
    1. Newton IL, Roeselers G. The effect of training set on the classification of honey bee gut microbiota using the Naïve Bayesian Classifier. BMC Microbiol. 2012;12:221.
    1. Myers G. A fast bit-vector algorithm for approximate string matching based on dynamic programming. J. ACM. 1999;46:1–13.
    1. Bruno WJ, Socci ND, Halpern AL. Weighted neighbor joining: a likelihood-based approach to distance-based phylogeny reconstruction. Mol. Biol. Evol. 2000;17:189–197.
    1. Cole JR, Wang Q, Cardenas E, Fish J, Chai B, Farris RJ, Kulam-Syed-Mohideen AS, McGarrell DM, Marsh TL, Garrity GM, et al. The Ribosomal Database Project: improved alignments and new tools for rRNA analysis. Nucleic Acids Res. 2009;37:D141–D145.
    1. Cole JR, Wang Q, Chai B, Tiedje JM. The Ribosomal Database Project: sequences and software for high-throughput rRNA analysis. In: de Bruijn FJ, editor. Handbook of Molecular Microbial Ecology I: Metagenomics and Complementary Approaches. Hoboken, NJ: J. Wiley & Sons, Inc.; 2011. pp. 313–324.
    1. Fish JA, Chai B, Wang Q, Yanni S, Brown CT, Tiedje JM, Cole JR. FunGene: the functional gene pipeline and repository. Front. Terr. Microbiol. 2013;4:291.
    1. McDonald D, Clemente JC, Kuczynski J, Rideout JR, Stombaugh J, Wendel D, Wilke A, Huse S, Hufnagle J, Meyer F, et al. The Biological Observation Matrix (BIOM) format or: how I learned to stop worrying and love the ome-ome. GigaScience. 2012;1:7.
    1. Edgar RC. Search and clustering orders of magnitude faster than BLAST. Bioinformatics. 2010;26:2460–2461.
    1. Fu L, Niu B, Zhu Z, Wu S, Li W. CD-HIT: accelerated for clustering the next generation sequencing data. Bioinformatics. 2012;28:3150–3152.
    1. Loewenstein Y, Portugaly E, Fromer M, Linial M. Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space. Bioinformatics. 2008;24:i41–i49.
    1. Sun Y, Cai Y, Liu L, Yu F, Farrell ML, McKendree W, Farmerie W. ESPRIT: estimating species richness using large collections of 16S rRNA pyrosequences. Nucleic Acids Res. 2009;37:e76.
    1. Schloss PD, Westcott SL, Ryabin T, Hall JR, Hartmann M, Hollister EB, Lesniewski RA, Oakley BB, Parks DH, Robinson CJ, et al. Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl. Environ. Microbiol. 2009;75:7537–7541.
    1. Masella AP, Bartram AK, Truszkowski JM, Brown DG, Neufeld JD. PANDAseq: paired-end assembler for illumina sequences. BMC Bioinformatics. 2012;13:31.
    1. Kozich JJ, Westcott SL, Baxter NT, Highlander SK, Schloss PD. Development of a dual-index sequencing strategy and curation pipeline for analyzing amplicon sequence data on the MiSeq Illumina sequencing platform. Appl. Environ. Microbiol. 2013;79:5112–5120.

Source: PubMed

3
S'abonner