SINA: accurate high-throughput multiple sequence alignment of ribosomal RNA genes

Elmar Pruesse, Jörg Peplies, Frank Oliver Glöckner, Elmar Pruesse, Jörg Peplies, Frank Oliver Glöckner

Abstract

Motivation: In the analysis of homologous sequences, computation of multiple sequence alignments (MSAs) has become a bottleneck. This is especially troublesome for marker genes like the ribosomal RNA (rRNA) where already millions of sequences are publicly available and individual studies can easily produce hundreds of thousands of new sequences. Methods have been developed to cope with such numbers, but further improvements are needed to meet accuracy requirements.

Results: In this study, we present the SILVA Incremental Aligner (SINA) used to align the rRNA gene databases provided by the SILVA ribosomal RNA project. SINA uses a combination of k-mer searching and partial order alignment (POA) to maintain very high alignment accuracy while satisfying high throughput performance demands. SINA was evaluated in comparison with the commonly used high throughput MSA programs PyNAST and mothur. The three BRAliBase III benchmark MSAs could be reproduced with 99.3, 97.6 and 96.1 accuracy. A larger benchmark MSA comprising 38 772 sequences could be reproduced with 98.9 and 99.3% accuracy using reference MSAs comprising 1000 and 5000 sequences. SINA was able to achieve higher accuracy than PyNAST and mothur in all performed benchmarks.

Availability: Alignment of up to 500 sequences using the latest SILVA SSU/LSU Ref datasets as reference MSA is offered at http://www.arb-silva.de/aligner. This page also links to Linux binaries, user manual and tutorial. SINA is made available under a personal use license.

Figures

Fig. 1.
Fig. 1.
The alignment of the selected reference sequences is converted from RC-MSA representation (top) to PO-MSA representation (bottom)
Fig. 2.
Fig. 2.
SINA alignment accuracy decreases almost linearly with the shared fractional identity of candidate and reference when using one reference sequence (red line). Using larger numbers of reference sequences markedly increases accuracy
Fig. 3.
Fig. 3.
An alternative implementation which used simple column-profiles built from the selected reference sequences showed overall lower accuracy. Increasing the number of reference sequences quickly led to a degradation in accuracy

References

    1. Altschul S.F., et al. Basic local alignment search tool. J. Mol. Biol. 1990;215:403–410.
    1. Amaral-Zettler L., et al. Proceedings of the international workshop on Ribosomal RNA technology, April 7-9, 2008, Bremen, Germany. Syst. Appl. Microbio. 2008;31:258–268.
    1. Caporaso J.G., et al. PyNAST: a flexible tool for aligning sequences to a template alignment. Bioinformatics. 2010;26:266–267.
    1. Cole J.R., et al. The Ribosomal Database Project: improved alignments and new tools for rRNA analysis. Nucleic Acids Res. 2009;37:D141–D145.
    1. DeSantis T.Z., et al. Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB. Appl. Environ. Microbiol. 2006a;72:5069–5072.
    1. DeSantis T.Z., et al. NAST: a multiple sequence alignment server for comparative analysis of 16S rRNA genes. Nucleic Acids Res. 2006b;34:W394–W399.
    1. Edgar R.C. Local homology recognition and distance measures in linear time using compressed amino acid alphabets. Nucleic Acids Res. 2004a;32:380–385.
    1. Edgar R.C. MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics. 2004b;5:113.
    1. Edgar R.C. Search and clustering orders of magnitude faster than BLAST. Bioinformatics. 2010;26:2460–2461.
    1. Feng D.F., Doolittle R.F. Progressive Sequence Alignment as a Prerequisite to Correct Phylogenetic Trees. J. Mol. Evol. 1987;25:351–360.
    1. Gotoh O. An improved algorithm for matching biological sequences. J. Mol. Biol. 1982;162:705–8.
    1. Kemena C., Notredame C. Upcoming challenges for multiple sequence alignment methods in the high-throughput era. Bioinformatics. 2009;25:2455–2465.
    1. Lee C., et al. Multiple sequence alignment using partial order graphs. Bioinformatics. 2002;18:452–464.
    1. Leinonen R., et al. Improvements to services at the European Nucleotide Archive. Nucleic Acids Res. 2010;38:D39–D45.
    1. Löytynoja A., Goldman N. Phylogeny-aware gap placement prevents errors in sequence alignment and evolutionary analysis. Science. 2008;320:1632–1635.
    1. Ludwig W., et al. ARB: a software environment for sequence data. Nucleic Acids Res. 2004;32:1363–1371.
    1. Morrison D.A., Ellis J.T. Effects of nucleotide sequence alignment on phylogeny estimation: a case study of 18S rDNAs of apicomplexa. Mol. Biol. Evol. 1997;14:428–441.
    1. Nawrocki E.P., Eddy S.R. Infernal 1.0: RNA sequence analysis with covariance models. BMC Bioinformatics. 2008:2008–2008.
    1. Nawrocki E.P., et al. Infernal 1.0: inference of RNA alignments. Bioinformatics. 2009;25:1335–1337.
    1. Needleman S.B., Wunsch C.D. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 1970;48:443–453.
    1. Notredame C. Recent Evolutions of Multiple Sequence Alignment Algorithms. PLoS Comput. Biol. 2007;3:4.
    1. Pei J. Multiple protein sequence alignment. Current Opinion in Structural Biology. 2008;18:382–386.
    1. Pruesse E., et al. SILVA: a comprehensive online resource for quality checked and aligned ribosomal RNA sequence data compatible with ARB. Nucleic Acids Res. 2007;35:7188–7196.
    1. Rice C.M., et al. The EMBL data library. Nucleic Acids Res. 1993;21:2967–71.
    1. Schloss P.D., et al. Introducing mothur: Open-Source, Platform-Independent, Community-Supported Software for Describing and Comparing Microbial Communities. Appl. Environ. Microbiol. 2009;75:7537–7541.
    1. Thompson J.D., et al. BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs. Bioinformatics. 1999;15:87–88.
    1. Thompson J.D., et al. A Comprehensive Benchmark Study of Multiple Sequence Alignment Methods: Current Challenges and Future Perspectives. PLoS One. 2011;6:14.
    1. Wang L., Jiang T. On the complexity of multiple sequence alignment. J. Computat. Biol. 1994;1:337–348.
    1. Wilm A., et al. An enhanced RNA alignment benchmark for sequence alignment programs. Algorithms Mol. Biol. 2006;1:19.

Source: PubMed

3
Předplatit