SUPER-FOCUS: a tool for agile functional analysis of shotgun metagenomic data

Genivaldo Gueiros Z Silva, Kevin T Green, Bas E Dutilh, Robert A Edwards, Genivaldo Gueiros Z Silva, Kevin T Green, Bas E Dutilh, Robert A Edwards

Abstract

Summary: Analyzing the functional profile of a microbial community from unannotated shotgun sequencing reads is one of the important goals in metagenomics. Functional profiling has valuable applications in biological research because it identifies the abundances of the functional genes of the organisms present in the original sample, answering the question what they can do. Currently, available tools do not scale well with increasing data volumes, which is important because both the number and lengths of the reads produced by sequencing platforms keep increasing. Here, we introduce SUPER-FOCUS, SUbsystems Profile by databasE Reduction using FOCUS, an agile homology-based approach using a reduced reference database to report the subsystems present in metagenomic datasets and profile their abundances. SUPER-FOCUS was tested with over 70 real metagenomes, the results showing that it accurately predicts the subsystems present in the profiled microbial communities, and is up to 1000 times faster than other tools.

Availability and implementation: SUPER-FOCUS was implemented in Python, and its source code and the tool website are freely available at https://edwards.sdsu.edu/SUPERFOCUS.

Contact: redwards@mail.sdsu.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

© The Author 2015. Published by Oxford University Press.

Figures

Fig. 1.
Fig. 1.
Workflow of the SUPER-FOCUS program
Fig. 2.
Fig. 2.
Representation of a subsystem structure (Levels 1–3 classifications and Function)
Fig. 3.
Fig. 3.
Percent classification sensitivity (A) and precision (B) of level 1 subsystems and speed of RAPSearch2 and SUPER-FOCUS using different databases and parameter modes. This analysis was based on a comparison of 50 HMP metagenomes, where blastx assignments using DB_100 were considered to be the true answer
Fig. 4.
Fig. 4.
Percentage of level 3 subsystems present in all the testing set metagenomes predicted by SUPER-FOCUS
Fig. 5.
Fig. 5.
Confusion matrix displaying the percentage of correct assignments in each level 1 subsystem for the 50 HMP metagenomes. (a) Shows the RAPSearch2 assignments in the sensitive mode to DB_100. (b) Shows the SUPER-FOCUS assignments in the sensitive mode to DB_100
Fig. 6.
Fig. 6.
Classification sensitivity using level 1 classifications and speed comparison of 50 HMP metagenomes using RAPSearch2 and SUPER-FOCUS using different databases and modes, but removing Eurkaryota and viral assignments. blastx assignments using DB_100 were considered to be the true answer
Fig. 7.
Fig. 7.
Classification sensitivity (a) and precision (b) percent using level 1 and speed comparison of three viromes using RAPSearch2 and SUPER-FOCUS using different databases and modes. blastx assignments using DB_100 were considered to be the true answer
Fig. 8.
Fig. 8.
Run time comparison for the three marine viromes using SUPER-FOCUS, RTMg, MEGAN and MG-RAST
Fig. 9.
Fig. 9.
Run time comparison for the one big data metagenome using SUPER-FOCUS, RTMg, MEGAN and MG-RAST
Fig. 10.
Fig. 10.
Comparison of level 1 subsystems profile of one big data metagenome using SUPER-FOCUS, RTMg, MEGAN, MG-RAST and blastx that are considered to be the true answer
Fig. 11.
Fig. 11.
Box plots displaying the percent sensitivity (A and C) and precision (B and D) of RAPSearch2 (A and B), blastx (C and D) annotation of the 20 coral metagenomes. RAPsearch2 was tested in the fast and sensitive modes
Fig. 12.
Fig. 12.
Hierarchical clustering of the taxonomic (A) and functional (B) annotations of 20 coral metagenomes. Genus level taxonomic annotation was performed using FOCUS. Functional annotation of level 3 subsystems was performed using SUPER-FOCUS using blastx and DB_98

References

    1. Altschul S.F., et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 3389–3402.
    1. Aziz R.K., et al. (2008) The RAST server: rapid annotations using subsystems technology. BMC Genomics, 9, 75.
    1. Aziz R.K., et al. (2012) SEED servers: high-performance access to the SEED genomes, annotations, and metabolic models. PLoS One, 7, e48053.
    1. Berendzen J., et al. (2012) Rapid phylogenetic and functional classification of short genomic fragments with signature peptides. BMC Res. Notes, 5, 460.
    1. Buchfink B., et al. (2015) Fast and sensitive protein alignment using DIAMOND. Nat. Methods, 12, 59–60.
    1. Caspi R., et al. (2010) The MetaCyc database of metabolic pathways and enzymes and the BioCyc collection of pathway/genome databases. Nucleic Acids Res., 38, D473–D479.
    1. Cock P., et al. (2009) Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinforma. Oxf. Engl., 25, 1422–1423.
    1. Consortium T.H.M.P. (2012) Structure, function and diversity of the healthy human microbiome. Nature, 486, 207–214.
    1. Dinsdale E.A., et al. (2008) Microbial ecology of four coral atolls in the Northern line islands. PLoS One, 3, e1584.
    1. Disz T., et al. (2010) Accessing the SEED genome databases via Web services API: tools for programmers. BMC Bioinformatics, 11, 319.
    1. Edwards R.A., et al. (2012) Real time metagenomics: using k-mers to annotate metagenomes. Bioinformatics, 28, 3316–3317.
    1. Garcia G.D., et al. (2013) Metagenomic analysis of healthy and white plague-affected Mussismilia braziliensis corals. Microb. Ecol., 65, 1076–1086.
    1. Haas A.F., et al. (2014) Unraveling the unseen players in the ocean–a field guide to water chemistry and marine microbiology. JoVE J. Vis. Exp., e52131–e52131.
    1. Handelsman J. (2004) Metagenomics: application of genomics to uncultured microorganisms. Microbiol. Mol. Biol. Rev., 68, 669–685.
    1. Huang Y., et al. (2010) CD-HIT suite: a web server for clustering and comparing biological sequences. Bioinformatics, 26, 680–682.
    1. Hunter J.D. (2007) Matplotlib: a 2D graphics environment. Comput. Sci. Eng., 9, 90–95.
    1. Jones E., et al. (2001) SciPy: Open source scientific tools for Python. , (20 October 2015, date last accessed).
    1. Kanehisa M., Goto S. (2000) KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res., 28, 27–30.
    1. Kent W.J. (2002) BLAT—The BLAST-like alignment tool. Genome Res., 12, 656–664.
    1. Lindgreen S., et al. (2015) An evaluation of the accuracy and speed of metagenome analysis tools. 017830, .
    1. Li W., et al. (2012) Ultrafast clustering algorithms for metagenomic sequence analysis. Brief. Bioinform., 13, 656–668.
    1. Mendoza M.L.Z., et al. (2015) Environmental genes and genomes: understanding the differences and challenges in the approaches and software for their analyses. Brief. Bioinform., 16, 745–758.
    1. Meyer F., et al. (2008) The metagenomics RAST server–a public resource for the automatic phylogenetic and functional analysis of metagenomes. BMC Bioinformatics, 9, 386.
    1. Mitra S., et al. (2011) Functional analysis of metagenomes and metatranscriptomes using SEED and KEGG. BMC Bioinformatics, 12, S21.
    1. de Oliveira L.S., et al. (2012) Transcriptomic analysis of the red seaweed Laurencia dendroidea (Florideophyceae, Rhodophyta) and its microbiome. BMC Genomics, 13, 487.
    1. Ounit R., et al. (2015) CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers. BMC Genomics, 16, 236.
    1. Overbeek R., et al. (2004) The SEED: a peer-to-peer environment for genome annotation. Commun ACM, 47, 46–51.
    1. Overbeek R., et al. (2005) The subsystems approach to genome annotation and its use in the project to annotate 1000 genomes. Nucleic Acids Res., 33, 5691–5702.
    1. Rho M., et al. (2010) FragGeneScan: predicting genes in short and error-prone reads. Nucleic Acids Res., 38, e191.
    1. Rotmistrovsky K., Agarwala R. (2011) BMTagger: best match tagger for removing human reads from metagenomics datasets. (20 October 2015, date last accessed).
    1. Schmieder R., Edwards R. (2011) Fast identification and removal of sequence contamination from genomic and metagenomic datasets. PLoS One, 6, e17288.
    1. Segata N., et al. (2012) Metagenomic microbial community profiling using unique clade-specific marker genes. Nat. Methods, 9, 811–814.
    1. Silva G.G.Z., et al. (2014) FOCUS: an alignment-free model to identify organisms in metagenomes using non-negative least squares. Peer J, 2, e425.
    1. Trindade-Silva A.E., et al. (2012) Taxonomic and functional microbial signatures of the endemic marine sponge arenosclera brasiliensis. PLoS One, 7, e39905.
    1. Trindade-Silva A.E., et al. (2013) Polyketide synthase gene diversity within the microbiome of the sponge arenosclera brasiliensis, endemic to the Southern Atlantic Ocean. Appl. Environ. Microbiol., 79, 1598–1605.
    1. Weiss S., et al. (2014) Tracking down the sources of experimental contamination in microbiome studies. Genome Biol., 15, 564.
    1. Whitman W.B., et al. (1998) Prokaryotes: the unseen majority. Proc. Natl. Acad. Sci. USA, 95, 6578–6583.
    1. Wood D.E., Salzberg S.L. (2014) Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol., 15, R46.
    1. Zhang J., et al. (2011) The impact of next-generation sequencing on genomics. J. Genet. Genomics, 38, 95–109.
    1. Zhao Y., et al. (2012) RAPSearch2: a fast and memory-efficient protein similarity search tool for next-generation sequencing data. Bioinformatics, 28, 125–126.

Source: PubMed

3
Iratkozz fel