Metagenomic biomarker discovery and explanation

Nicola Segata, Jacques Izard, Levi Waldron, Dirk Gevers, Larisa Miropolsky, Wendy S Garrett, Curtis Huttenhower, Nicola Segata, Jacques Izard, Levi Waldron, Dirk Gevers, Larisa Miropolsky, Wendy S Garrett, Curtis Huttenhower

Abstract

This study describes and validates a new method for metagenomic biomarker discovery by way of class comparison, tests of biological consistency and effect size estimation. This addresses the challenge of finding organisms, genes, or pathways that consistently explain the differences between two or more microbial communities, which is a central problem to the study of metagenomics. We extensively validate our method on several microbiomes and a convenient online interface for the method is provided at http://huttenhower.sph.harvard.edu/lefse/.

Figures

Figure 1
Figure 1
LEfSe mines a wide range of high-throughput genetic data to find biologically relevant features characterizing one or more experimental conditions. The inputs to the system are the specifications of the biological hypothesis under investigation (conditions and inter-condition sample groupings), the high-dimensional data obtained experimentally, and, optionally, prior knowledge from literature or databases used to define known relationships between features (used for meaningful hierarchical organization of the discovered biomarkers) or samples (used for testing biological consistency of potential biomarkers). LEfSe is a three-step algorithm (detailed in Figure 6). (a) LEfSe first provides the list of features that are differential among conditions of interest with statistical and biological significance, ranking them according to the effect size. (b) For problems with known hierarchical structure, either phylogenetic or functional, we then provide a mapping of the differences to taxonomic or functional trees. (c) Finally, the system produces a histogram visualizing the raw data within the specified problem structure for each relevant feature. While LEfSe has been developed primarily for metagenomic data containing taxon or gene abundances, it can be used for biomarker discovery in any setting where prior biological knowledge regarding the structure of a comparison is coupled with statistically significant differences in high-dimensional genomic features. KEGG, Kyoto Encyclopedia of Genes and Genomes; WGS, whole genome shotgun.
Figure 2
Figure 2
LEfSe results on human microbiomes. (a-c) Mucosal body site analysis. Mucosal microbial communities are diverse, while non-mucosal body sites are characterized by several clades, including the Actinobacteria. The analysis reported here is carried out on initial data from the Human Microbiome Project [55,56] assigning the main body sites to mucosal and non-mucosal classes, and using the body sites as subclasses. These graphical outputs were generated by the publicly available LEfSe visualization modules applied on the analysis results and integrating microbial taxonomic prior knowledge [58]. (a) Histogram of the LDA scores computed for features differentially abundant between mucosal and non-mucosal body sites. LEfSe scores can be interpreted as the degree of consistent difference in relative abundance between features in the two classes of analyzed microbial communities. The histogram thus identifies which clades among all those detected as statistically and biologically differential explain the greatest differences between communities. (b) Taxonomic representation of statistically and biologically consistent differences between mucosal and non-mucosal body sites. Differences are represented in the color of the most abundant class (red indicating non-mucosal, yellow non-significant). Each circle's diameter is proportional to the taxon's abundance. This representation, here employing the Ribosomal Database Project (RDP) taxonomy [58], simultaneously highlights high-level trends and specific genera - for example, multiple differentially abundant sibling taxa consistent with the variation of the parent clade. (c) Histogram of the Actinomycetales relative abundances (in the 0[1] interval) in mucosal and non-mucosal body sites. Subclasses (specific body sites) are differentially colored and the mean and median relative abundance of the Actinomycetales are indicated with solid and dashed lines, respectively. (d,e) Aerobiosis analysis. The cladograms report the taxa (highlighted by small circles and by shading) showing different abundance values (according to LEfSe) in the three O2-dependent classes as described in Results; for each taxon, the color denotes the class with higher median for both the small circles and the shading. (d) The strict (all classes differential) version of LEfSe detects 13 biomarkers whereas (e) the non-strict (at least one class differential) version of LEfSe detects 60 microbial biomarkers with abundance differential under aerobic, anaerobic, or microaerobic conditions. Additional file 2 reports the non-strict version of LEfSe focused on the Firmicutes phylum, highlighting several low-O2 specific genera within Ruminococcaceae and Lachnospiraceae.
Figure 3
Figure 3
Comparison between Rag2-/- (control) and T-bet-/- × Rag2-/- (case) mice highlighting that, at the phylum level, Firmicutes are enriched in T-bet-/- × Rag2-/- mice, whereas Actinobacteria are enriched in Rag2-/- mice. In agreement with previous culture-based studies, Bifidobacterium species are underabundant in T-bet-/- × Rag2-/- mice [68], and LEfSe highlights several additional genus-level clades, including the specifically depleted Roseburia and Papillibacter within the otherwise overabundant Firmicutes.
Figure 4
Figure 4
LEfSe highlights pathways consistently differential between bacterial microbiomes and viromes within diverse environmental subclasses. (a) Using the SEED [71] catalog of functional pathways, LEfSe reports Nucleoside and nucleotide metabolism and Respiration to differ consistently between bacterial microbiomes and viromes across environmental samples described in [70]. The former is significant using the strictest all-subclasses test, the latter in the more lenient one-subclass test. (b) A two-level cladogram reporting the significant pathway differences as visualized using the SEED hierarchy (see Additional file 3 for the three-level cladogram and detailed differences). (c) Metastats [45] reports four additional pathways differential among these data (Carbohydrates, DNA metabolism, Membrane transport and Nitrogen metabolism). Using only the KW test portion of LEfSe (α = 0.05), we obtain results consonant with Metastats (excluding Nitrogen metabolism). However, as shown here, an overview of the abundance histograms of these subsystems demonstrates them to be less consistent across environments (for example, Coral and Hyper-saline subclasses in the Carbohydrates, Membrane transport and Nitrogen metabolism) and to lose significance within individual subclasses (as for the DNA metabolism subsystem).
Figure 5
Figure 5
Comparison of LEfSe and the KW test alone for false positive and negative rates in synthetic data. Both tests used α = 0.05 in all cases, and the three artificial datasets comprise 100 samples, each in two classes, each with two subclasses of cardinality 25. The samples consist of 1,000 synthetic features taking the place of microbial taxa, pathways, and so on; half are negative (not biomarkers) and the other half positive. (a) LEfSe and KW false positive and negative rates at increasing values of the difference between class means. Negative features are normally distributed with parameters (μ = 10,000, σ = 100) across classes; positive features contain classes with increasingly different means. (b) Performance as standard deviation varies within classes (rather than the difference between means, fixed at 2,000). (c) Performance as standard deviation increases within inconsistent subclasses. Negative features have subclasses sampled from the same normal distribution (and thus not representing consistent biomarkers). Positive features are distributed as in (b). In all cases, LEfSe sacrifices a small number of false negatives in order to achieve a false positive rate near zero, with the goal of ensuring that biomarkers of large effect size will be both reproducible and biologically interpretable.
Figure 6
Figure 6
Schematic representation of the statistical and computational steps implemented in LEfSe. Input data consist of a collection of m samples (columns) each made up of n numerical features (rows, typically normalized per-sample, red representing high values and green low). These samples are labeled with a class (taking two or more possible values) that represents the main biological comparison under investigation; they may also have one or more subclass labels reflecting within-class groupings. (a) Step 1 analyzes all features, testing whether values in different classes are differentially distributed. (b) Features violating the null hypothesis are further analyzed in step 2, which tests whether all pairwise comparisons between subclasses in different classes significantly agree with the class level trend. (c) The resulting subset of vectors is used to build a LDA model from which the relative difference among classes is used to rank the features. The final output thus consists of a list of features that are discriminative with respect to the classes, consistent with the subclass grouping within classes, and ranked according to the effect size with which they differentiate classes.

References

    1. Golub TR. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science. 1999;286:531–537. doi: 10.1126/science.286.5439.531.
    1. Petricoin EF, Ardekani AM, Hitt BA, Levine PJ, Fusaro VA, Steinberg SM, Mills GB, Simone C, Fishman DA, Kohn EC, Liotta LA. Use of proteomic patterns in serum to identify ovarian cancer GLOSSARY. Lancet. 2002;359:572–577. doi: 10.1016/S0140-6736(02)07746-2.
    1. Tothill RW, Tinker AV, George J, Brown R, Fox SB, Lade S, Johnson DS, Trivett MK, Etemadmoghadam D, Locandro B, Traficante N, Fereday S, Hung JA, Chiew YE, Haviv I. Australian Ovarian Cancer Study Group. Gertig D, DeFazio A, Bowtell DD. Novel molecular subtypes of serous and endometrioid ovarian cancer linked to clinical outcome. Clin Cancer Res. 2008;14:5198–5208. doi: 10.1158/1078-0432.CCR-08-0196.
    1. Wei X, Li K-C. Exploring the within- and between-class correlation distributions for tumor classification. Proc Natl Acad Sci USA. 2010;107:6737–6742. doi: 10.1073/pnas.0910140107.
    1. De Filippo C, Cavalieri D, Di Paola M, Ramazzotti M, Poullet JB, Massart S, Collini S, Pieraccini G, Lionetti P. Impact of diet in shaping gut microbiota revealed by a comparative study in children from Europe and rural Africa. Proc Natl Acad Sci USA. 2010;107:14691–14696. doi: 10.1073/pnas.1005963107.
    1. Turnbaugh PJ, Bäckhed F, Fulton L, Gordon JI. Diet-induced obesity is linked to marked but reversible alterations in the mouse distal gut microbiome. Cell Host Microbe. 2008;3:213–223. doi: 10.1016/j.chom.2008.02.015.
    1. Ley RE, Peterson Da, Gordon JI. Ecological and evolutionary forces shaping microbial diversity in the human intestine. Cell. 2006;124:837–848. doi: 10.1016/j.cell.2006.02.017.
    1. Manichanh C, Rigottier-Gois L, Bonnaud E, Gloux K, Pelletier E, Frangeul L, Nalin R, Jarrin C, Chardon P, Marteau P, Roca J, Dore J. Reduced diversity of faecal microbiota in Crohn's disease revealed by a metagenomic approach. Gut. 2006;55:205–211. doi: 10.1136/gut.2005.073817.
    1. Sokol H, Seksik P, Furet JP, Firmesse O, Nion-Larmurier I, Beaugerie L, Cosnes J, Corthier G, Marteau P, Doré J. Low counts of Faecalibacterium prausnitzii in colitis microbiota. Inflamm Bowel Dis. 2009;15:1183–1189. doi: 10.1002/ibd.20903.
    1. Ordovas JM, Mooser V. Metagenomics: the role of the microbiome in cardiovascular diseases. Curr Opin Lipidol. 2006;17:157–161. doi: 10.1097/.
    1. Zhang L, Henson BS, Camargo PM, Wong DT. The clinical value of salivary biomarkers for periodontal disease. Periodontology 2000. 2009;51:25–37. doi: 10.1111/j.1600-0757.2009.00315.x.
    1. Zhang L, Farrell JJ, Zhou H, Elashoff D, Akin D, Park NH, Chia D, Wong DT. Salivary transcriptomic biomarkers for detection of resectable pancreatic cancer. Gastroenterology. 2010;138:949–957. doi: 10.1053/j.gastro.2009.11.010. e947.
    1. NIH HMP Working Group. Peterson J, Garges S, Giovanni M, McInnes P, Wang L, Schloss JA, Bonazzi V, McEwen JE, Wetterstrand KA, Deal C, Baker CC, Di Francesco V, Howcroft TK, Karp RW, Lunsford RD, Wellington CR, Belachew T, Wright M, Giblin C, David H, Mills M, Salomon R, Mullins C, Akolkar B, Begg L, Davis C, Grandison L, Humble M, Khalsa J. et al.The NIH Human Microbiome Project. Genome Res. 2009;19:2317–2323.
    1. Hamady M, Fraser-Liggett CM, Turnbaugh PJ, Ley RE, Knight R, Gordon JI. The Human Microbiome Project. Nature. 2007;449:804–810. doi: 10.1038/nature06244.
    1. Magrini V, Turnbaugh PJ, Ley RE, Mardis ER, Mahowald MA, Gordon JI. An obesity-associated gut microbiome with increased capacity for energy harvest. Nature. 2006;444:1027–1131. doi: 10.1038/nature05414.
    1. Duncan SH, Lobley GE, Holtrop G, Ince J, Johnstone aM, Louis P, Flint HJ. Human colonic microbiota associated with diet, obesity and weight loss. Int J Obesity (Lond) 2008;32:1720–1724. doi: 10.1038/ijo.2008.155.
    1. Turnbaugh PJ, Ridaura VK, Faith JJ, Rey FE, Knight R, Gordon JI. The effect of diet on the human gut microbiome: a metagenomic analysis in humanized gnotobiotic mice. Sci Transl Med. 2009;1:6ra14. doi: 10.1126/scitranslmed.3000322.
    1. Gao Z, Tseng C-h, Strober BE, Pei Z, Blaser MJ. Substantial alterations of the cutaneous bacterial biota in psoriatic lesions. PloS One. 2008;3:e2719. doi: 10.1371/journal.pone.0002719.
    1. Tringe SG, von Mering C, Kobayashi A, Salamov AA, Chen K, Chang HW, Podar M, Short JM, Mathur EJ, Detter JC, Bork P, Hugenholtz P, Rubin EM. Comparative metagenomics of microbial communities. Science. 2005;308:554–557. doi: 10.1126/science.1107851.
    1. Solovyev VV, Allen EE, Ram RJ, Rokhsar DS, Chapman J, Richardson PM, Tyson GW, Rubin EM, Banfield JF, Hugenholtz P. Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature. 2004;428:37–43. doi: 10.1038/nature02340.
    1. Lecuit M, Lortholary O. Immunoproliferative small intestinal disease associated with Campylobacter jejuni. Med Mal Infect. 2005;35(Suppl 2):S56–58.
    1. Relman DA, Schmidt TM, MacDermott RP, Falkow S. Identification of the uncultured bacillus of Whipple's disease. N Engl J Med. 1992;327:293–301. doi: 10.1056/NEJM199207303270501.
    1. Oakley BB, Fiedler TL, Marrazzo JM, Fredricks DN. Diversity of human vaginal bacterial communities and associations with clinically defined bacterial vaginosis. Appl Environ Microbiol. 2008;74:4898–4909. doi: 10.1128/AEM.02884-07.
    1. Tusher VG, Tibshirani R, Chu G. Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci USA. 2001;98:5116–5121. doi: 10.1073/pnas.091062498.
    1. Smyth GK. Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol. 2004;3:Article3.
    1. Clarke R, Ressom HW, Wang A, Xuan J, Liu MC, Gehan Ea, Wang Y. The properties of high-dimensional data spaces: implications for exploring gene and protein expression data. Nat Rev Cancer. 2008;8:37–49. doi: 10.1038/nrc2294.
    1. Swan Ka, Curtis DE, McKusick KB, Voinov AV, Mapa Fa, Cancilla MR. High-throughput gene mapping in Caenorhabditis elegans. Genome Res. 2002;12:1100–1105.
    1. Wooley JC, Ye Y. Metagenomics: facts and artifacts, and computational challenges*. J Comput Sci Technol. 2009;25:71–81.
    1. Turnbaugh PJ, Hamady M, Yatsunenko T, Cantarel BL, Duncan A, Ley RE, Sogin ML, Jones WJ, Roe BA, Affourtit JP, Egholm M, Henrissat B, Heath AC, Knight R, Gordon JI. A core gut microbiome in obese and lean twins. Nature. 2009;457:480–484. doi: 10.1038/nature07540.
    1. Pedrós-Alió C. Marine microbial diversity: can it be determined? Trends Microbiol. 2006;14:257–263. doi: 10.1016/j.tim.2006.04.007.
    1. Sogin ML, Morrison HG, Huber Ja, Welch D, Huse SM, Neal PR, Arrieta JM, Herndl GJ. Microbial diversity in the deep sea and the underexplored "rare biosphere". Proc Natl Acad Sci USA. 2006;103:12115–12120. doi: 10.1073/pnas.0605127103.
    1. Gobet A, Quince C, Ramette A. Multivariate Cutoff Level Analysis (MultiCoLA) of large community data sets. Nucleic Acids Res. 2010;38:e155. doi: 10.1093/nar/gkq545.
    1. Dethlefsen L, McFall-Ngai M, Relman DA. An ecological and evolutionary perspective on human-microbe mutualism and disease. Nature. 2007;449:811–818. doi: 10.1038/nature06245.
    1. Huson DH, Auch AF, Qi J, Schuster SC. MEGAN analysis of metagenomic data. Genome Res. 2007;17:377–386. doi: 10.1101/gr.5969107.
    1. Mitra S, Gilbert JA, Field D, Huson DH. Comparison of multiple metagenomes using phylogenetic networks based on ecological indices. ISME J. 2010;4:1236–1242. doi: 10.1038/ismej.2010.51.
    1. Mitra S, Klar B, Huson DH. Visual and statistical comparison of metagenomes. Bioinformatics. 2009;25:1849–1855. doi: 10.1093/bioinformatics/btp341.
    1. Parks DH, Beiko RG. Identifying biologically relevant differences between metagenomic communities. Bioinformatics. 2010;26:715–721. doi: 10.1093/bioinformatics/btq041.
    1. Lozupone C, Knight R. UniFrac: a new phylogenetic method for comparing microbial communities. Appl Environ Microbiol. 2005;71:8228–8235. doi: 10.1128/AEM.71.12.8228-8235.2005.
    1. Meyer F, Paarmann D, D'Souza M, Olson R, Glass EM, Kubal M, Paczian T, Rodriguez A, Stevens R, Wilke A, Wilkening J, Edwards RA. The metagenomics RAST server - a public resource for the automatic phylogenetic and functional analysis of metagenomes. BMC Bioinformatics. 2008;9:386. doi: 10.1186/1471-2105-9-386.
    1. Kristiansson E, Hugenholtz P, Dalevi D. ShotgunFunctionalizeR: an R-package for functional comparison of metagenomes. Bioinformatics. 2009;25:2737–2738. doi: 10.1093/bioinformatics/btp508.
    1. Schloss PD, Westcott SL, Ryabin T, Hall JR, Hartmann M, Hollister EB, Lesniewski RA, Oakley BB, Parks DH, Robinson CJ, Sahl JW, Stres B, Thallinger GG, Van Horn DJ, Weber CF. Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol. 2009;75:7537–7541. doi: 10.1128/AEM.01541-09.
    1. Goll J, Rusch D, Tanenbaum DM, Thiagarajan M, Li K, Methé BA, Yooseph S. METAREP: JCVI Metagenomics Reports - an open source tool for high-performance comparative metagenomics. Bioinformatics. 2010;26:2631–2632. doi: 10.1093/bioinformatics/btq455.
    1. Jolliffe IT. Principal Component Analysis. New York: Springer-Verlag; 1986.
    1. Gower JC. Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika. 1966;53:325–338.
    1. White JR, Nagarajan N, Pop M. Statistical methods for detecting differentially abundant features in clinical metagenomic samples. PLoS Comput Biol. 2009;5:e1000352. doi: 10.1371/journal.pcbi.1000352.
    1. Goecks J, Nekrutenko A, Taylor J. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol. 2010;11:R86. doi: 10.1186/gb-2010-11-8-r86.
    1. Blankenberg D, Von Kuster G, Coraor N, Ananda G, Lazarus R, Mangan M, Nekrutenko A, Taylor J. Galaxy: a web-based genome analysis tool for experimentalists. Curr Protoc Mol Biol. 2010;Chapter 19:Unit 19.10.1-21.
    1. LEfSe.
    1. Kruskal WH, Wallis WA. Use of ranks in one-criterion variance analysis. J Am Stat Assoc. 1952;47:583–621. doi: 10.2307/2280779.
    1. Wilcoxon F. Individual comparisons by ranking methods. Biometrics. 1945;1:80–83. doi: 10.2307/3001968.
    1. Mann HB, Whitney DR. On a test of whether one of two random variables is stochastically larger than the other. Ann Math Stat. 1947;18:50–60. doi: 10.1214/aoms/1177730491.
    1. Fisher RA. The use of multiple measurements in taxonomic problems. Ann Eugenics. 1936;7:179–188. doi: 10.1111/j.1469-1809.1936.tb02137.x.
    1. Dal Bello F, Hertel C. Oral cavity as natural reservoir for intestinal lactobacilli. Syst Appl Microbiol. 2006;29:69–76. doi: 10.1016/j.syapm.2005.07.002.
    1. Costello EK, Lauber CL, Hamady M, Fierer N, Gordon JI, Knight R. Bacterial community variation in human body habitats across space and time. Science. 2009;326:1694–1697. doi: 10.1126/science.1177486.
    1. Human Microbiome Project clinical sampling protocol.
    1. Turner JR. Intestinal mucosal barrier function in health and disease. Nat Rev Immunol. 2009;9:799–809. doi: 10.1038/nri2653.
    1. Cole JR, Wang Q, Cardenas E, Fish J, Chai B, Farris RJ, Kulam-Syed-Mohideen AS, McGarrell DM, Marsh T, Garrity GM, Tiedje JM. The Ribosomal Database Project: improved alignments and new tools for rRNA analysis. Nucleic Acids Res. 2009;37:D141–145. doi: 10.1093/nar/gkn879.
    1. Hilbert F, Scherwitzel M, Paulsen P, Szostak MP. Survival of Campylobacter jejuni under conditions of atmospheric oxygen tension with the support of Pseudomonas spp. Appl Environ Microbiol. 2010;76:5911–5917. doi: 10.1128/AEM.01532-10.
    1. Godon J-J, Morinière J, Moletta M, Gaillac M, Bru V, Delgènes J-P. Rarity associated with specific ecological niches in the bacterial world: the 'Synergistes' example. Environ Microbiol. 2005;7:213–224. doi: 10.1111/j.1462-2920.2004.00693.x.
    1. Shah Sa, Simpson SJ, Brown LF, Comiskey M, de Jong YP, Allen D, Terhorst C. Development of colonic adenocarcinomas in a mouse model of ulcerative colitis. Inflamm Bowel Dis. 1998;4:196–202.
    1. Pizarro T. Mouse models for the study of Crohn's disease. Trends Mol Med. 2003;9:218–222. doi: 10.1016/S1471-4914(03)00052-2.
    1. Panwala CM, Jones JC, Viney JL. A novel model of inflammatory bowel disease: mice deficient for the multiple drug resistance gene, mdr1a, spontaneously develop colitis. J Immunol. 1998;161:5733–5744.
    1. Wirtz S, Neurath MF. Mouse models of inflammatory bowel disease. Adv Drug Delivery Rev. 2007;59:1073–1083. doi: 10.1016/j.addr.2007.07.003.
    1. Sartor RB. Mechanisms of disease: pathogenesis of Crohn's disease and ulcerative colitis. Nat Clin Pract Gastroenterol Hepatol. 2006;3:390–407. doi: 10.1038/ncpgasthep0528.
    1. Garrett WS, Lord GM, Punit S, Lugo-Villarino G, Mazmanian SK, Ito S, Glickman JN, Glimcher LH. Communicable ulcerative colitis induced by T-bet deficiency in the innate immune system. Cell. 2007;131:33–45. doi: 10.1016/j.cell.2007.08.017.
    1. Garrett WS, Gallini CA, Yatsunenko T, Michaud M, DuBois A, Delaney ML, Punit S, Karlsson M, Bry L, Glickman JN, Gordon JI, Onderdonk AB, Glimcher LH. Enterobacteriaceae act in concert with the gut microbiota to induce spontaneous and maternally transmitted colitis. Cell Host Microbe. 2010;8:292–300. doi: 10.1016/j.chom.2010.08.004.
    1. Veiga P, Gallini CA, Beal C, Michaud M, Delaney ML, DuBois A, Khlebnikov A, van Hylckama Vlieg JE, Punit S, Glickman JN, Onderdonk A, Glimcher LH, Garrett WS. Bifidobacterium animalis subsp. lactis fermented milk product reduces inflammation by altering a niche for colitogenic microbes. Proc Natl Acad Sci USA. 2010;107:18132–18137. doi: 10.1073/pnas.1011737107.
    1. Masaaki O, Yoshimi B, Kai-P L, Nobuko M. Metascardovia criceti Gen. Nov., Sp. Nov., from hamster dental plaque. Microbiol Immunol. 2007;51:747–754.
    1. Dinsdale EA, Edwards RA, Hall D, Angly F, Breitbart M, Brulc JM, Furlan M, Desnues C, Haynes M, Li L, McDaniel L, Moran MA, Nelson KE, Nilsson C, Olson R, Paul J, Brito BR, Ruan Y, Swan BK, Stevens R, Valentine DL, Thurber RV, Wegley L, White BA, Rohwer F. Functional metagenomic profiling of nine biomes. Nature. 2008;452:629–632. doi: 10.1038/nature06810.
    1. Overbeek R, Begley T, Butler RM, Choudhuri JV, Chuang HY, Cohoon M, de Crécy-Lagard V, Diaz N, Disz T, Edwards R, Fonstein M, Frank ED, Gerdes S, Glass EM, Goesmann A, Hanson A, Iwata-Reuyl D, Jensen R, Jamshidi N, Krause L, Kubal M, Larsen N, Linke B, McHardy AC, Meyer F, Neuweger H, Olsen G, Olson R, Osterman A, Portnoy V. et al.The subsystems approach to genome annotation and its use in the project to annotate 1000 genomes. Nucleic Acids Res. 2005;33:5691–5702. doi: 10.1093/nar/gki866.
    1. Greene JM, Collins F, Lefkowitz EJ, Roos D, Scheuermann RH, Sobral B, Stevens R, White O, Di Francesco V. National Institute of Allergy and Infectious Diseases bioinformatics resource centers: new assets for pathogen informatics. Infect Immun. 2007;75:3212–3219. doi: 10.1128/IAI.00105-07.
    1. Krebs CJ. Ecology: The Experimental Analysis of Distribution and Abundance. Benjamin Cummings; 2008.
    1. Kurokawa K, Itoh T, Kuwahara T, Oshima K, Toh H, Toyoda A, Takami H, Morita H, Sharma VK, Srivastava TP, Taylor TD, Noguchi H, Mori H, Ogura Y, Ehrlich DS, Itoh K, Takagi T, Sakaki Y, Hayashi T, Hattori M. Comparative metagenomics revealed commonly enriched gene sets in human gut microbiomes. DNA Res. 2007;14:169–181. doi: 10.1093/dnares/dsm018.
    1. Tatusov RL. A genomic perspective on protein families. Science. 1997;278:631–637. doi: 10.1126/science.278.5338.631.
    1. Tatusov RL, Natale DA, Garkavtsev IV, Tatusova TA, Shankavaram UT, Rao BS, Kiryutin B, Galperin MY, Fedorova ND, Koonin EV. The COG database: new developments in phylogenetic classification of proteins from complete genomes. Nucleic Acids Res. 2001;29:22–28. doi: 10.1093/nar/29.1.22.
    1. Turroni F, Foroni E, Pizzetti P, Giubellini V, Ribbera A, Merusi P, Cagnasso P, Bizzarri B, de'Angelis GL, Shanahan F, van Sinderen D, Ventura M. Exploring the diversity of the bifidobacterial population in the human intestinal tract. Appl Environ Microbiol. 2009;75:1534–1545. doi: 10.1128/AEM.02216-08.
    1. Pawitan Y, Michiels S, Koscielny S, Gusnanto A, Ploner A. False discovery rate, sensitivity and sample size for microarray studies. Bioinformatics. 2005;21:3017–3024. doi: 10.1093/bioinformatics/bti448.
    1. Suzuki Y, Nei M. False-positive selection identified by ML-based methods: examples from the Sig1 gene of the diatom Thalassiosira weissflogii and the tax gene of a human T-cell lymphotropic virus. Mol Biol Evol. 2004;21:914–921. doi: 10.1093/molbev/msh098.
    1. Boulesteix A-L. Over-optimism in bioinformatics research. Bioinformatics. 2010;26:437–439. doi: 10.1093/bioinformatics/btp648.
    1. 2020 visions. Nature. 2010;463:26–32.
    1. Hamady M, Knight R. Microbial community profiling for human microbiome projects: tools, techniques, and challenges. Genome Res. 2009;19:1141–1152. doi: 10.1101/gr.085464.108.
    1. Wooley JC, Godzik A, Friedberg I. A primer on metagenomics. PLoS Comput Biol. 2010;6:e1000667. doi: 10.1371/journal.pcbi.1000667.
    1. Ritchie MD. Using prior knowledge and genome-wide association to identify pathways involved in multiple sclerosis. Genome Med. 2009;1:65. doi: 10.1186/gm65.
    1. Tintle N, Lantieri F, Lebrec J, Sohns M, Ballard D, Bickeböller H. Inclusion of a priori information in genome-wide association analysis. Genet Epidemiol. 2009;33(Suppl 1):S74–80.
    1. Lin W-Y, Lee W-C. Incorporating prior knowledge to facilitate discoveries in a genome-wide association study on age-related macular degeneration. BMC Res Notes. 2010;3:26. doi: 10.1186/1756-0500-3-26.
    1. Reeder J, Knight R. The 'rare biosphere': a reality check. Nat Methods. 2009;6:636–637. doi: 10.1038/nmeth0909-636.
    1. Taylor MW, Schupp PJ, Dahllof I, Kjelleberg S, Steinberg PD. Host specificity in marine sponge-associated bacteria, and potential implications for marine microbial diversity. Environ Microbiol. 2003;6:121–130. doi: 10.1046/j.1462-2920.2003.00545.x.
    1. Tamames J, Abellán JJ, Pignatelli M, Camacho A, Moya A. Environmental distribution of prokaryotic taxa. BMC Microbiol. 2010;10:85. doi: 10.1186/1471-2180-10-85.
    1. Kassen R. The experimental evolution of specialists, generalists, and the maintenance of diversity. J Evol Biol. 2002;15:173–190. doi: 10.1046/j.1420-9101.2002.00377.x.
    1. Frank DN, Pace NR, Peterson DA, Gordon JI. Metagenomic approaches for defining the pathogenesis of inflammatory bowel diseases. Cell Host Microbe. 2008;3:417–427. doi: 10.1016/j.chom.2008.05.001.
    1. Young C, Sharma R, Handfield M, Mai V, Neu J. Biomarkers for infants at risk for necrotizing enterocolitis: clues to prevention? Pediatric Res. 2009;65:91R–97R. doi: 10.1203/PDR.0b013e31819dba7d.
    1. Asikainen S, Doğan B, Turgut Z, Paster BJ, Bodur A, Oscarsson J. Specified species in gingival crevicular fluid predict bacterial diversity. PLoS ONE. 2010;5:e13589. doi: 10.1371/journal.pone.0013589.
    1. Wong D, Zhang L, Farrell J, Zhou H, Elashoff D, Gao K, Paster B. Salivary biomarkers for pancreatic cancer detection. J Clin Oncol. 2009;27:4630.
    1. Culligan EP, Hill C, Sleator RD. Probiotics and gastrointestinal disease: successes, problems and future prospects. Gut Pathog. 2009;1:19. doi: 10.1186/1757-4749-1-19.
    1. Preidis GA, Versalovic J. Targeting the human microbiome with antibiotics, probiotics, and prebiotics: gastroenterology enters the metagenomics era. Gastroenterology. 2009;136:2015–2031. doi: 10.1053/j.gastro.2009.01.072.
    1. Borody TJ, Warren EF, Leis S, Surace R, Ashman O. Treatment of ulcerative colitis using fecal bacteriotherapy. J Clin Gastroenterol. 2003;37:42–47. doi: 10.1097/00004836-200307000-00012.
    1. Khoruts A, Dicksved J, Jansson JK, Sadowsky MJ. Changes in the composition of the human fecal microbiome after bacteriotherapy for recurrent Clostridium difficile-associated diarrhea. J Clin Gastroenterol. 2010;44:354–360.
    1. Manichanh C, Reeder J, Gibert P, Varela E, Llopis M, Antolin M, Guigo R, Knight R, Guarner F. Reshaping the gut microbiome with bacterial transplantation and antibiotic intake. Genome Res. 2010;20:1411–1419. doi: 10.1101/gr.107987.110.
    1. You D, Franzos MA. Successful treatment of fulminant Clostridium difficile infection with fecal bacteriotherapy. Ann Intern Med. 2008;148:632–633.
    1. Chang Y-w, Lin C-j. Feature ranking using linear SVM. J Machine Learning Res. 2008;3:53–64.
    1. Wang Q, Garrity GM, Tiedje JM, Cole JR. Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Appl Environ Microbiol. 2007;73:5261–5267. doi: 10.1128/AEM.00062-07.
    1. Bell TC, Cleary JG, Witten IH. Text Compression. Prentice-Hall, Inc; 1990.
    1. HMP Data Analysis and Coordination Center.
    1. Mo Bio PowerSoil kit.
    1. Huse SM, Huber Ja, Morrison HG, Sogin ML, Welch DM. Accuracy and quality of massively parallel DNA pyrosequencing. Genome Biol. 2007;8:R143. doi: 10.1186/gb-2007-8-7-r143.
    1. Pruesse E, Quast C, Knittel K, Fuchs BM, Ludwig W, Peplies J, Glöckner FO. SILVA: a comprehensive online resource for quality checked and aligned ribosomal RNA sequence data compatible with ARB. Nucleic Acids Res. 2007;35:7188–7196. doi: 10.1093/nar/gkm864.
    1. Schloss PD. A high-throughput DNA sequence aligner for microbial ecology studies. PloS ONE. 2009;4:e8230. doi: 10.1371/journal.pone.0008230.
    1. Haas BJ, Gevers D, Earl AM, Feldgarden M, Ward DV, Giannoukos G, Ciulla D, Tabbaa D, Highlander SK, Sodergren E, Methé B, DeSantis TZ. Human Microbiome Consortium. Petrosino JF, Knight R, Birren BW. Chimeric 16S rRNA sequence formation and detection in Sanger and 454-pyrosequenced PCR amplicons. Genome Res. 2011;21:494–504. doi: 10.1101/gr.112730.110.
    1. Garrity GM, Lilburn TG, Cole JR, Harrison SH, Euzeby J, Tindall BJ. Taxonomic Outline of the Bacteria and Archaea. 2007.
    1. Sequence Read Archive: SRP002012 Human Microbiome Project 454 Clinical Production Pilot (PPS)
    1. Hothorn TH, Hornik K, van De Wiel MA, Zeileis A. Implementing a class of permutation tests: the coin package. J Stat Software. 2008;28:1–23.
    1. Venables WN, Ripley BD. Modern Applied Statistics with S. 4. Springer; 2002.
    1. rpy2.
    1. Hunter JD. Matplotlib: a 2D graphics environment. Computing Sci Eng. 2007;9:90–95.

Source: PubMed

3
S'abonner