phyloseq: an R package for reproducible interactive analysis and graphics of microbiome census data

Paul J McMurdie, Susan Holmes, Paul J McMurdie, Susan Holmes

Abstract

Background: the analysis of microbial communities through dna sequencing brings many challenges: the integration of different types of data with methods from ecology, genetics, phylogenetics, multivariate statistics, visualization and testing. With the increased breadth of experimental designs now being pursued, project-specific statistical analyses are often needed, and these analyses are often difficult (or impossible) for peer researchers to independently reproduce. The vast majority of the requisite tools for performing these analyses reproducibly are already implemented in R and its extensions (packages), but with limited support for high throughput microbiome census data.

Results: Here we describe a software project, phyloseq, dedicated to the object-oriented representation and analysis of microbiome census data in R. It supports importing data from a variety of common formats, as well as many analysis techniques. These include calibration, filtering, subsetting, agglomeration, multi-table comparisons, diversity analysis, parallelized Fast UniFrac, ordination methods, and production of publication-quality graphics; all in a manner that is easy to document, share, and modify. We show how to apply functions from other R packages to phyloseq-represented data, illustrating the availability of a large number of open source analysis techniques. We discuss the use of phyloseq with tools for reproducible research, a practice common in other fields but still rare in the analysis of highly parallel microbiome census data. We have made available all of the materials necessary to completely reproduce the analysis and figures included in this article, an example of best practices for reproducible research.

Conclusions: The phyloseq project for R is a new open-source software package, freely available on the web from both GitHub and Bioconductor.

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1. Example of a phylogenetic sequencing…
Figure 1. Example of a phylogenetic sequencing workflow.
A diagram of an experimental and analysis workflow for amplicon or shotgun phylogenetic sequencing. The intended role for phyloseq is indicated.
Figure 2. Analysis workflow using phyloseq.
Figure 2. Analysis workflow using phyloseq.
The workflow starts with the results of OTU clustering and independently-measured sample data (Input, top left), and ends at various analytic procedures available in R for inference and validation. In between are key functions for preprocessing and graphics. Rounded rectangles and diamond shapes represent functions and data objects, respectively, further described in Figure 3.
Figure 3. The “phyloseq” class.
Figure 3. The “phyloseq” class.
The phyloseq class is an experiment-level data storage class defined by the phyloseq package for representing phylogenetic sequencing data. Most functions in the phyloseq package expect an instance of this class as their primary argument. See the phyloseq manual for a complete list of functions.
Figure 4. Graphic functions of the phyloseq…
Figure 4. Graphic functions of the phyloseq package.
The phyloseq class is an experiment-level data storage class defined by the phyloseq package for representing phylogenetic sequencing data. Most functions in the phyloseq package expect an instance of this class as their primary argument. See the phyloseq manual The Global Patterns and Enterotypes datasets are included with the phyloseq package. The Global Patterns data was preprocessed such that each sample was transformed to the same total read depth, and OTUs were trimmed that were not observed at least 3 times in 20% of samples or had a coefficient of variation ≤ 3.0 across all samples. For the plot_tree and plot_bar subplots, only the Bacteroidetes phylum is shown. Each subplot title indicates the plot function that produced it. Complete details for reproducing this figure are provided in File S2. All of these functions return a ggplot object that can be further customized/modified by tools in the ggplot2 package . See additional descriptions of each function in the body text, and at the phyloseq homepage .
Figure 5. plot_ordination display methods included in…
Figure 5. plot_ordination display methods included in phyloseq.
Each panel uses a “Bacteroidetes-only” subset of the preprocessed “Global Patterns” dataset that was also used in Figure 4. The coordinates are derived from an unconstrained correspondence analysis . Different panels illustrate different displays of the ordination results using the type argument to the plot_ordination function. (Top Left) Example of a samples-only display, with the “SampleType” mapped to the color aesthetic, and a filled-polygon layer to emphasize plot regions where sample types co-occur. (Top Left Insert) A “scree” plot of the eigenvalues associated with each axis, which indicates the proportion of total variability represented in each axis. (Top Right) Biplot representation in which samples and OTUs ordination results are overlaid. Clumps of OTUs appear to co-occur with different sample types, and some correlation with taxonomic phylum is also evident. (Middle) An OTUs-only plot that has been faceted (separated into panels) by class, with a two-dimensional density estimate overlain in blue. This view shows clearly a lack of association between the Sphingobacteria and Flavobacteria classes with fecal samples, which appear to be enriched in a subset of the Bacteroidia (relative to other OTUs in this Bacteroidetes-only dataset). Meanwhile, subsets of Bacteroidia appear to be enriched within multiple sample types. (Bottom) The “split” type for this graphic, in which both samples-only and OTUs-only plots are created, and shown side-by-side with one legend and shared vertical axis. Both the “biplot” and “split” options allow dual projections of both OTU- and sample-space.

References

    1. Metzker ML (2010) Sequencing technologies - the next generation. Nature Reviews Genetics 11: 31–46.
    1. Hamady M, Walker JJ, Harris JK, Gold NJ, Knight R (2008) Error-correcting barcoded primers for pyrosequencing hundreds of samples in multiplex. Nature Methods 5: 235–237.
    1. Pace NR (1997) A molecular view of microbial diversity and the biosphere. Science 276: 734–740.
    1. Liu Z, DeSantis TZ, Andersen GL, Knight R (2008) Accurate taxonomy assignments from 16S rRNA sequences produced by highly parallel pyrosequencers. Nucleic Acids Research 36: e120.
    1. DeSantis TZ, Hugenholtz P, Keller K, Brodie EL, Larsen N, et al. (2006) NAST: a multiple sequence alignment server for comparative analysis of 16S rRNA genes. Nucleic Acids Research 34: W394–9.
    1. DeSantis TZ, Hugenholtz P, Larsen N, Rojas M, Brodie EL, et al. (2006) Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB. Applied and Environ-mental Microbiology 72: 5069–5072.
    1. Cole JR, Wang Q, Cardenas E, Fish J, Chai B, et al. (2009) The Ribosomal Database Project: improved alignments and new tools for rRNA analysis. Nucleic Acids Research 37: D141–5.
    1. Pruesse E, Quast C, Knittel K, Fuchs BM, Ludwig W, et al. (2007) SILVA: a comprehensive online resource for quality checked and aligned ribosomal RNA sequence data compatible with ARB. Nucleic Acids Research 35: 7188–7196.
    1. Li W, Godzik A (2006) CD-HIT: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22: 1658–1659.
    1. Huang Y, Niu B, Gao Y, Fu L, Li W (2010) CD-HIT Suite: a web server for clustering and comparing biological sequences. Bioinformatics 26: 680–682.
    1. Caporaso J, Kuczynski J, Stombaugh J, Bittinger K, Bushman F, et al. (2010) QIIME allows analysis of high-throughput community sequencing data. Nature methods 7: 335–336.
    1. Schloss PD, Westcott SL, Ryabin T, Hall JR, Hartmann M, et al. (2009) Introducing mothur: Open-Source, Platform-Independent, Community-Supported Software for Describing and Comparing Microbial Communities. Applied and Environmental Microbiology 75: 7537–7541.
    1. Giongo A, Crabb DB, Davis-Richardson AG, Chauliac D, Mobberley JM, et al. (2010) PANGEA: pipeline for analysis of next generation amplicons. The ISME Journal 4: 852–861.
    1. Kunin V (2010) PyroTagger: A fast, accurate pipeline for analysis of rRNA amplicon pyrosequence data. The Open Journal
    1. Angiuoli SV, Matalka M, Gussman A, Galens K, Vangala M, et al. (2011) CloVR: a virtual machine for automated and portable sequence analysis from the desktop using cloud computing. BMC Bioinformatics 12: 356.
    1. 8th Annual Biotechnology and Bioinformatics Symposium (2011) The Genboree Microbiome Toolset and the Analysis of 16S rRNA Microbial Sequences. .
    1. QIIME EC2 image documentation. Available: . Accessed 2013 March 22.
    1. University of Colorado Boulder Knight Lab. n3phele bioinformatics in the cloud. Available: . Accessed 2013 March 22.
    1. Meyer F, Paarmann D, D'Souza M, Olson R, Glass EM, et al. (2008) The metagenomics RAST server - a public resource for the automatic phylogenetic and functional analysis of metagenomes. BMC Bioinformatics 9: 386.
    1. Venter JC, Adams MD, Sutton GG, Kerlavage AR, Smith HO, et al. (1998) Shotgun sequencing of the human genome. Science 280: 1540–1542.
    1. Fleischmann R, Adams M, White O, Clayton R, Kirkness E, et al. (1995) Whole-genome random sequencing and assembly of Haemophilus inuenzae Rd. Science 269: 496–512.
    1. Venter JC, Remington K, Heidelberg JF, Halpern AL, Rusch D, et al. (2004) Environmental genome shotgun sequencing of the sargasso sea. Science 304: 66–74.
    1. Sharpton TJ, Riesenfeld SJ, Kembel SW, Ladau J, O'Dwyer JP, et al. (2011) PhylOTU: a high-throughput procedure quantifies microbial community diversity and resolves novel taxa from metagenomic data. PLoS computational biology 7: e1001061.
    1. R Development Core Team (2011) R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0.
    1. Stroustrup B (2000) The C++ programming language. ISBN 0201700735. Addison-Wesley Pro-fessional, 3rd edition.
    1. Chambers J (2008) Software for data analysis: programming with R. Springer Verlag.
    1. Simpson GL. CRAN Task View: Analysis of Ecological and Environmental Data. Available: . Accessed 2013 March 22.
    1. Chakerian J, Holmes S (2010) distory: Distances between trees.
    1. Schliep KP (2011) phangorn: phylogenetic analysis in R. Bioinformatics 27: 592–593.
    1. Kembel SW, Cowan PD, Helmus MR, Cornwell WK, Morlon H, et al. (2010) Picante: R tools for integrating phylogenies and ecology. Bioinformatics 26: 1463–1464.
    1. McMurdie PJ, Holmes S (2012) phyloseq: A Bioconductor Package for Handling and Analysis of High-Throughput Phylogenetic Sequence Data. Pacific Symposium on Biocomputing 17: 235–246.
    1. Hardle W, Ronz B, editors (2002) Sweave. Dynamic generation of statistical reports using literate data analysis. Compstat 2002, Proceedings in Computational Statistics.
    1. Xie Y (2012) knitr: A general-purpose package for dynamic report generation in R. R package version 0.8
    1. Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, et al. (2004) Bioconductor: open software development for computational biology and bioinformatics. Genome Biology 5: R80.
    1. Beck D, Settles M, Foster JA (2011) OTUbase: an R infrastructure package for operational taxo-nomic unit data. Bioinformatics
    1. OTUbase Bioconductor Release Page. (2012) Available: . Accessed 2013 March 22.
    1. McDonald D, Clemente JC, Kuczynski J (2012) The Biological Observation Matrix (BIOM) format or: how I learned to stop worrying and love the ome-ome. Giga Science
    1. McMurdie PJ, Holmes S. Package manual for phyloseq. Available: . Accessed 2013 March 22.
    1. The phyloseq Homepage. Available: . Accessed 2013 March 22.
    1. R Development Core Team (2012) Writing R Extensions. Comprehensive R Archive Network (CRAN).
    1. Wickham H, Danenberg P, Eugster M. roxygen2: In-source documentation for R. R package version 2.2.2. Available: . Accessed 2013 March 22.
    1. Faith D, Minchin P (1987) Compositional dissimilarity as a robust measure of ecological distance. Vegetatio 69: 57–68.
    1. Anderson MJ, Ellingsen KE, McArdle BH (2006) Multivariate dispersion as a measure of beta diversity. Ecology Letters 9: 683–693.
    1. Hamady M, Lozupone C, Knight R (2009) Fast unifrac: facilitating high-throughput phylogenetic analyses of microbial communities including analysis of pyrosequencing and phylochip data. The ISME Journal
    1. Lozupone CA, Hamady M, Kelley ST, Knight R (2007) Quantitative and qualitative beta diversity measures lead to different insights into factors that structure microbial communities. Applied and Environmental Microbiology 73: 1576–1585.
    1. Lozupone C, Knight R (2005) UniFrac: a new phylogenetic method for comparing microbial communities. Applied and Environmental Microbiology 71: 8228–8235.
    1. Caporaso JG, Lauber CL, Walters WA, Berg-Lyons D, Lozupone CA, et al. (2011) Global patterns of 16S rRNA diversity at a depth of millions of sequences per sample. Proceedings of the National Academy of Sciences 108: 4516–4522.
    1. Greenacre MJ (1984) Theory and Applications of Correspondence Analysis. London: Academic Press.
    1. Ter Braak CJF (1986) Canonical Correspondence Analysis: A new eigenvector technique for multivariate direct gradient analysis. Ecology 67: 1167.
    1. Hill M, Gauch H (1980) Detrended Correspondence Analysis, an improved ordination technique. Vegetatio 42: 47–58.
    1. Wollenberg AL (1977) Redundancy analysis an alternative for canonical correlation analysis. Psychometrika 42: 207–219.
    1. Hotelling H (1933) Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology 24: 417–441.
    1. Pavoine S, Dufour A, Chessel D (2004) From dissimilarities among species to dissimilarities among communities: a double principal coordinate analysis. Journal of Theoretical Biology 228: 523–537.
    1. Gower JC (1966) Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika 53: 325–338.
    1. Minchin PR (1987) An evaluation of the relative robustness of techniques for ecological ordination. Vegetatio 69: 89–107.
    1. Thioulouse J (2011) Simultaneous analysis of a sequence of paired ecological tables: A comparison of several methods. Annals of Applied Statistics 5: 2300–2325.
    1. Wickham H (2009) ggplot2: elegant graphics for data analysis. Springer New York.
    1. Wilkinson L, Wills G (2005) The Grammar Of Graphics. Statistics and Computing. Springer, 2nd edition.
    1. Rajaram S, Oono Y (2010) NeatMap–non-clustering heat map alternatives in R. BMC Bioinformatics 11: 45.
    1. Csardi G, Nepusz T (2006) The igraph software package for complex network research. InterJournal Complex Systems 1695.
    1. Tufte ER (2001) The visual display of quantitative information, Graphics Press, Cheshire, Con-necticut, chapter 9 Aesthetics and Technique in Data Graphical Design. 2nd edition, p. 178.
    1. Greenacre M (2007) Correspondence analysis in practice. Chapman & Hall.
    1. Pinto AJ, Raskin L (2012) PCR Biases Distort Bacterial and Archaeal Community Structure in Pyrosequencing Datasets. PLoS ONE 7: e43093.
    1. Sanders HL (1968) Marine benthic diversity: A comparative study. The American Naturalist 102: 243–282.
    1. Holmes S, Alekseyenko A, Timme A, Nelson T, Pasricha PJ, et al. (2011) Visualization and statisti-cal comparisons of microbial communities using R packages on phylochip data. Pacific Symposium on Biocomputing 142–153.
    1. Allison DB, Cui X, Page GP, Sabripour M (2006) Microarray Data Analysis: from Disarray to Consolidation and Consensus. Nat Rev Genet 7: 55–65.
    1. Holmes S, McMurdie PJ (2012) Statistical analysis challenges in the microbiome. To appear PNAS: The Social Biology of Microbial Communities forum on Microbial Threats
    1. Nelson T, Pasricha P, Holmes S, Spormann A (2010) Shifts in luminal and mucosal microbial communities associated with an experimental model of irritable bowel syndrome. Gastroenterology
    1. Efron B, Tibshirani R (1993) An introduction to the bootstrap, volume 57. Chapman & Hall/CRC.
    1. Holmes S (2003) Bootstrapping phylogenetic trees: theory and methods. Statistical Science 241–255.
    1. Westfall PH, Young SS (1993) Resampling-Based Multiple Testing. Examples and Methods for P-Value Adjustment. Wiley-Interscience
    1. Pollard KS, Gilbert HN, Ge Y, Taylor S, Dudoit S (2010) multtest: Resampling-based multiple hypothesis testing. R package version 2.4.0
    1. Ioannidis JPA (2005) Why most published research findings are false. PLoS medicine 2: e124.
    1. Merali Z (2010) Computational science: Error, why scientific programming does not compute. Nature 467: 775–777.
    1. Peng RD (2011) Reproducible research in computational science. Science 334: 1226–1227.
    1. Ince DC, Hatton L, Graham-Cumming J (2012) The case for open computer programs. Nature 482: 485–488.
    1. Carey VJ, Stodden V (2010) Reproducible Research Concepts and Tools for Cancer Bioinformatics. In: Ochs MF, Casagrande JT, Davuluri RV, editors, Biomedical Informatics for Cancer Research, Boston, MA: Springer US. pp. 149–175.
    1. Knight R, Jansson J, Field D, Fierer N, Desai N, et al. (2012) Unlocking the potential of metage-nomics through replicated experimental design. Nature biotechnology 30: 513–520.
    1. Human Microbiome Project Consortium (2012) Structure, function and diversity of the healthy human microbiome. Nature 486: 207–214.
    1. Donoho DL (2010) An invitation to reproducible computational research. Biostatistics (Oxford, England) 11: 385–388.
    1. Peng RD (2009) Reproducible research and Biostatistics. Biostatistics (Oxford, England) 10: 405–408.
    1. Gentleman R, Temple Lang D (2004) Statistical analyses and reproducible research. Bioconductor Project Working Papers 2.
    1. Pérez F, Granger BE (2007) IPython: a System for Interactive Scientific Computing. Comput Sci Eng 9: 21–29.
    1. Allaire J, Horner J, Marti V, Porte N The markdown package: Markdown rendering for R. R package version 0.5.4. Available: . Accessed 2013 March 22.
    1. Gentleman R (2005) Reproducible research: a bioinformatics case study. Statistical applications in genetics and molecular biology 4: Article2.
    1. The phyloseq Demo Repository. Available: . Accessed 2013 March 22.
    1. Barnes N (2010) Publish your computer code: it is good enough. Nature 467: 753.
    1. Copeland WK, Krishnan V, Beck D, Settles M, Foster JA, et al. (2012) mcaGUI: microbial commu-nity analysis R-Graphical User Interface (GUI). Bioinformatics (Oxford, England) 28: 2198–2199.
    1. Wickham H (2007) Reshaping data with the reshape package. Journal of Statistical Software 21: 1–20.
    1. Wickham H (2011) The split-apply-combine strategy for data analysis. Journal of Statistical Software 40: 1–29.
    1. Arumugam M, Raes J, Pelletier E, Le Paslier D, Yamada T, et al. (2011) Enterotypes of the human gut microbiome. Nature 473: 174–180.
    1. Oksanen J, Blanchet FG, Kindt R, Legendre P, O'Hara RB, et al. (2011) vegan: Community Ecology Package. R package version 1.17–10

Source: PubMed

3
Suscribir