phyloseq: an R package for reproducible interactive analysis and graphics of microbiome census data
Paul J McMurdie, Susan Holmes, Paul J McMurdie, Susan Holmes
Abstract
Background: the analysis of microbial communities through dna sequencing brings many challenges: the integration of different types of data with methods from ecology, genetics, phylogenetics, multivariate statistics, visualization and testing. With the increased breadth of experimental designs now being pursued, project-specific statistical analyses are often needed, and these analyses are often difficult (or impossible) for peer researchers to independently reproduce. The vast majority of the requisite tools for performing these analyses reproducibly are already implemented in R and its extensions (packages), but with limited support for high throughput microbiome census data.
Results: Here we describe a software project, phyloseq, dedicated to the object-oriented representation and analysis of microbiome census data in R. It supports importing data from a variety of common formats, as well as many analysis techniques. These include calibration, filtering, subsetting, agglomeration, multi-table comparisons, diversity analysis, parallelized Fast UniFrac, ordination methods, and production of publication-quality graphics; all in a manner that is easy to document, share, and modify. We show how to apply functions from other R packages to phyloseq-represented data, illustrating the availability of a large number of open source analysis techniques. We discuss the use of phyloseq with tools for reproducible research, a practice common in other fields but still rare in the analysis of highly parallel microbiome census data. We have made available all of the materials necessary to completely reproduce the analysis and figures included in this article, an example of best practices for reproducible research.
Conclusions: The phyloseq project for R is a new open-source software package, freely available on the web from both GitHub and Bioconductor.
Conflict of interest statement
Competing Interests: The authors have declared that no competing interests exist.
Figures
References
- Metzker ML (2010) Sequencing technologies - the next generation. Nature Reviews Genetics 11: 31–46.
- Hamady M, Walker JJ, Harris JK, Gold NJ, Knight R (2008) Error-correcting barcoded primers for pyrosequencing hundreds of samples in multiplex. Nature Methods 5: 235–237.
- Pace NR (1997) A molecular view of microbial diversity and the biosphere. Science 276: 734–740.
- Liu Z, DeSantis TZ, Andersen GL, Knight R (2008) Accurate taxonomy assignments from 16S rRNA sequences produced by highly parallel pyrosequencers. Nucleic Acids Research 36: e120.
- DeSantis TZ, Hugenholtz P, Keller K, Brodie EL, Larsen N, et al. (2006) NAST: a multiple sequence alignment server for comparative analysis of 16S rRNA genes. Nucleic Acids Research 34: W394–9.
- DeSantis TZ, Hugenholtz P, Larsen N, Rojas M, Brodie EL, et al. (2006) Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB. Applied and Environ-mental Microbiology 72: 5069–5072.
- Cole JR, Wang Q, Cardenas E, Fish J, Chai B, et al. (2009) The Ribosomal Database Project: improved alignments and new tools for rRNA analysis. Nucleic Acids Research 37: D141–5.
- Pruesse E, Quast C, Knittel K, Fuchs BM, Ludwig W, et al. (2007) SILVA: a comprehensive online resource for quality checked and aligned ribosomal RNA sequence data compatible with ARB. Nucleic Acids Research 35: 7188–7196.
- Li W, Godzik A (2006) CD-HIT: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22: 1658–1659.
- Huang Y, Niu B, Gao Y, Fu L, Li W (2010) CD-HIT Suite: a web server for clustering and comparing biological sequences. Bioinformatics 26: 680–682.
- Caporaso J, Kuczynski J, Stombaugh J, Bittinger K, Bushman F, et al. (2010) QIIME allows analysis of high-throughput community sequencing data. Nature methods 7: 335–336.
- Schloss PD, Westcott SL, Ryabin T, Hall JR, Hartmann M, et al. (2009) Introducing mothur: Open-Source, Platform-Independent, Community-Supported Software for Describing and Comparing Microbial Communities. Applied and Environmental Microbiology 75: 7537–7541.
- Giongo A, Crabb DB, Davis-Richardson AG, Chauliac D, Mobberley JM, et al. (2010) PANGEA: pipeline for analysis of next generation amplicons. The ISME Journal 4: 852–861.
- Kunin V (2010) PyroTagger: A fast, accurate pipeline for analysis of rRNA amplicon pyrosequence data. The Open Journal
- Angiuoli SV, Matalka M, Gussman A, Galens K, Vangala M, et al. (2011) CloVR: a virtual machine for automated and portable sequence analysis from the desktop using cloud computing. BMC Bioinformatics 12: 356.
- 8th Annual Biotechnology and Bioinformatics Symposium (2011) The Genboree Microbiome Toolset and the Analysis of 16S rRNA Microbial Sequences. .
- QIIME EC2 image documentation. Available: . Accessed 2013 March 22.
- University of Colorado Boulder Knight Lab. n3phele bioinformatics in the cloud. Available: . Accessed 2013 March 22.
- Meyer F, Paarmann D, D'Souza M, Olson R, Glass EM, et al. (2008) The metagenomics RAST server - a public resource for the automatic phylogenetic and functional analysis of metagenomes. BMC Bioinformatics 9: 386.
- Venter JC, Adams MD, Sutton GG, Kerlavage AR, Smith HO, et al. (1998) Shotgun sequencing of the human genome. Science 280: 1540–1542.
- Fleischmann R, Adams M, White O, Clayton R, Kirkness E, et al. (1995) Whole-genome random sequencing and assembly of Haemophilus inuenzae Rd. Science 269: 496–512.
- Venter JC, Remington K, Heidelberg JF, Halpern AL, Rusch D, et al. (2004) Environmental genome shotgun sequencing of the sargasso sea. Science 304: 66–74.
- Sharpton TJ, Riesenfeld SJ, Kembel SW, Ladau J, O'Dwyer JP, et al. (2011) PhylOTU: a high-throughput procedure quantifies microbial community diversity and resolves novel taxa from metagenomic data. PLoS computational biology 7: e1001061.
- R Development Core Team (2011) R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0.
- Stroustrup B (2000) The C++ programming language. ISBN 0201700735. Addison-Wesley Pro-fessional, 3rd edition.
- Chambers J (2008) Software for data analysis: programming with R. Springer Verlag.
- Simpson GL. CRAN Task View: Analysis of Ecological and Environmental Data. Available: . Accessed 2013 March 22.
- Chakerian J, Holmes S (2010) distory: Distances between trees.
- Schliep KP (2011) phangorn: phylogenetic analysis in R. Bioinformatics 27: 592–593.
- Kembel SW, Cowan PD, Helmus MR, Cornwell WK, Morlon H, et al. (2010) Picante: R tools for integrating phylogenies and ecology. Bioinformatics 26: 1463–1464.
- McMurdie PJ, Holmes S (2012) phyloseq: A Bioconductor Package for Handling and Analysis of High-Throughput Phylogenetic Sequence Data. Pacific Symposium on Biocomputing 17: 235–246.
- Hardle W, Ronz B, editors (2002) Sweave. Dynamic generation of statistical reports using literate data analysis. Compstat 2002, Proceedings in Computational Statistics.
- Xie Y (2012) knitr: A general-purpose package for dynamic report generation in R. R package version 0.8
- Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, et al. (2004) Bioconductor: open software development for computational biology and bioinformatics. Genome Biology 5: R80.
- Beck D, Settles M, Foster JA (2011) OTUbase: an R infrastructure package for operational taxo-nomic unit data. Bioinformatics
- OTUbase Bioconductor Release Page. (2012) Available: . Accessed 2013 March 22.
- McDonald D, Clemente JC, Kuczynski J (2012) The Biological Observation Matrix (BIOM) format or: how I learned to stop worrying and love the ome-ome. Giga Science
- McMurdie PJ, Holmes S. Package manual for phyloseq. Available: . Accessed 2013 March 22.
- The phyloseq Homepage. Available: . Accessed 2013 March 22.
- R Development Core Team (2012) Writing R Extensions. Comprehensive R Archive Network (CRAN).
- Wickham H, Danenberg P, Eugster M. roxygen2: In-source documentation for R. R package version 2.2.2. Available: . Accessed 2013 March 22.
- Faith D, Minchin P (1987) Compositional dissimilarity as a robust measure of ecological distance. Vegetatio 69: 57–68.
- Anderson MJ, Ellingsen KE, McArdle BH (2006) Multivariate dispersion as a measure of beta diversity. Ecology Letters 9: 683–693.
- Hamady M, Lozupone C, Knight R (2009) Fast unifrac: facilitating high-throughput phylogenetic analyses of microbial communities including analysis of pyrosequencing and phylochip data. The ISME Journal
- Lozupone CA, Hamady M, Kelley ST, Knight R (2007) Quantitative and qualitative beta diversity measures lead to different insights into factors that structure microbial communities. Applied and Environmental Microbiology 73: 1576–1585.
- Lozupone C, Knight R (2005) UniFrac: a new phylogenetic method for comparing microbial communities. Applied and Environmental Microbiology 71: 8228–8235.
- Caporaso JG, Lauber CL, Walters WA, Berg-Lyons D, Lozupone CA, et al. (2011) Global patterns of 16S rRNA diversity at a depth of millions of sequences per sample. Proceedings of the National Academy of Sciences 108: 4516–4522.
- Greenacre MJ (1984) Theory and Applications of Correspondence Analysis. London: Academic Press.
- Ter Braak CJF (1986) Canonical Correspondence Analysis: A new eigenvector technique for multivariate direct gradient analysis. Ecology 67: 1167.
- Hill M, Gauch H (1980) Detrended Correspondence Analysis, an improved ordination technique. Vegetatio 42: 47–58.
- Wollenberg AL (1977) Redundancy analysis an alternative for canonical correlation analysis. Psychometrika 42: 207–219.
- Hotelling H (1933) Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology 24: 417–441.
- Pavoine S, Dufour A, Chessel D (2004) From dissimilarities among species to dissimilarities among communities: a double principal coordinate analysis. Journal of Theoretical Biology 228: 523–537.
- Gower JC (1966) Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika 53: 325–338.
- Minchin PR (1987) An evaluation of the relative robustness of techniques for ecological ordination. Vegetatio 69: 89–107.
- Thioulouse J (2011) Simultaneous analysis of a sequence of paired ecological tables: A comparison of several methods. Annals of Applied Statistics 5: 2300–2325.
- Wickham H (2009) ggplot2: elegant graphics for data analysis. Springer New York.
- Wilkinson L, Wills G (2005) The Grammar Of Graphics. Statistics and Computing. Springer, 2nd edition.
- Rajaram S, Oono Y (2010) NeatMap–non-clustering heat map alternatives in R. BMC Bioinformatics 11: 45.
- Csardi G, Nepusz T (2006) The igraph software package for complex network research. InterJournal Complex Systems 1695.
- Tufte ER (2001) The visual display of quantitative information, Graphics Press, Cheshire, Con-necticut, chapter 9 Aesthetics and Technique in Data Graphical Design. 2nd edition, p. 178.
- Greenacre M (2007) Correspondence analysis in practice. Chapman & Hall.
- Pinto AJ, Raskin L (2012) PCR Biases Distort Bacterial and Archaeal Community Structure in Pyrosequencing Datasets. PLoS ONE 7: e43093.
- Sanders HL (1968) Marine benthic diversity: A comparative study. The American Naturalist 102: 243–282.
- Holmes S, Alekseyenko A, Timme A, Nelson T, Pasricha PJ, et al. (2011) Visualization and statisti-cal comparisons of microbial communities using R packages on phylochip data. Pacific Symposium on Biocomputing 142–153.
- Allison DB, Cui X, Page GP, Sabripour M (2006) Microarray Data Analysis: from Disarray to Consolidation and Consensus. Nat Rev Genet 7: 55–65.
- Holmes S, McMurdie PJ (2012) Statistical analysis challenges in the microbiome. To appear PNAS: The Social Biology of Microbial Communities forum on Microbial Threats
- Nelson T, Pasricha P, Holmes S, Spormann A (2010) Shifts in luminal and mucosal microbial communities associated with an experimental model of irritable bowel syndrome. Gastroenterology
- Efron B, Tibshirani R (1993) An introduction to the bootstrap, volume 57. Chapman & Hall/CRC.
- Holmes S (2003) Bootstrapping phylogenetic trees: theory and methods. Statistical Science 241–255.
- Westfall PH, Young SS (1993) Resampling-Based Multiple Testing. Examples and Methods for P-Value Adjustment. Wiley-Interscience
- Pollard KS, Gilbert HN, Ge Y, Taylor S, Dudoit S (2010) multtest: Resampling-based multiple hypothesis testing. R package version 2.4.0
- Ioannidis JPA (2005) Why most published research findings are false. PLoS medicine 2: e124.
- Merali Z (2010) Computational science: Error, why scientific programming does not compute. Nature 467: 775–777.
- Peng RD (2011) Reproducible research in computational science. Science 334: 1226–1227.
- Ince DC, Hatton L, Graham-Cumming J (2012) The case for open computer programs. Nature 482: 485–488.
- Carey VJ, Stodden V (2010) Reproducible Research Concepts and Tools for Cancer Bioinformatics. In: Ochs MF, Casagrande JT, Davuluri RV, editors, Biomedical Informatics for Cancer Research, Boston, MA: Springer US. pp. 149–175.
- Knight R, Jansson J, Field D, Fierer N, Desai N, et al. (2012) Unlocking the potential of metage-nomics through replicated experimental design. Nature biotechnology 30: 513–520.
- Human Microbiome Project Consortium (2012) Structure, function and diversity of the healthy human microbiome. Nature 486: 207–214.
- Donoho DL (2010) An invitation to reproducible computational research. Biostatistics (Oxford, England) 11: 385–388.
- Peng RD (2009) Reproducible research and Biostatistics. Biostatistics (Oxford, England) 10: 405–408.
- Gentleman R, Temple Lang D (2004) Statistical analyses and reproducible research. Bioconductor Project Working Papers 2.
- Pérez F, Granger BE (2007) IPython: a System for Interactive Scientific Computing. Comput Sci Eng 9: 21–29.
- Allaire J, Horner J, Marti V, Porte N The markdown package: Markdown rendering for R. R package version 0.5.4. Available: . Accessed 2013 March 22.
- Gentleman R (2005) Reproducible research: a bioinformatics case study. Statistical applications in genetics and molecular biology 4: Article2.
- The phyloseq Demo Repository. Available: . Accessed 2013 March 22.
- Barnes N (2010) Publish your computer code: it is good enough. Nature 467: 753.
- Copeland WK, Krishnan V, Beck D, Settles M, Foster JA, et al. (2012) mcaGUI: microbial commu-nity analysis R-Graphical User Interface (GUI). Bioinformatics (Oxford, England) 28: 2198–2199.
- Wickham H (2007) Reshaping data with the reshape package. Journal of Statistical Software 21: 1–20.
- Wickham H (2011) The split-apply-combine strategy for data analysis. Journal of Statistical Software 40: 1–29.
- Arumugam M, Raes J, Pelletier E, Le Paslier D, Yamada T, et al. (2011) Enterotypes of the human gut microbiome. Nature 473: 174–180.
- Oksanen J, Blanchet FG, Kindt R, Legendre P, O'Hara RB, et al. (2011) vegan: Community Ecology Package. R package version 1.17–10
Source: PubMed