Waste not, want not: why rarefying microbiome data is inadmissible
Paul J McMurdie, Susan Holmes, Paul J McMurdie, Susan Holmes
Abstract
Current practice in the normalization of microbiome count data is inefficient in the statistical sense. For apparently historical reasons, the common approach is either to use simple proportions (which does not address heteroscedasticity) or to use rarefying of counts, even though both of these approaches are inappropriate for detection of differentially abundant species. Well-established statistical theory is available that simultaneously accounts for library size differences and biological variability using an appropriate mixture model. Moreover, specific implementations for DNA sequencing read count data (based on a Negative Binomial model for instance) are already available in RNA-Seq focused R packages such as edgeR and DESeq. Here we summarize the supporting statistical theory and use simulations and empirical data to demonstrate substantial improvements provided by a relevant mixture model framework over simple proportions or rarefying. We show how both proportions and rarefied counts result in a high rate of false positives in tests for species that are differentially abundant across sample classes. Regarding microbiome sample-wise clustering, we also show that the rarefying procedure often discards samples that can be accurately clustered by alternative methods. We further compare different Negative Binomial methods with a recently-described zero-inflated Gaussian mixture, implemented in a package called metagenomeSeq. We find that metagenomeSeq performs well when there is an adequate number of biological replicates, but it nevertheless tends toward a higher false positive rate. Based on these results and well-established statistical theory, we advocate that investigators avoid rarefying altogether. We have provided microbiome-specific extensions to these tools in the R package, phyloseq.
Conflict of interest statement
The authors have declared that no competing interests exist.
Figures
References
- Shendure J, Lieberman Aiden E (2012) The expanding scope of DNA sequencing. Nature Biotechnology 30: 1084–1094.
- Shendure J, Ji H (2008) Next-generation DNA sequencing. Nature Biotechnology 26: 1135–1145.
- Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B (2008) Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nature Methods 5: 621–628.
- Pace NR (1997) A molecular view of microbial diversity and the biosphere. Science 276: 734–740.
- Wilson KH, Wilson WJ, Radosevich JL, DeSantis TZ, Viswanathan VS, et al. (2002) High-Density Microarray of Small-Subunit Ribosomal DNA Probes. Appl Environ Microbiol 68: 2535–2541.
- Huse SM, Dethlefsen L, Huber JA, Welch DM, Relman DA, et al. (2008) Exploring microbial diversity and taxonomy using SSU rRNA hypervariable tag sequencing. PLoS Genetics 4: e1000255.
- Riesenfeld CS, Schloss PD, Handelsman J (2004) Metagenomics: genomic analysis of microbial communities. Annual Review of Genetics 38: 525–552.
- Allison DB, Cui X, Page GP, Sabripour M (2006) Microarray Data Analysis: from Disarray to Consolidation and Consensus. Nature Reviews Genetics 7: 55–65.
- Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y (2008) RNA-Seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Research 18: 1509–1517.
- Lu J, Tomfohr JK, Kepler TB (2005) Identifying differential expression in multiple SAGE libraries: an overdispersed log-linear model approach. BMC Bioinformatics 6: 165.
- Robinson MD, Smyth GK (2007) Small-sample estimation of negative binomial dispersion, with applications to SAGE data. Biostatistics (Oxford, England) 9: 321–332.
- Cameron AC, Trivedi P (2013) Regression analysis of count data, volume 53. Cambridge University Press.
- Anders S, Huber W (2010) Differential expression analysis for sequence count data. Genome Biology 11: R106.
- Yu D, Huber W, Vitek O (2013) Shrinkage estimation of dispersion in Negative Binomial models for RNASeq experiments with small sample size. Bioinformatics (Oxford, England) 29: 1275–1282.
- Di Bella JM, Bao Y, Gloor GB, Burton JP, Reid G (2013) High throughput sequencing methods and analysis for microbiome research. Journal of Microbiological Methods 95: 401–414.
- Segata N, Boernigen D, Tickle TL, Morgan XC, Garrett WS, et al. (2013) Computational meta'omics for microbial community studies. Molecular Systems Biology 9: 666.
- Navas-Molina JA, Peralta-Sánchez JM, González A, McMurdie PJ, Vázquez-Baeza Y, et al. (2013) Advancing Our Understanding of the Human Microbiome Using QIIME. Methods in Enzymology 531: 371–444.
- Hughes JB, Hellmann JJ (2005) The application of rarefaction techniques to molecular inventories of microbial diversity. Methods in Enzymology 397: 292–308.
- Koren O, Knights D, González A, Waldron L, Segata N, et al. (2013) A guide to enterotypes across the human body: meta-analysis of microbial community structures in human microbiome datasets. PLoS Computational Biology 9: e1002863.
- Sanders HL (1968) Marine benthic diversity: A comparative study. The American Naturalist 102: 243–282.
- Gotelli NJ, Colwell RK (2001) Quantifying biodiversity: procedures and pitfalls in the measurement and comparison of species richness. Ecology Letters 4: 379–391.
- Mao CX, Colwell RK (2005) Estimation of Species Richness: Mixture Models, the Role of Rare Species, and Inferential Challenges. Ecology 86: 1143–1153.
- Lozupone C, Knight R (2005) UniFrac: a new phylogenetic method for comparing microbial communities. Applied and Environmental Microbiology 71: 8228–8235.
- Lozupone C, Lladser ME, Knights D, Stombaugh J, Knight R (2011) UniFrac: an effective distance metric for microbial community comparison. The ISME Journal 5: 169–172.
- Hamady M, Walker JJ, Harris JK, Gold NJ, Knight R (2008) Error-correcting barcoded primers for pyrosequencing hundreds of samples in multiplex. Nature Methods 5: 235–237.
- Liu Z, DeSantis TZ, Andersen GL, Knight R (2008) Accurate taxonomy assignments from 16S rRNA sequences produced by highly parallel pyrosequencers. Nucleic Acids Research 36: e120.
- Hamady M, Lozupone C, Knight R (2010) Fast UniFrac: facilitating high-throughput phylogenetic analyses of microbial communities including analysis of pyrosequencing and PhyloChip data. The ISME Journal 4: 17–27.
- Yatsunenko T, Rey FE, Manary MJ, Trehan I, Dominguez-Bello MG, et al. (2012) Human gut microbiome viewed across age and geography. Nature 486: 222–227.
- Caporaso J, Kuczynski J, Stombaugh J, Bittinger K, Bushman F, et al. (2010) QIIME allows analysis of high-throughput community sequencing data. Nature Methods 7: 335–336.
- Schloss PD, Westcott SL, Ryabin T, Hall JR, Hartmann M, et al. (2009) Introducing mothur: Open-Source, Platform-Independent, Community-Supported Software for Describing and Comparing Microbial Communities. Applied and Environmental Microbiology 75: 7537–7541.
- Gilbert JA, Field D, Swift P, Newbold L, Oliver A, et al. (2009) The seasonal structure of microbial communities in the Western English Channel. Environmental Microbiology 11: 3132–3139.
- McMurdie PJ, Holmes S (2013) phyloseq: an R package for reproducible interactive analysis and graphics of microbiome census data. PLoS ONE 8: e61217.
- Charlson ES, Chen J, Custers-Allen R, Bittinger K, Li H, et al. (2010) Disordered microbial communities in the upper respiratory tract of cigarette smokers. PLoS ONE 5: e15216.
- Price LB, Liu CM, Johnson KE, Aziz M, Lau MK, et al. (2010) The effects of circumcision on the penis microbiome. PLoS ONE 5: e8422.
- Kembel SW, Jones E, Kline J, Northcutt D, Stenson J, et al. (2012) Architectural design influences the diversity and structure of the built environment microbiome. The ISME Journal 6: 1469–1479.
- Flores GE, Bates ST, Caporaso JG, Lauber CL, Leff JW, et al. (2013) Diversity, distribution and sources of bacteria in residential kitchens. Environmental Microbiology 15: 588–596.
- Kang DW, Park JG, Ilhan ZE, Wallstrom G, Labaer J, et al. (2013) Reduced incidence of Prevotella and other fermenters in intestinal microflora of autistic children. PLoS ONE 8: e68322.
- Segata N, Haake SK, Mannon P, Lemon KP, Waldron L, et al. (2012) Composition of the adult digestive tract bacterial microbiome based on seven mouth surfaces, tonsils, throat and stool samples. Genome Biology 13: R42.
- White JR, Nagarajan N, Pop M (2009) Statistical methods for detecting differentially abundant features in clinical metagenomic samples. PLoS Computational Biology 5: e1000352.
- Paulson JN, Stine OC, Bravo HC, Pop M (2013) Differential abundance analysis for microbial marker-gene surveys. Nature Methods 10: 1200–1202.
- Robinson MD, McCarthy DJ, Smyth GK (2009) edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics (Oxford, England) 26: 139–140.
- Gower JC (1966) Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika 53: 325–338.
- Oksanen J, Blanchet FG, Kindt R, Legendre P, O'Hara RB, et al... (2011) vegan: Community Ecology Package. R package version 1.17-10.
- Anderson M (2001) A new method for non-parametric multivariate analysis of variance. Austral Ecology 26: 32–46.
- Bray JR, Curtis JT (1957) An Ordination of the Upland Forest Communities of Southern Wisconsin. Ecological Monographs 27: 325.
- Witten DM (2011) Classification and clustering of sequencing data using a Poisson model. The Annals of Applied Statistics 5: 2493–2518.
- Lozupone CA, Hamady M, Kelley ST, Knight R (2007) Quantitative and qualitative beta diversity measures lead to different insights into factors that structure microbial communities. Applied and Environmental Microbiology 73: 1576–1585.
- Caporaso JG, Lauber CL, Walters WA, Berg-Lyons D, Lozupone CA, et al. (2011) Global patterns of 16S rRNA diversity at a depth of millions of sequences per sample. Proceedings of the National Academy of Sciences 108: 4516–4522.
- Kaufman L, Rousseeuw PJ (1990) Finding Groups in Data: An Introduction to Cluster Analysis, JohnWiley & Sons, chapter 2.
- Reynolds A, Richards G, Iglesia B, Rayward-Smith V (2006) Clustering rules: A comparison of partitioning and hierarchical clustering algorithms. Journal of Mathematical Modelling and Algorithms 5: 475–504.
- Pollard KS, Gilbert HN, Ge Y, Taylor S, Dudoit S (2010) multtest: Resampling-based multiple hypothesis testing. R package version 2.4.0.
- Benjamini Y, Hochberg Y (1995) Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society Series B (Methodological) 57: 289–300.
- Allaire J, Horner J, Marti V, Porte N (2014) markdown: Markdown rendering for R. R package version 0.6.4.
- Maechler M, Rousseeuw P, Struyf A, Hubert M, Hornik K (2013) cluster: Cluster Analysis Basics and Extensions.
- Revolution Analytics (2011) foreach: Foreach looping construct for R. R package version 1.3.2.
- Wickham H (2009) ggplot2: elegant graphics for data analysis. Springer New York.
- Wickham H (2011) The split-apply-combine strategy for data analysis. Journal of Statistical Software 40: 1–29.
- Wickham H (2007) Reshaping data with the reshape package. Journal of Statistical Software 21: 1–20.
- Sing T, Sander O, Beerenwinkel N, Lengauer T (2005) ROCR: visualizing classifier performance in R. Bioinformatics (Oxford, England) 21: 3940–3941.
- R Development Core Team (2011) R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0.
- Hastie TJ, Pregibon D (1992) Generalized linear models. In: Chambers JM, Hastie TJ, editors, Statistical Models in S, Chapman & Hall/CRC, chapter 6..
- Nookaew I, Papini M, Pornputtapong N, Scalcinati G, Fagerberg L, et al. (2012) A comprehensive comparison of RNA-Seq-based transcriptome analysis from reads to differential gene expression and crosscomparison with microarrays: a case study in Saccharomyces cerevisiae. Nucleic Acids Research 40: 10084–10097.
- Bullard J, Purdom E, Hansen K, Dudoit S (2010) Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics 11: 94.
- Sun J, Nishiyama T, Shimizu K, Kadota K (2013) TCC: an R package for comparing tag count data with robust normalization strategies. BMC Bioinformatics 14: 219.
- Soneson C, Delorenzi M (2013) A comparison of methods for differential expression analysis of RNA-Seq data. BMC Bioinformatics 14: 91.
- Hardcastle TJ, Kelly KA (2010) baySeq: empirical Bayesian methods for identifying differential expression in sequence count data. BMC Bioinformatics 11: 422.
- Ozer HG, Parvin JD, Huang K (2012) DFI: gene feature discovery in RNA-Seq experiments from multiple sources. BMC Genomics 13 Suppl 8: S11.
- Bourgon R, Gentleman R, Huber W (2010) Independent filtering increases detection power for highthroughput experiments. Proceedings of the National Academy of Sciences 107: 9546–9551.
- Chao A, Chazdon RL, Colwell RK, Shen TJ (2005) A new statistical approach for assessing similarity of species composition with incidence and abundance data. Ecology Letters 8: 148–159.
- Schloss PD (2008) Evaluating different approaches that test whether microbial communities have the same structure. The ISME Journal 2: 265–275.
- Gentleman R, Temple Lang D (2004) Statistical analyses and reproducible research. Bioconductor Project Working Papers 1: 1–38.
- Peng RD (2011) Reproducible research in computational science. Science 334: 1226–1227.
- Donoho DL (2010) An invitation to reproducible computational research. Biostatistics (Oxford, England) 11: 385–388.
- Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, et al. (2004) Bioconductor: open software development for computational biology and bioinformatics. Genome Biology 5: R80.
- Wu GD, Chen J, Hoffmann C, Bittinger K, Chen YY, et al. (2011) Linking long-term dietary patterns with gut microbial enterotypes. Science 334: 105–108.
Source: PubMed