Waste not, want not: why rarefying microbiome data is inadmissible

Paul J McMurdie, Susan Holmes, Paul J McMurdie, Susan Holmes

Abstract

Current practice in the normalization of microbiome count data is inefficient in the statistical sense. For apparently historical reasons, the common approach is either to use simple proportions (which does not address heteroscedasticity) or to use rarefying of counts, even though both of these approaches are inappropriate for detection of differentially abundant species. Well-established statistical theory is available that simultaneously accounts for library size differences and biological variability using an appropriate mixture model. Moreover, specific implementations for DNA sequencing read count data (based on a Negative Binomial model for instance) are already available in RNA-Seq focused R packages such as edgeR and DESeq. Here we summarize the supporting statistical theory and use simulations and empirical data to demonstrate substantial improvements provided by a relevant mixture model framework over simple proportions or rarefying. We show how both proportions and rarefied counts result in a high rate of false positives in tests for species that are differentially abundant across sample classes. Regarding microbiome sample-wise clustering, we also show that the rarefying procedure often discards samples that can be accurately clustered by alternative methods. We further compare different Negative Binomial methods with a recently-described zero-inflated Gaussian mixture, implemented in a package called metagenomeSeq. We find that metagenomeSeq performs well when there is an adequate number of biological replicates, but it nevertheless tends toward a higher false positive rate. Based on these results and well-established statistical theory, we advocate that investigators avoid rarefying altogether. We have provided microbiome-specific extensions to these tools in the R package, phyloseq.

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Figure 1. A minimal example of the…
Figure 1. A minimal example of the effect of rarefying on statistical power.
Hypothetical abundance data in its original (Top-Left) and rarefied (Top-Right) form, with corresponding formal test results for differentiation (Bottom).
Figure 2. Graphical summary of the two…
Figure 2. Graphical summary of the two simulation frameworks.
Both Simulation A (clustering) and Simulation B (differential abundance) are represented. All simulations begin with real microbiome count data from a survey experiment referred to here as “the Global Patterns dataset” . Tables of integers with multiple columns represent an abundance count matrix (“OTU table”), while a single-column of integers represents a multinomial of OTU counts/proportions. In both simulation illustrations an effect size is explained and given an example value of 10 for easy mental computation, but its meaning is different for each simulation. Note that effect size is altogether different than library size, the latter being equivalent to both the column sums and the number of reads per sample. A grey highlight indicates count values for which an effect has been applied in Simulation B. Protocol S1 includes the complete source code used to compute the example values shown here, as well as the full simulations discussed below.
Figure 3. Examples of overdispersion in microbiome…
Figure 3. Examples of overdispersion in microbiome data.
Common-Scale Variance versus Mean for Microbiome Data. Each point in each panel represents a different OTU's mean/variance estimate for a biological replicate and study. The data in this figure come from the Global Patterns survey and the Long-Term Dietary Patterns study , with results from additional studies included in Protocol S1. (Right) Variance versus mean abundance for rarefied counts. (Left) Common-scale variances and common-scale means, estimated according to Equations 6 and 7 from Anders and Huber , implemented in the DESeq package (Text S1). The dashed gray line denotes the σ2 = μ case (Poisson; φ = 0). The cyan curve denotes the fitted variance estimate using DESeq , with method = ‘pooled’, sharingMode = ‘fit-only’, fitType = ‘local’.
Figure 4. Clustering accuracy in simulated two-class…
Figure 4. Clustering accuracy in simulated two-class mixing.
Partitioning around medoids, clustering accuracy (vertical axis) that results following different normalization and distance methods. Points denote the mean values of replicates, with a vertical bar representing one standard deviation above and below. Normalization method is indicated by both shade and shape, while panel columns and panel rows indicate the distance metric and median library size (), respectively. The horizontal axis is the effect size, which in this context is the ratio of target to non-target values in the multinomials that were used to simulate microbiome counts. Each multinomial is derived from two microbiomes that have negligible overlapping OTUs (Fecal and Ocean microbiomes in the Global Patterns dataset [48]). Higher values of effect size indicate an easier clustering task. For simulation details and precise definitions of abbreviations see Simulation A of the Methods section.
Figure 5. Normalization by rarefying only, dependency…
Figure 5. Normalization by rarefying only, dependency on library size threshold.
Unlike the analytical methods represented in Figure 4, here rarefying is the only normalization method used, but at varying values of the minimum library size threshold, shown as library-size quantile (horizontal axis). Panel columns, panel rows, and point/line shading indicate effect size (ES), median library size (), and distance method applied after rarefying, respectively. Because discarded samples cannot be accurately clustered, the line is the maximum achievable accuracy.
Figure 6. Performance of differential abundance detection…
Figure 6. Performance of differential abundance detection with and without rarefying.
Performance summarized here by the “Area Under the Curve” (AUC) metric of a Receiver Operator Curve (ROC) (vertical axis). Briefly, the AUC value varies from 0.5 (random) to 1.0 (perfect), incorporating both sensitivity and specificity. The horizontal axis indicates the effect size, shown as the actual multiplication factor applied to OTU counts in the test class to simulate a differential abundance. Each curve traces the respective normalization method's mean performance of that panel, with a vertical bar indicating a standard deviation in performance across all replicates and microbiome templates. The right-hand side of the panel rows indicates the median library size, , while the darkness of line shading indicates the number of samples per simulated experiment. Color shade and shape indicate the normalization method. See Methods section for the definitions of each normalization and testing method. For all methods, detection among multiple tests was defined using a False Discovery Rate (Benjamini-Hochberg [52]) significance threshold of 0.05.

References

    1. Shendure J, Lieberman Aiden E (2012) The expanding scope of DNA sequencing. Nature Biotechnology 30: 1084–1094.
    1. Shendure J, Ji H (2008) Next-generation DNA sequencing. Nature Biotechnology 26: 1135–1145.
    1. Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B (2008) Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nature Methods 5: 621–628.
    1. Pace NR (1997) A molecular view of microbial diversity and the biosphere. Science 276: 734–740.
    1. Wilson KH, Wilson WJ, Radosevich JL, DeSantis TZ, Viswanathan VS, et al. (2002) High-Density Microarray of Small-Subunit Ribosomal DNA Probes. Appl Environ Microbiol 68: 2535–2541.
    1. Huse SM, Dethlefsen L, Huber JA, Welch DM, Relman DA, et al. (2008) Exploring microbial diversity and taxonomy using SSU rRNA hypervariable tag sequencing. PLoS Genetics 4: e1000255.
    1. Riesenfeld CS, Schloss PD, Handelsman J (2004) Metagenomics: genomic analysis of microbial communities. Annual Review of Genetics 38: 525–552.
    1. Allison DB, Cui X, Page GP, Sabripour M (2006) Microarray Data Analysis: from Disarray to Consolidation and Consensus. Nature Reviews Genetics 7: 55–65.
    1. Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y (2008) RNA-Seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Research 18: 1509–1517.
    1. Lu J, Tomfohr JK, Kepler TB (2005) Identifying differential expression in multiple SAGE libraries: an overdispersed log-linear model approach. BMC Bioinformatics 6: 165.
    1. Robinson MD, Smyth GK (2007) Small-sample estimation of negative binomial dispersion, with applications to SAGE data. Biostatistics (Oxford, England) 9: 321–332.
    1. Cameron AC, Trivedi P (2013) Regression analysis of count data, volume 53. Cambridge University Press.
    1. Anders S, Huber W (2010) Differential expression analysis for sequence count data. Genome Biology 11: R106.
    1. Yu D, Huber W, Vitek O (2013) Shrinkage estimation of dispersion in Negative Binomial models for RNASeq experiments with small sample size. Bioinformatics (Oxford, England) 29: 1275–1282.
    1. Di Bella JM, Bao Y, Gloor GB, Burton JP, Reid G (2013) High throughput sequencing methods and analysis for microbiome research. Journal of Microbiological Methods 95: 401–414.
    1. Segata N, Boernigen D, Tickle TL, Morgan XC, Garrett WS, et al. (2013) Computational meta'omics for microbial community studies. Molecular Systems Biology 9: 666.
    1. Navas-Molina JA, Peralta-Sánchez JM, González A, McMurdie PJ, Vázquez-Baeza Y, et al. (2013) Advancing Our Understanding of the Human Microbiome Using QIIME. Methods in Enzymology 531: 371–444.
    1. Hughes JB, Hellmann JJ (2005) The application of rarefaction techniques to molecular inventories of microbial diversity. Methods in Enzymology 397: 292–308.
    1. Koren O, Knights D, González A, Waldron L, Segata N, et al. (2013) A guide to enterotypes across the human body: meta-analysis of microbial community structures in human microbiome datasets. PLoS Computational Biology 9: e1002863.
    1. Sanders HL (1968) Marine benthic diversity: A comparative study. The American Naturalist 102: 243–282.
    1. Gotelli NJ, Colwell RK (2001) Quantifying biodiversity: procedures and pitfalls in the measurement and comparison of species richness. Ecology Letters 4: 379–391.
    1. Mao CX, Colwell RK (2005) Estimation of Species Richness: Mixture Models, the Role of Rare Species, and Inferential Challenges. Ecology 86: 1143–1153.
    1. Lozupone C, Knight R (2005) UniFrac: a new phylogenetic method for comparing microbial communities. Applied and Environmental Microbiology 71: 8228–8235.
    1. Lozupone C, Lladser ME, Knights D, Stombaugh J, Knight R (2011) UniFrac: an effective distance metric for microbial community comparison. The ISME Journal 5: 169–172.
    1. Hamady M, Walker JJ, Harris JK, Gold NJ, Knight R (2008) Error-correcting barcoded primers for pyrosequencing hundreds of samples in multiplex. Nature Methods 5: 235–237.
    1. Liu Z, DeSantis TZ, Andersen GL, Knight R (2008) Accurate taxonomy assignments from 16S rRNA sequences produced by highly parallel pyrosequencers. Nucleic Acids Research 36: e120.
    1. Hamady M, Lozupone C, Knight R (2010) Fast UniFrac: facilitating high-throughput phylogenetic analyses of microbial communities including analysis of pyrosequencing and PhyloChip data. The ISME Journal 4: 17–27.
    1. Yatsunenko T, Rey FE, Manary MJ, Trehan I, Dominguez-Bello MG, et al. (2012) Human gut microbiome viewed across age and geography. Nature 486: 222–227.
    1. Caporaso J, Kuczynski J, Stombaugh J, Bittinger K, Bushman F, et al. (2010) QIIME allows analysis of high-throughput community sequencing data. Nature Methods 7: 335–336.
    1. Schloss PD, Westcott SL, Ryabin T, Hall JR, Hartmann M, et al. (2009) Introducing mothur: Open-Source, Platform-Independent, Community-Supported Software for Describing and Comparing Microbial Communities. Applied and Environmental Microbiology 75: 7537–7541.
    1. Gilbert JA, Field D, Swift P, Newbold L, Oliver A, et al. (2009) The seasonal structure of microbial communities in the Western English Channel. Environmental Microbiology 11: 3132–3139.
    1. McMurdie PJ, Holmes S (2013) phyloseq: an R package for reproducible interactive analysis and graphics of microbiome census data. PLoS ONE 8: e61217.
    1. Charlson ES, Chen J, Custers-Allen R, Bittinger K, Li H, et al. (2010) Disordered microbial communities in the upper respiratory tract of cigarette smokers. PLoS ONE 5: e15216.
    1. Price LB, Liu CM, Johnson KE, Aziz M, Lau MK, et al. (2010) The effects of circumcision on the penis microbiome. PLoS ONE 5: e8422.
    1. Kembel SW, Jones E, Kline J, Northcutt D, Stenson J, et al. (2012) Architectural design influences the diversity and structure of the built environment microbiome. The ISME Journal 6: 1469–1479.
    1. Flores GE, Bates ST, Caporaso JG, Lauber CL, Leff JW, et al. (2013) Diversity, distribution and sources of bacteria in residential kitchens. Environmental Microbiology 15: 588–596.
    1. Kang DW, Park JG, Ilhan ZE, Wallstrom G, Labaer J, et al. (2013) Reduced incidence of Prevotella and other fermenters in intestinal microflora of autistic children. PLoS ONE 8: e68322.
    1. Segata N, Haake SK, Mannon P, Lemon KP, Waldron L, et al. (2012) Composition of the adult digestive tract bacterial microbiome based on seven mouth surfaces, tonsils, throat and stool samples. Genome Biology 13: R42.
    1. White JR, Nagarajan N, Pop M (2009) Statistical methods for detecting differentially abundant features in clinical metagenomic samples. PLoS Computational Biology 5: e1000352.
    1. Paulson JN, Stine OC, Bravo HC, Pop M (2013) Differential abundance analysis for microbial marker-gene surveys. Nature Methods 10: 1200–1202.
    1. Robinson MD, McCarthy DJ, Smyth GK (2009) edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics (Oxford, England) 26: 139–140.
    1. Gower JC (1966) Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika 53: 325–338.
    1. Oksanen J, Blanchet FG, Kindt R, Legendre P, O'Hara RB, et al... (2011) vegan: Community Ecology Package. R package version 1.17-10.
    1. Anderson M (2001) A new method for non-parametric multivariate analysis of variance. Austral Ecology 26: 32–46.
    1. Bray JR, Curtis JT (1957) An Ordination of the Upland Forest Communities of Southern Wisconsin. Ecological Monographs 27: 325.
    1. Witten DM (2011) Classification and clustering of sequencing data using a Poisson model. The Annals of Applied Statistics 5: 2493–2518.
    1. Lozupone CA, Hamady M, Kelley ST, Knight R (2007) Quantitative and qualitative beta diversity measures lead to different insights into factors that structure microbial communities. Applied and Environmental Microbiology 73: 1576–1585.
    1. Caporaso JG, Lauber CL, Walters WA, Berg-Lyons D, Lozupone CA, et al. (2011) Global patterns of 16S rRNA diversity at a depth of millions of sequences per sample. Proceedings of the National Academy of Sciences 108: 4516–4522.
    1. Kaufman L, Rousseeuw PJ (1990) Finding Groups in Data: An Introduction to Cluster Analysis, JohnWiley & Sons, chapter 2.
    1. Reynolds A, Richards G, Iglesia B, Rayward-Smith V (2006) Clustering rules: A comparison of partitioning and hierarchical clustering algorithms. Journal of Mathematical Modelling and Algorithms 5: 475–504.
    1. Pollard KS, Gilbert HN, Ge Y, Taylor S, Dudoit S (2010) multtest: Resampling-based multiple hypothesis testing. R package version 2.4.0.
    1. Benjamini Y, Hochberg Y (1995) Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society Series B (Methodological) 57: 289–300.
    1. Allaire J, Horner J, Marti V, Porte N (2014) markdown: Markdown rendering for R. R package version 0.6.4.
    1. Maechler M, Rousseeuw P, Struyf A, Hubert M, Hornik K (2013) cluster: Cluster Analysis Basics and Extensions.
    1. Revolution Analytics (2011) foreach: Foreach looping construct for R. R package version 1.3.2.
    1. Wickham H (2009) ggplot2: elegant graphics for data analysis. Springer New York.
    1. Wickham H (2011) The split-apply-combine strategy for data analysis. Journal of Statistical Software 40: 1–29.
    1. Wickham H (2007) Reshaping data with the reshape package. Journal of Statistical Software 21: 1–20.
    1. Sing T, Sander O, Beerenwinkel N, Lengauer T (2005) ROCR: visualizing classifier performance in R. Bioinformatics (Oxford, England) 21: 3940–3941.
    1. R Development Core Team (2011) R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0.
    1. Hastie TJ, Pregibon D (1992) Generalized linear models. In: Chambers JM, Hastie TJ, editors, Statistical Models in S, Chapman & Hall/CRC, chapter 6..
    1. Nookaew I, Papini M, Pornputtapong N, Scalcinati G, Fagerberg L, et al. (2012) A comprehensive comparison of RNA-Seq-based transcriptome analysis from reads to differential gene expression and crosscomparison with microarrays: a case study in Saccharomyces cerevisiae. Nucleic Acids Research 40: 10084–10097.
    1. Bullard J, Purdom E, Hansen K, Dudoit S (2010) Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics 11: 94.
    1. Sun J, Nishiyama T, Shimizu K, Kadota K (2013) TCC: an R package for comparing tag count data with robust normalization strategies. BMC Bioinformatics 14: 219.
    1. Soneson C, Delorenzi M (2013) A comparison of methods for differential expression analysis of RNA-Seq data. BMC Bioinformatics 14: 91.
    1. Hardcastle TJ, Kelly KA (2010) baySeq: empirical Bayesian methods for identifying differential expression in sequence count data. BMC Bioinformatics 11: 422.
    1. Ozer HG, Parvin JD, Huang K (2012) DFI: gene feature discovery in RNA-Seq experiments from multiple sources. BMC Genomics 13 Suppl 8: S11.
    1. Bourgon R, Gentleman R, Huber W (2010) Independent filtering increases detection power for highthroughput experiments. Proceedings of the National Academy of Sciences 107: 9546–9551.
    1. Chao A, Chazdon RL, Colwell RK, Shen TJ (2005) A new statistical approach for assessing similarity of species composition with incidence and abundance data. Ecology Letters 8: 148–159.
    1. Schloss PD (2008) Evaluating different approaches that test whether microbial communities have the same structure. The ISME Journal 2: 265–275.
    1. Gentleman R, Temple Lang D (2004) Statistical analyses and reproducible research. Bioconductor Project Working Papers 1: 1–38.
    1. Peng RD (2011) Reproducible research in computational science. Science 334: 1226–1227.
    1. Donoho DL (2010) An invitation to reproducible computational research. Biostatistics (Oxford, England) 11: 385–388.
    1. Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, et al. (2004) Bioconductor: open software development for computational biology and bioinformatics. Genome Biology 5: R80.
    1. Wu GD, Chen J, Hoffmann C, Bittinger K, Chen YY, et al. (2011) Linking long-term dietary patterns with gut microbial enterotypes. Science 334: 105–108.

Source: PubMed

3
订阅