Hypothesis testing and power calculations for taxonomic-based human microbiome data

Patricio S La Rosa, J Paul Brooks, Elena Deych, Edward L Boone, David J Edwards, Qin Wang, Erica Sodergren, George Weinstock, William D Shannon, Patricio S La Rosa, J Paul Brooks, Elena Deych, Edward L Boone, David J Edwards, Qin Wang, Erica Sodergren, George Weinstock, William D Shannon

Abstract

This paper presents new biostatistical methods for the analysis of microbiome data based on a fully parametric approach using all the data. The Dirichlet-multinomial distribution allows the analyst to calculate power and sample sizes for experimental design, perform tests of hypotheses (e.g., compare microbiomes across groups), and to estimate parameters describing microbiome properties. The use of a fully parametric model for these data has the benefit over alternative non-parametric approaches such as bootstrapping and permutation testing, in that this model is able to retain more information contained in the data. This paper details the statistical approaches for several tests of hypothesis and power/sample size calculations, and applies them for illustration to taxonomic abundance distribution and rank abundance distribution data using HMP Jumpstart data on 24 subjects for saliva, subgingival, and supragingival samples. Software for running these analyses is available.

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1. Description of Dirichlet-multinomial parameters.
Figure 1. Description of Dirichlet-multinomial parameters.
Intuitive description of the meaning of the overdispersion parameter . The four plots show the taxa frequencies for each of the five hypothetical samples (dashed lines) with 12 taxa in each sample, and the corresponding weighted average across the five samples given by the vector of taxa frequencies (solid line). The plots on the left show the taxa frequencies of samples drawn from a Multinomial distribution and the plots on the right show taxa frequencies of five samples drawn from a Dirichlet Multinomial. The top row of plots is for samples with a smaller number of sequence reads, while the bottom row of plots is for samples with a larger number of sequence reads. As the number of reads increases for the multinomial distribution increases each samples taxa frequencies converge onto the mean, while for the Dirichlet-multinomial an increased number of reads is still associated with the same variability between the individual samples.
Figure 2. Definition of effect size.
Figure 2. Definition of effect size.
Illustration of a small and a large effect size when comparing two groups.
Figure 3. Comparison of two metagenomic groups…
Figure 3. Comparison of two metagenomic groups using a taxa composition data analysis approach.
Taxa frequency means at Class level obtained from subgingival plaque samples (blue curve) and from supragingival plaques samples (red curve): a) The mean of all taxa frequencies found in each group, b) The mean of taxa frequencies whose weighted average across both groups is larger than 1%. The remaining taxa are pooled into an additional taxon labeled as ‘Pooled taxa’.
Figure 4. Comparison of three metagenomic groups…
Figure 4. Comparison of three metagenomic groups using a taxa composition data analysis approach.
Taxa frequencies at class level obtained from saliva (black line), subgingival plaque (blue line), and from supragingival plaques samples (red line): a) The mean of all taxa frequencies found in each group, b) the mean of taxa frequencies whose weighted average across both groups is larger than 1%. The remaining taxa are pooled into an additional taxon labeled as ‘Pooled taxa’.
Figure 5. Comparison of two metagenomic groups…
Figure 5. Comparison of two metagenomic groups using rank abundance distribution data.
Ranked taxa frequencies mean at class level obtained from subgingival plaque samples (blue curve) and from supragingival plaques samples (red curve): a) The means of all ranked taxa frequencies found in each group; b) The mean of ranked taxa frequencies whose weighted average across both groups is larger than 1%. The remaining taxa are pooled into an additional taxon labeled as ‘Pooled taxa’.
Figure 6. Comparison of three metagenomic groups…
Figure 6. Comparison of three metagenomic groups using rank abundance distribution data.
Ranked taxa frequencies mean at class level obtained from subgingival plaque samples (blue curve) and from supragingival plaques samples (red curve): a) The means of all ranked taxa frequencies found in each group; b) The mean of ranked taxa frequencies whose weighted average across both groups is larger than 1%. The remaining taxa are pooled into an additional taxon labeled as ‘Pooled taxa’.

References

    1. Peterson J, Garges S, Giovanni M, McInnes P, Wang L, et al. (2009) The NIH Human Microbiome Project. Genome Research 19: 2317–2323.
    1. Turnbaugh PJ, Ley RE, Hamady M, Fraser-Liggett CM, Knight R, et al. (2007) The human microbiome project. Nature 449: 804–810.
    1. Wooley JC, Godzik A, Friedberg I (2010) A Primer on Metagenomics. PLoS Comput Biol 6: e1000667.
    1. Singleton DR, Furlong MA, Rathbun SL, Whitman WB (2001) Quantitative Comparisons of 16S rRNA Gene Sequence Libraries from Environmental Samples. Appl Environ Microbiol 67: 4374–4376.
    1. Martin AP (2002) Phylogenetic Approaches for Describing and Comparing the Diversity of Microbial Communities. Appl Environ Microbiol 68: 3673–3682.
    1. Schloss PD, Larget BR, Handelsman J (2004) Integration of Microbial Ecology and Statistics: a Test To Compare Gene Libraries. Appl Environ Microbiol 70: 5485–5492.
    1. Lozupone C, Knight R (2005) UniFrac: a new phylogenetic method for comparing microbial communities. Appl Environ Microbiol 71: 8228–8235.
    1. Schloss PD, Handelsman J (2005) Introducing DOTUR, a computer program for defining operational taxonomic units and estimating species richness. Appl Environ Microbiol 71: 1501–1506.
    1. Schloss PD, Handelsman J (2006) Introducing SONS, a tool for operational taxonomic unit-based comparisons of microbial community memberships and structures. Appl Environ Microbiol 72: 6773–6779.
    1. Schloss PD, Handelsman J (2006) Introducing TreeClimber, a test to compare microbial community structures. Appl Environ Microbiol 72: 2379–2384.
    1. Hamady M, Lozupone C, Knight R (2009) Fast UniFrac: facilitating high-throughput phylogenetic analyses of microbial communities including analysis of pyrosequencing and PhyloChip data. ISME J 4: 17–27.
    1. White JR, Nagarajan N, Pop M (2009) Statistical Methods for Detecting Differentially Abundant Features in Clinical Metagenomic Samples. PLoS Comput Biol 5: e1000352.
    1. Mantel N (1967) The detection of disease clustering and a generalized regression approach. Cancer research 27: 209–220.
    1. Mantel N, Valand RS (1970) A technique of nonparametric multivariate analysis. Biometrics: 547–558.
    1. Clarke KR (1993) Non-parametric multivariate analyses of changes in community structure. Australian journal of ecology 18: 117–143.
    1. Anderson MJ (2001) A new method for non-parametric multivariate analysis of variance. Austral Ecology 26: 32–46.
    1. Holmes I, Harris K, Quince C (2012) Dirichlet multinomial mixtures: generative models for microbial metagenomics. PLoS One 7: e30126.
    1. La Rosa PS, Deych E, Shands B, Shannon WD (2011) HMP: Hypothesis Testing and Power Calculations for Comparing Metagenomic Samples from HMP. R-package.
    1. Human Microbiome Project 16S rRNA Clinical Production Pilot (ID: 48335). pp. The NCBI BioProject website. Available: . Accessed 18 Sep 2012.
    1. Cole JR, Chai B, Farris RJ, Wang Q, Kulam SA, et al. (2005) The Ribosomal Database Project (RDP-II): sequences and tools for high-throughput rRNA analysis. Nucleic Acids Research 33: D294–D296.
    1. Vilo C, Dong Q (2012) Evaluation of the RDP Classifier Accuracy Using 16S rRNA Gene Variable Regions. Metagenomics.
    1. Cox DR (1983) Some remarks on overdispersion. Biometrika 70: 269–274.
    1. Brier SS (1980) Analysis of contingency table under cluster sampling. Biometrika 67: 591–596.
    1. Tvedebrink T (2010) Overdispersion in allelic counts and theta-correction in forensic genetics. Theor Popul Biol 78: 200–210.
    1. Mosimann JE (1962) On the compound multinomial distribution, the multivariate β-distribution, and correlations among proportions. Biometrika 49: 65–82.
    1. Whittaker R (1965) Dominance and diversity in land plant communities. Science 147: 250.
    1. Magurran AE (2004) Measuring biological diversity: Wiley-Blackwell.
    1. McGill BJ, Etienne RS, Gray JS, Alonso D, Anderson MJ, et al. (2007) Species abundance distributions: moving beyond single prediction theories to integration within an ecological framework. Ecol Lett 10: 995–1015.
    1. Legendre P (1998) Numerical ecology. Developments in environmental modelling.
    1. Weir BS, Hill WG (2002) ESTIMATING F-STATISTICS. Annual Review of Genetics 36: 721–750.
    1. Kim BS, Margolin BH (1992) Testing Goodness of Fit of a Multinomial Model Against Overdispersed Alternatives. Biometrics 48: 711–719.
    1. K. J Koehler, Wilson JR (1986) Chi-square tests for comparing vectors of proportions for several cluster samples. Communications in statistics Theory and Methods 15: 2977–2990.
    1. Wilson JR, Koehler KJ (1984) Testing of equality of vectors of proportions for several cluster samples. Proceedings of Joint Statistical Association Meetings Survey Research Methods.
    1. Kirk RE (1968) Experimental Design. Belmont: Wadsworth Inc.

Source: PubMed

3
Sottoscrivi