A GLM-based latent variable ordination method for microbiome samples

Michael B Sohn, Hongzhe Li, Michael B Sohn, Hongzhe Li

Abstract

Distance-based ordination methods, such as principal coordinates analysis (PCoA), are widely used in the analysis of microbiome data. However, these methods are prone to pose a potential risk of misinterpretation about the compositional difference in samples across different populations if there is a difference in dispersion effects. Accounting for high sparsity and overdispersion of microbiome data, we propose a GLM-based Ordination Method for Microbiome Samples (GOMMS) in this article. This method uses a zero-inflated quasi-Poisson (ZIQP) latent factor model. An EM algorithm based on the quasi-likelihood is developed to estimate parameters. It performs comparatively to the distance-based approach when dispersion effects are negligible and consistently better when dispersion effects are strong, where the distance-based approach sometimes yields undesirable results. The estimated latent factors from GOMMS can be used to associate the microbiome community with covariates or outcomes using the standard multivariate tests, which can be investigated in future confirmatory experiments. We illustrate the method in simulations and an analysis of microbiome samples from nasopharynx and oropharynx.

Keywords: 16S sequencing; Factor models; Microbiome; Zero-inflated models.

© 2017, The International Biometric Society.

Figures

Figure 1.
Figure 1.
A graphical representation of a factor model for an example involving p orthogonal factors and m observed variables.
Figure 2.
Figure 2.
Simulation results where data are generated from NB distributions. (a) There is a difference in dispersions but not in means between the two populations. (b) There is a difference in both means and dispersions between the two populations. The difference in dispersions between two populations is 10 fold. White circles represent samples in a population with higher dispersion.
Figure 3.
Figure 3.
Simulation results where data are generated from ZINB distributions with taxon-specific overdispersion parameters. Four scenarios are presented, where each population is represented by a different color and white circles represents samples in a population with higher dispersion.
Figure 4.
Figure 4.
Comparisons of bacterial community compositions from three methods, where circles and triangles represent samples from nasopharynx and oropharynx, respectively. Left panel: comparison between nasopharynx and oropharynx among the non-smokers. Dark gray color is for samples from the right side of nasopharynx and oropharynx and white color is for samples from the left side. Right panel: comparison between smokers and non-makers. Dark gray color represents samples of smokers and white color represents samples of nonsmokers.
Figure 5.
Figure 5.
Elapsed time since last smoke vs. the estimated factor loadings for oropharyngeal samples of the smokers, τ is Kendall’s rank correlation coefficient and p is its corresponding p-value.

Source: PubMed

3
订阅