Phyloseq: a bioconductor package for handling and analysis of high-throughput phylogenetic sequence data

Paul J McMurdie, Susan Holmes, Paul J McMurdie, Susan Holmes

Abstract

We present a detailed description of a new Bioconductor package, phyloseq, for integrated data and analysis of taxonomically-clustered phylogenetic sequencing data in conjunction with related data types. The phyloseq package integrates abundance data, phylogenetic information and covariates so that exploratory transformations, plots, and confirmatory testing and diagnostic plots can be carried out seamlessly. The package is built following the S4 object-oriented framework of the R language so that once the data have been input the user can easily transform, plot and analyze the data. We present some examples that highlight the methods and the ease with which we can leverage existing packages.

Figures

Fig. 1
Fig. 1
Classes and inheritance in the phyloseq package. Core data classes are shown with grey fill and rounded corners. The class name and its slots are shown with red- or blue-shaded text, respectively. Inheritance is indicated graphically by arrows. Lines without arrows indicate that a higher-order object contains a slot with the associated class as one of its components.
Fig. 2
Fig. 2
Example of a default plot method for summarizing an object of class otuSam-Tax. Each phyloseq class has a specialized plot method for summarizing its data. In this case, relative abundance is shown quantitatively in a stacked barplot by phylum. Different taxa within a stack are differentiated by an alternating series of grayscale. The OTU identifier of taxa comprising a large enough fraction of the total community, 5% in this case, is labeled on the corresponding bar segment. Several diversity/richness indices are also shown.
Fig. 3
Fig. 3
NMDS ordination graphic generated by wunifracMDS. The NMDS coordinates are generated by metaMDS(), with the weighted-UniFrac distance matrix as argument, and 2-dimensions specified by default. A separate analysis was done using adonis(), which also did not find a compelling association between the weighted UniFrac distances and the gender (p = 0.29) or diet (p = 0.9) of subjects in the study.
Fig. 4
Fig. 4
Redundancy analysis and Constrained Correspondence Analysis. (Left) Redundancy analysis applied to a thresholded, ranked-transformed abundance table that had been trimmed such that only the phyla accounting for the top 99% of taxa are included. (Right) Original trimmed abundance table (no transformation nor threshold) subjected to Constrained Correspondence Analysis (CCA), constrained on a subject’s diet and gender.
Fig. 5
Fig. 5
Enlarged RDA and CCA plots emphasizing the taxa (species) coordinates. Graphics were produced with calcplotrda() or calcplotcca() convenience wrappers in phyloseq, which utilize analysis and graphics tools from the vegan and ggplot2 packages, respectively. Only the phyla accounting for the top 99% of taxa are included.
Fig. 6
Fig. 6
Example of phylogenetic sequence data before and after basic clustering with tipglom() function. (Left) Standard phylogram produced using default plotting function and no OTU clustering. (Right) Annotated phylogram after OTU clustering with tipglom(). Different symbols next to each tip indicate different samples in which the OTU was observed. The number inside each symbol indicates the respective number of individuals of a given OTU were observed in each sample.

Source: PubMed

3
Iratkozz fel