Strains, functions and dynamics in the expanded Human Microbiome Project

Jason Lloyd-Price, Anup Mahurkar, Gholamali Rahnavard, Jonathan Crabtree, Joshua Orvis, A Brantley Hall, Arthur Brady, Heather H Creasy, Carrie McCracken, Michelle G Giglio, Daniel McDonald, Eric A Franzosa, Rob Knight, Owen White, Curtis Huttenhower, Jason Lloyd-Price, Anup Mahurkar, Gholamali Rahnavard, Jonathan Crabtree, Joshua Orvis, A Brantley Hall, Arthur Brady, Heather H Creasy, Carrie McCracken, Michelle G Giglio, Daniel McDonald, Eric A Franzosa, Rob Knight, Owen White, Curtis Huttenhower

Abstract

The characterization of baseline microbial and functional diversity in the human microbiome has enabled studies of microbiome-related disease, diversity, biogeography, and molecular function. The National Institutes of Health Human Microbiome Project has provided one of the broadest such characterizations so far. Here we introduce a second wave of data from the study, comprising 1,631 new metagenomes (2,355 total) targeting diverse body sites with multiple time points in 265 individuals. We applied updated profiling and assembly methods to provide new characterizations of microbiome personalization. Strain identification revealed subspecies clades specific to body sites; it also quantified species with phylogenetic diversity under-represented in isolate genomes. Body-wide functional profiling classified pathways into universal, human-enriched, and body site-enriched subsets. Finally, temporal analysis decomposed microbial variation into rapidly variable, moderately variable, and stable subsets. This study furthers our knowledge of baseline human microbial diversity and enables an understanding of personalized microbiome function and dynamics.

Conflict of interest statement

The authors declare no competing financial interests.

Figures

Figure 1. Personalization, niche association, and reference…
Figure 1. Personalization, niche association, and reference genome coverage in strain-level metagenomic profiles.
a, Mean phylogenetic divergences between strains of species with sufficient coverage at each targeted body site (minimum 2 strain pairs). b, Individuals tended to retain personalized strains, as visualized by a principal coordinates analysis (PCoA) plot for Actinomyces sp. oral taxon 448, in which lines connect samples from the same individual. c, Quantification of niche association (Methods; only species with sufficient coverage in at least five samples at two or more body sites). Higher values indicate greater phylogenetic separation between body sites. d, PCoA showing niche association of Haemophilus parainfluenzae, showing subspecies specialization to three different body sites. e, PCoA for Eubacterium siraeum. f, Coverage of human-associated strains by the current 16,903 reference genome set (Methods). Top 25 species by mean relative abundance when present (>0.1% relative abundance) are shown (minimum prevalence of 50 samples). Sample counts in Supplementary Table 2, and distance matrices are available from Extended Data Table 1b. PowerPoint slide
Figure 2. Core and distinguishing functions of…
Figure 2. Core and distinguishing functions of human body site microbiomes.
a, In total, 28 metabolic pathways were core at all 6 major body sites (‘supercore’ pathways). An, anterior nares; Bm, buccal mucosa; Pf, posterior fornix; S, stool; Sp, supragingival plaque; Td, tongue dorsum. Two supercore pathways and b, 17 additional pathways were core in multiple body areas and enriched among human-associated taxa (‘human microbiome-enriched’ pathways). c, 21 pathways were considerably more abundant at 1 body site than at sites from all other body areas (‘body site-enriched’ pathways). Heat map values reflect the first quartile of relative abundance (heat maps are expanded in Extended Data Fig. 3). In pathway bar plots, total (community) abundance is log-scaled, and the contributions of the top seven genera are proportionally scaled within the total. ‘Other’ encompasses contributions from additional, known genera; ‘unclassified’ encompasses contributions of unknown taxonomy. PowerPoint slide
Figure 3. Temporal dynamics of individual species…
Figure 3. Temporal dynamics of individual species and microbial pathways at each targeted body site.
a, Jaccard similarity is maximal between technical replicates and decreases with time, although within-subject similarity always exceeds between-subject similarity. b, Gaussian process decomposition of the variance in species abundances (each point is one species; filtering criteria in Methods) into three biologically relevant components based on their characteristic timescales (Methods). Technical noise was estimated (Supplementary Table 5) but not visualized. Species with high inference uncertainty (s.e.m. on the ternary diagram > 0.2) are grey and the inference is biased towards the centre of the diagrams (Methods). Labelled version in Extended Data Fig. 5. c, Same as b, but for abundances of all core pathways. d, Illustrative time series showing dynamics at different locations within the ternary plots (Extended Data Fig. 4d–f for real examples). Sample counts in Supplementary Table 1. PowerPoint slide
Figure 4. Assembly and annotation of body-wide…
Figure 4. Assembly and annotation of body-wide human microbiomes.
ad, Tukey boxplots of total assembly size, maximum and average contig lengths, and gene counts for single and co-assemblies (sample sizes in Supplementary Table 1). e, Rarefaction curves of gene families (open reading frame (ORF) clusters at 90% sequence similarity) from predicted genes generated from single assemblies for the targeted body sites (points), with a power-law fit (lines). The size of the HMP1 WMS dataset at each body site is also shown (open circles). The rarefaction trajectory was robust to changes in the sequence similarity threshold (for the 188 posterior fornix samples, the number of gene families ranged between only 1,131,796 to 1,271,891 for similarities 70–95%). Colouring is as in axis labels of a. Sample counts in Supplementary Table 1. PowerPoint slide
Extended Data Figure 1. Extended body-wide metagenomic…
Extended Data Figure 1. Extended body-wide metagenomic taxonomic profiles in HMP1-II.
a, The combined HMP1-II datasets include a total of 2,355 metagenomes (724 previously published and 1,631 new, including 252 technical replicates). These span the project’s six targeted body sites (anterior nares, buccal mucosa, supragingival plaque, tongue dorsum, stool, and posterior fornix) in addition to at least 20 samples each from 3 additional sites, of the 18 total sampled sites: retroauricular crease, palatine tonsils, and subgingival plaque. Metagenomes are now available for at least one body site for a total of 265 individuals. b, PCoA using Bray–Curtis distances among all microbes at the species level. c, Relative abundances of the most prevalent and abundant microbes (bacterial, viral, eukaryotic, and archaeal) among all body sites, as profiled by MetaPhlAn2. Prevalent eukaryotic microbes are shown at the genus level. d, Taxonomic profiles do not vary more between sequencing centres, batches, or clinical centres than they do among individuals within body sites. Ordinations show Bray–Curtis principal coordinates of species-level abundances at each body site. Within-site ecological structure is as expected, with no divergence associated with technical variables along the first two ordination axes.
Extended Data Figure 2. Geographic, temporal, and…
Extended Data Figure 2. Geographic, temporal, and biogeographic strain variation.
a, Mean distance (Kimura two-parameter) among strains from subjects either within or between three-digit zip codes (the finest degree of geographic information available). Data and sample sizes are in Supplementary Table 2. b, Mean strain divergences between different visits for the same subject and body site compared to the mean distance between the same visits for the same subject and body site (technical replicates) for each species. cu, PCoA plots based on the Kimura two-parameter distance are shown for Escherichia coli (c), Actinomyces johnsonii (d), and all species (that is, those shown Fig. 1b ; eu), sorted in descending order of their niche-association score (Methods). Distance matrices used to generate these PCoAs are publicly available (Extended Data Table 1b).
Extended Data Figure 3. Core and distinguishing…
Extended Data Figure 3. Core and distinguishing functions of human body site microbiomes.
This figure extends Fig. 3, with additional details and examples. a, 28 metabolic pathways were core (>75% prevalent) at all major body sites. We refer to these as ‘supercore’ pathways. b, Pathways core to greater numbers of body sites tended to have broader taxonomic ranges, with supercore pathways among the most broadly distributed (Tukey boxplots). c, 19 pathways (including two supercore pathways, starred in a) were core in multiple body areas and specifically enriched among taxa inhabiting the human microbiome (annotated to <10% of non-human-associated genera). Human-microbiome-enriched pathways include specific MetaCyc-defined variants of more broadly defined or distributed processes, for example, peptidoglycan biosynthesis (PWY-6471). d, ‘Site-enriched’ pathways are considerably more abundant at one body site than at sites from all other body areas. Black dots indicate the site where each site-enriched pathway achieved its peak abundance. Heat map values reflect the first quartile of relative abundance in a particular body site (coordinated with the percentile cutoff for a core pathway). eg, Additional examples of the three pathway classes enumerated in a, c, and d, respectively. In each example, total (community) abundance is log-scaled, and the contributions of the top seven genera are proportionally scaled within the community total. ‘Other’ encompasses pathway contributions from genera outside of the top seven, and ‘unclassified’ encompasses pathway contributions from unidentified members of the community.
Extended Data Figure 4. Sampling interval distribution,…
Extended Data Figure 4. Sampling interval distribution, parameter fits for simulated samples, and examples of microbial species abundance dynamics and corresponding Gaussian process fits.
a, Distributions of differences in time between samples at each targeted body site. Technical replicates are shown as Δt = 0. b, Parameter fits for simulated samples with U = 0, B = 0, T = 0.95, N = 0.05, and varying l (see Methods). Simulated samples were drawn with the real sample distribution and count from each site, to show how limitations in sampling at certain sites alter the fidelity of the fits. c, Parameter fits for five simulated samples with each of the three pure components (coloured red, green and blue), as well as all even mixtures of pairs of them (for example, yellow points are even mixtures of U and T), and even mixtures of all three (black), for differing levels of technical noise (N) and fixed l = 0.5. Uncertain inferences are more desaturated. df, Three examples of taxonomic profiles fit with the Gaussian process model are shown on plots designed to allow a direct comparison between the data and the fit Gaussian process, and allow the different dynamics to be visualized despite the limit of only up to three time points per person. Each example was chosen as an exemplar of one of the three non-technical components in the model. Insets denote confidence deciles of the MCMC samples. The abundance of Fusobacterium periodonticum in the tongue dorsum shows strong time-varying behaviour (d), Bacteroides stercoris in stool shows mostly inter-individual differences (e), and Gemella haemolysans in the buccal mucosa is dominated by biological noise (f). The plots show the absolute difference in arcsine square-root transformed microbial abundance (|Δx|) between pairs of samples from the same person against the difference in time between samples (points). A Gaussian-smoothed estimate of the standard deviation of the points is also shown (blue line, bandwidth three months), along with the expected difference from the fit Gaussian process (red line). The standard deviation of differences between technical replicates (points with Δt = 0 months) is also shown as the line stub at the origin, directly visualizing the level of technical noise. Biological noise is visible here as the difference between technical noise and the variance of the remaining points extrapolated to the origin. The time-varying component is visible as a gradual increase in the variance of the differences over time (that is, gradually increasing red and blue lines). Finally, inter-individual differences are visible by comparing the limit of the variance of the data with the variance of differences between subjects (green line).
Extended Data Figure 5. Gaussian process decomposition…
Extended Data Figure 5. Gaussian process decomposition of temporal variance for metagenomic species abundances.
The posterior mean of the decomposition of variance (Methods) is shown for each species (Supplementary Table 5), coloured by phylum. Uncertainty in the estimate was assessed by the square root of the mean squared distance on the ternary plot of MCMC samples from the posterior mean, and is codified with larger points indicating more certain estimates.
Extended Data Figure 6. Assembly annotation specificity…
Extended Data Figure 6. Assembly annotation specificity for single and co-assemblies.
a, Tukey boxplot of the percentage of proteins in each functional specificity category. b, An example Venn diagram for one set of single-sample supragingival plaque assemblies and their combined co-assembly (co-assembly on bottom left), showing counts of shared genes (computed via strict alignment) between all combinations of assemblies; the co-assembly by itself contains 96.9% of all detected genes. c, Tukey boxplot of the number of Gene Ontology (GO) terms (generated using a GO Slim with around 1,700 terms) shared between single and co-assemblies, unique to the co-assembly, or unique to one of the single assemblies, generated from a random selection of 250 assemblies across 6 body sites. Co-assemblies capture GO terms that are not in individual assemblies.
Extended Data Figure 7. Sequencing statistics and…
Extended Data Figure 7. Sequencing statistics and assembly quality assessments.
a, Tukey boxplots of total raw reads per sample among body sites upon retrieval from the SRA. b, Percentages of human reads marked by BMTagger per body site. c, Percentages of non-human (bacterial) reads aligned to assemblies showing assembly effectiveness (Methods; read and contig mapping to assemblies and reference genomes). d, Comparison of the number of unique Pfam domains detected in each sample by HUMAnN2 and in the assemblies, coloured by body site. Pfam domains in HUMAnN2 were considered ‘detected’ if UniRef50 sequences annotated with the domain were present in a sample at >10 reads per kilobase (around 1× coverage). Pfam domains were detected in the assemblies if they were found on a single contig by Attributor (Methods). e, Number of Pfam domains that were detected in at least 75% of samples (core’ domains) by each method, for each targeted body site. Pfams domains are stratified by unknown function.
Extended Data Figure 8. Metagenomic features abundances…
Extended Data Figure 8. Metagenomic features abundances significantly associated with host phenotype.
a, b, Significant associations of nontrivial effect size (FDR < 0.1 and |β| > 0.01) in a multivariate linear model (significance and coefficients in Supplementary Table 5) between taxon abundances (a) and pathway abundances (b). All detected associations are independent of all other metadata, including whether the subject was breastfed, the subject’s broad dietary characterization, temperature, introitus pH, posterior fornix pH, gender, age, ethnicity, study day processed, sequencing centre, clinical centre, number of quality bases, percentage of human reads, systolic blood pressure, diastolic blood pressure, pulse, whether the subject had given birth, HMP1/HMP1-II, and BMI (group sizes in Extended Data Table 1; see Methods). Non-significant associations here should not be considered evidence of no association.
Extended Data Figure 9. Updated associations in…
Extended Data Figure 9. Updated associations in HMP1-II.
a, HMP cohort subjects reported whether they were breastfed as infants. Remarkably, overall phylum Firmicutes abundances were lower even during adulthood (subjects’ current ages were 18–40) in individuals who had been historically breastfed. b, c, Differences associated with infant breastfeeding persisted in other clades and body sites, for example, oral Neisseria (b), although age-linked associations differed among taxa (for example, overall oral Neisseria decrease with age) (c). df, Examples of associations significant in the original HMP1 metagenome set that were retained in the larger HMP1-II dataset include: d, Bacteroides vulgatus in stool is significantly more abundant in Asian people compared to those of other ethnicities. e, Lactobacillus crispatus in the posterior fornix is negatively associated with vaginal pH. f, Bacteroides are significantly more abundant in individuals who have been breastfed as infants. Boxplot whiskers are defined by Tukey’s method.

References

    1. The Human Microbiome Project Consortium. Structure, function and diversity of the healthy human microbiome. Nature486, 207–214 (2012)
    1. Lloyd-Price J, Abu-Ali G, Huttenhower C. The healthy human microbiome. Genome Med. 2016;8:51. doi: 10.1186/s13073-016-0307-y.
    1. Gensollen T, Iyer SS, Kasper D L, Blumberg RS. How colonization by microbiota in early life shapes the immune system. Science. 2016;352:539–544. doi: 10.1126/science.aad9378.
    1. Honda K, Littman DR. The microbiota in adaptive immune homeostasis and disease. Nature. 2016;535:75–84. doi: 10.1038/nature18848.
    1. Qin J, et al. A human gut microbial gene catalogue established by metagenomic sequencing. Nature. 2010;464:59–65. doi: 10.1038/nature08821.
    1. Li J, et al. An integrated catalog of reference genes in the human gut microbiome. Nat. Biotechnol. 2014;32:834–841. doi: 10.1038/nbt.2942.
    1. Beaumont M, et al. Heritable components of the human fecal microbiome are associated with visceral fat. Genome Biol. 2016;17:189. doi: 10.1186/s13059-016-1052-7.
    1. Falony G, et al. Population-level analysis of gut microbiome variation. Science. 2016;352:560–564. doi: 10.1126/science.aad3503.
    1. Zhernakova A, et al. Population-based metagenomics analysis reveals markers for gut microbiome composition and diversity. Science. 2016;352:565–569. doi: 10.1126/science.aad3369.
    1. Si J, You H J, Yu J, Sung J, Ko G. Prevotella as a hub for vaginal microbiota under the influence of host genetics and their association with obesity. Cell Host Microbe. 2017;21:97–105. doi: 10.1016/j.chom.2016.11.010.
    1. Gonzalez, A. et al. Migraines are correlated with higher levels of nitrate-, nitrite-, and nitric oxide-reducing oral microbes in the American Gut Project cohort. mSystems10.1128/mSystems.00105-16 (2016)
    1. Oh J, et al. Biogeography and individuality shape function in the human skin metagenome. Nature. 2014;514:59–64. doi: 10.1038/nature13786.
    1. The Human Microbiome Project Consortium. A framework for human microbiome research. Nature486, 215–221 (2012)
    1. Truong DT, Tett A, Pasolli E, Huttenhower C, Segata N. Microbial strain-level population structure and genetic diversity from metagenomes. Genome Res. 2017;27:626–638. doi: 10.1101/gr.216242.116.
    1. Schloissnig S, et al. Genomic variation landscape of the human gut microbiome. Nature. 2013;493:45–50. doi: 10.1038/nature11711.
    1. Luo C, et al. ConStrains identifies microbial strains in metagenomic datasets. Nat. Biotechnol. 2015;33:1045–1052. doi: 10.1038/nbt.3319.
    1. Kimura M. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J. Mol. Evol. 1980;16:111–120. doi: 10.1007/BF01731581.
    1. Franzosa EA, et al. Identifying personal microbiomes using metagenomic codes. Proc. Natl Acad. Sci. USA. 2015;112:E2930–E2938. doi: 10.1073/pnas.1423854112.
    1. Leinonen R, Sugawara H, Shumway M. The sequence read archive. Nucleic Acids Res. 2011;39:D19–D21. doi: 10.1093/nar/gkq1019.
    1. Truong DT, et al. MetaPhlAn2 for enhanced metagenomic taxonomic profiling. Nat. Methods. 2015;12:902–903. doi: 10.1038/nmeth.3589.
    1. Hoffmann C, et al. Archaea and fungi of the human gut microbiome: correlations with diet and bacterial residents. PLoS ONE. 2013;8:e66019. doi: 10.1371/journal.pone.0066019.
    1. Schwiertz A, et al. Microbiota and SCFA in lean and overweight healthy subjects. Obesity. 2010;18:190–195. doi: 10.1038/oby.2009.167.
    1. Pride DT, et al. Evidence of a robust resident bacteriophage population revealed through analysis of the human salivary virome. ISME J. 2012;6:915–926. doi: 10.1038/ismej.2011.169.
    1. Abubucker S, et al. Metabolic reconstruction for metagenomic data and its application to the human microbiome. PLoS Comput. Biol. 2012;8:e1002358. doi: 10.1371/journal.pcbi.1002358.
    1. Caspi R, et al. The MetaCyc database of metabolic pathways and enzymes and the BioCyc collection of pathway/genome databases. Nucleic Acids Res. 2014;42:D459–D471. doi: 10.1093/nar/gkt1103.
    1. Leonardi R, Zhang YM, Rock C O, Jackowski S. Coenzyme A: back in action. Prog. Lipid Res. 2005;44:125–153. doi: 10.1016/j.plipres.2005.04.001.
    1. Khakh BS, Burnstock G. The double life of ATP. Sci. Am. 2009;301:84–92. doi: 10.1038/scientificamerican1209-84.
    1. Morkbak AL, Poulsen SS, Nexo E. Haptocorrin in humans. Clin. Chem. Lab. Med. 2007;45:1751–1759. doi: 10.1515/CCLM.2007.343.
    1. Roy CC, Kien CL, Bouthillier L, Levy E. Short-chain fatty acids: ready for prime time? Nutr. Clin. Pract. 2006;21:351–366. doi: 10.1177/0115426506021004351.
    1. Schreiber F, et al. Denitrification in human dental plaque. BMC Biol. 2010;8:24. doi: 10.1186/1741-7007-8-24.
    1. Flint HJ, Scott KP, Duncan S H, Louis P, Forano E. Microbial degradation of complex carbohydrates in the gut. Gut Microbes. 2012;3:289–306. doi: 10.4161/gmic.19897.
    1. Faith JJ, et al. The long-term stability of the human gut microbiota. Science. 2013;341:1237439. doi: 10.1126/science.1237439.
    1. Flores GE, et al. Temporal variability is a personalized feature of the human microbiome. Genome Biol. 2014;15:531. doi: 10.1186/s13059-014-0531-y.
    1. Ding T, Schloss PD. Dynamics and associations of microbial community types across the human body. Nature. 2014;509:357–360. doi: 10.1038/nature13178.
    1. Turnbaugh PJ, et al. The human microbiome project. Nature. 2007;449:804–810. doi: 10.1038/nature06244.
    1. Shafquat A, Joice R, Simmons S L, Huttenhower C. Functional and phylogenetic assembly of microbial communities in the human microbiome. Trends Microbiol. 2014;22:261–266. doi: 10.1016/j.tim.2014.01.011.
    1. Gajer P, et al. Temporal dynamics of the human vaginal microbiota. Sci. Transl. Med. 2012;4:132ra52. doi: 10.1126/scitranslmed.3003605.
    1. Peng Y, Leung HC, Yiu S M, Chin FY. IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth. Bioinformatics. 2012;28:1420–1428. doi: 10.1093/bioinformatics/bts174.
    1. Finn RD, et al. The Pfam protein families database: towards a more sustainable future. Nucleic Acids Res. 2016;44:D279–D285. doi: 10.1093/nar/gkv1344.
    1. Aagaard K, et al. The Human Microbiome Project strategy for comprehensive sampling of the human microbiome and why it matters. FASEB J. 2013;27:1012–1022. doi: 10.1096/fj.12-220806.
    1. Caporaso JG, et al. QIIME allows analysis of high-throughput community sequencing data. Nat. Methods. 2010;7:335–336. doi: 10.1038/nmeth.f.303.
    1. Huang K, et al. MetaRef: a pan-genomic database for comparative and community microbial genomics. Nucleic Acids Res. 2014;42:D617–D624. doi: 10.1093/nar/gkt1078.
    1. Suzek BE, Wang Y, Huang H, McGarvey PB, Wu CH. UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics. 2015;31:926–932. doi: 10.1093/bioinformatics/btu739.
    1. Caspi R, et al. The MetaCyc database of metabolic pathways and enzymes and the BioCyc collection of pathway/genome databases. Nucleic Acids Res. 2016;44:D471–D480. doi: 10.1093/nar/gkv1164.
    1. Galperin MY, Makarova KS, Wolf YI, Koonin EV. Expanded microbial genome coverage and improved protein family annotation in the COG database. Nucleic Acids Res. 2015;43:D261–D269. doi: 10.1093/nar/gku1223.
    1. Baba T, et al. Construction of Escherichia coli K-12 in-frame, single-gene knockout mutants: the Keio collection. Mol. Syst. Biol. 2006;2(2006):0008.
    1. The UniProt Consortium. UniProt: the universal protein knowledgebase. Nucleic Acids Res.45, D158–D169 (2017)
    1. Gelman A, Rubin DB. Inference from iterative simulation using multiple sequences. Stat. Sci. 1992;7:457–472. doi: 10.1214/ss/1177011136.
    1. Morgan XC, et al. Dysfunction of the intestinal microbiome in inflammatory bowel disease and treatment. Genome Biol. 2012;13:R79. doi: 10.1186/gb-2012-13-9-r79.
    1. Namiki T, Hachiya T, Tanaka H, Sakakibara Y. MetaVelvet: an extension of Velvet assembler to de novo metagenome assembly from short sequence reads. Nucleic Acids Res. 2012;40:e155. doi: 10.1093/nar/gks678.
    1. Luo R, et al. SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. GigaScience. 2015;1:18. doi: 10.1186/2047-217X-1-18.
    1. Boisvert S, Raymond F, Godzaridis E, Laviolette F, Corbeil J. Ray Meta: scalable de novo metagenome assembly and profiling. Genome Biol. 2012;13:R122. doi: 10.1186/gb-2012-13-12-r122.
    1. Bankevich A, et al. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol. 2012;19:455–477. doi: 10.1089/cmb.2012.0021.
    1. Zerbino DR, Birney E. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 2008;18:821–829. doi: 10.1101/gr.074492.107.
    1. Pell J, et al. Scaling metagenome sequence assembly with probabilistic de Bruijn graphs. Proc. Natl Acad. Sci. USA. 2012;109:13272–13277. doi: 10.1073/pnas.1121464109.
    1. Mende DR, et al. Assessment of metagenomic assembly using simulated next generation sequencing data. PLoS ONE. 2012;7:e31386. doi: 10.1371/journal.pone.0031386.
    1. Zhu W, Lomsadze A, Borodovsky M. Ab initio gene identification in metagenomic sequences. Nucleic Acids Res. 2010;38:e132. doi: 10.1093/nar/gkq275.
    1. Suzek BE, Huang H, McGarvey P, Mazumder R, Wu CH. UniRef: comprehensive and non-redundant UniProt reference clusters. Bioinformatics. 2007;23:1282–1288. doi: 10.1093/bioinformatics/btm098.
    1. Zhao Y, Tang H, Ye Y. RAPSearch2: a fast and memory-efficient protein similarity search tool for next-generation sequencing data. Bioinformatics. 2012;28:125–126. doi: 10.1093/bioinformatics/btr595.
    1. Finn RD, et al. Pfam: the protein families database. Nucleic Acids Res. 2014;42:D222–D230. doi: 10.1093/nar/gkt1223.
    1. Haft DH, et al. TIGRFAMs and Genome Properties in 2013. Nucleic Acids Res. 2013;41:D387–D395. doi: 10.1093/nar/gks1234.
    1. Eddy SR. Accelerated profile HMM searches. PLoS Comput. Biol. 2011;7:e1002195. doi: 10.1371/journal.pcbi.1002195.
    1. Sonnhammer EL, von Heijne G, Krogh A. A hidden Markov model for predicting transmembrane helices in protein sequences. Proc. Int. Conf. Intell. Syst. Mol. Biol. 1998;6:175–182.
    1. Orvis J, et al. Ergatis: a web interface and scalable software system for bioinformatics workflows. Bioinformatics. 2010;26:1488–1492. doi: 10.1093/bioinformatics/btq167.
    1. Galens K, et al. The IGS standard operating procedure for automated prokaryotic annotation. Stand. Genomic Sci. 2011;4:244–251. doi: 10.4056/sigs.1223234.
    1. Edgar RC. Search and clustering orders of magnitude faster than BLAST. Bioinformatics. 2010;26:2460–2461. doi: 10.1093/bioinformatics/btq461.
    1. Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat. Methods. 2012;9:357–359.
    1. Roager HM, et al. Colonic transit time is related to bacterial metabolism and mucosal turnover in the gut. Nat. Microbiol. 2016;1:16093. doi: 10.1038/nmicrobiol.2016.93.

Source: PubMed

3
Se inscrever