Microbiome profiling by illumina sequencing of combinatorial sequence-tagged PCR products

Gregory B Gloor, Ruben Hummelen, Jean M Macklaim, Russell J Dickson, Andrew D Fernandes, Roderick MacPhee, Gregor Reid, Gregory B Gloor, Ruben Hummelen, Jean M Macklaim, Russell J Dickson, Andrew D Fernandes, Roderick MacPhee, Gregor Reid

Abstract

We developed a low-cost, high-throughput microbiome profiling method that uses combinatorial sequence tags attached to PCR primers that amplify the rRNA V6 region. Amplified PCR products are sequenced using an Illumina paired-end protocol to generate millions of overlapping reads. Combinatorial sequence tagging can be used to examine hundreds of samples with far fewer primers than is required when sequence tags are incorporated at only a single end. The number of reads generated permitted saturating or near-saturating analysis of samples of the vaginal microbiome. The large number of reads allowed an in-depth analysis of errors, and we found that PCR-induced errors composed the vast majority of non-organism derived species variants, an observation that has significant implications for sequence clustering of similar high-throughput data. We show that the short reads are sufficient to assign organisms to the genus or species level in most cases. We suggest that this method will be useful for the deep sequencing of any short nucleotide region that is taxonomically informative; these include the V3, V5 regions of the bacterial 16S rRNA genes and the eukaryotic V9 region that is gaining popularity for sampling protist diversity.

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1. Expected amplified product size using… — **Figure 1. Expected amplified product size using constant regions flanking eubacterial variable regions.**

Figure 2. Conceptual workflow of the data… — **Figure 2. Conceptual workflow of the data analysis.**
PCR products derived from the eubacterial V6 rRNA region were sequenced on a single paired-end Illumina run. Reads were filtered for quality, overlapped and clustered as outlined in the text. Only reads with 0 mismatches in the overlapping region were used for further analysis.

Figure 3. The proportion of reads in… — **Figure 3. The proportion of reads in the 25 most abundant OTUs clustered at 92% identity as a function of the number of differences with the seed ISU.**
The red line shows the plot for the concatenated primer sequences, and the blue line shows the plot for the OTU containing the most abundant ISU.

Figure 4. Neighbour-joining tree derived from Levenshtein… — **Figure 4. Neighbour-joining tree derived from Levenshtein distance between the 108 most abundant ISU sequences.**
ISUs clustered into OTUs at 95% identity are connected with red branches and ISU sequences clustered at 92% identity are connected with green branches. The seed sequence for each 95% identity OTU cluster is identified by a red dot.

Figure 5. Quality scores for all overlapped… — **Figure 5. Quality scores for all overlapped 120 bp composite reads.**
The scores a log-odds score of the likelihood of error in the base call, higher scores represent lower likelihoods of error . They are expected to decrease with distance from the left or right sequencing primer, and to be highest in the region of perfect overlap because scores are additive.

Figure 6. The frequency of each nucleotide… — **Figure 6. The frequency of each nucleotide observed at each position in the left and right primers derived from the Illumina dataset.**
There are million sequences, and the difference in frequency between the correct and altered nucleotide is relatively constant. Note that the errors are at the same frequency at each end of the primers.

Figure 7. The sequence variation in OTU… — **Figure 7. The sequence variation in OTU 0 and OTU 1.**
The plot shows the number of times that each nucleotide occurred at each position in two example OTUs.

Figure 8. Boxplot summaries of the difference… — **Figure 8. Boxplot summaries of the difference between the frequency of the most in common residue at each position and the frequency of each sequence variant.**
The OTU numbers are given at the top of the graph.

Figure 9. Plot of the reproducibility between… — **Figure 9. Plot of the reproducibility between and within samples.**
The black-filled circles plot within-sample variation, and the red circles plot the between-sample variation for the GTCGC tag. The count of sequences composing OTUs clustered at 95% identity for samples containing the GTCGC tag and the GTCG N-1 tag are in black. This shows the technical replication of the data when amplified from the same sample in the same tube. The open red circles plot the correspondence for between-sample OTU counts.

**Figure 10. An example rarefaction curve.**
The top panel shows rarefaction curves generated for sample 1 by resampling with replacement either all OTUs or ISUs, or OTUs and ISUs where at least 3 reads were observed. The bottom panel shows the rarefaction curve and the 95% and 99% confidence interval for all OTUs in sample 1. Rarefaction curves for all 272 samples are given in Supplementary Figure S2.

Figure 11. Correspondence between Chao1, ACE and… — **Figure 11. Correspondence between Chao1, ACE and rarefaction curves for the 272 samples.**
The X and Y axes show the fraction of species that were found in each sample for the two estimates. Red-filled circles highlight those samples where the limit rarefaction value was less than 0.97.

Figure 12. Plot of the number of… — **Figure 12. Plot of the number of distinct ISU or OTU classes in each sample as a function of the number of reads.**
The number of ISU classes increases with the number of reads, but the number of OTU classes becomes constant above 20000–30000 reads.

**Figure 13. DGGE analysis of selected samples.**
Panel A shows representative PCR amplicons from 3 of 20 clinical samples (Subjects 40, 48 and 89) were electrophoresed on a denaturing gradient gel. Bands were excised, sequenced and identified as in the Materials and Methods. Bands are labeled as follows: le = *Leptotrichia amnionii*; in = *Lactobacillus iners*; ga = *Gardnerella vaginalis*; cr = *Lactobacillus crispatus*; pr = *Prevotella amnii* (also named *P. amniotica*). Panel B shows a Venn diagram of the organisms identified by Illumina sequencing of the V6 rRNA region and by sequencing DGGE bands amplified from the V3 rRNA region.

References

1. Andersson AF, Lindberg M, Jakobsson H, Bäckhed F, Nyrén P, et al. Comparative analysis of human gut microbiota by barcoded pyrosequencing. PLoS One. 2008;3:e2836.
1. Lauber CL, Hamady M, Knight R, Fierer N. Pyrosequencing-based assessment of soil ph as a predictor of soil bacterial community structure at the continental scale. Appl Environ Microbiol. 2009;75:5111–20.
1. Polymenakou PN, Lampadariou N, Mandalakis M, Tselepides A. Phylogenetic diversity of sediment bacteria from the southern cretan margin, eastern mediterranean sea. Syst Appl Microbiol. 2009 Feb;32:17–26.
1. Hamady M, Knight R. Microbial community profiling for human microbiome projects: Tools, techniques, and challenges. Genome Res. 2009;19:1141–52.
1. Amaral-Zettler LA, McCliment EA, Ducklow HW, Huse SM. A method for studying protistan diversity using massively parallel sequencing of v9 hypervariable regions of small-subunit ribosomal rna genes. PLoS One. 2009;4:e6372.
1. Schellenberg J, Links MG, Hill JE, Dumonceaux TJ, Peters GA, et al. Pyrosequencing of the chaperonin-60 universal target as a tool for determining microbial community composition. Appl Environ Microbiol. 2009;75:2889–98.
1. Nocker A, Burr M, Camper AK. Genotypic microbial community profiling: a critical technical review. Microb Ecol. 2007;54:276–89.
1. Petrosino JF, Highlander S, Luna RA, Gibbs RA, Versalovic J. Metagenomic pyrosequencing and microbial identification. Clin Chem. 2009;55:856–66.
1. Reeder J, Knight R. The ‘rare biosphere’: a reality check. Nat Methods. 2009;6:636–7.
1. Quince C, Lanzén A, Curtis TP, Davenport RJ, Hall N, et al. Accurate determination of microbial diversity from 454 pyrosequencing data. Nat Methods. 2009;6:639–41.
1. Pawlowski J, Lecroq B. Short rdna barcodes for species identification in foraminifera. J Eukaryot Microbiol. 2010;57:197–205.
1. Caporaso JG, Lauber CL, Walters WA, Berg-Lyons D, Lozupone CA, et al. Global patterns of 16s rrna diversity at a depth of millions of sequences per sample. Proc Natl Acad Sci U S A 2010
1. Hummelen R, Fernandes AD, Macklaim JM, Dickson RJ, Changalucha J, et al. Deep sequencing of the vaginal microbiota of women with hiv. PLoS ONE. 2010;5:e12078.
1. Huse SM, Dethlefsen L, Huber JA, Mark Welch D, Welch DM, et al. Exploring microbial diversity and taxonomy using ssu rrna hypervariable tag sequencing. PLoS Genet. 2008;4:e1000255.
1. Cole JR, Wang Q, Cardenas E, Fish J, Chai B, et al. The ribosomal database project: improved alignments and new tools for rrna analysis. Nucleic Acids Res. 2009;37:D141–5.
1. Wang Y, Qian PY. Conservative fragments in bacterial 16s rrna genes and primer design for 16s ribosomal dna amplicons in metagenomic studies. PLoS One. 2009;4:e7401.
1. Laurikari V. Tre the free and portable approximate regex matching library. 2010. Available: . Accessed 2010, May 28.
1. Bentley DR, Balasubramanian S, Swerdlow HP, Smith GP, Milton J, et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature. 2008;456:53–9.
1. Smith DR, Quinlan AR, Peckham HE, Makowsky K, Tao W, et al. Rapid whole-genome mutational profiling using next-generation sequencing technologies. Genome Res. 2008;18:1638–42.
1. Whiteford N, Skelly T, Curtis C, Ritchie ME, Löhr A, et al. Swift: primary data analysis for the illumina solexa sequencing platform. Bioinformatics. 2009;25:2194–9.
1. Frank DN. Barcrawl and bartab: software tools for the design and implementation of barcoded primers for highly multiplexed dna sequencing. BMC Bioinformatics. 2009;10:362.
1. Engels WR. Contributing software to the internet: the amplify program. Trends Biochem Sci. 1993;18:448–50.
1. Lahr DJG, Katz LA. Reducing the impact of pcr-mediated recombination in molecular evolution and environmental studies using a new-generation high-fidelity dna polymerase. Biotechniques. 2009;47:857–66.
1. Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, et al. Blast+: architecture and applications. BMC Bioinformatics. 2009;10:421.
1. Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Sayers EW. Genbank. Nucleic Acids Res. 2010;38:D46–51.
1. Altschul SF, Gish W. Local alignment statistics. Methods Enzymol. 1996;266:460–80.
1. Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Mapping and quantifying mammalian transcriptomes by rna-seq. Nat Methods. 2008;5:621–8.
1. Hurlbert S. The nonconcept of species diversity: a critique and alternative parameters. Ecology. 1971;52:577–586.
1. Smith EP, Van Belle G. Nonparametric estimation of species richness. Biometrics. 1984;40:119–129.
1. Efron B. Nonparametric estimates of standard error: The jackknife, the bootstrap and other methods. Biometrika. 1981;68:589.
1. Colwell R. EstimateS: Statistical estimation of species richness and shared species from samples. 2009. Available: . Accessed 2010 April 6.
1. Chao A. Nonparametric estimation of the number of classes in a population. Scandinavian Journal of Statistics. 1984;11:265–270.
1. Chao A, Lee SM. Estimating the number of classes via sample coverage. J American Stat Soc. 1992;87:210–217.
1. Oksanen J, Kindt R, Legendre P, B O. vegan: Community ecology package version 18-3. r package. 2010. Available: . Accessed 2010 May 1.
1. Srinivasan S, Fredricks DN. The human vaginal bacterial biota and bacterial vaginosis. Interdisciplinary perspectives on infectious diseases. 2008;2008:750479.
1. Shi Y, Chen L, Tong J, Xu C. Preliminary characterization of vaginal microbiota in healthy chinese women using cultivation-independent methods. J Obstet Gynaecol Res. 2009;35:525–32.
1. Rodrigue S, Materna AC, Timberlake SC, C BM, Malmstrom RR, et al. Unlocking short read sequencing for metagenomics. PLoS ONE. 2010;5:e11840.
1. Ravel J, Gajer P, Abdo Z, Schneider GM, Koenig SSK, et al. Vaginal microbiome of reproductive-age women. Proc Natl Acad Sci U S A. 2010 doi/10.1073/pnas.100611107.
1. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, et al. Gapped blast and psi-blast: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402.
1. Cock PJA, Fields CJ, Goto N, Heuer ML, Rice PM. The sanger fastq file format for sequences with quality scores, and the solexa/illumina fastq variants. Nucleic Acids Res. 2010;38:1767–71.

Source: PubMed

Microbiome profiling by illumina sequencing of combinatorial sequence-tagged PCR products

Abstract

Conflict of interest statement

Figures

References

Sponsors and Collaborators

Medical Conditions

Drug Interventions