Generation of multimillion-sequence 16S rRNA gene libraries from complex microbial communities by assembling paired-end illumina reads

Andrea K Bartram, Michael D J Lynch, Jennifer C Stearns, Gabriel Moreno-Hagelsieb, Josh D Neufeld, Andrea K Bartram, Michael D J Lynch, Jennifer C Stearns, Gabriel Moreno-Hagelsieb, Josh D Neufeld

Abstract

Microbial communities host unparalleled taxonomic diversity. Adequate characterization of environmental and host-associated samples remains a challenge for microbiologists, despite the advent of 16S rRNA gene sequencing. In order to increase the depth of sampling for diverse bacterial communities, we developed a method for sequencing and assembling millions of paired-end reads from the 16S rRNA gene (spanning the V3 region; ∼200 nucleotides) by using an Illumina genome analyzer. To confirm reproducibility and to identify a suitable computational pipeline for data analysis, sequence libraries were prepared in duplicate for both a defined mixture of DNAs from known cultured bacterial isolates (>1 million postassembly sequences) and an Arctic tundra soil sample (>6 million postassembly sequences). The Illumina 16S rRNA gene libraries represent a substantial increase in number of sequences over all extant next-generation sequencing approaches (e.g., 454 pyrosequencing), while the assembly of paired-end 125-base reads offers a methodological advantage by incorporating an initial quality control step for each 16S rRNA gene sequence. This method incorporates indexed primers to enable the characterization of multiple microbial communities in a single flow cell lane, may be modified readily to target other variable regions or genes, and demonstrates unprecedented and economical access to DNAs from organisms that exist at low relative abundances.

Figures

Fig. 1.
Fig. 1.
Overview of the Illumina 16S rRNA gene sequencing method and generated library data. (A) The schema indicates a PCR (20 cycles) and gel purification of ∼330-base PCR products, including the conserved 16S rRNA gene primer-binding region. (B) Informatics pipeline for generating clusters and taxonomic affiliations. (C) Resulting taxonomic affiliations for the replicate control libraries (C1 and C2) and the Sanger sequencing-based library (CL). (D) Taxonomic affiliations for the Alert tundra duplicate libraries (AT1 and AT2) and the Sanger sequencing-based library (ATS).
Fig. 2.
Fig. 2.
Quality (Q) scores for all 125-base sequence reads. The Q score is an integer mapping of P, the probability that the corresponding base call is incorrect, with higher Q scores indicating lower error rates. The magnitude of sequence overlap for each assembled read was characterized, and the mean () and standard deviation (±σ) were plotted relative to sequence length. The region of potential read overlap as presented does not explicitly calculate the additive Q score at each position, as the range of overlap varied due to the large range of V3 lengths.
Fig. 3.
Fig. 3.
Rank-abundance curves for duplicate control libraries (A) and Alert Arctic tundra libraries (B). The data shown are the raw data and also the data clustered using CD-HIT at a cutoff of 97%. Note that the Alert Illumina library was considered as separate replicates (AT1 and AT2) and also as a composite library (ATCL), which represents the combined replicates.
Fig. 4.
Fig. 4.
Effect of library size on phylotype coverage. Randomly subsampled libraries were drawn in triplicate from combined AT libraries and used to calculate Good's coverage estimates. Averages for triplicates were plotted with standard deviations.
Fig. 5.
Fig. 5.
Taxonomic affiliations at the levels of phylum, class, and order for consecutive abundance ranks of sequence data clustered at 97% with CD-HIT. Predominant taxa are represented in the bottom row, and singletons are at the top for each taxonomic level. Full details of RDP affiliations are summarized in Tables S3, S4, and S5 in the supplemental material.

Source: PubMed

3
Suscribir