DADA2: High-resolution sample inference from Illumina amplicon data

Benjamin J Callahan, Paul J McMurdie, Michael J Rosen, Andrew W Han, Amy Jo A Johnson, Susan P Holmes, Benjamin J Callahan, Paul J McMurdie, Michael J Rosen, Andrew W Han, Amy Jo A Johnson, Susan P Holmes

Abstract

We present the open-source software package DADA2 for modeling and correcting Illumina-sequenced amplicon errors (https://github.com/benjjneb/dada2). DADA2 infers sample sequences exactly and resolves differences of as little as 1 nucleotide. In several mock communities, DADA2 identified more real variants and output fewer spurious sequences than other methods. We applied DADA2 to vaginal samples from a cohort of pregnant women, revealing a diversity of previously undetected Lactobacillus crispatus variants.

Figures

Figure 1. Sequence variants inferred by DADA2…
Figure 1. Sequence variants inferred by DADA2 compared to the OTUs constructed by UPARSE
The merged sequences output by DADA2 are plotted for three Illumina amplicon datasets: (a) Balanced, (b) HMP, and (c) Extreme. Frequency is plotted on the y-axis; Hamming distance to the closest more-abundant sequence on the x-axis. Shapes represent accuracy (Methods). When variants are well separated from other members of the community the sequence variants inferred by DADA2 largely coincide with the OTUs output by UPARSE (black). However, DADA2 resolves additional variation (blue), especially within the UPARSE's OTU radius (dashed line), while outputting fewer spurious sequences (One Off and Other).
Figure 2. Lactobacillus crispatus sequence variants in…
Figure 2. Lactobacillus crispatus sequence variants in the human vaginal community during pregnancy
DADA2 identified six Lactobacillus crispatus 16S rRNA sequence variants present in multiple samples and a significant fraction of all reads (L1: 19.7%, L2: 11.1%, L3: 6.5%, L4: 3.1%, L5: 1.3%, L6: 0.4%). (a) The frequency of L1–L6 in each sample. Black bars at the bottom link samples from the same subject. The frequency of (b) L1 vs. L2, and (c) L1 vs. L3, by sample. The dashed line indicates a total frequency of 1.

References

    1. Human Microbiome Project Consortium. Nature. 2012;486:207–214.
    1. Rosen MJ, Davison M, Bhaya D, Fisher DS. Science. 2015;348:1019–1023.
    1. Reeder J, Knight R. Nat Methods. 2010;7:668–669.
    1. Quince C, Lanzen A, Davenport RJ, Turnbaugh PJ. BMC Bioinformatics. 2011;12:38.
    1. Rosen MJ, Callahan BJ, Fisher DS, Holmes SP. BMC Bioinformatics. 2012;13:283.
    1. Bragg L, Stone G, Imelfort M, Hugenholtz P, Tyson GW. Nat Methods. 2012;9:425–426.
    1. Schloss PD, et al. Appl Environ Microbiol. 2009;75:7537–7541.
    1. Caporaso JG, et al. Nat Methods. 2010;7:335–336.
    1. Edgar RC. Nat Methods. 2013;10:996–998.
    1. Eren AM, Borisy GG, Huse SM, Welch JLM. Proc Natl Acad Sci USA. 2014;111:E2875–E2884.
    1. Eren AM, Morrison HG, Lescault PJ, Reveillaud J, Vineis JH, Sogin ML. ISME J. 2015;9:968–979.
    1. Tikhonov M, Leach RW, Wingreen NS. ISME J. 2015;9:68–80.
    1. Wang C, Mitsuya Y, Gharizadeh B, Ronaghi M, Shafer RW. Genome Research. 2007;17:1195–1201.
    1. McElroy K, Zagordi O, Bull R, Luciani F, Beerenwinkel N. BMC Genomics. 2013;14:501.
    1. Guarner F. Nat Rev Gastroenterol Hepatol. 2014;11:647–649.
    1. Schirmer M, et al. Nucleic Acids Res. 2015;43:e37.
    1. Kozich JJ, Westcott SL, Baxter NT, Highlander SK, Schloss PD. Appl Environ Microbiol. 2013;79:5112–5120.
    1. Edgar RC, Flyvbjerg H. Bioinformatics. 2015;31:3476–3482.
    1. MacIntyre DA, et al. Sci Rep. 2015;5:8988.
    1. Ravel J, et al. Proc Natl Acad Sci USA. 2011;108(Supplement 1):4680–4687.
Methods-only References
    1. Sun Y, et al. ESPRIT: estimating species richness using large collections of 16S rRNA pyrosequences. Nucleic Acids Res. 2009;37:e76.
    1. Caporaso JG, et al. ISME J. 2012;6:1621.
    1. Edgar RC, Haas BJ, Clemente JC, Quince C, Knight R. Bioinformatics. 2011;27:2194–2200.

Source: PubMed

3
Iratkozz fel