Fast identification and removal of sequence contamination from genomic and metagenomic datasets

Robert Schmieder, Robert Edwards, Robert Schmieder, Robert Edwards

Abstract

High-throughput sequencing technologies have strongly impacted microbiology, providing a rapid and cost-effective way of generating draft genomes and exploring microbial diversity. However, sequences obtained from impure nucleic acid preparations may contain DNA from sources other than the sample. Those sequence contaminations are a serious concern to the quality of the data used for downstream analysis, causing misassembly of sequence contigs and erroneous conclusions. Therefore, the removal of sequence contaminants is a necessary and required step for all sequencing projects. We developed DeconSeq, a robust framework for the rapid, automated identification and removal of sequence contamination in longer-read datasets (150 bp mean read length). DeconSeq is publicly available as standalone and web-based versions. The results can be exported for subsequent analysis, and the databases used for the web-based version are automatically updated on a regular basis. DeconSeq categorizes possible contamination sequences, eliminates redundant hits with higher similarity to non-contaminant genomes, and provides graphical visualizations of the alignment results and classifications. Using DeconSeq, we conducted an analysis of possible human DNA contamination in 202 previously published microbial and viral metagenomes and found possible contamination in 145 (72%) metagenomes with as high as 64% contaminating sequences. This new framework allows scientists to automatically detect and efficiently remove unwanted sequence contamination from their datasets while eliminating critical limitations of current methods. DeconSeq's web interface is simple and user-friendly. The standalone version allows offline analysis and integration into existing data processing pipelines. DeconSeq's results reveal whether the sequencing experiment has succeeded, whether the correct sample was sequenced, and whether the sample contains any sequence contamination from DNA preparation or host. In addition, the analysis of 202 metagenomes demonstrated significant contamination of the non-human associated metagenomes, suggesting that this method is appropriate for screening all metagenomes. DeconSeq is available at http://deconseq.sourceforge.net/.

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1. Alignment sensitivity of BWA-SW for…
Figure 1. Alignment sensitivity of BWA-SW for human sequences.
Query coverage and alignment identity values ranged from 90% to 100%. The sensitivity shows how many sequences could be aligned back to the reference. The simulated datasets contained 28,612,955 reads for 200 bp, 11,444,886 reads for 500 bp, and 5,722,210 reads for 1,000 bp.
Figure 2. Repeats causing alignment problems for…
Figure 2. Repeats causing alignment problems for BWA-SW.
The query coverage was set to 95%, with identity set to 99%, 97% and 94% for error rates of 0%, 2% and 5%, respectively. The numbers above the bars show the number of unaligned sequences of each category for the given thresholds. The values shown in parenthesis represent the percentage of unaligned sequences. The simulated datasets contained 28,612,955 reads for 200 bp, 11,444,886 reads for 500 bp, and 5,722,210 reads for 1,000 bp.
Figure 3. DeconSeq web interface.
Figure 3. DeconSeq web interface.
Screenshots of the DeconSeq web interface at different steps of the data processing. The user can either input a data ID to access already processed data (A) or input a new sequence file and select the database (B). After processing the data, the results are shown including the input information (C), Coverage vs. Identity plots for “remove” databases (D) and “retain” databases (E), classification of input data into “clean”, “contamination”, and “both” (F), and download options (G).
Figure 4. Coverage vs. Identity plots generated…
Figure 4. Coverage vs. Identity plots generated by DeconSeq.
The plots show the number of matching reads for different query coverage and alignment identity values. The size of each dot in the plots is defined by the number of matching reads with exactly this coverage and identity value. Red dots represent matching reads against the “remove” databases and blue dots against “retain” databases. The column and row sums at the top and right of each plot allow an easier identification of the number of sequences that match for a particular threshold value. The plots for matching reads against the “remove” databases do not show matching reads that additionally have a match against the “retain” databases (A). Results for reads matching against both databases are shown in a second plot where dots for a single read are connected by lines. If the match against the “remove” database is more similar, then the line is colored red, otherwise blue. In B, for example, the majority of sequences is more similar to the “retain” databases and in C the majority is more similar to the “remove” databases.
Figure 5. Result of human DNA contamination…
Figure 5. Result of human DNA contamination identified in 202 metagenomes.
All seven human genome sequences were used as “remove” databases and depending on the metagenome type (viral or microbial), the viral or bacterial genomes were selected as “retain” database. 145 (72%) of the metagenomes contained at least one possible contamination sequence using a threshold of 95% query coverage and 94% alignment identity.
Figure 6. Flowchart of DeconSeq for the…
Figure 6. Flowchart of DeconSeq for the identification of possible contaminant sequences.

References

    1. Tringe SG, von Mering C, Kobayashi A, Salamov AA, Chen K, et al. Comparative metagenomics of microbial communities. Science. 2005;308:554–557.
    1. Kunin V, Copeland A, Lapidus A, Mavromatis K, Hugenholtz P. A bioinformatician's guide to metagenomics. Microbiology and Molecular Biology Reviews. 2008;72:557–578.
    1. Dinsdale EA, Edwards RA, Hall D, Angly F, Breitbart M, et al. Functional metagenomic profiling of nine biomes. Nature. 2008;452:629–632.
    1. Rosen GL, Sokhansanj BA, Polikar R, Bruns MA, Russell J, et al. Signal processing for metagenomics: extracting information from the soup. Current Genomics. 2009;10:493–510.
    1. Qin J, Li R, Raes J, Arumugam M, Burgdorf KS, et al. A human gut microbial gene catalogue established by metagenomic sequencing. Nature. 2010;464:59–65.
    1. Wooley JC, Godzik A, Friedberg I. A primer on metagenomics. PLoS Computational Biology. 2010;6:e1000667.
    1. Turnbaugh PJ, Ley RE, Hamady M, Fraser-Liggett CM, Knight R, et al. The human microbiome project. Nature. 2007;449:804–810.
    1. Peterson J, Garges S, Giovanni M, McInnes P, Wang L, et al. The NIH human microbiome project. Genome Research. 2009;19:2317–2323.
    1. Flicek P, Birney E. Sense from sequence reads: methods for alignment and assembly. Nature Methods. 2009;6:S6–S12.
    1. Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010;11:31–46.
    1. Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, et al. Initial sequencing and analysis of the human genome. Nature. 2001;409:860–921.
    1. Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, et al. The sequence of the human genome. Science. 2001;291:1304–1351.
    1. Levy S, Sutton G, Ng PC, Feuk L, Halpern AL, et al. The diploid genome sequence of an individual human. PLoS Biology. 2007;5:e254.
    1. Wheeler DA, Srinivasan M, Egholm M, Shen Y, Chen L, et al. The complete genome of an individual by massively parallel DNA sequencing. Nature. 2008;452:872–876.
    1. Bentley DR, Balasubramanian S, Swerdlow HP, Smith GP, Milton J, et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature. 2008;456:53–59.
    1. Wang J, Wang W, Li R, Li Y, Tian G, et al. The diploid genome sequence of an asian individual. Nature. 2008;456:60–65.
    1. Ahn S, Kim T, Lee S, Kim D, Ghang H, et al. The first korean genome sequence and analysis: full genome sequencing for a socio-ethnic group. Genome Research. 2009;19:1622–1629.
    1. Li Y, Wang J. Faster human genome sequencing. Nat Biotech. 2009;27:820–821.
    1. Collins FS, Barker AD. Mapping the cancer genome. pinpointing the genes involved in cancer will help chart a new course across the complex landscape of human malignancies. Scientific American. 2007;296:50–57.
    1. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. Journal of Molecular Biology. 1990;215:403–410.
    1. Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research. 1997;25:3389–3402.
    1. Morgulis A, Coulouris G, Raytselis Y, Madden TL, Agarwala R, et al. Database indexing for production MegaBLAST searches. Bioinformatics. 2008;24:1757–1764.
    1. Li H, Ruan J, Durbin R. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Research. 2008;18:1851–1858.
    1. Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009;25:1754–1760.
    1. Li R, Yu C, Li Y, Lam T, Yiu S, et al. SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics. 2009;25:1966–1967.
    1. Homer N, Merriman B, Nelson SF. BFAST: an alignment tool for large scale genome resequencing. PloS One. 2009;4:e7767.
    1. Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology. 2009;10:R25.
    1. Eid J, Fehr A, Gray J, Luong K, Lyle J, et al. Real-time DNA sequencing from single polymerase molecules. Science. 2009;323:133–138.
    1. McCarthy A. Third generation DNA sequencing: pacific biosciences' single molecule real time technology. Chemistry & Biology. 2010;17:675–676.
    1. Li H, Durbin R. Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics. 2010;26:589–595.
    1. Li H, Homer N. A survey of sequence alignment algorithms for next-generation sequencing. Briefings in Bioinformatics. 2010;11:473–483.
    1. Smith TF, Waterman MS. Identification of common molecular subsequences. Journal of Molecular Biology. 1981;147:195–197.
    1. Ning Z, Cox AJ, Mullikin JC. SSAHA: a fast search method for large DNA databases. Genome Research. 2001;11:1725–1729.
    1. Kent WJ. BLAT–the BLAST-like alignment tool. Genome Research. 2002;12:656–664.
    1. Kurtz S, Phillippy A, Delcher AL, Smoot M, Shumway M, et al. Versatile and open software for comparing large genomes. Genome Biology. 2004;5:R12.
    1. Ferragina P, Manzini G. Opportunistic data structures with applications. 2000. 390 In: Proceedings of the 41st Annual Symposium on Foundations of Computer Science. IEEE Computer Society. Available: .
    1. Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, et al. BLAST+: architecture and applications. BMC Bioinformatics. 2009;10:421.
    1. Huse S, Huber J, Morrison H, Sogin M, Welch D. Accuracy and quality of massively parallel DNA pyrosequencing. Genome Biology. 2007;8:R143.
    1. Alexander RP, Fang G, Rozowsky J, Snyder M, Gerstein MB. Annotating non-coding regions of the genome. Nat Rev Genet. 2010;11:559–571.
    1. Hugenholtz P, Tyson GW. Microbiology: metagenomics. Nature. 2008;455:481–483.
    1. Mavromatis K, Ivanova N, Barry K, Shapiro H, Goltsman E, et al. Use of simulated data sets to evaluate the fidelity of metagenomic processing methods. Nature Methods. 2007;4:495–500.
    1. Willner D, Thurber RV, Rohwer F. Metagenomic signatures of 86 microbial and viral metagenomes. Environmental Microbiology. 2009;11:1752–1766.
    1. Willner D, Furlan M, Haynes M, Schmieder R, Angly FE, et al. Metagenomic analysis of respiratory tract DNA viral communities in cystic fibrosis and non-cystic fibrosis individuals. PloS One. 2009;4:e7370.
    1. Turnbaugh PJ, Hamady M, Yatsunenko T, Cantarel BL, Duncan A, et al. A core gut microbiome in obese and lean twins. Nature. 2009;457:480–484.
    1. Frazer KA, Murray SS, Schork NJ, Topol EJ. Human genetic variation and its contribution to complex traits. Nature Reviews Genetics. 2009;10:241–251.
    1. Kidd JM, Sampas N, Antonacci F, Graves T, Fulton R, et al. Characterization of missing human genome sequences and copy-number polymorphic insertions. Nat Meth. 2010;7:365–371.
    1. Li R, Li Y, Zheng H, Luo R, Zhu H, et al. Building the sequence map of the human pan-genome. Nature Biotechnology. 2010;28:57–63.
    1. Turner DJ, Keane TM, Sudbery I, Adams DJ. Next-generation sequencing of vertebrate experimental organisms. Mammalian Genome: Official Journal of the International Mammalian Genome Society. 2009;20:327–338.
    1. Meyer F, Paarmann D, D'Souza M, Olson R, Glass EM, et al. The metagenomics RAST server - a public resource for the automatic phylogenetic and functional analysis of metagenomes. BMC Bioinformatics. 2008;9:386.
    1. Smith AD, Xuan Z, Zhang MQ. Using quality scores and longer reads improves accuracy of solexa read mapping. BMC Bioinformatics. 2008;9:128.
    1. Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, et al. The human genome browser at UCSC. Genome Research. 2002;12:996–1006.
    1. Schmieder R, Edwards R. Quality control and preprocessing of metagenomic datasets. Bioinformatics 2011
    1. Cock PJA, Fields CJ, Goto N, Heuer ML, Rice PM. The sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Research. 2010;38:1767–1771.
    1. Schmieder R, Lim YW, Rohwer F, Edwards R. TagCleaner: identification and removal of tag sequences from genomic and metagenomic datasets. BMC Bioinformatics. 2010;11:341.

Source: PubMed

3
Iratkozz fel