Using populations of human and microbial genomes for organism detection in metagenomes

Sasha K Ames, Shea N Gardner, Jose Manuel Marti, Tom R Slezak, Maya B Gokhale, Jonathan E Allen, Sasha K Ames, Shea N Gardner, Jose Manuel Marti, Tom R Slezak, Maya B Gokhale, Jonathan E Allen

Abstract

Identifying causative disease agents in human patients from shotgun metagenomic sequencing (SMS) presents a powerful tool to apply when other targeted diagnostics fail. Numerous technical challenges remain, however, before SMS can move beyond the role of research tool. Accurately separating the known and unknown organism content remains difficult, particularly when SMS is applied as a last resort. The true amount of human DNA that remains in a sample after screening against the human reference genome and filtering nonbiological components left from library preparation has previously been underreported. In this study, we create the most comprehensive collection of microbial and reference-free human genetic variation available in a database optimized for efficient metagenomic search by extracting sequences from GenBank and the 1000 Genomes Project. The results reveal new human sequences found in individual Human Microbiome Project (HMP) samples. Individual samples contain up to 95% human sequence, and 4% of the individual HMP samples contain 10% or more human reads. Left unidentified, human reads can complicate and slow down further analysis and lead to inaccurately labeled microbial taxa and ultimately lead to privacy concerns as more human genome data is collected.

© 2015 Ames et al.; Published by Cold Spring Harbor Laboratory Press.

Figures

Figure 1.
Figure 1.
Average percentage of reads identified as human sequence in HMP samples, using LMAT-Ref, LMAT-GenBank, or LMAT-Grand by body site.
Figure 2.
Figure 2.
Histogram showing how often different amounts of human reads are found across the collection of sequencer runs. The x-axis displays human read abundance in sequencer runs in bins of 2%. The y-axis shows the percentage of sequencer runs with the amount of human reads specified on the x-axis using a log scale. The highest fraction of human reads in a sequencer run is 94% and found in one run.
Figure 3.
Figure 3.
Sensitive BLAST search based assignment of reads from an HMP sample reported to have a high abundance of newly labeled human reads. The left panel shows the distribution of taxonomic assignments after reads were binned into clusters of similar reads. The right panel shows the raw abundance based on read counts for each read assignment. Taxonomic assignments with a 0% abundance label reflect percentages <1%.
Figure 4.
Figure 4.
Fraction of shared genus (left) and species (right) calls. ROC curve shown using different minimum abundance thresholds to make organism calls. Different taxonomy calling methods are shown. HMP DACC, MetaPhlAn, and LMAT taxonomy calls with different database types: LMAT-RefSeq (RefSeq), LMAT-ML (ML), and LMAT-ML-Human (ML+humanNoprune).

References

    1. The 1000 Genomes Project Consortium. 2010. A map of human genome variation from population-scale sequencing. Nature 467: 1061–1073.
    1. The 1000 Genomes Project Consortium. 2012. An integrated map of genetic variation from 1,092 human genomes. Nature 491: 56–65.
    1. Albertsen M, Hugenholtz P, Skarshewski A, Nielsen KL, Tyson GW, Nielsen PH. 2013. Genome sequences of rare, uncultured bacteria obtained by differential coverage binning of multiple metagenomes. Nat Biotechnol 31: 533–538.
    1. Allen JE, Gardner SN, Slezak TR. 2008. DNA signatures for detecting genetic engineering in bacteria. Genome Biol 9: R56.
    1. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. 1990. Basic local alignment search tool. J Mol Biol 215: 403–410.
    1. Ames SK, Hysom DA, Gardner SN, Lloyd GS, Gokhale MB, Allen JE. 2013. Scalable metagenomic taxonomy classification using a reference genome database. Bioinformatics 29: 2253–2260.
    1. Ames S, Allen JE, Hysom DA, Lloyd GS, Gokhale MB. 2014. Design and optimization of a metagenomics analysis workflow for NVRAM. In IEEE 28th International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 556–565.
    1. Aronesty E. 2013. Comparison of sequencing utility programs. Open BioInform J 7: 1–8.
    1. Benson DA, Cavanaugh M, Clark K, Karsch-Mizrachi I, Lipman DJ, Ostell J, Sayers EW. 2013. GenBank. Nucleic Acids Res 41: D36–D42.
    1. Berendzen J, Bruno W, Cohn J, Hengartner N, Kuske C, McMahon B, Wolinsky M, Xie G. 2012. Rapid phylogenetic and functional classification of short genomic fragments with signature peptides. BMC Res Notes 5: 460.
    1. Byrd A, Perez-Rogers J, Manimaran S, Castro-Nallar E, Toma I, McCaffrey T, Siegel M, Benson G, Crandall K, Johnson W. 2014. Clinical PathoScope: rapid alignment and filtration for accurate pathogen identification in clinical samples using unassembled sequencing data. BMC Bioinformatics 15: 262.
    1. Cotten M, Oude Munnink B, Canuti M, Deijs M, Watson SJ, Kellam P, van der Hoek L. 2014. Full genome virus detection in fecal samples using sensitive nucleic acid preparation, deep sequencing, and a novel iterative sequence classification algorithm. PLoS One 9: e93269.
    1. Elhaik E, Tatarinova T, Chebotarev D, Piras IS, Maria Calò C, De Montis A, Atzori M, Marini M, Tofanelli S, Francalacci P, et al. 2014. Geographic population structure analysis of worldwide human populations infers their biogeographical origins. Nat Commun 5: 3513.
    1. Fricke WF, Rasko DA. 2014. Bacterial genome sequencing in the clinic: bioinformatic challenges and solutions. Nat Rev Genet 15: 49–55.
    1. Fu L, Niu B, Zhu Z, Wu S, Li W. 2012. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28: 3150–3152.
    1. Gymrek M, Golan D, Rosset S, Erlich Y. 2012. lobSTR: a short tandem repeat profiler for personal genomes. Genome Res 22: 1154–1162.
    1. Howe AC, Jansson JK, Malfatti SA, Tringe SG, Tiedje JM, Brown CT. 2014. Tackling soil diversity with the assembly of large, complex metagenomes. Proc Natl Acad Sci 11: 4904–4909.
    1. Hu H, Hu Y, Pan Y, Liang H, Wang H, Wang X, Hao Q, Yang X, Yang X, Xiao X, et al. 2012. Novel plasmid and its variant harboring both a blaNDM-1 gene and type IV secretion system in clinical isolates of Acinetobacter lwoffii. Antimicrob Agents Chemother 56: 1698–1702.
    1. Huang L, Popic V, Batzoglou S. 2013. Short read alignment with populations of genomes. Bioinformatics 29: i361–i370.
    1. The Human Microbiome Project Consortium. 2012a. Structure, function and diversity of the healthy human microbiome. Nature 486: 207–214.
    1. The Human Microbiome Project Consortium. 2012b. A framework for human microbiome research. Nature 486: 215–221.
    1. Jurka J, Kapitonov VV, Pavlicek A, Klonowski P, Kohany O, Walichiewicz J. 2005. Repbase Update, a database of eukaryotic repetitive elements. Cytogenet Genome Res 110: 462–467.
    1. Klymiuk I, Högenauer C, Halwachs B, Thallinger GG, Fricke WF, Steininger C. 2014. A physicians’ wish list for the clinical application of intestinal metagenomics. PLoS Med 11: e1001627.
    1. Langdon W. 2014. Mycoplasma contamination in the 1000 Genomes Project. BioData Min 7: 3.
    1. Laurence M, Hatzis C, Brash DE. 2014. Common contaminants in next-generation sequencing that hinder discovery of low-abundance microbes. PLoS One 9: e97876.
    1. Liu B, Gibbons T, Ghodsi M, Treangen T, Pop M. 2011. Accurate and fast estimation of taxonomic profiles from metagenomic shotgun sequences. BMC Genomics 12: S4.
    1. Marçais G, Kingsford C. 2011. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27: 764–770.
    1. Martin J, Sykes S, Young S, Kota K, Sanka R, Sheth N, Orvis J, Sodergren E, Wang Z, Weinstock GM, et al. 2012. Optimizing read mapping to reference genomes to determine composition and species prevalence in microbial communities. PLoS One 7: e36427.
    1. Minot ST, Turner SD, Ternus KL, Kadavy DR. 2014. SIANN: strain identification by alignment to near neighbors. bioRxiv. 10.1101/001727.
    1. Naccache SN, Greninger AL, Lee D, Coffey LL, Phan T, Rein-Weston A, Aronsohn A, Hackett J, Delwart EL, Chiu CY. 2013. The perils of pathogen discovery: origin of a novel parvovirus-like hybrid genome traced to nucleic acid extraction spin columns. J Virol 87: 11966–11977.
    1. Naccache SN, Federman S, Veeeraraghavan N, Zaharia M, Lee D, Samayoa E, Bouquet J, Greninger AL, Luk KC, Enge B, et al. 2014. A cloud-compatible bioinformatics pipeline for ultrarapid pathogen identification from next-generation sequencing of clinical samples. Genome Res 24: 1180–1192.
    1. Nakamura S, Yang CS, Sakon N, Ueda M, Tougan T, Yamashita A, Goto N, Takahashi K, Yasunaga T, Ikuta K, et al. 2009. Direct metagenomic detection of viral pathogens in nasal and fecal specimens using an unbiased high-throughput sequencing approach. PLoS One 4: e4219.
    1. Nielsen HB, Almeida M, Juncker AS, Rasmussen S, Li J, Sunagawa S, Plichta DR, Gautier L, Pedersen AG, Le Chatelier E, et al. 2014. Identification and assembly of genomes and genetic elements in complex metagenomic samples without using reference genomes. Nat Biotech 32: 822–8.
    1. Olalde I, Allentoft ME, Sánchez-Quinto F, Santpere G, Chiang CWK, DeGiorgio M, Prado-Martinez J, Rodriguez JA, Rasmussen S, Quilez J, et al. 2014. Derived immune and ancestral pigmentation alleles in a 7,000-year-old Mesolithic European. Nature 507: 225–228.
    1. Salter SJ, Cox MJ, Turek EM, Calus ST, Cookson WO, Moffatt MF, Turner P, Parkhill J, Loman NJ, Walker AW. 2014. Reagent contamination can critically impact sequence-based microbiome analyses. BMC Biol 12: 87.
    1. Schmieder R, Edwards R. 2011. Fast identification and removal of sequence contamination from genomic and metagenomic datasets. PLoS One 6: e17288.
    1. Segata N, Waldron L, Ballarini A, Narasimhan V, Jousson O, Huttenhower C. 2012. Metagenomic microbial community profiling using unique clade-specific marker genes. Nat Methods 9: 811–814.
    1. SRA Handbook. 2014. SRA handbook: aspera transfer guide. National Center for Biotechnology Information, Bethesda, MD.
    1. Sunagawa S, Mende DR, Zeller G, Izquierdo-Carrasco F, Berger SA, Kultima JR, Coelho LP, Arumugam M, Tap J, Nielsen HB, et al. 2013. Metagenomic species profiling using universal phylogenetic marker genes. Nat Methods 10: 1196–1199.
    1. Takeuchi F, Sekizuka T, Yamashita A, Ogasawara Y, Mizuta K, Kuroda M. 2014. MePIC, metagenomic pathogen identification for clinical specimens. Jpn J Infect Dis 67: 62–65.
    1. Tu Q, He Z, Zhou J. 2014. Strain/species identification in metagenomes using genome-specific markers. Nucleic Acids Res 42: e67.
    1. Van Essen B, Hsieh H, Ames S, Gokhale M. 2012. DI-MMAP: a high performance memory-map runtime for data-intensive applications. In High performance computing, networking, storage and analysis (SCC), 2012 SC Companion, pp. 731–735.
    1. Van Essen B, Hsieh H, Ames S, Pearce R, Gokhale M. 2013. DI-MMAP—a scalable memory-map runtime for out-of-core data-intensive applications. Cluster Comput 18: 15–28.
    1. Willems TF, Gymrek M, Highnam G; The 1000 Genomes Project, Mittelman D, Erlich Y. 2014. The landscape of human STR variation. Genome Res 24: 1894–1904.
    1. Wilson MR, Naccache SN, Samayoa E, Biagtan M, Bashir H, Yu G, Salamat SM, Somasekar S, Federman S, Miller S, et al. 2014. Actionable diagnosis of neuroleptospirosis by next-generation sequencing. N Engl J Med 370: 2408–2417.
    1. Wood D, Salzberg S. 2014. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol 15: R46.
    1. Yozwiak NL, Skewes-Cox P, Stenglein MD, Balmaseda A, Harris E, DeRisi JL. 2012. Virus identification in unknown tropical febrile illness cases using deep sequencing. PLoS Negl Trop Dis 6: e1485.
    1. Zhao Y, Tang H, Ye Y. 2012. RAPSearch2: a fast and memory-efficient protein similarity search tool for next-generation sequencing data. Bioinformatics 28: 125–126.
    1. Zhao G, Krishnamurthy S, Cai Z, Popov VL, Travassos da Rosa AP, Guzman H, Cao S, Virgin HW, Tesh RB, Wang D. 2013. Identification of novel viruses using VirusHunter—an automated data analysis pipeline. PLoS One 8: e78470.

Source: PubMed

3
Subscribe