A cloud-compatible bioinformatics pipeline for ultrarapid pathogen identification from next-generation sequencing of clinical samples
Samia N Naccache, Scot Federman, Narayanan Veeraraghavan, Matei Zaharia, Deanna Lee, Erik Samayoa, Jerome Bouquet, Alexander L Greninger, Ka-Cheung Luk, Barryett Enge, Debra A Wadford, Sharon L Messenger, Gillian L Genrich, Kristen Pellegrino, Gilda Grard, Eric Leroy, Bradley S Schneider, Joseph N Fair, Miguel A Martínez, Pavel Isa, John A Crump, Joseph L DeRisi, Taylor Sittler, John Hackett Jr, Steve Miller, Charles Y Chiu, Samia N Naccache, Scot Federman, Narayanan Veeraraghavan, Matei Zaharia, Deanna Lee, Erik Samayoa, Jerome Bouquet, Alexander L Greninger, Ka-Cheung Luk, Barryett Enge, Debra A Wadford, Sharon L Messenger, Gillian L Genrich, Kristen Pellegrino, Gilda Grard, Eric Leroy, Bradley S Schneider, Joseph N Fair, Miguel A Martínez, Pavel Isa, John A Crump, Joseph L DeRisi, Taylor Sittler, John Hackett Jr, Steve Miller, Charles Y Chiu
Abstract
Unbiased next-generation sequencing (NGS) approaches enable comprehensive pathogen detection in the clinical microbiology laboratory and have numerous applications for public health surveillance, outbreak investigation, and the diagnosis of infectious diseases. However, practical deployment of the technology is hindered by the bioinformatics challenge of analyzing results accurately and in a clinically relevant timeframe. Here we describe SURPI ("sequence-based ultrarapid pathogen identification"), a computational pipeline for pathogen identification from complex metagenomic NGS data generated from clinical samples, and demonstrate use of the pipeline in the analysis of 237 clinical samples comprising more than 1.1 billion sequences. Deployable on both cloud-based and standalone servers, SURPI leverages two state-of-the-art aligners for accelerated analyses, SNAP and RAPSearch, which are as accurate as existing bioinformatics tools but orders of magnitude faster in performance. In fast mode, SURPI detects viruses and bacteria by scanning data sets of 7-500 million reads in 11 min to 5 h, while in comprehensive mode, all known microorganisms are identified, followed by de novo assembly and protein homology searches for divergent viruses in 50 min to 16 h. SURPI has also directly contributed to real-time microbial diagnosis in acutely ill patients, underscoring its potential key role in the development of unbiased NGS-based clinical assays in infectious diseases that demand rapid turnaround times.
© 2014 Naccache et al.; Published by Cold Spring Harbor Laboratory Press.
Figures
References
- Akobeng AK 2007. Understanding diagnostic tests 3: receiver operating characteristic curves. Acta Paediatr 96: 644–647
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ 1990. Basic local alignment search tool. J Mol Biol 215: 403–410
- Barnes GL, Uren E, Stevens KB, Bishop RF 1998. Etiology of acute gastroenteritis in hospitalized children in Melbourne, Australia, from April 1980 to March 1993. J Clin Microbiol 36: 133–138
- Bhaduri A, Qu K, Lee CS, Ungewickell A, Khavari PA 2012. Rapid identification of non-human sequences in high-throughput sequencing datasets. Bioinformatics 28: 1174–1175
- Bloch KC, Glaser C 2007. Diagnostic approaches for patients with suspected encephalitis. Curr Infect Dis Rep 9: 315–322
- Borozan I, Wilson S, Blanchette P, Laflamme P, Watt SN, Krzyzanowski PM, Sircoulomb F, Rottapel R, Branton PE, Ferretti V 2012. CaPSID: a bioinformatics platform for computational pathogen sequence identification in human genomes and transcriptomes. BMC Bioinformatics 13: 206.
- Briese T, Paweska JT, McMullan LK, Hutchison SK, Street C, Palacios G, Khristova ML, Weyer J, Swanepoel R, Egholm M, et al. 2009. Genetic detection and characterization of Lujo virus, a new hemorrhagic fever-associated arenavirus from southern Africa. PLoS Pathog 5: e1000455.
- Castellarin M, Warren RL, Freeman JD, Dreolini L, Krzywinski M, Strauss J, Barnes R, Watson P, Allen-Vercoe E, Moore RA, et al. 2012. Fusobacterium nucleatum infection is prevalent in human colorectal carcinoma. Genome Res 22: 299–306
- Chen EC, Yagi S, Kelly KR, Mendoza SP, Tarara RP, Canfield DR, Maninger N, Rosenthal A, Spinner A, Bales KL, et al. 2011. Cross-species transmission of a novel adenovirus associated with a fulminant pneumonia outbreak in a new world monkey colony. PLoS Pathog 7: e1002155.
- Chiu CY 2013. Viral pathogen discovery. Curr Opin Microbiol 16: 468–478
- Collins FS, Hamburg MA 2013. First FDA authorization for next-generation sequencer. N Engl J Med 369: 2369–2371
- Crump JA, Morrissey AB, Nicholson WL, Massung RF, Stoddard RA, Galloway RL, Ooi EE, Maro VP, Saganda W, Kinabo GD, et al. 2013. Etiology of severe non-malaria febrile illness in Northern Tanzania: a prospective cohort study. PLoS Negl Trop Dis 7: e2324.
- Delwart EL 2007. Viral metagenomics. Rev Med Virol 17: 115–131
- Denno DM, Shaikh N, Stapp JR, Qin X, Hutter CM, Hoffman V, Mooney JC, Wood KM, Stevens HJ, Jones R, et al. 2012. Diarrhea etiology in a pediatric emergency department: a case control study. Clin Infect Dis 55: 897–904
- Dimon MT, Wood HM, Rabbitts PH, Arron ST 2013. IMSA: integrated metagenomic sequence analysis for identification of exogenous reads in a host genomic background. PLoS ONE 8: e64546.
- Dunne WM Jr, Westblade LF, Ford B 2012. Next-generation and whole-genome sequencing in the diagnostic clinical microbiology laboratory. Eur J Clin Microbiol Infect Dis 31: 1719–1726
- Firth C, Lipkin WI 2013. The genomics of emerging pathogens. Annu Rev Genomics Hum Genet 14: 281–300
- Gardner MJ, Hall N, Fung E, White O, Berriman M, Hyman RW, Carlton JM, Pain A, Nelson KE, Bowman S, et al. 2002. Genome sequence of the human malaria parasite Plasmodium falciparum. Nature 419: 498–511
- Gargis AS, Kalman L, Berry MW, Bick DP, Dimmock DP, Hambuch T, Lu F, Lyon E, Voelkerding KV, Zehnbauer BA, et al. 2012. Assuring the quality of next-generation sequencing in clinical laboratory practice. Nat Biotechnol 30: 1033–1036
- Grard G, Fair JN, Lee D, Slikas E, Steffen I, Muyembe JJ, Sittler T, Veeraraghavan N, Ruby JG, Wang C, et al. 2012. A novel rhabdovirus associated with acute hemorrhagic fever in central Africa. PLoS Pathog 8: e1002924.
- Gremme G, Steinbiss S, Kurtz S 2013. GenomeTools: a comprehensive software library for efficient processing of structured genome annotations. IEEE/ACM Trans Comput Biol Bioinformatics 10: 645–656
- Greninger AL, Chen EC, Sittler T, Scheinerman A, Roubinian N, Yu G, Kim E, Pillai DR, Guyard C, Mazzulli T, et al. 2010. A metagenomic analysis of pandemic influenza A (2009 H1N1) infection in patients from North America. PLoS ONE 5: e13381.
- Knope K, Whelan P, Smith D, Johansen C, Moran R, Doggett S, Sly A, Hobby M, Kurucz N, Wright P, et al. 2013. Arboviral diseases and malaria in Australia, 2010-11: annual report of the National Arbovirus and Malaria Advisory Committee. Commun Dis Intell Q Rep 37: E1–E20
- Kollef KE, Schramm GE, Wills AR, Reichley RM, Micek ST, Kollef MH 2008. Predictors of 30-day mortality and hospital costs in patients with ventilator-associated pneumonia attributed to potentially antibiotic-resistant gram-negative bacteria. Chest 134: 281–287
- Kostic AD, Ojesina AI, Pedamallu CS, Jung J, Verhaak RG, Getz G, Meyerson M 2011. PathSeq: software to identify or discover microbes by deep sequencing of human tissue. Nat Biotechnol 29: 393–396
- Kostic AD, Gevers D, Pedamallu CS, Michaud M, Duke F, Earl AM, Ojesina AI, Jung J, Bass AJ, Tabernero J, et al. 2012. Genomic analysis identifies association of Fusobacterium with colorectal carcinoma. Genome Res 22: 292–298
- Langmead B, Salzberg SL 2012. Fast gapped-read alignment with Bowtie 2. Nat Methods 9: 357–359
- Li H, Durbin R 2009. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25: 1754–1760
- Loman NJ, Constantinidou C, Chan JZ, Halachev M, Sergeant M, Penn CW, Robinson ER, Pallen MJ 2012. High-throughput bacterial genome sequencing: an embarrassment of choice, a world of opportunity. Nat Rev Microbiol 10: 599–606
- Louie JK, Hacker JK, Gonzales R, Mark J, Maselli JH, Yagi S, Drew WL 2005. Characterization of viral agents causing acute respiratory infection in a San Francisco University Medical Center Clinic during the influenza season. Clin Infect Dis 41: 822–828
- MacConaill L, Meyerson M 2008. Adding pathogens by genomic subtraction. Nat Genet 40: 380–382
- Martin M 2011. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnetjournal 17: 1
- Morse SS, Mazet JA, Woolhouse M, Parrish CR, Carroll D, Karesh WB, Zambrana-Torrelio C, Lipkin WI, Daszak P 2012. Prediction and prevention of the next pandemic zoonosis. Lancet 380: 1956–1965
- Naeem R, Rashid M, Pain A 2013. READSCAN: a fast and scalable pathogen discovery program with accurate genome relative abundance estimation. Bioinformatics 29: 391–392
- Niu B, Zhu Z, Fu L, Wu S, Li W 2011. FR-HIT, a very fast program to recruit metagenomic reads to homologous reference genomes. Bioinformatics 27: 1704–1705
- Nunez JJ, Fritz CL, Knust B, Buttke D, Enge B, Novak MG, Kramer V, Osadebe L, Messenger S, Albarino CG, et al. 2014. Hantavirus infections among overnight visitors to Yosemite National Park, California, USA, 2012. Emerg Infect Dis 20: 386–393
- Pruitt KD, Tatusova T, Maglott DR 2007. NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res 35: D61–D65
- Quail MA, Smith M, Coupland P, Otto TD, Harris SR, Connor TR, Bertoni A, Swerdlow HP, Gu Y 2012. A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers. BMC Genomics 13: 341.
- Schmieder R, Edwards R 2011. Quality control and preprocessing of metagenomic datasets. Bioinformatics 27: 863–864
- Simpson JT, Wong K, Jackman SD, Schein JE, Jones SJ, Birol I 2009. ABySS: a parallel assembler for short read sequence data. Genome Res 19: 1117–1123
- Sommer DD, Delcher AL, Salzberg SL, Pop M 2007. Minimus: a fast, lightweight genome assembler. BMC Bioinformatics 8: 64.
- Swei A, Russell BJ, Naccache SN, Kabre B, Veeraraghavan N, Pilgard MA, Johnson BJ, Chiu CY 2013. The genome sequence of Lone Star virus, a highly divergent bunyavirus found in the Amblyomma americanum tick. PLoS ONE 8: e62083.
- Tong S, Li Y, Rivailler P, Conrardy C, Castillo DA, Chen LM, Recuenco S, Ellison JA, Davis CT, York IA, et al. 2012. A distinct lineage of influenza A virus from bats. Proc Natl Acad Sci 109: 4269–4274
- Treangen TJ, Sommer DD, Angly FE, Koren S, Pop M 2011. Next generation sequence assembly with AMOS. Curr Protoc Bioinformatics 33: 11.8.1–11.8.18
- van Gageldonk-Lafeber AB, Heijnen ML, Bartelds AI, Peters MF, van der Plas SM, Wilbrink B 2005. A case-control study of acute respiratory tract infection in general practice patients in The Netherlands. Clin Infect Dis 41: 490–497
- Wang Q, Jia P, Zhao Z 2013. VirusFinder: software for efficient and accurate detection of viruses and their integration sites in host genomes through next generation sequencing data. PLoS ONE 8: e64465.
- Ward KN, Kalima P, MacLeod KM, Riordan T 2002. Neuroinvasion during delayed primary HHV-7 infection in an immunocompetent adult with encephalitis and flaccid paralysis. J Med Virol 67: 538–541
- Wilson MR, Naccache SN, Samayoa E, Biagtan M, Bashir H, Yu G, Salamat SM, Somasekar S, Federman S, Miller S, et al. 2014. Actionable diagnosis of neuroleptospirosis by next-generation sequencing. N Engl J Med doi: 10.1056/NEJMoa.1401268
- Wylie KM, Mihindukulasuriya KA, Sodergren E, Weinstock GM, Storch GA 2012. Sequence analysis of the human virome in febrile and afebrile children. PLoS ONE 7: e27735.
- Xu B, Liu L, Huang X, Ma H, Zhang Y, Du Y, Wang P, Tang X, Wang H, Kang K, et al. 2011. Metagenomic analysis of fever, thrombocytopenia and leukopenia syndrome (FTLS) in Henan Province, China: discovery of a new bunyavirus. PLoS Pathog 7: e1002369.
- Yu G, Greninger AL, Isa P, Phan TG, Martinez MA, de la Luz Sanchez M, Contreras JF, Santos-Preciado JI, Parsonnet J, Miller S, et al. 2012. Discovery of a novel polyomavirus in acute diarrheal samples from children. PLoS ONE 7: e49449.
- Zaharia M, Bolosky WJ, Curtis K, Fox A, Patterson D, Shenker S, Stoica I, Karp RM, Sittler T 2011. Faster and more accurate sequence alignment with SNAP. arXiv 1111.5572
- Zaki AM, van Boheemen S, Bestebroer TM, Osterhaus AD, Fouchier RA 2012. Isolation of a novel coronavirus from a man with pneumonia in Saudi Arabia. N Engl J Med 367: 1814–1820
- Zhao Y, Tang H, Ye Y 2012. RAPSearch2: a fast and memory-efficient protein similarity search tool for next-generation sequencing data. Bioinformatics 28: 125–126
- Zhao G, Krishnamurthy S, Cai Z, Popov VL, Travassos da Rosa AP, Guzman H, Cao S, Virgin HW, Tesh RB, Wang D 2013. Identification of novel viruses using VirusHunter–an automated data analysis pipeline. PLoS ONE 8: e78470.
- Zweig MH, Campbell G 1993. Receiver-operating characteristic (ROC) plots: a fundamental evaluation tool in clinical medicine. Clin Chem 39: 561–577
Source: PubMed