A cloud-compatible bioinformatics pipeline for ultrarapid pathogen identification from next-generation sequencing of clinical samples

Samia N Naccache, Scot Federman, Narayanan Veeraraghavan, Matei Zaharia, Deanna Lee, Erik Samayoa, Jerome Bouquet, Alexander L Greninger, Ka-Cheung Luk, Barryett Enge, Debra A Wadford, Sharon L Messenger, Gillian L Genrich, Kristen Pellegrino, Gilda Grard, Eric Leroy, Bradley S Schneider, Joseph N Fair, Miguel A Martínez, Pavel Isa, John A Crump, Joseph L DeRisi, Taylor Sittler, John Hackett Jr, Steve Miller, Charles Y Chiu, Samia N Naccache, Scot Federman, Narayanan Veeraraghavan, Matei Zaharia, Deanna Lee, Erik Samayoa, Jerome Bouquet, Alexander L Greninger, Ka-Cheung Luk, Barryett Enge, Debra A Wadford, Sharon L Messenger, Gillian L Genrich, Kristen Pellegrino, Gilda Grard, Eric Leroy, Bradley S Schneider, Joseph N Fair, Miguel A Martínez, Pavel Isa, John A Crump, Joseph L DeRisi, Taylor Sittler, John Hackett Jr, Steve Miller, Charles Y Chiu

Abstract

Unbiased next-generation sequencing (NGS) approaches enable comprehensive pathogen detection in the clinical microbiology laboratory and have numerous applications for public health surveillance, outbreak investigation, and the diagnosis of infectious diseases. However, practical deployment of the technology is hindered by the bioinformatics challenge of analyzing results accurately and in a clinically relevant timeframe. Here we describe SURPI ("sequence-based ultrarapid pathogen identification"), a computational pipeline for pathogen identification from complex metagenomic NGS data generated from clinical samples, and demonstrate use of the pipeline in the analysis of 237 clinical samples comprising more than 1.1 billion sequences. Deployable on both cloud-based and standalone servers, SURPI leverages two state-of-the-art aligners for accelerated analyses, SNAP and RAPSearch, which are as accurate as existing bioinformatics tools but orders of magnitude faster in performance. In fast mode, SURPI detects viruses and bacteria by scanning data sets of 7-500 million reads in 11 min to 5 h, while in comprehensive mode, all known microorganisms are identified, followed by de novo assembly and protein homology searches for divergent viruses in 50 min to 16 h. SURPI has also directly contributed to real-time microbial diagnosis in acutely ill patients, underscoring its potential key role in the development of unbiased NGS-based clinical assays in infectious diseases that demand rapid turnaround times.

© 2014 Naccache et al.; Published by Cold Spring Harbor Laboratory Press.

Figures

Figure 1.
Figure 1.
The SURPI pipeline for pathogen detection. (A) A schematic overview of the SURPI pipeline. Raw NGS reads are preprocessed by removal of adapter, low-quality, and low-complexity sequences, followed by computational subtraction of human reads using SNAP. In fast mode, viruses and bacteria are identified by SNAP alignment to viral and bacterial nucleotide databases. In comprehensive mode, reads are aligned using SNAP to all nucleotide sequences in the NCBI nt collection, enabling identification of bacteria, fungi, parasites, and viruses. For pathogen discovery of divergent microorganisms, unmatched reads and contigs generated from de novo assembly are then aligned to a viral protein database or all protein sequences in the NCBI nr collection using RAPSearch. SURPI reports include a list of all classified reads with taxonomic assignments, a summary table of read counts, and both viral and bacterial genomic coverage maps. (B) Relative proportion of NGS reads classified as human, bacterial, viral, or other in different clinical sample types. (C) The SNAP nucleotide aligner (Zaharia et al. 2011). SNAP aligns reads by generating a hash table of sequences of length “s” from the reference database and then comparing the hash index with “n” seeds of length “s” generated from the query sequence, producing a match based on the edit distance “d.” (D) The RAPSearch protein similarity search tool (Zhao et al. 2012). RAPSearch aligns translated nucleotide queries to a protein database using a compressed amino acid alphabet at the level of chemical similarity for greatly increased processing speed.
Figure 2.
Figure 2.
SURPI aligners (SNAP and RAPSearch) are comparable to other tested aligners for detection of human, bacterial, and viral reads from in silico-generated query data sets. ROC curves were generated to evaluate the ability of four nucleotide aligners (SNAP, BWA, BT2, and BLASTn) to correctly detect in silico-generated NGS reads when mapped against the human DB (A), bacterial DB (B), or viral nucleotide DB (C). The accuracy of detection was assessed using Youden’s index and the F1 score. Sensitivity or the true positive rate (TPR) (y-axis) is plotted against 1-specificity or the false positive rate (FPR) (x-axis). (D) Detection of reads corresponding to four viral genomes [norovirus, Zaire ebolavirus, influenza A(H1N1)pdm09, and HIV-1] by nucleotide alignment. (E) Detection of reads corresponding to three divergent viruses (TMAdV, BASV, and bat influenza H17N10, a novel influenza strain) by nucleotide alignment. (F) Detection of reads corresponding to three divergent viruses (TMAdV, BASV, and bat influenza H17N10) by translated nucleotide (protein) alignment using the RAPSearch and BLASTx aligners. The sequences of these viruses were removed from the nucleotide and protein viral reference databases prior to alignment. The lower shaded panels are magnifications of the corresponding shaded boxed regions in the upper panels.
Figure 3.
Figure 3.
SURPI aligners (SNAP and RAPSearch) are significantly faster than other tested aligners and scale better with larger data sets. Timing performance was benchmarked on a single computational server using in silico query data sets of increasing size. The breaks (zigzag lines) represent computational times that are off-scale. Some of the computational times were estimated (asterisks). (A) Performance time for alignment of reads to the human DB. (B) Performance time for SNAP alignment of reads to the entire 42-Gb NCBI nt DB. The z-axis denotes the approximate number of remaining reads following computational subtraction against the human DB. SNAP performance times were benchmarked separately on local and cloud servers. (C) Performance times for translated nucleotide alignment to the viral protein DB using RAPSearch and BLASTx.
Figure 4.
Figure 4.
SURPI aligners (SNAP and RAPSearch) are comparable to other tested aligners for detection of viral reads in clinical NGS data sets. ROC curves were generated to evaluate the ability of nucleotide and translated nucleotide (protein) aligners to detect reads corresponding to three target viruses: (A) respiratory syncytial virus (RSV) from stool; (B) influenza A(H1N1)pdm09 from a nasal swab; and (C) Sin Nombre hantavirus from serum. Sensitivity or the true positive rate (TPR) (y-axis) is plotted against 1-specificity or the false positive rate (FPR) (x-axis). For each aligner, reads assigned to the correct viral genus were used for generating the ROC curve. The shaded panels are magnifications of the corresponding shaded regions in the upper panels (AC, nucleotide alignment) or overlapping larger panel (C, translated nucleotide alignment).
Figure 5.
Figure 5.
The SURPI pipeline correctly identifies viral species in clinical NGS data sets. Data sets corresponding to clinical samples or sample pools harboring target viral pathogens were analyzed using SURPI. Pie charts show detected viruses derived from the output summary tables. Target viruses are color-coded in yellow or orange; other viruses are color-coded ranked by their relative abundance in shades of blue, followed by shades of purple. Coverage maps of the “best hit” viral genome in fast mode (red) and comprehensive mode (pink, overlaid by red) display automated SURPI output corresponding to the detected target viral genome (blue text). The read coverage (y-axis, log scale) and de novo assembled contigs (black lines) are plotted as a function of nucleotide position along the genome (x-axis). Percent coverage achieved using SURPI in fast mode (“FAST”), in comprehensive mode (“COMPREHENSIVE”), and by de novo assembly (“ASSEMBLY”), as well as the actual coverage from all reads in the data set (“ALL”) are shown. (A) Coverage plots of HIV-1 spiked at titers of 102−104 copies/mL. The number of mapped reads and percent coverage are plotted against the viral copy number (inset). Coverage plots of SaV and HPeV-1 (B), HPV-18 (C), HHV-3 (D), and HCV-1b (E). (F) Coverage plot mapping SURPI-classified genus-level Mastadenovirus reads (red/pink) to the SAdV-18 genome, or Mastadenovirus reads (red/pink) and all specific TMAdV reads (gray) to the TMAdV genome. (G) Coverage plots mapping SURPI-classified family-level Rhabdoviridae reads (pink) or all specific BASV reads (gray) to the BASV genome.
Figure 6.
Figure 6.
The SURPI pipeline correctly identifies bacterial and parasitic species in clinical NGS data sets. Three NGS data sets corresponding to clinical samples or sample pools and found to harbor target pathogenic bacteria or parasites were analyzed using SURPI in comprehensive mode. Pie charts represent the breakdown of SURPI-classified pathogen reads by family. (A) Serum from an individual with acute hemorrhagic fever in the Democratic Republic of the Congo (DRC), Africa, was analyzed by unbiased NGS. NGS reads identified as Plasmodium by SURPI are mapped to the 14 chromosomes of Plasmodium falciparum clone 3D7, including multiple hits to telomeric ends by reads corresponding to the var gene (Gardner et al. 2002). (B) Serum from a patient who died from a critical febrile illness in Tanzania, Africa (Crump et al. 2013) was analyzed using NGS. SURPI generates a coverage map corresponding to the “best hit” bacterial genome, Haemophilus influenzae. (C) SURPI was used to classify the diversity of bacterial species in 22 clinical samples, 11 from colorectal tumors and 11 from normal tissue (Castellarin et al. 2012). For the top 10 bacterial species, the fold-increase in the average normalized abundance between normal and diseased tissue is plotted in rank order from most to least abundant.
Figure 7.
Figure 7.
Speed of SURPI and feasibility for real-time clinical analysis. (A) Timing performance for SURPI in fast mode (red) and comprehensive mode (blue) was benchmarked on a single computational server across 12 NGS data sets representing a variety of infectious diseases and sample types. Processing end-to-end-times are plotted against the number of reads (inset), along with regression trend lines corresponding to SURPI processing in fast and comprehensive modes. (B) A serum sample from a returning traveler with an acute febrile illness was analyzed using NGS, resulting in SURPI detection of human herpesvirus 7 (HHV-7) infection (inset, coverage plot) in a clinically relevant 48-h timeframe.

References

    1. Akobeng AK 2007. Understanding diagnostic tests 3: receiver operating characteristic curves. Acta Paediatr 96: 644–647
    1. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ 1990. Basic local alignment search tool. J Mol Biol 215: 403–410
    1. Barnes GL, Uren E, Stevens KB, Bishop RF 1998. Etiology of acute gastroenteritis in hospitalized children in Melbourne, Australia, from April 1980 to March 1993. J Clin Microbiol 36: 133–138
    1. Bhaduri A, Qu K, Lee CS, Ungewickell A, Khavari PA 2012. Rapid identification of non-human sequences in high-throughput sequencing datasets. Bioinformatics 28: 1174–1175
    1. Bloch KC, Glaser C 2007. Diagnostic approaches for patients with suspected encephalitis. Curr Infect Dis Rep 9: 315–322
    1. Borozan I, Wilson S, Blanchette P, Laflamme P, Watt SN, Krzyzanowski PM, Sircoulomb F, Rottapel R, Branton PE, Ferretti V 2012. CaPSID: a bioinformatics platform for computational pathogen sequence identification in human genomes and transcriptomes. BMC Bioinformatics 13: 206.
    1. Briese T, Paweska JT, McMullan LK, Hutchison SK, Street C, Palacios G, Khristova ML, Weyer J, Swanepoel R, Egholm M, et al. 2009. Genetic detection and characterization of Lujo virus, a new hemorrhagic fever-associated arenavirus from southern Africa. PLoS Pathog 5: e1000455.
    1. Castellarin M, Warren RL, Freeman JD, Dreolini L, Krzywinski M, Strauss J, Barnes R, Watson P, Allen-Vercoe E, Moore RA, et al. 2012. Fusobacterium nucleatum infection is prevalent in human colorectal carcinoma. Genome Res 22: 299–306
    1. Chen EC, Yagi S, Kelly KR, Mendoza SP, Tarara RP, Canfield DR, Maninger N, Rosenthal A, Spinner A, Bales KL, et al. 2011. Cross-species transmission of a novel adenovirus associated with a fulminant pneumonia outbreak in a new world monkey colony. PLoS Pathog 7: e1002155.
    1. Chiu CY 2013. Viral pathogen discovery. Curr Opin Microbiol 16: 468–478
    1. Collins FS, Hamburg MA 2013. First FDA authorization for next-generation sequencer. N Engl J Med 369: 2369–2371
    1. Crump JA, Morrissey AB, Nicholson WL, Massung RF, Stoddard RA, Galloway RL, Ooi EE, Maro VP, Saganda W, Kinabo GD, et al. 2013. Etiology of severe non-malaria febrile illness in Northern Tanzania: a prospective cohort study. PLoS Negl Trop Dis 7: e2324.
    1. Delwart EL 2007. Viral metagenomics. Rev Med Virol 17: 115–131
    1. Denno DM, Shaikh N, Stapp JR, Qin X, Hutter CM, Hoffman V, Mooney JC, Wood KM, Stevens HJ, Jones R, et al. 2012. Diarrhea etiology in a pediatric emergency department: a case control study. Clin Infect Dis 55: 897–904
    1. Dimon MT, Wood HM, Rabbitts PH, Arron ST 2013. IMSA: integrated metagenomic sequence analysis for identification of exogenous reads in a host genomic background. PLoS ONE 8: e64546.
    1. Dunne WM Jr, Westblade LF, Ford B 2012. Next-generation and whole-genome sequencing in the diagnostic clinical microbiology laboratory. Eur J Clin Microbiol Infect Dis 31: 1719–1726
    1. Firth C, Lipkin WI 2013. The genomics of emerging pathogens. Annu Rev Genomics Hum Genet 14: 281–300
    1. Gardner MJ, Hall N, Fung E, White O, Berriman M, Hyman RW, Carlton JM, Pain A, Nelson KE, Bowman S, et al. 2002. Genome sequence of the human malaria parasite Plasmodium falciparum. Nature 419: 498–511
    1. Gargis AS, Kalman L, Berry MW, Bick DP, Dimmock DP, Hambuch T, Lu F, Lyon E, Voelkerding KV, Zehnbauer BA, et al. 2012. Assuring the quality of next-generation sequencing in clinical laboratory practice. Nat Biotechnol 30: 1033–1036
    1. Grard G, Fair JN, Lee D, Slikas E, Steffen I, Muyembe JJ, Sittler T, Veeraraghavan N, Ruby JG, Wang C, et al. 2012. A novel rhabdovirus associated with acute hemorrhagic fever in central Africa. PLoS Pathog 8: e1002924.
    1. Gremme G, Steinbiss S, Kurtz S 2013. GenomeTools: a comprehensive software library for efficient processing of structured genome annotations. IEEE/ACM Trans Comput Biol Bioinformatics 10: 645–656
    1. Greninger AL, Chen EC, Sittler T, Scheinerman A, Roubinian N, Yu G, Kim E, Pillai DR, Guyard C, Mazzulli T, et al. 2010. A metagenomic analysis of pandemic influenza A (2009 H1N1) infection in patients from North America. PLoS ONE 5: e13381.
    1. Knope K, Whelan P, Smith D, Johansen C, Moran R, Doggett S, Sly A, Hobby M, Kurucz N, Wright P, et al. 2013. Arboviral diseases and malaria in Australia, 2010-11: annual report of the National Arbovirus and Malaria Advisory Committee. Commun Dis Intell Q Rep 37: E1–E20
    1. Kollef KE, Schramm GE, Wills AR, Reichley RM, Micek ST, Kollef MH 2008. Predictors of 30-day mortality and hospital costs in patients with ventilator-associated pneumonia attributed to potentially antibiotic-resistant gram-negative bacteria. Chest 134: 281–287
    1. Kostic AD, Ojesina AI, Pedamallu CS, Jung J, Verhaak RG, Getz G, Meyerson M 2011. PathSeq: software to identify or discover microbes by deep sequencing of human tissue. Nat Biotechnol 29: 393–396
    1. Kostic AD, Gevers D, Pedamallu CS, Michaud M, Duke F, Earl AM, Ojesina AI, Jung J, Bass AJ, Tabernero J, et al. 2012. Genomic analysis identifies association of Fusobacterium with colorectal carcinoma. Genome Res 22: 292–298
    1. Langmead B, Salzberg SL 2012. Fast gapped-read alignment with Bowtie 2. Nat Methods 9: 357–359
    1. Li H, Durbin R 2009. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25: 1754–1760
    1. Loman NJ, Constantinidou C, Chan JZ, Halachev M, Sergeant M, Penn CW, Robinson ER, Pallen MJ 2012. High-throughput bacterial genome sequencing: an embarrassment of choice, a world of opportunity. Nat Rev Microbiol 10: 599–606
    1. Louie JK, Hacker JK, Gonzales R, Mark J, Maselli JH, Yagi S, Drew WL 2005. Characterization of viral agents causing acute respiratory infection in a San Francisco University Medical Center Clinic during the influenza season. Clin Infect Dis 41: 822–828
    1. MacConaill L, Meyerson M 2008. Adding pathogens by genomic subtraction. Nat Genet 40: 380–382
    1. Martin M 2011. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnetjournal 17: 1
    1. Morse SS, Mazet JA, Woolhouse M, Parrish CR, Carroll D, Karesh WB, Zambrana-Torrelio C, Lipkin WI, Daszak P 2012. Prediction and prevention of the next pandemic zoonosis. Lancet 380: 1956–1965
    1. Naeem R, Rashid M, Pain A 2013. READSCAN: a fast and scalable pathogen discovery program with accurate genome relative abundance estimation. Bioinformatics 29: 391–392
    1. Niu B, Zhu Z, Fu L, Wu S, Li W 2011. FR-HIT, a very fast program to recruit metagenomic reads to homologous reference genomes. Bioinformatics 27: 1704–1705
    1. Nunez JJ, Fritz CL, Knust B, Buttke D, Enge B, Novak MG, Kramer V, Osadebe L, Messenger S, Albarino CG, et al. 2014. Hantavirus infections among overnight visitors to Yosemite National Park, California, USA, 2012. Emerg Infect Dis 20: 386–393
    1. Pruitt KD, Tatusova T, Maglott DR 2007. NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res 35: D61–D65
    1. Quail MA, Smith M, Coupland P, Otto TD, Harris SR, Connor TR, Bertoni A, Swerdlow HP, Gu Y 2012. A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers. BMC Genomics 13: 341.
    1. Schmieder R, Edwards R 2011. Quality control and preprocessing of metagenomic datasets. Bioinformatics 27: 863–864
    1. Simpson JT, Wong K, Jackman SD, Schein JE, Jones SJ, Birol I 2009. ABySS: a parallel assembler for short read sequence data. Genome Res 19: 1117–1123
    1. Sommer DD, Delcher AL, Salzberg SL, Pop M 2007. Minimus: a fast, lightweight genome assembler. BMC Bioinformatics 8: 64.
    1. Swei A, Russell BJ, Naccache SN, Kabre B, Veeraraghavan N, Pilgard MA, Johnson BJ, Chiu CY 2013. The genome sequence of Lone Star virus, a highly divergent bunyavirus found in the Amblyomma americanum tick. PLoS ONE 8: e62083.
    1. Tong S, Li Y, Rivailler P, Conrardy C, Castillo DA, Chen LM, Recuenco S, Ellison JA, Davis CT, York IA, et al. 2012. A distinct lineage of influenza A virus from bats. Proc Natl Acad Sci 109: 4269–4274
    1. Treangen TJ, Sommer DD, Angly FE, Koren S, Pop M 2011. Next generation sequence assembly with AMOS. Curr Protoc Bioinformatics 33: 11.8.1–11.8.18
    1. van Gageldonk-Lafeber AB, Heijnen ML, Bartelds AI, Peters MF, van der Plas SM, Wilbrink B 2005. A case-control study of acute respiratory tract infection in general practice patients in The Netherlands. Clin Infect Dis 41: 490–497
    1. Wang Q, Jia P, Zhao Z 2013. VirusFinder: software for efficient and accurate detection of viruses and their integration sites in host genomes through next generation sequencing data. PLoS ONE 8: e64465.
    1. Ward KN, Kalima P, MacLeod KM, Riordan T 2002. Neuroinvasion during delayed primary HHV-7 infection in an immunocompetent adult with encephalitis and flaccid paralysis. J Med Virol 67: 538–541
    1. Wilson MR, Naccache SN, Samayoa E, Biagtan M, Bashir H, Yu G, Salamat SM, Somasekar S, Federman S, Miller S, et al. 2014. Actionable diagnosis of neuroleptospirosis by next-generation sequencing. N Engl J Med doi: 10.1056/NEJMoa.1401268
    1. Wylie KM, Mihindukulasuriya KA, Sodergren E, Weinstock GM, Storch GA 2012. Sequence analysis of the human virome in febrile and afebrile children. PLoS ONE 7: e27735.
    1. Xu B, Liu L, Huang X, Ma H, Zhang Y, Du Y, Wang P, Tang X, Wang H, Kang K, et al. 2011. Metagenomic analysis of fever, thrombocytopenia and leukopenia syndrome (FTLS) in Henan Province, China: discovery of a new bunyavirus. PLoS Pathog 7: e1002369.
    1. Yu G, Greninger AL, Isa P, Phan TG, Martinez MA, de la Luz Sanchez M, Contreras JF, Santos-Preciado JI, Parsonnet J, Miller S, et al. 2012. Discovery of a novel polyomavirus in acute diarrheal samples from children. PLoS ONE 7: e49449.
    1. Zaharia M, Bolosky WJ, Curtis K, Fox A, Patterson D, Shenker S, Stoica I, Karp RM, Sittler T 2011. Faster and more accurate sequence alignment with SNAP. arXiv 1111.5572
    1. Zaki AM, van Boheemen S, Bestebroer TM, Osterhaus AD, Fouchier RA 2012. Isolation of a novel coronavirus from a man with pneumonia in Saudi Arabia. N Engl J Med 367: 1814–1820
    1. Zhao Y, Tang H, Ye Y 2012. RAPSearch2: a fast and memory-efficient protein similarity search tool for next-generation sequencing data. Bioinformatics 28: 125–126
    1. Zhao G, Krishnamurthy S, Cai Z, Popov VL, Travassos da Rosa AP, Guzman H, Cao S, Virgin HW, Tesh RB, Wang D 2013. Identification of novel viruses using VirusHunter–an automated data analysis pipeline. PLoS ONE 8: e78470.
    1. Zweig MH, Campbell G 1993. Receiver-operating characteristic (ROC) plots: a fundamental evaluation tool in clinical medicine. Clin Chem 39: 561–577

Source: PubMed

3
Abonnieren