VarGenius executes cohort-level DNA-seq variant calling and annotation and allows to manage the resulting data through a PostgreSQL database

F Musacchia, A Ciolfi, M Mutarelli, A Bruselles, R Castello, M Pinelli, S Basu, S Banfi, G Casari, M Tartaglia, V Nigro, TUDP, Raffaele Castello, Annalaura Torella, Gaia Esposito, Francesco Musacchia, Margherita Mutarelli, Gerarda Cappuccio, Michele Pinelli, Giorgia Mancano, Silvia Maitz, Nicola Brunetti-Pierri, Giancarlo Parenti, Angelo Selicorni, Sandro Banfi, Vincenzo Nigro, Giorgio Casari, F Musacchia, A Ciolfi, M Mutarelli, A Bruselles, R Castello, M Pinelli, S Basu, S Banfi, G Casari, M Tartaglia, V Nigro, TUDP, Raffaele Castello, Annalaura Torella, Gaia Esposito, Francesco Musacchia, Margherita Mutarelli, Gerarda Cappuccio, Michele Pinelli, Giorgia Mancano, Silvia Maitz, Nicola Brunetti-Pierri, Giancarlo Parenti, Angelo Selicorni, Sandro Banfi, Vincenzo Nigro, Giorgio Casari

Abstract

Background: Targeted resequencing has become the most used and cost-effective approach for identifying causative mutations of Mendelian diseases both for diagnostics and research purposes. Due to very rapid technological progress, NGS laboratories are expanding their capabilities to address the increasing number of analyses. Several open source tools are available to build a generic variant calling pipeline, but a tool able to simultaneously execute multiple analyses, organize, and categorize the samples is still missing.

Results: Here we describe VarGenius, a Linux based command line software able to execute customizable pipelines for the analysis of multiple targeted resequencing data using parallel computing. VarGenius provides a database to store the output of the analysis (calling quality statistics, variant annotations, internal allelic variant frequencies) and sample information (personal data, genotypes, phenotypes). VarGenius can also perform the "joint analysis" of hundreds of samples with a single command, drastically reducing the time for the configuration and execution of the analysis. VarGenius executes the standard pipeline of the Genome Analysis Tool-Kit (GATK) best practices (GBP) for germinal variant calling, annotates the variants using Annovar, and generates a user-friendly output displaying the results through a web page. VarGenius has been tested on a parallel computing cluster with 52 machines with 120GB of RAM each. Under this configuration, a 50 M whole exome sequencing (WES) analysis for a family was executed in about 7 h (trio or quartet); a joint analysis of 30 WES in about 24 h and the parallel analysis of 34 single samples from a 1 M panel in about 2 h.

Conclusions: We developed VarGenius, a "master" tool that faces the increasing demand of heterogeneous NGS analyses and allows maximum flexibility for downstream analyses. It paves the way to a different kind of analysis, centered on cohorts rather than on singleton. Patient and variant information are stored into the database and any output file can be accessed programmatically. VarGenius can be used for routine analyses by biomedical researchers with basic Linux skills providing additional flexibility for computational biologists to develop their own algorithms for the comparison and analysis of data. The software is freely available at: https://github.com/frankMusacchia/VarGenius.

Conflict of interest statement

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Figures

Fig. 1
Fig. 1
VarGenius flowchart: sequential steps allowed in VarGenius to execute different pipelines. Dark gray indicates a mandatory step, medium gray an optional one and the lighter gray represents the input and output of the pipeline. This figure also shows the input and output of VarGenius
Fig. 2
Fig. 2
VarGenius files and data management: the samples sheet data (containing FASTQ paths, analysis, samples and read files names, sequencing type, target file and user id) is imported into the database. VarGenius automatically chooses different settings for two predefined pipelines: one for exomes and the second for amplicon panels. The different tasks (quality check, alignment, refinement, variant calling, variant filtering and output production) are executed as Torque jobs using QSUB command and scheduled in the cluster
Fig. 3
Fig. 3
VarGenius database schema: analyzes and samples information is managed at three different levels using the analyzes, samples and readfiles tables. They are also used to keep track of the steps executed. The variants table contains the information about the variants and their allelic frequencies. The statistics table contains the information about the variant for each analysis (quality score, depth, etc). The genotypes table contains the genotype information obtained with GATK HaplotypeCaller while the annotation table contains more specific information about the variants calling (gene, transcript, exon, nucleotide and aminoacidic substitutions). The last three tables (transcripts, genes and phenotypes) contain the information to build the gene annotation
Fig. 4
Fig. 4
HTML page with results. This page is given as for example, it is the first page of the web site produced and shows how the results are organized. Links to download the output files and tables showing quality check statistics are present
Fig. 5
Fig. 5
Example figure for global percentage of target coverage plot. This plot shows the percentage of the target covered by samples obtained with the same target file and can be used for sequencing run evaluation. This kind of plot is print by VarGenius at different levels of coverage (1X, 10X, 20X, 40X, 80X,100X). The figure shows the coverage of the target at 20X. Three different colors are used for kinship of the samples (probands, mothers and fathers)
Fig. 6
Fig. 6
This figure details the start-and-stop method in VarGenius. At any task the input is taken from a folder belonging to the previous one. Thus, the refinement task takes the input from the alignment task and puts the output in the refine_out folder
Fig. 7
Fig. 7
An example query to our database to identify which samples have a specific variant
Fig. 8
Fig. 8
Example of a table of total reads on the X and Y chromosomes. This table is displayed in the HTML web page and shows, for each sample in an analysis, the total reads on both chromosome X and Y and their relative percentages
Fig. 9
Fig. 9
An example of segregation test table. This table is displayed in the HTML web page and matches the number of variants found in a sample with a particular genotype with all other samples. It may be used to compare calls in probands and their parents

References

    1. Wetterstrand KA. DNA Sequencing Costs: Data from the NHGRI Genome Sequencing Program (GSP). Retrieved 16 Dec 2017, from
    1. Hodges E, Xuan Z, Balija V, Kramer M, Molla MN, Smith SW, Middle CM, Rodesch MJ, Albert TJ, Hannon GJ, McCombie WR. Genome-wide in situexon capture for selective resequencing. Nat Genet. 2007. 10.1038/ng.2007.42.
    1. Gilissen C, Hoischen A, Brunner HG, Veltman JA. Unlocking Mendelian disease using exome sequencing. Genome Biol. 2011. 10.1186/gb-2011-12-9-228.
    1. Li X, Montgomery SB. Detection and impact of rare regulatory variants in human disease. Front Genet. 2013. 10.3389/fgene.2013.00067.
    1. Ward LD, Kellis M. Interpreting noncoding genetic variation in complex traits and human disease. Nat Biotechnol. 2012. 10.1038/nbt.2422.
    1. Choi M, Scholl UI, Ji W, Liu T, Tikhonova IR, Zumbo P, Nayir A, Bakkaloğlu A, Ozen S, Sanjad S, Nelson-Williams C, Farhi A, Mane S, Lifton RP. Genetic diagnosis by whole exome capture and massively parallel DNA sequencing. Proc Natl Acad Sci U S A. 2009;2009. 10.1073/pnas.0910672106.
    1. Girisha KM, Shukla A, Trujillano D, Bhavani GS, Hebbar M, Kadavigere R, Rolfs A. A homozygous nonsense variant in IFT52 is associated with a human skeletal ciliopathy. Clin Genet. 2016. 10.1111/cge.12762.
    1. Levy SE, Myers RM. GG17CH05-Levy Advancements in Next-Generation Sequencing. Annu Rev Genomics Hum Genet. 2016. 10.1146/annurev-genom-083115-022413.
    1. Gilissen C, Hoischen A, Brunner HG, Veltman JA. Disease gene identification strategies for exome sequencing. Eur J Hum Genet. 2012. 10.1038/ejhg.2011.258.
    1. Editorial ExAC project pins down rare gene variants. Nature Editorial. 2016; 10.1038/536249a.
    1. Higasa K, Miyake N, Yoshimura J, Okamura K, Niihori T, Saitsu H, Doi K, Shimizu M, Nakabayashi K, Aoki Y, Tsurusaki Y, Morishita S, Kawaguchi T, Migita O, Nakayama K, Nakashima M, Mitsui J, Narahara M, Hayashi K, Funayama R, Yamaguchi D, Ishiura H, Ko WY, Hata K, Nagashima T, Yamada R, Matsubara Y, Umezawa A, Tsuji S, Matsumoto N, Matsuda F. Human genetic variation database, a reference database of genetic variations in the Japanese population. J Hum Genet. 2016. 10.1038/jhg.2016.12.
    1. Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, Sirotkin K. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 2001;29(1):308–311. doi: 10.1093/nar/29.1.308.
    1. Menon R, Patel NV, Mohapatra A, Joshi CG. VDAP-GUI: a user-friendly pipeline for variant discovery and annotation of raw next-generation sequencing data. 3 Biotech. 2016. 10.1007/s13205-016-0382-1.
    1. Lam HYK, Pan C, Clark MJ, Lacroute P, Chen R, Haraksingh R, O’Huallachain M, Gerstein MB, Kidd JM, Bustamante CD, Snyder M. Detecting and annotating genetic variations using the HugeSeq pipeline. Nat Biotechnol. 2012. 10.1038/nbt.2134.
    1. Li H, Durbin R. Fast and accurate long-read alignment with burrows–wheeler transform. Bioinformatics. 2010. 10.1093/bioinformatics/btp698.
    1. Van der Auwera GA, Carneiro MO, Hartl C, Poplin R, del Angel G, Levy-Moonshine A, Jordan T, Shakir K, Roazen D, Thibault J, Banks E, Garimella KV, Altshuler D, Gabriel S, DePristo MA. From FASTQ data to high-confidence variant calls: the genome analysis toolkit best practices pipeline. Curr Protoc Bioinformatics. 2013. 10.1002/0471250953.bi1110s43.
    1. Fischer M, Snajder R, Pabinger S, Dander A, Schossig A, Zschocke J, Trajanoski Z, Stocker G. SIMPLEX: cloud-enabled pipeline for the comprehensive analysis of exome sequencing data. PLoS One. 2012. 10.1371/journal.pone.0041948.
    1. Paila U, Chapman BA, Kirchner R, Quinlan AR. GEMINI: integrative exploration of genetic variation and genome annotations. PLoS Comput Biol. 2013. 10.1371/journal.pcbi.1003153.
    1. Rubio-Camarillo M, López-Fernández H, Gómez-López G, Carro Á, Fernández JM, Torre CF, Fdez-Riverola F, Glez-Peña D. RUbioSeq+: a multiplatform application that executes parallelized pipelines to analyse next-generation sequencing data. Comput Methods Prog Biomed. 2017. 10.1016/j.cmpb.2016.10.008.
    1. Mutarelli M, Marwah V, Rispoli R, Carrella D, Dharmalingam G, Oliva G, di Bernardo D. A community-based resource for automatic exome variant-calling and annotation in Mendelian disorders. BMC Genomics. 2014. 10.1186/1471-2164-15-S3-S5.
    1. D’Antonio M, De D’Onorio Meo P, Paoletti D, Elmi B, Pallocca M, Sanna N, Picardi E, Pesole G, Castrignanò T. WEP: a high-performance analysis pipeline for whole-exome data. BMC Bioinformatics. 2013. 10.1186/1471-2105-14-S7-S11.
    1. Karczewski KJ, Fernald GH, Martin AR, Snyder M, Tatonetti NP, Dudley JT. STORMSeq: an open-source, user-friendly pipeline for processing personal genomics data in the cloud. PLoS One. 2014. 10.1371/journal.pone.0084860.
    1. Blankenberg D, Von Kuster G, Coraor N, Ananda G, Lazarus R, Mangan M, Nekrutenko A, Taylor J. Galaxy, a web-based genome analysis tool for experimentalists. Curr Protoc Mol Biol 2010; doi: 10.1002/0471142727.mb1910s89.
    1. Simon Andrews. FASTQC: A quality control tool for high throughput sequence data. Retrieved 16 Dec 2017, from
    1. Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics. 2014;30(15):2114–2120. doi: 10.1093/bioinformatics/btu170.
    1. Retrieved October 2018 from
    1. Wang K, Li M, Hakonarson H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res. 2010. 10.1093/nar/gkq603.
    1. Kircher M, Witten DM, Jain P, O’Roak BJ, Cooper GM, Shendure J. A general framework for estimating the relative pathogenicity of human genetic variants. Nat Genet. 2014. 10.1038/ng.2892.
    1. Liu X, Jian X, Boerwinkle E. dbNSFP: a lightweight database of human nonsynonymous SNPs and their functional predictions. Hum Mutat. 2011. 10.1002/humu.21517.
    1. Quang D, Chen Y, Xie X. DANN: a deep learning approach for annotating the pathogenicity of genetic variants. Bioinformatics. 2015. 10.1093/bioinformatics/btu703.
    1. 1000 Genomes Project Consortium. Auton A, Abecasis GR, Altshuler DM, Durbin RM, Abecasis GR, Bentley DR, Abecasis GR. A global reference for human genetic variation. Nature. 2015;526(7571):68–74. doi: 10.1038/nature15393.
    1. Lek M, Karczewski KJ, Minikel EV, Samocha KE, Banks E, Fennell T, O’Donnell-Luria AH, Ware JS, Hill AJ, Cummings BB, Tukiainen T, Birnbaum DP, Kosmicki JA, Duncan LE, Estrada K, Zhao F, Zou J, Pierce-Hoffman E, Berghout J, Cooper DN, Deflaux N, DePristo M, Do R, Flannick J, Fromer M, Gauthier L, Goldstein J, Gupta N, Howrigan D, Kiezun A, Kurki MI, Moonshine AL, Natarajan P, Orozco L, Peloso GM, Poplin R, Rivas MA, Ruano-Rubio V, Rose SA, Ruderfer DM, Shakir K, Stenson PD, Stevens C, Thomas BP, Tiao G, Tusie-Luna MT, Weisburd B, Won HH, Yu D, Altshuler DM, Ardissino D, Boehnke M, Danesh J, Donnelly S, Elosua R, Florez JC, Gabriel SB, Getz G, Glatt SJ, Hultman CM, Kathiresan S, Laakso M, McCarroll S, McCarthy MI, McGovern D, McPherson R, Neale BM, Palotie A, Purcell SM, Saleheen D, Scharf JM, Sklar P, Sullivan PF, Tuomilehto J, Tsuang MT, Watkins HC, Wilson JG, Daly MJ, MacArthur DG, Exome Aggregation Consortium. Analysis of protein-coding genetic variation in 60,706 humans. Nature. 2016. 10.1038/nature19057.
    1. Agarwala V, Flannick J, Sunyaev S. GoT2D Consortium & Altshuler D. Evaluating empirical bounds on complex disease genetic architecture. Nat Genet. 2013. 10.1038/ng.2804.
    1. Petrovski S, Wang Q, Heinzen EL, Allen AS, Goldstein DB. Genic intolerance to functional variation and the interpretation of personal genomes. PLoS Genet. 2013. 10.1371/journal.pgen.100370932.
    1. Itan Y, Shang L, Boisson B, Patin E, Bolze A, Moncada-Vélez M, Scott E, Ciancanelli MJ, Lafaille FG, Markle JG, Martinez-Barricarte R, de Jong SJ, Kong XF, Nitschke P, Belkadi A, Bustamante J, Puel A, Boisson-Dupuis S, Stenson PD, Gleeson JG, Cooper DN, Quintana-Murci L, Claverie JM, Zhang SY, Abel L, Casanova JL. The human gene damage index as a gene-level approach to prioritizing exome variants. Proc Natl Acad Sci U S A. 2015. 10.1073/pnas.1518646112.
    1. Thorvaldsdóttir H, Robinson JT, Mesirov JP. Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration. Brief Bioinform. 2013. 10.1093/bib/bbs017.
    1. Zook JM, Chapman B, Wang J, Mittelman D, Hofmann O, Hide W, Salit M. Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nat Biotechnol. 2014. 10.1038/nbt.2835.
    1. Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, Handsaker RE, Lunter G, Marth GT, Sherry ST, McVean G, Durbin R, 1000 Genomes Project Analysis Group. The variant call format and VCFtools. Bioinformatics. 2017. 10.1093/bioinformatics/btr330.

Source: PubMed

3
Suscribir