Collective judgment predicts disease-associated single nucleotide variants

Emidio Capriotti, Russ B Altman, Yana Bromberg, Emidio Capriotti, Russ B Altman, Yana Bromberg

Abstract

Background: In recent years the number of human genetic variants deposited into the publicly available databases has been increasing exponentially. The latest version of dbSNP, for example, contains ~50 million validated Single Nucleotide Variants (SNVs). SNVs make up most of human variation and are often the primary causes of disease. The non-synonymous SNVs (nsSNVs) result in single amino acid substitutions and may affect protein function, often causing disease. Although several methods for the detection of nsSNV effects have already been developed, the consistent increase in annotated data is offering the opportunity to improve prediction accuracy.

Results: Here we present a new approach for the detection of disease-associated nsSNVs (Meta-SNP) that integrates four existing methods: PANTHER, PhD-SNP, SIFT and SNAP. We first tested the accuracy of each method using a dataset of 35,766 disease-annotated mutations from 8,667 proteins extracted from the SwissVar database. The four methods reached overall accuracies of 64%-76% with a Matthew's correlation coefficient (MCC) of 0.38-0.53. We then used the outputs of these methods to develop a machine learning based approach that discriminates between disease-associated and polymorphic variants (Meta-SNP). In testing, the combined method reached 79% overall accuracy and 0.59 MCC, ~3% higher accuracy and ~0.05 higher correlation with respect to the best-performing method. Moreover, for the hardest-to-define subset of nsSNVs, i.e. variants for which half of the predictors disagreed with the other half, Meta-SNP attained 8% higher accuracy than the best predictor.

Conclusions: Here we find that the Meta-SNP algorithm achieves better performance than the best single predictor. This result suggests that the methods used for the prediction of variant-disease associations are orthogonal, encoding different biologically relevant relationships. Careful combination of predictions from various resources is therefore a good strategy for the selection of high reliability predictions. Indeed, for the subset of nsSNVs where all predictors were in agreement (46% of all nsSNVs in the set), our method reached 87% overall accuracy and 0.73 MCC. Meta-SNP server is freely accessible at http://snps.biofold.org/meta-snp.

Figures

Figure 1
Figure 1
Illustrating orthogonality of the component methods. Unweighted Pair Group Method with Arithmetic Mean (UPGMA) trees visualize the similarity between PANTHER, PhD-SNP, SIFT and SNAP according to the overlap (panel A) and the correlation (panel B) between the predictions in Table 2. The trees were drawn using the drawtree package [31].
Figure 2
Figure 2
Venn diagram of prediction overlaps. Overlap between the predictions returned by PANTHER (blue), PhD-SNP (red), SIFT (grey) and SNAP (green), generated using Venny [32].
Figure 3
Figure 3
The overlap in sequence profile-based feature distributions is most visible for hardest to predict set of variants. Distributions of the frequencies of the wild-type (Fwt) and mutant (Fm) residues and conservation indices (CI) for disease-related (red) and polymorphic (blue) nsSNVs were computed from sequence profiles. The distributions are calculated on SV-2009 dataset (panels A, B, C) and its subsets: Consensus (panels D, E, F), Majority (panels G, H, I) and Tie (panels J, K, L). Distributions of all profile features overlap most for Tie set and least for Consensus set.
Figure 4
Figure 4
Meta-SNP is more accurate in predicting disease-associated nsSNVs than all of its components for all data sets. (A) Receiver operating characteristic (ROC) curves for all the prediction algorithms show that Meta-SNP is a better predictor than all of its component methods. (B) Of all subsets of SV-2009, however, Meta-SNP performs best on the Consensus set, followed by Majority and Tie subsets.
Figure 5
Figure 5
Comparison between Meta-SNP, CONDEL and PhD-SNP. (A) Performances of CONDEL, PhD-SNP and Meta-SNP on the NSV-2012 dataset. (B) Meta-SNP accuracy on NSV-2012 dataset and its three subsets. (C) Accuracy of Meta-SNP in terms of TPR and TNR improves as a function of increasing. Note that there are only 25 disease causing variants at RI = 9, resulting in an artifact of the curve - an unexpected drop in accuracy, RI. TPR, TNR and RI are defined in Methods. DB is the fraction of the NSV-2012 dataset with an RI higher or equal than a given threshold.

References

    1. 1000 Genomes Project Consortium: a map of human genome variation from population-scale sequencing. Nature. 2010;14(7319):1061–1073. doi: 10.1093/bioinformatics/btq028.
    1. Stenson PD, Ball EV, Mort M, Phillips AD, Shiel JA, Thomas NS, Abeysinghe S, Krawczak M, Cooper DN. Human Gene Mutation Database (HGMD): 2003 update. Hum Mutat. 2003;14(6):577–581. doi: 10.1002/humu.10212.
    1. Thusberg J, Olatubosun A, Vihinen M. Performance of mutation pathogenicity prediction methods on missense variants. Hum Mutat. 2011;14(4):358–368. doi: 10.1002/humu.21445.
    1. Ward LD, Kellis M. Interpreting noncoding genetic variation in complex traits and human disease. Nat Biotechnol. 2012;14(11):1095–1106. doi: 10.1038/nbt.2422.
    1. Capriotti E, Nehrt NL, Kann MG, Bromberg Y. Bioinformatics for personal genome interpretation. Brief Bioinform. 2012;14(4):495–512. doi: 10.1093/bib/bbr070.
    1. Fernald GH, Capriotti E, Daneshjou R, Karczewski KJ, Altman RB. Bioinformatics challenges for personalized medicine. Bioinformatics. 2011;14(13):1741–1748. doi: 10.1093/bioinformatics/btr295.
    1. Amberger J, Bocchini CA, Scott AF, Hamosh A. McKusick's Online Mendelian Inheritance in Man (OMIM) Nucleic Acids Res. 2009;14:D793–D796. doi: 10.1093/nar/gkn665.
    1. Schaefer C, Bromberg Y, Achten D, Rost B. Disease-related mutations predicted to impact protein function. BMC Genomics. 2012;14(Suppl 4):S11. doi: 10.1186/1471-2164-13-S4-S11.
    1. Marchini J, Donnelly P, Cardon LR. Genome-wide strategies for detecting multiple loci that influence complex diseases. Nat Genet. 2005;14(4):413–417. doi: 10.1038/ng1537.
    1. Kraft P, Hunter DJ. Genetic risk prediction--are we there yet? N Engl J Med. 2009;14(17):1701–1703. doi: 10.1056/NEJMp0810107.
    1. Bromberg Y, Rost B. SNAP: predict effect of non-synonymous polymorphisms on function. Nucleic Acids Res. 2007;14(11):3823–3835. doi: 10.1038/nature09534.
    1. Capriotti E, Calabrese R, Fariselli P, Martelli PL, Casadio R. Functional annotations improve the predictive score of human disease-related mutations in proteins. Hum Mutat. 2009;14(8):1237–1244. doi: 10.1002/humu.20628.
    1. Capriotti E, Altman RB. A new disease-specific machine learning approach for the prediction of cancer-causing missense variants. Genomics. 2011.
    1. Capriotti E, Altman RB. Improving the prediction of disease-related variants using protein three-dimensional structure. BMC Bioinformatics. 2011;14(S3) doi: 10.1093/nar/gkm238.
    1. Capriotti E, Arbiza L, Casadio R, Dopazo J, Dopazo H, Marti-Renom MA. Use of estimated evolutionary strength at the codon level improves the prediction of disease-related protein mutations in humans. Hum Mutat. 2008;14(1):198–204. doi: 10.1073/pnas.0404380101.
    1. Capriotti E, Calabrese R, Casadio R. Predicting the insurgence of human genetic diseases associated to single point protein mutations with support vector machines and evolutionary information. Bioinformatics. 2006;14(22):2729–2734. doi: 10.1093/nar/gkg509.
    1. Ng PC, Henikoff S. SIFT: Predicting amino acid changes that affect protein function. Nucleic Acids Res. 2003;14(13):3812–3814. doi: 10.1093/nar/gkg509.
    1. Thomas PD, Kejariwal A. Coding single-nucleotide polymorphisms associated with complex vs. Mendelian disease: evolutionary evidence for differences in molecular effects. Proc Natl Acad Sci USA. 2004;14(43):15398–15403. doi: 10.1093/nar/gkf493.
    1. Li B, Krishnan VG, Mort ME, Xin F, Kamati KK, Cooper DN, Mooney SD, Radivojac P. Automated inference of molecular mechanisms of disease from amino acid substitutions. Bioinformatics. 2009;14(21):2744–2750. doi: 10.1093/nar/gkf493.
    1. Thusberg J, Vihinen M. Pathogenic or not? And if so, then how? Studying the effects of missense mutations using bioinformatics methods. Hum Mutat. 2009;14(5):703–714. doi: 10.1002/humu.20938.
    1. Ramensky V, Bork P, Sunyaev S. Human non-synonymous SNPs: server and survey. Nucleic Acids Res. 2002;14(17):3894–3900. doi: 10.1002/humu.20938.
    1. Mottaz A, David FP, Veuthey AL, Yip YL. Easy retrieval of single amino-acid polymorphisms and phenotype information using SwissVar. Bioinformatics. 2010;14(6):851–852. doi: 10.1002/humu.20938.
    1. Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, Sirotkin K. dbSNP: the NCBI database of genetic variation. Nucleic Acid Res. 2001;14(1):308–311. doi: 10.1002/humu.20938.
    1. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;14(17):3389–3402. doi: 10.1093/nar/25.17.3389.
    1. Suzek BE, Huang H, McGarvey P, Mazumder R, Wu CH. UniRef: comprehensive and non-redundant UniProt reference clusters. Bioinformatics. 2007;14(10):1282–1288. doi: 10.1093/bioinformatics/btm098.
    1. Pei J, Grishin NV. AL2CO: calculation of positional conservation in a protein sequence alignment. Bioinformatics. 2001;14(8):700–712. doi: 10.1093/bioinformatics/17.8.700.
    1. Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, W IH. The WEKA Data Mining Software: An Update. SIGKDD Explorations. 2009;14
    1. Kawabata T, Ota M, Nishikawa K. The Protein Mutant Database. Nucleic Acids Res. 1999;14(1):355–357. doi: 10.1093/nar/27.1.355.
    1. Schaefer C, Meier A, Rost B, Bromberg Y. SNPdbe: constructing an nsSNP functional impacts database. Bioinformatics. 2012;14(4):601–602. doi: 10.1093/bioinformatics/btr705.
    1. Gonzalez-Perez A, Lopez-Bigas N. Improving the assessment of the outcome of nonsynonymous SNVs with a consensus deleteriousness score, Condel. Am J Hum Genet. 2011;14(4):440–449. doi: 10.1016/j.ajhg.2011.03.004.
    1. DrawTree.
    1. VENNY. An interactive tool for comparing lists with Venn Diagrams.

Source: PubMed

3
S'abonner