New insights into the pathogenicity of non-synonymous variants through multi-level analysis

Hong Sun, Guangjun Yu, Hong Sun, Guangjun Yu

Abstract

Precise classification of non-synonymous single nucleotide variants (SNVs) is a fundamental goal of clinical genetics. Next-generation sequencing technology is effective for establishing the basis of genetic diseases. However, identification of variants that are causal for genetic diseases remains a challenge. We analyzed human non-synonymous SNVs from a multilevel perspective to characterize pathogenicity. We showed that computational tools, though each having its own strength and weakness, tend to be overly dependent on the degree of conservation. For the mutations at non-degenerate sites, the amino acid sites of pathogenic substitutions show a distinct distribution in the classes of protein domains compared with the sites of benign substitutions. Overlooked disease susceptibility of genes explains in part the failures of computational tools. The more pathogenic sites observed, the more likely the gene is expressed in a high abundance or in a high tissue-specific manner, and have a high node degree of protein-protein interaction. The destroyed functions due to some false-negative mutations may arise because of a reprieve from the epigenetic repressed state which shouldn't happen in multiple biological conditions, instead of the defective protein. Our work adds more to our knowledge of non-synonymous SNVs' pathogenicity, thus will benefit the field of clinical genetics.

Conflict of interest statement

The authors declare no competing interests.

Figures

Figure 1
Figure 1
Overreliance on the degree of conservation in pathogenicity predictions. (a) Percentage of highly conserved positions of false-positive (FP) variants. (b,c) Percentage of lowly conserved positions of variants with false negative (FN) predictions for pathogenic variants annotated by ClinVar (b) and for DM variants annotated by HGMD (c). The dashed lines show the proportion of positions with high (a) or low (b,c) conservation score over all positions of benign (a) or pathogenic/DM (b,c) variants. The vertebrate phastCons score cutoff for high or low level conservation is set at 0.5. The observed excess for the positions are evaluated by p-values based on Pearson’s chi-squared test with respect to the proportion of all annotated positions of benign/pathogenic/DM variants. The significances are indicated as * for p < 0.05, ** for p < 10−5 and *** for p < 10−10.
Figure 2
Figure 2
The maximum (a) and coefficient of variation (b) of prediction scores assigned to the four types of nucleotides at the non-degenerated sites corresponding to the four groups of variants annotated by ClinVar and DM variants annotated by HGMD. Wilcoxon tests were used to test the significance of the differences between groups of variants. Significant differences were observed between pathogenic variants and benign variants for all the computational tools.
Figure 3
Figure 3
Characterization of pathogenicity at the gene level. Cumulative probability distributions of the maximum expression level (a) and the tissue specificity of expression (b) among the 53 human normal tissues examined, and the ranked protein-protein interaction network degree (c) for all genes, ClinVar genes and HGMD genes. (d,e,f,g,h,i) Analysis on ClinVar genes and HGMD genes corresponding to the number of pathogenic variants found in the gene. Cumulative probability distributions of the maximum expression level (d,g), the tissue specificity of expression (e,h) and the ranked protein-protein interaction network degree (f,i) for genes carrying at least one sites of pathogenic variants (n > 0) and genes carrying more than 20 sites of pathogenic variants. ClinVar genes are defined as genes that contain pathogenic variant(s) annotated by ClinVar database, and HGMD genes are defined as genes that contain DM variant(s) annotated by HGMD database. Kolmogorov-Smirnov tests were used to test the significance of the differences between gene groups. P-values from pairwise comparisons are shown.
Figure 4
Figure 4
Percentage of cell types in PolyComb state for the non-synonymous sites. Sites of pathogenic variants annotated by ClinVar and DM variants annotated by HGMD are frequently found in repressed PolyComb state (a) as well as in weak repressed PolyComb state. (b) For most computational models analyzed in this study, sites of variants with false-negative predictions (red) are more frequently found in repressed or weak repressed PolyComb state compared to the sites of variants with true-positive predictions (blue) for pathogenic variants annotated by ClinVar (c) and DM variants annotated by HGMD. (d) Wilcoxon tests were used to test the significance of the differences. P-values are shown.

References

    1. The Genomes Project Consortium A map of human genome variation from population scale sequencing. Nature. 2010;467:1061–1073. doi: 10.1038/nature09534.
    1. Collins FS, Guyer MS, Charkravarti A. Variations on a theme: cataloging human DNA sequence variation. Science. 1997;278:1580–1581. doi: 10.1126/science.278.5343.1580.
    1. Bamshad MJ, et al. Exome sequencing as a tool for Mendelian disease gene discovery. Nat Rev Genet. 2011;12:745–755. doi: 10.1038/nrg3031/nrg3031.
    1. 1000 Genomes Project Consortium An integrated map of genetic variation from 1,092 human genomes. Nature. 2012;491:56–65. doi: 10.1038/nature11632.
    1. Kiezun A, et al. Exome sequencing and the genetic basis of complex traits. Nat Genet. 2012;44:623–630. doi: 10.1038/ng.2303/ng.2303.
    1. Ginsburg GS, Willard HF. Genomic and personalized medicine: foundations and applications. Transl Res. 2009;154:277–287. doi: 10.1016/j.trsl.2009.09.005/S1931-5244(09)00274-6.
    1. de Ligt J, et al. Diagnostic exome sequencing in persons with severe intellectual disability. N Engl J Med. 2012;367:1921–1929. doi: 10.1056/NEJMoa1206524.
    1. Vissers LE, et al. A de novo paradigm for mental retardation. Nat Genet. 2010;42:1109–1112. doi: 10.1038/ng.712.
    1. Ng PC, Henikoff S. Predicting Deleterious Amino Acid Substitutions. Genome Research. 2001;11:863–874. doi: 10.1101/gr.176601.
    1. Choi Y, Sims GE, Murphy S, Miller JR, Chan AP. Predicting the Functional Effect of Amino Acid Substitutions and Indels. PLoS ONE. 2012;7:e46688. doi: 10.1371/journal.pone.0046688.
    1. Schwarz JM, Cooper DN, Schuelke M, Seelow D. MutationTaster2: mutation prediction for the deep-sequencing age. Nat Methods. 2014;11:361–362. doi: 10.1038/nmeth.2890/nmeth.2890.
    1. Shihab HA, et al. An integrative approach to predicting the functional effects of non-coding and coding sequence variation. Bioinformatics. 2015;31:1536–1543. doi: 10.1093/bioinformatics/btv009.
    1. Rogers MF, et al. FATHMM-XF: accurate prediction of pathogenic point mutations via extended features. Bioinformatics. 2018;34:511–513. doi: 10.1093/bioinformatics/btx536.
    1. Shihab HA, et al. Predicting the Functional, Molecular, and Phenotypic Consequences of Amino Acid Substitutions using Hidden Markov Models. Human Mutation. 2013;34:57–65. doi: 10.1002/humu.22225.
    1. Dong C, et al. Comparison and integration of deleteriousness prediction methods for nonsynonymous SNVs in whole exome sequencing studies. Human Molecular Genetics. 2015;24:2125–2137. doi: 10.1093/hmg/ddu733.
    1. Adzhubei IA, et al. A method and server for predicting damaging missense mutations. Nature methods. 2010;7:248–249. doi: 10.1038/nmeth0410-248.
    1. Reva B, Antipin Y, Sander C. Predicting the functional impact of protein mutations: application to cancer genomics. Nucleic Acids Research. 2011;39:e118–e118. doi: 10.1093/nar/gkr407.
    1. Kircher M, et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nat Genet. 2014;46:310–315. doi: 10.1038/ng.2892/ng.2892.
    1. Quang D, Chen Y, Xie X. DANN: a deep learning approach for annotating the pathogenicity of genetic variants. Bioinformatics. 2015;31:761–763. doi: 10.1093/bioinformatics/btu703.
    1. Liu X, Wu C, Li C, Boerwinkle E. dbNSFPv3.0: A One-Stop Database of Functional Predictions and Annotations for Human Non-synonymous and Splice Site SNVs. Human Mutation. 2016;37:235–241. doi: 10.1002/humu.22932.
    1. Thusberg J, Olatubosun A, Vihinen M. Performance of mutation pathogenicity prediction methods on missense variants. Hum Mutat. 2011;32:358–368. doi: 10.1002/humu.21445.
    1. Miosge LA, et al. Comparison of predicted and actual consequences of missense mutations. Proceedings of the National Academy of Sciences of the United States of America. 2015;112:E5189–E5198. doi: 10.1073/pnas.1511585112.
    1. Jackson BR. The Dangers of False-Positive and False-Negative Test Results: False-Positive Results as a Function of Pretest Probability. Clinics in Laboratory Medicine. 2008;28:305–319. doi: 10.1016/j.cll.2007.12.009.
    1. Landrum MJ, et al. ClinVar: public archive of interpretations of clinically relevant variants. Nucleic Acids Res. 2016;44:D862–868. doi: 10.1093/nar/gkv1222/gkv1222.
    1. Stenson PD, et al. The Human Gene Mutation Database: building a comprehensive mutation repository for clinical and molecular genetics, diagnostic testing and personalized genomic medicine. Hum Genet. 2014;133:1–9. doi: 10.1007/s00439-013-1358-4.
    1. Grimm DG, et al. The Evaluation of Tools Used to Predict the Impact of Missense Variants Is Hindered by Two Types of Circularity. Human Mutation. 2015;36:513–523. doi: 10.1002/humu.22768.
    1. Vihinen M. Majority Vote and Other Problems when using Computational Tools. Human Mutation. 2014;35:912–914. doi: 10.1002/humu.22600.
    1. Stone EA, Cooper GM, Sidow A. Trade-offs in detecting evolutionarily constrained sequence by comparative genomics. Annu Rev Genomics Hum Genet. 2005;6:143–164. doi: 10.1146/annurev.genom.6.080604.162146.
    1. Cooper GM, Shendure J. Needles in stacks of needles: finding disease-causal variants in a wealth of genomic data. Nat Rev Genet. 2011;12:628–640. doi: 10.1038/nrg3046/nrg3046.
    1. Aken BL, et al. The Ensembl gene annotation system. Database: The Journal of Biological Databases and Curation. 2016;2016:baw093. doi: 10.1093/database/baw093.
    1. Subramanian A, et al. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences of the United States of America. 2005;102:15545–15550. doi: 10.1073/pnas.0506580102.
    1. Pollard KS, Hubisz MJ, Rosenbloom KR, Siepel A. Detection of nonneutral substitution rates on mammalian phylogenies. Genome Research. 2010;20:110–121. doi: 10.1101/gr.097857.109.
    1. The Genomes Project Consortium A global reference for human genetic variation. Nature. 2015;526:68–74. doi: 10.1038/nature15393.
    1. Väliaho J, Faisal I, Ortutay C, Smith CIE, Vihinen M. Characterization of All Possible Single-Nucleotide Change Caused Amino Acid Substitutions in the Kinase Domain of Bruton Tyrosine Kinase. Human Mutation. 2015;36:638–647. doi: 10.1002/humu.22791.
    1. Schaafsma GCP, Vihinen M. Large differences in proportions of harmful and benign amino acid substitutions between proteins and diseases. Human Mutation. 2017;38:839–848. doi: 10.1002/humu.23236.
    1. van der Velde KJ, et al. GAVIN: Gene-Aware Variant INterpretation for medical sequencing. Genome Biology. 2017;18:6. doi: 10.1186/s13059-016-1141-7.
    1. Kohler S, Bauer S, Horn D, Robinson PN. Walking the interactome for prioritization of candidate disease genes. Am J Hum Genet. 2008;82:949–958. doi: 10.1016/j.ajhg.2008.02.013/S0002-9297(08)00172-9.
    1. Guan Y, et al. Tissue-specific functional networks for prioritizing phenotype and disease genes. PLoS Comput Biol. 2012;8:e1002694. doi: 10.1371/journal.pcbi.1002694/PCOMPBIOL-D-12-00191.
    1. Winter EE, Goodstadt L, Ponting CP. Elevated Rates of Protein Secretion, Evolution, and Disease Among Tissue-Specific Genes. Genome Research. 2004;14:54–61. doi: 10.1101/gr.1924004.
    1. Lage K, et al. A large-scale analysis of tissue-specific pathology and gene expression of human disease genes and complexes. Proceedings of the National Academy of Sciences of the United States of America. 2008;105:20870–20875. doi: 10.1073/pnas.0810772105.
    1. The GTEx Consortium The Genotype-Tissue Expression (GTEx) project. Nature genetics. 2013;45:580–585. doi: 10.1038/ng.2653.
    1. Maurano MT, et al. Systematic localization of common disease-associated variation in regulatory DNA. Science. 2012;337:1190–1195. doi: 10.1126/science.1222794/science.1222794.
    1. Lee TI, Young RA. Transcriptional regulation and its misregulation in disease. Cell. 2013;152:1237–1251. doi: 10.1016/j.cell.2013.02.014/S0092-8674(13)00203-1.
    1. Schwartz, Y. B. et al. Genome-wide analysis of Polycomb targets in Drosophila melanogaster. Nat Genet38, 700–705, 10.1038/ng1817 (2006).
    1. Tolhuis B, et al. Genome-wide profiling of PRC1 and PRC2 Polycomb chromatin binding in Drosophila melanogaster. Nat Genet. 2006;38:694–699. doi: 10.1038/ng1792.
    1. Wang X, Moazed D. DNA sequence-dependent epigenetic inheritance of gene silencing and histone H3K9 methylation. Science. 2017;356:88–91. doi: 10.1126/science.aaj2114.
    1. Mohd-Sarip A, Cléard F, Mishra RK, Karch F, Verrijzer CP. Synergistic recognition of an epigenetic DNA element by Pleiohomeotic and a Polycomb core complex. Genes & Development. 2005;19:1755–1760. doi: 10.1101/gad.347005.
    1. Weksberg R, et al. Discordant KCNQ1OT1 imprinting in sets of monozygotic twins discordant for Beckwith-Wiedemann syndrome. Hum Mol Genet. 2002;11:1317–1325. doi: 10.1093/hmg/11.11.1317.
    1. Azzi S, et al. Multilocus methylation analysis in a large cohort of 11p15-related foetal growth disorders (Russell Silver and Beckwith Wiedemann syndromes) reveals simultaneous loss of methylation at paternal and maternal imprinted loci. Hum Mol Genet. 2009;18:4724–4733. doi: 10.1093/hmg/ddp435.
    1. Rhead B, et al. The UCSC Genome Browser database: update 2010. Nucleic Acids Res. 2010;38:D613–619. doi: 10.1093/nar/gkp939.
    1. Siepel A, et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Research. 2005;15:1034–1050. doi: 10.1101/gr.3715005.
    1. Roadmap Epigenomics C, et al. Integrative analysis of 111 reference human epigenomes. Nature. 2015;518:317–330. doi: 10.1038/nature14248.
    1. Landgraf, P. et al. A mammalian microRNA expression atlas based on small RNA library sequencing. Cell129, 1401–1414, S0092-8674(07)00604-6/10.1016/j.cell.2007.04.040 (2007).
    1. Huang da W, Sherman BT, Lempicki RA. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc. 2009;4:44–57. doi: 10.1038/nprot.2008.211/nprot.2008.211.
    1. Everitt, B. The Cambridge Dictionary of Statistics. Cambridge University Press (1998).
    1. R Core Team (2015). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria.

Source: PubMed

3
購読する