Genetic variation in an individual human exome

Pauline C Ng, Samuel Levy, Jiaqi Huang, Timothy B Stockwell, Brian P Walenz, Kelvin Li, Nelson Axelrod, Dana A Busam, Robert L Strausberg, J Craig Venter, Pauline C Ng, Samuel Levy, Jiaqi Huang, Timothy B Stockwell, Brian P Walenz, Kelvin Li, Nelson Axelrod, Dana A Busam, Robert L Strausberg, J Craig Venter

Abstract

There is much interest in characterizing the variation in a human individual, because this may elucidate what contributes significantly to a person's phenotype, thereby enabling personalized genomics. We focus here on the variants in a person's 'exome,' which is the set of exons in a genome, because the exome is believed to harbor much of the functional variation. We provide an analysis of the approximately 12,500 variants that affect the protein coding portion of an individual's genome. We identified approximately 10,400 nonsynonymous single nucleotide polymorphisms (nsSNPs) in this individual, of which approximately 15-20% are rare in the human population. We predict approximately 1,500 nsSNPs affect protein function and these tend be heterozygous, rare, or novel. Of the approximately 700 coding indels, approximately half tend to have lengths that are a multiple of three, which causes insertions/deletions of amino acids in the corresponding protein, rather than introducing frameshifts. Coding indels also occur frequently at the termini of genes, so even if an indel causes a frameshift, an alternative start or stop site in the gene can still be used to make a functional protein. In summary, we reduced the set of approximately 12,500 nonsilent coding variants by approximately 8-fold to a set of variants that are most likely to have major effects on their proteins' functions. This is our first glimpse of an individual's exome and a snapshot of the current state of personalized genomics. The majority of coding variants in this individual are common and appear to be functionally neutral. Our results also indicate that some variants can be used to improve the current NCBI human reference genome. As more genomes are sequenced, many rare variants and non-SNP variants will be discovered. We present an approach to analyze the coding variation in humans by proposing multiple bioinformatic methods to hone in on possible functional variation.

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Figure 1. The allele frequencies of heterozygous…
Figure 1. The allele frequencies of heterozygous and homozygous nsSNPs in HuRef.
For heterozygous SNPs, the minor allele frequency is plotted. For homozygous nsSNPs, the frequency for the observed allele in HuRef is plotted.
Figure 2. The percentage of nsSNPs predicted…
Figure 2. The percentage of nsSNPs predicted to affect protein function, by category.
A higher fraction of heterozygous, novel, and rare nsSNPs are predicted to affect function compared to homozygous and common nsSNPs. Rare nsSNPs have allele frequencies  = 0.05.
Figure 3. The size distribution of coding…
Figure 3. The size distribution of coding indels.
Coding indels are predominantly the size of 3n, where n is an integer. 3n coding indels do not cause frameshifts, whereas non-3n coding indels do.
Figure 4. Location of coding indels.
Figure 4. Location of coding indels.
On the x-axis is the relative protein location of the coding indel, which is the first amino acid position of the indel divided by the protein length. A relative protein location near zero indicates that the indel is located near the N-terminus of the protein and a relative protein location near one indicates that the indel is located near the C-terminus of the protein. Indels occur frequently at the N- and C-termini of proteins.
Figure 5. An example of a homozygous…
Figure 5. An example of a homozygous indel located near an exon boundary.
The HuRef assembly has a homozygous insertion of A at chr11: 44881936. This insertion resides inside a coding exon of the gene TP53I11, but is near a 2 bp intron. With this new base inserted, a single amino acid is introduced into the protein sequence, which is the more likely scenario instead of a 2 bp intron.
Figure 6. The K a /K s…
Figure 6. The Ka/Ks ratios of Commonly-Affected genes and Rarely-Affected Genes.
Commonly-Affected genes have a higher Ka/Ks ratio than Rarely-Affected genes, which suggests that Commonly-Affected genes are under weaker selection.
Figure 7. A summary of the nonsilent…
Figure 7. A summary of the nonsilent coding variants and their observed trends.

References

    1. Botstein D, Risch N. Discovering genotypes underlying human phenotypes: past successes for mendelian disease, future approaches for complex disease. Nat Genet. 2003;33(Suppl):228–237.
    1. Stenson PD, Ball EV, Mort M, Phillips AD, Shiel JA, et al. Human Gene Mutation Database (HGMD): 2003 update. Hum Mutat. 2003;21:577–581.
    1. Chakravarti A. Population genetics–making sense out of sequence. Nat Genet. 1999;21:56–60.
    1. Hirschhorn JN, Daly MJ. Genome-wide association studies for common diseases and complex traits. Nat Rev Genet. 2005;6:95–108.
    1. Stephens JC, Schneider JA, Tanguay DA, Choi J, Acharya T, et al. Haplotype variation and linkage disequilibrium in 313 human genes. Science. 2001;293:489–493.
    1. Cargill M, Altshuler D, Ireland J, Sklar P, Ardlie K, et al. Characterization of single-nucleotide polymorphisms in coding regions of human genes. Nat Genet. 1999;22:231–238.
    1. Bustamante CD, Fledel-Alon A, Williamson S, Nielsen R, Hubisz MT, et al. Natural selection on protein-coding genes in the human genome. Nature. 2005;437:1153–1157.
    1. Halushka MK, Fan JB, Bentley K, Hsie L, Shen N, et al. Patterns of single-nucleotide polymorphisms in candidate genes for blood-pressure homeostasis. Nat Genet. 1999;22:239–247.
    1. Livingston RJ, von Niederhausern A, Jegga AG, Crawford DC, Carlson CS, et al. Pattern of sequence variation across 213 environmental response genes. Genome Res. 2004;14:1821–1831.
    1. Leabman MK, Huang CC, DeYoung J, Carlson EJ, Taylor TR, et al. Natural variation in human membrane transporter genes reveals evolutionary and functional constraints. Proc Natl Acad Sci U S A. 2003;100:5896–5901.
    1. Cohen JC, Kiss RS, Pertsemlidis A, Marcel YL, McPherson R, et al. Multiple rare alleles contribute to low plasma levels of HDL cholesterol. Science. 2004;305:869–872.
    1. . 1000 Genomes Project Meeting Report: A Workshop to Plan a Deep Catalog of Human Genetic Variation; 2007 September 17–18,2007; Cambridge, UK.
    1. Albert TJ, Molla MN, Muzny DM, Nazareth L, Wheeler D, et al. Direct selection of human genomic loci by microarray hybridization. Nat Methods. 2007;4:903–905.
    1. Okou DT, Steinberg KM, Middle C, Cutler DJ, Albert TJ, et al. Microarray-based genomic selection for high-throughput resequencing. Nat Methods. 2007;4:907–909.
    1. Porreca GJ, Zhang K, Li JB, Xie B, Austin D, et al. Multiplex amplification of large sets of human exons. Nat Methods. 2007;4:931–936.
    1. Levy S, Sutton G, Ng PC, Feuk L, Halpern AL, et al. The diploid genome sequence of an individual human. PLoS Biol. 2007;5:e254.
    1. Pennisi E. Genomics. On your mark. Get set. Sequence! Science. 2006;314:232.
    1. Church GM. The personal genome project. Mol Syst Biol. 2005;1:2005 0030.
    1. Blow N. Genomics: the personal side of genomics. Nature. 2007;449:627–630.
    1. Wheeler DA, Srinivasan M, Egholm M, Shen Y, Chen L, et al. The complete genome of an individual by massively parallel DNA sequencing. 2008;452:872.
    1. Jiang R, Duan J, Windemuth A, Stephens JC, Judson R, et al. Genome-wide evaluation of the public SNP databases. Pharmacogenomics. 2003;4:779–789.
    1. Reich DE, Gabriel SB, Altshuler D. Quality and completeness of SNP databases. Nat Genet. 2003;33:457–458.
    1. Carlson CS, Eberle MA, Rieder MJ, Smith JD, Kruglyak L, et al. Additional SNPs and linkage-disequilibrium analyses are necessary for whole-genome association studies in humans. Nat Genet. 2003;33:518–521.
    1. Frazer KA, Ballinger DG, Cox DR, Hinds DA, Stuve LL, et al. A second generation human haplotype map of over 3.1 million SNPs. Nature. 2007;449:851–861.
    1. InternationalHapMapConsortium. A haplotype map of the human genome. Nature. 2005;437:1299–1320.
    1. Eberle MA, Kruglyak L. An analysis of strategies for discovery of single-nucleotide polymorphisms. Genetic Epidemiology. 2000;19(Suppl 1):S29–S35.
    1. Yue P, Melamud E, Moult J. SNPs3D: candidate gene and SNP selection for association studies. BMC Bioinformatics. 2006;7:166.
    1. Ferrer-Costa C, Gelpi JL, Zamakola L, Parraga I, de la Cruz X, et al. PMUT: a web-based tool for the annotation of pathological mutations on proteins. Bioinformatics. 2005;21:3176–3178.
    1. Ng PC, Henikoff S. Predicting the effects of amino acid substitutions on protein function. Annu Rev Genomics Hum Genet. 2006;7:61–80.
    1. Ramensky V, Bork P, Sunyaev S. Human non-synonymous SNPs: server and survey. Nucleic Acids Res. 2002;30:3894–3900.
    1. Chasman D, Adams RM. Predicting the functional consequences of non-synonymous single nucleotide polymorphisms: structure-based assessment of amino acid variation. J Mol Biol. 2001;307:683–706.
    1. Lau AY, Chasman DI. Functional classification of proteins and protein variants. Proc Natl Acad Sci U S A. 2004;101:6576–6581.
    1. Ng PC, Henikoff S. Accounting for human polymorphisms predicted to affect protein function. Genome Res. 2002;12:436–446.
    1. Sunyaev S, Ramensky V, Koch I, Lathe W, 3rd, Kondrashov AS, et al. Prediction of deleterious human alleles. Hum Mol Genet. 2001;10:591–597.
    1. Wong GK, Yang Z, Passey DA, Kibukawa M, Paddock M, et al. A population threshold for functional polymorphisms. Genome Res. 2003;13:1873–1879.
    1. Kimura M. Evolutionary rate at the molecular level. Nature. 1968;217:624–626.
    1. Inoue K, Lupski JR. Molecular mechanisms for genomic disorders. Annu Rev Genomics Hum Genet. 2002;3:199–242.
    1. Conant GC, Wagner A. Duplicate genes and robustness to transient gene knock-downs in Caenorhabditis elegans. Proc Biol Sci. 2004;271:89–96.
    1. Mills RE, Luttig CT, Larkins CE, Beauchamp A, Tsui C, et al. An initial map of insertion and deletion (INDEL) variation in the human genome. Genome Res. 2006;16:1182–1190.
    1. Bhangale TR, Stephens M, Nickerson DA. Automating resequencing-based detection of insertion-deletion polymorphisms. Nat Genet. 2006;38:1457–1462.
    1. Chimpanzee Sequencing and Analysis Consortim. Initial sequence of the chimpanzee genome and comparison with the human genome. Nature. 2005;437:69–87.
    1. Hamosh A, Scott AF, Amberger JS, Bocchini CA, McKusick VA. Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res. 2005;33:D514–517.
    1. Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, et al. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 2001;29:308–311.
    1. Hunter DJ, Khoury MJ, Drazen JM. Letting the genome out of the bottle–will we get our wish? N Engl J Med. 2008;358:105–107.
    1. Wald NJ, Hackshaw AK, Frost CD. When can a risk factor be used as a worthwhile screening test? Bmj. 1999;319:1562–1565.
    1. Janssens AC, Gwinn M, Bradley LA, Oostra BA, van Duijn CM, et al. A critical appraisal of the scientific basis of commercial genomic profiles used to assess health risks and personalize health interventions. Am J Hum Genet. 2008;82:593–599.
    1. Hudson K, Javitt G, Burke W, Byers P. ASHG Statement* on direct-to-consumer genetic testing in the United States. Obstet Gynecol. 2007;110:1392–1395.
    1. Hunter DJ. Gene-environment interactions in human diseases. Nat Rev Genet. 2005;6:287–298.
    1. Boyadjiev SA, Jabs EW. Online Mendelian Inheritance in Man (OMIM) as a knowledgebase for human developmental disorders. Clin Genet. 2000;57:253–266.
    1. Baumgart E, Vanhooren JC, Fransen M, Marynen P, Puype M, et al. Molecular characterization of the human peroxisomal branched-chain acyl-CoA oxidase: cDNA cloning, chromosomal assignment, tissue distribution, and evidence for the absence of the protein in Zellweger syndrome. Proc Natl Acad Sci U S A. 1996;93:13748–13753.
    1. Boot RG, Renkema GH, Verhoek M, Strijland A, Bliek J, et al. The human chitotriosidase gene. Nature of inherited enzyme deficiency. J Biol Chem. 1998;273:25680–25685.
    1. Bhangale TR, Rieder MJ, Livingston RJ, Nickerson DA. Comprehensive identification and characterization of diallelic insertion-deletion polymorphisms in 330 human candidate genes. Hum Mol Genet. 2005;14:59–69.
    1. Drake JA, Bird C, Nemesh J, Thomas DJ, Newton-Cheh C, et al. Conserved noncoding sequences are selectively constrained and not mutation cold spots. Nat Genet. 2006;38:223–227.
    1. Hurst LD. The Ka/Ks ratio: diagnosing the form of sequence evolution. Trends Genet. 2002;18:486.
    1. Young JM, Trask BJ. The sense of smell: genomics of vertebrate odorant receptors. Hum Mol Genet. 2002;11:1153–1160.
    1. Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, et al. The sequence of the human genome. Science. 2001;291:1304–1351.
    1. Ning Z Fau-Cox AJ, Cox Aj Fau-Mullikin JC, Mullikin JC. SSAHA: a fast search method for large DNA databases.
    1. Olson M. Enrichment of super-sized resequencing targets from the human genome. Nat Methods. 2007;4:891–892.
    1. Waterston RH, Lindblad-Toh K, Birney E, Rogers J, Abril JF, et al. Initial sequencing and comparative analysis of the mouse genome. Nature. 2002;420:520–562.
    1. Gibbs RA, Weinstock GM, Metzker ML, Muzny DM, Sodergren EJ, et al. Genome sequence of the Brown Norway rat yields insights into mammalian evolution. Nature. 2004;428:493–521.
    1. Cooper GM, Brudno M, Stone EA, Dubchak I, Batzoglou S, et al. Characterization of evolutionary rates and constraints in three Mammalian genomes. Genome Res. 2004;14:539–548.
    1. Margulies EH, Cooper GM, Asimenos G, Thomas DJ, Dewey CN, et al. Analyses of deep mammalian sequence alignments and constraint predictions for 1% of the human genome. Genome Res. 2007;17:760–774.
    1. Winkelmann J, Schormair B, Lichtner P, Ripke S, Xiong L, et al. Genome-wide association study of restless legs syndrome identifies common variants in three genomic regions. Nat Genet. 2007;39:1000–1006.
    1. Rioux JD, Xavier RJ, Taylor KD, Silverberg MS, Goyette P, et al. Genome-wide association study identifies new susceptibility loci for Crohn disease and implicates autophagy in disease pathogenesis. Nat Genet. 2007;39:596–604.
    1. Saxena R, Voight BF, Lyssenko V, Burtt NP, de Bakker PI, et al. Genome-wide association analysis identifies loci for type 2 diabetes and triglyceride levels. Science. 2007;316:1331–1336.
    1. Scott LJ, Mohlke KL, Bonnycastle LL, Willer CJ, Li Y, et al. A genome-wide association study of type 2 diabetes in Finns detects multiple susceptibility variants. Science. 2007;316:1341–1345.
    1. Perez-Iratxeta C, Bork P, Andrade MA. Association of genes to genetically inherited diseases using data mining. Nat Genet. 2002;31:316–319.
    1. Aerts S, Lambrechts D, Maity S, Van Loo P, Coessens B, et al. Gene prioritization through genomic data fusion. Nat Biotechnol. 2006;24:537–544.
    1. Lander ES. The new genomics: global views of biology. Science. 1996;274:536–539.
    1. Sachidanandam R, Weissman D, Schmidt SC, Kakol JM, Stein LD, et al. A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature. 2001;409:928–933.
    1. Frayling TM, Timpson NJ, Weedon MN, Zeggini E, Freathy RM, et al. A common variant in the FTO gene is associated with body mass index and predisposes to childhood and adult obesity. Science. 2007;316:889–894.
    1. Helgadottir A, Thorleifsson G, Manolescu A, Gretarsdottir S, Blondal T, et al. A common variant on chromosome 9p21 affects the risk of myocardial infarction. Science. 2007;316:1491–1493.
    1. Wellcome Trust Case Control Consortium. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007;447:661–678.
    1. Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, et al. Initial sequencing and analysis of the human genome. Nature. 2001;409:860–921.
    1. 1000GenomesProject. Meeting Report: A Workshop to Plan a Deep Catalog of Human Genetic Variation; 2007 September 17–18,2007; Cambridge, UK.
    1. Zhang DL, Ji L, Li YD. [Analysis, identification and correction of some errors of model refseqs appeared in NCBI Human Gene Database by in silico cloning and experimental verification of novel human genes]. Yi Chuan Xue Bao. 2004;31:431–443.
    1. Cohen JC, Pertsemlidis A, Fahmi S, Esmail S, Vega GL, et al. Multiple rare variants in NPC1L1 associated with reduced sterol absorption and plasma low-density lipoprotein levels. Proc Natl Acad Sci U S A. 2006;103:1810–1815.
    1. Wood LD, Parsons DW, Jones S, Lin J, Sjoblom T, et al. The genomic landscapes of human breast and colorectal cancers. Science. 2007;318:1108–1113.
    1. Schneider TD. Information content of individual genetic sequences. J Theor Biol. 1997;189:427–441.
    1. Zhang Z, Schwartz S, Wagner L, Miller W. A greedy algorithm for aligning DNA sequences. J Comput Biol. 2000;7:203–214.
    1. Clamp M, Fry B, Kamal M, Xie X, Cuff J, et al. Distinguishing protein-coding and noncoding genes in the human genome. Proc Natl Acad Sci U S A. 2007;104:19428–19433.
    1. Benson G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 1999;27:573–580.
    1. Watterson GA, Guess HA. Is the most frequent allele the oldest? Theor Popul Biol. 1977;11:141–160.
    1. Ribases M, Gratacos M, Fernandez-Aranda F, Bellodi L, Boni C, et al. Association of BDNF with anorexia, bulimia and age of onset of weight loss in six European populations. Hum Mol Genet. 2004;13:1205–1212.
    1. Egan MF, Kojima M, Callicott JH, Goldberg TE, Kolachana BS, et al. The BDNF val66met polymorphism affects activity-dependent secretion of BDNF and human memory and hippocampal function. Cell. 2003;112:257–269.
    1. Karamohamed S, Latourelle JC, Racette BA, Perlmutter JS, Wooten GF, et al. BDNF genetic variants are associated with onset age of familial Parkinson disease: GenePD Study. Neurology. 2005;65:1823–1825.
    1. Rybakowski JK, Borkowska A, Skibinska M, Hauser J. Illness-specific association of val66met BDNF polymorphism with performance on Wisconsin Card Sorting Test in bipolar mood disorder. Mol Psychiatry. 2006;11:122–124.
    1. Gregersen N, Winter VS, Corydon MJ, Corydon TJ, Rinaldo P, et al. Identification of four new mutations in the short-chain acyl-CoA dehydrogenase (SCAD) gene in two patients: one of the variant alleles, 511C–>T, is present at an unexpectedly high frequency in the general population, as was the case for 625G–>A, together conferring susceptibility to ethylmalonic aciduria. Hum Mol Genet. 1998;7:619–627.
    1. Roddam PL, Rollinson S, O'Driscoll M, Jeggo PA, Jack A, et al. Genetic variants of NHEJ DNA ligase IV can affect the risk of developing multiple myeloma, a tumour characterised by aberrant class switch recombination. J Med Genet. 2002;39:900–905.
    1. Norrgard KJ, Pomponio RJ, Hymes J, Wolf B. Mutations causing profound biotinidase deficiency in children ascertained by newborn screening in the United States occur at different frequencies than in symptomatic children. Pediatr Res. 1999;46:20–27.
    1. Wolf B, Norrgard K, Pomponio RJ, Mock DM, McVoy JR, et al. Profound biotinidase deficiency in two asymptomatic adults. Am J Med Genet. 1997;73:5–9.
    1. Walley AJ, Chavanas S, Moffatt MF, Esnouf RM, Ubhi B, et al. Gene polymorphism in Netherton and common atopic disease. Nat Genet. 2001;29:175–178.
    1. Wakamatsu N, Kobayashi H, Miyatake T, Tsuji S. A novel exon mutation in the human beta-hexosaminidase beta subunit gene affects 3′ splice site selection. J Biol Chem. 1992;267:2406–2413.
    1. Kang D, Lee KM, Park SK, Berndt SI, Peters U, et al. Functional variant of manganese superoxide dismutase (SOD2 V16A) polymorphism is associated with prostate cancer risk in the prostate, lung, colorectal, and ovarian cancer study. Cancer Epidemiol Biomarkers Prev. 2007;16:1581–1586.
    1. Valenti L, Conte D, Piperno A, Dongiovanni P, Fracanzani AL, et al. The mitochondrial superoxide dismutase A16V polymorphism in the cardiomyopathy associated with hereditary haemochromatosis. J Med Genet. 2004;41:946–950.
    1. Hansell NK, James MR, Duffy DL, Birley AJ, Luciano M, et al. Effect of the BDNF V166M polymorphism on working memory in healthy adolescents. Genes Brain Behav. 2007;6:260–268.
    1. Tochigi M, Otowa T, Suga M, Rogers M, Minato T, et al. No evidence for an association between the BDNF Val66Met polymorphism and schizophrenia or personality traits. Schizophr Res. 2006;87:45–47.
    1. Jongepier H, Koppelman GH, Nolte IM, Bruinenberg M, Bleecker ER, et al. Polymorphisms in SPINK5 are not associated with asthma in a Dutch population. J Allergy Clin Immunol. 2005;115:486–492.
    1. Gray-Schopfer V, Wellbrock C, Marais R. Melanoma biology and new targeted therapy. Nature. 2007;445:851–857.

Source: PubMed

3
구독하다