Statistical guidance for experimental design and data analysis of mutation detection in rare monogenic mendelian diseases by exome sequencing

Degui Zhi, Rui Chen, Degui Zhi, Rui Chen

Abstract

Recently, whole-genome sequencing, especially exome sequencing, has successfully led to the identification of causal mutations for rare monogenic Mendelian diseases. However, it is unclear whether this approach can be generalized and effectively applied to other Mendelian diseases with high locus heterogeneity. Moreover, the current exome sequencing approach has limitations such as false positive and false negative rates of mutation detection due to sequencing errors and other artifacts, but the impact of these limitations on experimental design has not been systematically analyzed. To address these questions, we present a statistical modeling framework to calculate the power, the probability of identifying truly disease-causing genes, under various inheritance models and experimental conditions, providing guidance for both proper experimental design and data analysis. Based on our model, we found that the exome sequencing approach is well-powered for mutation detection in recessive, but not dominant, Mendelian diseases with high locus heterogeneity. A disease gene responsible for as low as 5% of the disease population can be readily identified by sequencing just 200 unrelated patients. Based on these results, for identifying rare Mendelian disease genes, we propose that a viable approach is to combine, sequence, and analyze patients with the same disease together, leveraging the statistical framework presented in this work.

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1. The calculated power of exome…
Figure 1. The calculated power of exome sequencing for rare monogenic Mendelian diseases for various parameter combinations.
Figure 2. Genes underlying highly heterogeneous diseases…
Figure 2. Genes underlying highly heterogeneous diseases can be identified by sequencing a moderate sized sample.
The calculated power with varying degrees of genetic heterogeneities (R) ranging from 0.01 to 1 is shown. Upper panel: power of Tr for detecting a recessive gene; Middle panel: power difference Tr-Ta for detecting a recessive gene; Lower panel: power of Td for detecting a dominant gene. Other parameters are fixed to the default values: number of mutations m = 300; total number of genes M = 20,000; sensitivity of detecting mutations Ps = 0.8; and the mutation probability equals genome-wide average w = 1. See Tables S2, S3 and S4 for more dense sampling of R values. Note that power does not always increase monotonously with sample sizes (zigzag line patterns). The loss of power upon increase of sample size is related to discrete changes in the significance level cutoff of the test and thus very small test size (not close to 0.05) as shown in Table S1, since the distribution of the test statistic is discrete.
Figure 3. High sensitivity of detecting mutations…
Figure 3. High sensitivity of detecting mutations is required to achieve a useful power.
The power for varying degrees of sensitivities of mutation detection, ranging from 0.1 to 1 is shown. Other parameters are fixed to the default values: number of mutations m = 300; total number of genes M = 20,000; genetic heterogeneity R = 0.05; and the mutation probability equals genome-wide average w = 1. See Tables S5 and S6 for more dense coverage of sensitivities of mutation detection.
Figure 4. Strict filtering of false positives…
Figure 4. Strict filtering of false positives has limited impact on recessive diseases but dramatically reduces the power of detecting dominant disease genes.
The power for varying degrees of filtering efficiencies, ranging from 5 to 500, is shown. Upper panel: power of Tr for recessive data; Lower panel: power of Td for dominant data. Other parameters are fixed to the default values: genetic heterogeneity R = 0.05; total number of genes M = 20,000; sensitivity of detecting mutations Ps = 0.8; and the mutation probability equals genome-wide average w = 1. See Tables S7 and S8 for more dense coverage of filtering efficiencies.
Figure 5. Power is low for long…
Figure 5. Power is low for long genes.
The power for varying degrees of relative mutation probabilities, ranging from 0.1 to 10 times the genome average is shown. Upper panel: power of Tr for recessive data; Lower panel: power of Tr for recessive data. Other parameters are fixed to the default values: number of mutations m = 300; genetic heterogeneity R = 0.05; total number of genes M = 20,000; and sensitivity of detecting mutations Ps = 0.8. See Tables S9 and S10 for more dense coverage of filtering efficiencies.

References

    1. Ng SB, Turner EH, Robertson PD, Flygare SD, Bigham AW, et al. Targeted capture and massively parallel sequencing of 12 human exomes. Nature. 2009;461:272–276.
    1. Ng SB, Buckingham KJ, Lee C, Bigham AW, Tabor HK, et al. Exome sequencing identifies the cause of a mendelian disorder. Nat Genet. 2010;42:30–35.
    1. Ng SB, Bigham AW, Buckingham KJ, Hannibal MC, McMillin MJ, et al. Exome sequencing identifies MLL2 mutations as a cause of Kabuki syndrome. Nat Genet. 2010;42:790–793.
    1. Bilguvar K, Ozturk AK, Louvi A, Kwan KY, Choi M, et al. Whole-exome sequencing identifies recessive WDR62 mutations in severe brain malformations. Nature. 2010;467:207–210.
    1. Krawitz PM, Schweiger MR, Rodelsperger C, Marcelis C, Kolsch U, et al. Identity-by-descent filtering of exome sequence data identifies PIGV mutations in hyperphosphatasia mental retardation syndrome. Nat Genet. 2010;42:827–829.
    1. Otto EA, Hurd TW, Airik R, Chaki M, Zhou W, et al. Candidate exome capture identifies mutation of SDCCAG8 as the cause of a retinal-renal ciliopathy. Nat Genet. 2010;42:840–850.
    1. Calvo SE, Tucker EJ, Compton AG, Kirby DM, Crawford G, et al. High-throughput, pooled sequencing identifies mutations in NUBPL and FOXRED1 in human complex I deficiency. Nat Genet. 2010;42:851–858.
    1. Lalonde E, Albrecht S, Ha KC, Jacob K, Bolduc N, et al. Unexpected allelic heterogeneity and spectrum of mutations in Fowler syndrome revealed by next-generation exome sequencing. Hum Mutat. 2010;31:918–923.
    1. Walsh T, Shahin H, Elkan-Miller T, Lee MK, Thornton AM, et al. Whole exome sequencing and homozygosity mapping identify mutation in the cell polarity protein GPSM2 as the cause of nonsyndromic hearing loss DFNB82. Am J Hum Genet. 2010;87:90–94.
    1. Wang JL, Yang X, Xia K, Hu ZM, Weng L, et al. TGM6 identified as a novel causative gene of spinocerebellar ataxias using exome sequencing. Brain. 2010;133:3510–3518.
    1. Byun M, Abhyankar A, Lelarge V, Plancoulaine S, Palanduz A, et al. Whole-exome sequencing-based discovery of STIM1 deficiency in a child with fatal classic Kaposi sarcoma. J Exp Med. 2010;207:2307–2312.
    1. Bolze A, Byun M, McDonald D, Morgan NV, Abhyankar A, et al. Whole-Exome-Sequencing-Based Discovery of Human FADD Deficiency. Am J Hum Genet. 2010;87:873–881.
    1. Rios J, Stein E, Shendure J, Hobbs HH, Cohen JC. Identification by whole-genome resequencing of gene defect responsible for severe hypercholesterolemia. Hum Mol Genet. 2010;19:4313–4318.
    1. Roach JC, Glusman G, Smit AF, Huff CD, Hubley R, et al. Analysis of genetic inheritance in a family quartet by whole-genome sequencing. Science. 2010;328:636–639.
    1. Sobreira NL, Cirulli ET, Avramopoulos D, Wohler E, Oswald GL, et al. Whole-genome sequencing of a single proband together with linkage analysis identifies a Mendelian disease gene. PLoS Genet. 2010;6:e1000991.
    1. Lupski JR, Reid JG, Gonzaga-Jauregui C, Rio Deiros D, Chen DC, et al. Whole-genome sequencing in a patient with Charcot-Marie-Tooth neuropathy. N Engl J Med. 2010;362:1181–1191.
    1. Stitziel NO, Kiezun A, Sunyaev S. Computational and statistical approaches to analyzing variants identified by exome sequencing. Genome biology. 2011;12:227.
    1. Li B, Leal SM. Discovery of rare variants via sequencing: implications for the design of complex trait association studies. PLoS Genet. 2009;5:e1000481.
    1. Liu DJ, Leal SM. Replication strategies for rare variant complex trait association studies via next-generation sequencing. American Journal of Human Genetics. 2010;87:790–801.
    1. Durbin RM, Abecasis GR, Altshuler DL, Auton A, Brooks LD, et al. A map of human genome variation from population-scale sequencing. Nature. 2010;467:1061–1073.
    1. Bansal V, Libiger O, Torkamani A, Schork NJ. Statistical analysis strategies for association studies involving rare variants. Nature reviews Genetics. 2010;11:773–785.
    1. Asimit J, Zeggini E. Rare variant association analysis methods for complex traits. Annual review of genetics. 2010;44:293–308.
    1. Morris AP, Zeggini E. An evaluation of statistical approaches to rare variant analysis in genetic association studies. Genet Epidemiol. 2010;34:188–193.
    1. Li B, Leal SM. Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. Am J Hum Genet. 2008;83:311–321.
    1. Kumar P, Henikoff S, Ng PC. Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithm. Nat Protoc. 2009;4:1073–1081.
    1. Adzhubei IA, Schmidt S, Peshkin L, Ramensky VE, Gerasimova A, et al. A method and server for predicting damaging missense mutations. Nat Methods. 2010;7:248–249.
    1. Cooper GM, Goode DL, Ng SB, Sidow A, Bamshad MJ, et al. Single-nucleotide evolutionary constraint scores highlight disease-causing mutations. Nat Methods. 2010;7:250–251.
    1. Madsen BE, Browning SR. A groupwise association test for rare mutations using a weighted sum statistic. PLoS Genet. 2009;5:e1000384.
    1. Schwarz JM, Rodelsperger C, Schuelke M, Seelow D. MutationTaster evaluates disease-causing potential of sequence alterations. Nat Methods. 2010;7:575–576.

Source: PubMed

3
Subskrybuj