Methods for high-density admixture mapping of disease genes

Nick Patterson, Neil Hattangadi, Barton Lane, Kirk E Lohmueller, David A Hafler, Jorge R Oksenberg, Stephen L Hauser, Michael W Smith, Stephen J O'Brien, David Altshuler, Mark J Daly, David Reich, Nick Patterson, Neil Hattangadi, Barton Lane, Kirk E Lohmueller, David A Hafler, Jorge R Oksenberg, Stephen L Hauser, Michael W Smith, Stephen J O'Brien, David Altshuler, Mark J Daly, David Reich

Abstract

Admixture mapping (also known as "mapping by admixture linkage disequilibrium," or MALD) has been proposed as an efficient approach to localizing disease-causing variants that differ in frequency (because of either drift or selection) between two historically separated populations. Near a disease gene, patient populations descended from the recent mixing of two or more ethnic groups should have an increased probability of inheriting the alleles derived from the ethnic group that carries more disease-susceptibility alleles. The central attraction of admixture mapping is that, since gene flow has occurred recently in modern populations (e.g., in African and Hispanic Americans in the past 20 generations), it is expected that admixture-generated linkage disequilibrium should extend for many centimorgans. High-resolution marker sets are now becoming available to test this approach, but progress will require (a). computational methods to infer ancestral origin at each point in the genome and (b). empirical characterization of the general properties of linkage disequilibrium due to admixture. Here we describe statistical methods to estimate the ancestral origin of a locus on the basis of the composite genotypes of linked markers, and we show that this approach accurately estimates states of ancestral origin along the genome. We apply this approach to show that strong admixture linkage disequilibrium extends, on average, for 17 cM in African Americans. Finally, we present power calculations under varying models of disease risk, sample size, and proportions of ancestry. Studying approximately 2500 markers in approximately 2500 patients should provide power to detect many regions contributing to common disease. A particularly important result is that the power of an admixture mapping study to detect a locus will be nearly the same for a wide range of mixture scenarios: the mixture proportion should be 10%-90% from both ancestral populations.

Figures

Figure 1
Figure 1
Schematic of how a disease locus will appear in an admixture scan. Around the locus, there should be an unusually high proportion of ancestry from one of the parental populations, because of patients inheriting high-risk alleles from that group. The peak can be identified not only in a case-control comparison but also in a comparison of the estimate of ancestry in cases at that point in their genome with the rest of their genomes. The width of the peak of association is determined by the number of generations since admixture.
Figure 2
Figure 2
Data from three African American samples (Smith et al. 2004), used to reconstruct ancestry along chromosome 22, based on genotypes at 52 SNPs (the positions of the SNPs are indicated by black lines). Each individual has segments of the genome of both European and African ancestry, randomly distributed over the chromosomes. At any locus, an individual can have only 0, 1, or 2 European ancestry alleles. Individual 3, for example, is confidently estimated to have 1 European-ancestry allele between 0 and 25 cM, 2 between 35 and 40 cM, and 1 between 45 and 70 cM. A higher density of markers is clearly needed to resolve ancestry in some places, highlighting the importance of including more SNPs in the map. To test for disease association on the basis of such data, one would search for genomic segments where the estimated number of European ancestry alleles, summed over samples, is greater than the genomewide average.
Figure 3
Figure 3
Quantitative assessment of the ability of the MCMC to detect regions of the genome with high or low levels of European ancestry. From the 442 patients with MS, we identified subsets of individuals carrying at least one copy of an allele that has a much higher frequency in Europeans than in Africans, thereby defining five populations that we knew were enriched for European ancestry at that point in their genomes. For the analyses, we conditioned on genotypes at the following five polymorphisms: DRB1*1501 in human leukocyte antigen (HLA) (n=57 individuals), rs7349 (n=125), rs1002587 (n=129), rs1205817 (n=177), and rs737802 (n=141). We tested for association through use of the locus-genome statistic and a disease model of twofold increased risk due to European ancestry (ψ1=2; ψ2=4). Peaks of highly significant ancestry association were identified in all five examples, with widths of 10–15 cM (where the width is defined as the log likelihood being within 1 of the maximum). Positions of the highly informative markers used for inference are indicated by triangles at the bottom of each figure.
Figure 4
Figure 4
A, Estimates of percent European ancestry for 718 African American individuals, based on empirical data collected at our laboratory. We compare the estimates of ancestry from the MCMC with estimates made through use of a simple maximum-likelihood approach using a subset of 186 unlinked markers that were chosen to have the highest information content (Smith et al. [in this issue]) while spaced at least 10 cM apart. The close correlation provides confidence that the MCMC accurately estimates unknown parameters. B, Comparison of Mi with λi estimates (the SEs are shown in gray). Individuals with high Mi often have low λi values, which may be due to these individuals often having one European parent, resulting in an Mi near 50% but a low λi because the chromosome from the European parent never crosses over. Such individuals should ideally be excluded from an African American admixture mapping study (i.e., samples' parents should not have entirely European American or West African ancestry), because chromosomes that do not cross over between ancestries contribute no power to a study.
Figure 5
Figure 5
Difference between the true values of Mi, λi, pAj, and pBj and the estimates from the MCMC. These results are obtained by simulating data sets in which 1,000 samples are genotyped in 2,147 markers from the map described by Smith et al. ( [in this issue]). In the simulations, we set Mi=20%±12% and λi=6±2, to match the values observed empirically in African Americans, and we assume no disease locus. The difference between the true value and estimate (divided by the estimated SE estimated by the MCMC) is, on average, close to 0, indicating that the estimates are unbiased. Compared with normal theory, the residuals are larger than expected, indicating that the MCMC slightly underestimates the SEs, although this does not appear to cause false positives (table 2).
Figure 6
Figure 6
Simulations to assess the power of the method to detect a disease locus at which a population A–ancestry allele confers 1-, 1.3-, 1.5-, 1.7-, and 2-fold multiplicative increased risk. The ancestry of the samples was assumed to be Mi∼20%±12% and λi∼6±2, and the markers are 2,147 from the map described in the accompanying article by Smith et al. ( [in this issue]). For the simulations, we picked a “typical” locus from the map (chromosome 8, position 131 cM), where the estimated information about ancestry provided by nearby markers (estimated as described by Smith et al. [in this issue]) is 67% of the maximum. For each of the five risk models and sample sizes of 250, 500, 750, 1,000, and 2,000 (assuming equal numbers of cases and controls), 20 simulations were performed. The number of simulations that pass the genomewide threshold of significance (LOD >2) was plotted for the main locus-genome statistic (we used a hypothesis of equally likely risk models of ψ1=0.5, 1.3, 1.5, and 2.0, with ψ2=ψ21 in the locus-genome tests for association). These simulations demonstrate that even relative risks due to ancestry of as little as 1.3 can be detected by admixture mapping with 2,000 cases and controls. The significance threshold we use (LOD >2) is quite stringent, so, in practice, many simulations that do not formally exceed this significance threshold will produce large enough scores (LOD >0) that they would be followed up by studying a higher density of markers at the strongest peaks of association. Extraction of substantially more information by genotyping a higher density of markers should bring real disease loci above the genomewide threshold of significance.
Figure 7
Figure 7
Effect of map quality on the power to detect a disease locus. Using the 2,147 markers from the map described by Smith et al. ( [in this issue]), we performed 100 simulations with 200 cases and 200 controls and a multiplicative risk model of 2 due to a European-ancestry allele. We performed the simulations for six loci where the information extractions, according to our theoretical calculation (described in detail by Smith et al. [in this issue]), were 0.5, 0.6, 0.7, 0.8, 0.9, and 1. The inverse of information extraction should be the same as the increase in sample size that is necessary to detect a disease locus there (as compared with perfect information). For example, at the Duffy locus on chromosome 1—the rightmost data point in this figure—an allele distinguishes essentially perfectly between West African and European ancestry, and information extraction is 1. Our simulations show that genomewide scores, in practice, increase faster than would be expected on the basis of the theoretical power calculation (dashed line). Thus, although the average locus in the map has claimed 71% information extraction, the mean association score from simulations is ∼50% of the Duffy locus. The power loss compared with theory is due, we believe, to the fact that there is less certainty about allele frequencies at loci where there is lower information extraction, so the MCMC is less certain about declaring an association.
Figure 8
Figure 8
Comparison of the power of sib-pair linkage mapping, haplotype association mapping, and admixture mapping. A, Power as a function of sample size. These charts present the number of case-control or sib-sib pairs that are expected to be required to detect a disease locus. To set thresholds for genomewide significance, we assume that 300,000 independent markers have been tested for haplotype mapping (including the real risk allele) and that there is perfect information extraction for linkage and admixture mapping, with all samples having a proportion of population A ancestry (for example, European ancestry in African Americans) of Mi=20%. These represent idealized scenarios, so that, in practice, 1.2- to 2-fold more samples would be required than are shown here (see the “Methods” section). For simplicity, we assume that the allele that is being studied is the only one at the locus that increases risk for the disease (with all other alleles conferring equal and lower risk). These results show that, for low-penetrance risk alleles (1.3-fold, 1.5-fold, and 2-fold increased risk due to the allele rather than ancestry) that differ substantially in frequency across populations, admixture mapping requires many fewer samples than linkage mapping (although usually more samples than haplotype-based association mapping). B, Power as a function of number of genotypes. These charts correspond to the same scenarios but report the number of genotypes required rather than the number of samples. The advantages of admixture mapping are most apparent in this comparison, since many fewer markers are required for a whole-genome admixture scan than a whole-genome association scan.
Figure 9
Figure 9
Number of samples required to detect a disease locus where population A ancestry, on average, increases risk, as a function of the proportion of ancestry in each sample. Individuals with population A ancestry between 10% and 90% provide the most power. The power for admixture mapping contributed by a typical African American sample (20% European ancestry; 80% African ancestry) corresponds to a percent population A of 0.2 (European ancestry confers increased risk) or 0.8 (African ancestry confers increased risk). Fewer samples are required if the less common (European) ancestry confers increased risk (e.g., a disease such as MS rather than prostate cancer), although the effect is slight (only 1.2- to 1.3-fold more samples are required to achieve the same power; see fig. 10). We note that this graph assumes perfect information extraction and the same Mi for the two parents of each sample. Deviations from these assumptions—in particular, the imperfect information extraction in real maps such as that described by Smith et al. ( [in this issue])—mean that the number of samples required for a practical study would be about twice as high as shown.
Figure 10
Figure 10
Number of samples necessary to detect a disease locus under the ideal assumption of perfect information about ancestry and the same Mi in both parents. The number of samples necessary to detect an association in African Americans is estimated by averaging the power for a given risk model and percentage of ancestry (given by the curves in fig. 9) over the percentages of ancestry seen in African Americans: Mi∼20%±12% as described in the text.
Figure B1
Figure B1
Top, Prior distributions generating base ethnicity probabilities Mi and Poisson crossover rates λi for individual i. These parameters are for the autosomes. Similarly, we generate MXi and λXi for the X chromosome. The distribution of MXi is dependent on the random variable Mi. Bottom, Allele counts, [nA0(j),nA1(j)], for marker j in sample from population A and similar counts, [nB0(j),nB1(j)], for population B. We also have parameters τ(A) and τ(B) modeling divergence between our modern samples and the actual parental populations of our admixed sample sample. We generate “true” allele frequencies for the modern populations and then allele frequencies p(j) for the parental populations. M,λ and the allele frequencies p(j) now drive a Lander-Green HMM (Lander and Green 1987) that estimates ancestry at every point of the genome. Ancestries Eij form a (hidden) Markov chain, and outputs Oij are observable genotypes generated from the Eij using the probabilities pA(j),pB(j).
Figure B2
Figure B2
The log10 Bayes factor, shown iteration by iteration, on a real data set, where we believe there is no evidence for a causal allele. No long-term structure is visible.
Figure B3
Figure B3
Correlation coefficients, as we vary the “lag” between iterations, for two statistics from an iteration of the MCMC. Our first statistic, “llike,” is used for monitoring purposes. If E is the full set of ethnicities, we compute the sum of the log likelihood of E and the log likelihood of our observations conditional on E. This is a statistic sensitive to bad behavior of the MCMC. We also show the correlation structure of our genomewide score. Some long-term structure is evident, although the correlation is small.

Source: PubMed

3
Se inscrever