Integrated genotype calling and association analysis of SNPs, common copy number polymorphisms and rare CNVs

Joshua M Korn, Finny G Kuruvilla, Steven A McCarroll, Alec Wysoker, James Nemesh, Simon Cawley, Earl Hubbell, Jim Veitch, Patrick J Collins, Katayoon Darvishi, Charles Lee, Marcia M Nizzari, Stacey B Gabriel, Shaun Purcell, Mark J Daly, David Altshuler, Joshua M Korn, Finny G Kuruvilla, Steven A McCarroll, Alec Wysoker, James Nemesh, Simon Cawley, Earl Hubbell, Jim Veitch, Patrick J Collins, Katayoon Darvishi, Charles Lee, Marcia M Nizzari, Stacey B Gabriel, Shaun Purcell, Mark J Daly, David Altshuler

Abstract

Accurate and complete measurement of single nucleotide (SNP) and copy number (CNV) variants, both common and rare, will be required to understand the role of genetic variation in disease. We present Birdsuite, a four-stage analytical framework instantiated in software for deriving integrated and mutually consistent copy number and SNP genotypes. The method sequentially assigns copy number across regions of common copy number polymorphisms (CNPs), calls genotypes of SNPs, identifies rare CNVs via a hidden Markov model (HMM), and generates an integrated sequence and copy number genotype at every locus (for example, including genotypes such as A-null, AAB and BBB in addition to AA, AB and BB calls). Such genotypes more accurately depict the underlying sequence of each individual, reducing the rate of apparent mendelian inconsistencies. The Birdsuite software is applied here to data from the Affymetrix SNP 6.0 array. Additionally, we describe a method, implemented in PLINK, to utilize these combined SNP and CNV genotypes for association testing with a phenotype.

Figures

Figure 1
Figure 1
Overview of Birdsuite. In step 1, Canary estimates copy number across regions of known common copy number polymorphisms (‘CNP genotyping’). In step 2, Birdseed assigns canonical SNP genotypes (AA, AB or BB) to samples estimated by Canary to have two copies of a SNP locus (‘SNP genotyping’). Additionally, it calculates probe-specific mean and variances. In step 3, Birdseye estimates the likelihood of rare or de novo copy number variants, using probe-specific means and variances informed by Canary and Birdseed, and combining data across multiple probes in the region (‘CNV discovery’). In step 4, Fawkes combines copy number information for each sample at each locus with allele-specific information to assign a comprehensive SNP genotype, including noncanonical genotypes such as A-null or AAB.
Figure 2
Figure 2
Schematic of how a CNP is processed through Canary illustrated with data from chromosome 4. (a) Histogram of raw data from a single copy number probe. (b) Cross-correlation matrix of neighboring probe intensity profiles across 88 HapMap samples (a depicts probe 13). For SNPs, the intensity used is the sum of the intensities for the A and B alleles. The high and consistent correlation in the center indicates copy number variation as opposed to random noise, demarcating the boundaries of the CNP. (c) Heat map depicting normalized intensities for adjacent probes across the 88 samples; red indicates low intensity, and yellow high intensity. Copy number probes are denoted by a pink dot below x axis. Probes used to summarize a CNP are notated with a black line at the bottom. (d) Summarized intensity measurements (across 88 samples) for this CNP, overlaid with the Gaussian clusters that Canary fit to the data, and colored by copy number. (e) Canary genotypes across six batches of HapMap data for a multiallelic CNP on the x axis is batch plus jitter, and on the y axis is sample intensity. Also along the y axis is represented the prior expectation of where the different copy number genotype classes should lie. The ‘mix’ batches are designed such that samples are not separated by ancestral group. (f) Same as e, except showing a YRI-specific simple deletion CNP for which batch effects are evident.
Figure 3
Figure 3
Schematic of how a single SNP is processed through Birdsuite and evaluation of mendelian inconsistencies. (a) Raw data for the SNP, plotting allele-A intensity versus allele-B intensity. Unlike most SNPs on the array, this SNP does not form three discrete clusters because it lies in the common CNP depicted in Figure 2. (b) Same as a, colored by CNP genotype (as determined by Canary in Fig. 2d). (c) Birdseed uses those samples with two copies at this locus to determine allele-specific probe characteristics (‘clusters’) for the SNP. (d) Copy-variant clusters imputed on the basis of the two-copy models. These are in turn used to aid Birdseye in the search for undiscovered CNVs, and by Fawkes to assign allele-specific copy number genotypes such as A/null (lower right green region) or BBBB (upper left magenta region). (e) Plot showing the rate of mendelian inconsistency (MI) in SNPs that overlap a known CNP for 91 children using Birdseed alone (sans Canary) versus using the entire Birdsuite. Only copy-normal calls were used to test for a MI; rate of MI is the number of inconsistencies divided by the number of tests. (f) Histogram of MI rate using the Birdsuite divided by MI rate using Birdseed alone, calculated using all autosomal SNPs. MI rate decreases for all 91 samples, indicating that a considerable percentage of inconsistencies are due to either inherited, de novo or somatic copy number variation.
Figure 4
Figure 4
Discovery of unknown or de novo copy number variation using Birdseye. (a) Raw data from a copy number probe, with one sample (arrow) colored green (top left). Raw data from a neighboring SNP, with the same sample (arrow) colored green (top right). Although the sample is relatively low in intensity, one would not have confidence calling a deletion on the basis of these data alone (bottom). A view across a larger region surrounding these two probe locations. A point is placed at the estimated copy number for this sample at each queried locus (without taking into account neighboring probes). With enough probes to support the evidence of a deletion, the HMM transitions to call a heterozygous deletion in this sample across an 85-kb region (blue line). (b,c) In addition, calling the deletion in the sample shown in a, Birdseye determines the relative log-likelihood of the identical deletion in each parent of this sample. Owing to strong evidence against this deletion in the parents, the region represents a de novo event in the child. (d) Data from in silico gender-mixing experiment. Sensitivity and breakpoint accuracy to discover simulated deletions of varying size (left). A deletion was considered discovered only if the lod score for the deletion was above 2. Sensitivity to discover the simulated deletions plotted against expected number of false-positive discoveries per genome (right). Points are placed at lod thresholds of 5, 2, 1 and 0.

Source: PubMed

3
구독하다