The ClinSeq Project: piloting large-scale genome sequencing for research in genomic medicine

Leslie G Biesecker, James C Mullikin, Flavia M Facio, Clesson Turner, Praveen F Cherukuri, Robert W Blakesley, Gerard G Bouffard, Peter S Chines, Pedro Cruz, Nancy F Hansen, Jamie K Teer, Baishali Maskeri, Alice C Young, NISC Comparative Sequencing Program, Teri A Manolio, Alexander F Wilson, Toren Finkel, Paul Hwang, Andrew Arai, Alan T Remaley, Vandana Sachdev, Robert Shamburek, Richard O Cannon, Eric D Green, Leslie G Biesecker, James C Mullikin, Flavia M Facio, Clesson Turner, Praveen F Cherukuri, Robert W Blakesley, Gerard G Bouffard, Peter S Chines, Pedro Cruz, Nancy F Hansen, Jamie K Teer, Baishali Maskeri, Alice C Young, NISC Comparative Sequencing Program, Teri A Manolio, Alexander F Wilson, Toren Finkel, Paul Hwang, Andrew Arai, Alan T Remaley, Vandana Sachdev, Robert Shamburek, Richard O Cannon, Eric D Green

Abstract

ClinSeq is a pilot project to investigate the use of whole-genome sequencing as a tool for clinical research. By piloting the acquisition of large amounts of DNA sequence data from individual human subjects, we are fostering the development of hypothesis-generating approaches for performing research in genomic medicine, including the exploration of issues related to the genetic architecture of disease, implementation of genomic technology, informed consent, disclosure of genetic information, and archiving, analyzing, and displaying sequence data. In the initial phase of ClinSeq, we are enrolling roughly 1000 participants; the evaluation of each includes obtaining a detailed family and medical history, as well as a clinical evaluation. The participants are being consented broadly for research on many traits and for whole-genome sequencing. Initially, Sanger-based sequencing of 300-400 genes thought to be relevant to atherosclerosis is being performed, with the resulting data analyzed for rare, high-penetrance variants associated with specific clinical traits. The participants are also being consented to allow the contact of family members for additional studies of sequence variants to explore their potential association with specific phenotypes. Here, we present the general considerations in designing ClinSeq, preliminary results based on the generation of an initial 826 Mb of sequence data, the findings for several genes that serve as positive controls for the project, and our views about the potential implications of ClinSeq. The early experiences with ClinSeq illustrate how large-scale medical sequencing can be a practical, productive, and critical component of research in genomic medicine.

Figures

Figure 1.
Figure 1.
A spatial conceptualization of research studies in genomic medicine. There are three key “dimensions” to consider when applying genomics to clinical research: genome breadth (the fraction of the genome that is interrogated), number of subjects or participants, and the associated clinical data about those individuals (including its depth, breadth, and rigor). While the ideal study would acquire whole-genome sequences from large numbers of extensively phenotyped subjects, this is currently impractical. Single-gene studies can involve a few or numerous subjects and extensive clinical data, but by definition involve the examination of only a single gene and thus occupy one wall of this space. The individual genomes that have recently been sequenced (Levy et al. 2007; Bentley et al. 2008; Wang et al. 2008; Wheeler et al. 2008) provide nearly complete genome breadth, but with limited clinical data; further, their limited subject numbers place them on another wall of this space. The 1000 Genomes Project (http://www.1000genomes.org/) is providing large subject numbers and extensive genome breadth, but no clinical data—positioning it on the floor of this space. ClinSeq aims to reside in the center of this space, having attributes of substantial subject size (n = 1000 initially), moderate genome breadth (∼400 genes initially, with plans for expanding this breadth), and substantial clinical data.
Figure 2.
Figure 2.
ClinSeq sample and data flow. DNA samples and clinical data emanate from the initial participant enrollment and clinical evaluation, and then flow through the indicated clinical and research processes (see text for details). Note the separate acquisition and handling of DNA samples for clinical and research purposes, respectively, with the former handled by a CLIA laboratory prior to any results being returned to participants. Further, variants identified in putative disease-causing genes must first be reviewed by a data-monitoring board before being reported back to participants.
Figure 3.
Figure 3.
Snapshot of ClinSeq sequence coverage. This “heat map” provides an overview of the targeted sequence coverage for 27 genes selected at random from the set of 140 genes with completed PCR primer design. The figure illustrates the range and variability in the yield of sequence data for a subset of the analyzed genes. These 27 genes are being sequenced using 343 amplimers (of 2444 total) represented by columns; the data are shown for 326 enrolled participants (of 586 total) represented by rows. The colors represent the percent sequence coverage (see scale on right) of the corresponding PCR products at or above a threshold of phred Q20, with white indicating the absence of data at this time. Such heat-map results are used to monitor overall quality of the ClinSeq sequencing pipeline, whereas more direct quality measures (see text) are used to assess the suitability of individual sequence data for inclusion in subsequent analyses and/or return to participants. The complete heat map showing all amplimers and all participants for the data set discussed here is provided in Supplemental Figure S1.
Figure 4.
Figure 4.
Distribution of variant counts. An estimated total of 1984 variants (corrected for false-positives from the observed 2107 autosomal ROI variants) were identified in ROIs among 250 ClinSeq participants (see text for details). The number of times each variant was detected is depicted, in each case broken down relative to its presence or absence in dbSNP. Note that the x-axis is discontinuous beyond a count of 25, and the allele counts greater than 25 are presented in bins of 25; also note that the y-axis uses a logarithmic scale. The data show that 966 variants are unique (i.e., a minor allele count of 1), and, in fact, comprise about half of the variants detected in this data set (792 not in dbSNP and 174 in dbSNP).
Figure 5.
Figure 5.
Pedigree for Clinical Case 1 (Box 1). Standard pedigree nomenclature is used. Abbreviations used: d. 70: died in his/her 70s, in addition, the cause of death may be specified—MI, myocardial infarction, cancer, etc.; CABG 60, coronary artery bypass graft in his/her 60s; HL 20, hyperlipidemia of unknown type, diagnosed in his/her 20s; and HL 2, hyperlipidemia of unknown type diagnosed at age 2. For the patients who have undergone mutation testing, the LDLR mutation status is indicated; c.564G>T/+ indicates heterozygosity for the mutation.
Figure 6.
Figure 6.
Single axial slice of the coronary calcium scan from the patient described in Clinical Case 1 (Box 1) that shows severe calcification of the left anterior descending coronary artery (red arrow), the portion of the circumflex coronary artery within the imaging plane (white arrow), and the aortic root around the origin of the left main coronary artery (yellow arrow).
Figure 7.
Figure 7.
Pedigree for Clinical Case 2 (Box 1). Abbreviations used are similar to those in Figure 5.

Source: PubMed

3
Subscribe