Genome mapping on nanochannel arrays for structural variation analysis and sequence assembly

Ernest T Lam, Alex Hastie, Chin Lin, Dean Ehrlich, Somes K Das, Michael D Austin, Paru Deshpande, Han Cao, Niranjan Nagarajan, Ming Xiao, Pui-Yan Kwok, Ernest T Lam, Alex Hastie, Chin Lin, Dean Ehrlich, Somes K Das, Michael D Austin, Paru Deshpande, Han Cao, Niranjan Nagarajan, Ming Xiao, Pui-Yan Kwok

Abstract

We describe genome mapping on nanochannel arrays. In this approach, specific sequence motifs in single DNA molecules are fluorescently labeled, and the DNA molecules are uniformly stretched in thousands of silicon channels on a nanofluidic device. Fluorescence imaging allows the construction of maps of the physical distances between occurrences of the sequence motifs. We demonstrate the analysis, individually and as mixtures, of 95 bacterial artificial chromosome (BAC) clones that cover the 4.7-Mb human major histocompatibility complex region. We obtain accurate, haplotype-resolved, sequence motif maps hundreds of kilobases in length, resulting in a median coverage of 114× for the BACs. The final sequence motif map assembly contains three contigs. With an average distance of 9 kb between labels, we detect 22 haplotype differences. We also use the sequence motif maps to provide scaffolds for de novo assembly of sequencing data. Nanochannel genome mapping should facilitate de novo assembly of sequencing reads from complex regions in diploid organisms, haplotype and structural variation analysis and comparative genomics.

Figures

Figure 1
Figure 1
Nanochannel arrays. (a) In a microfluidic environment, long (>100 kb) DNA fragments (in green in bottom panel) are in the coiled ball form and clog the entrance of the nanochannel array, as it is energetically unfavorable for the molecules to uncoil and enter the nanochannels. (b) A gradient region is placed in front of the nanochannels. Here, the physical confinement is sufficiently dense that the molecules are forced to flow by the pillars, where they uncoil and stream into the nanochannels without clogging. (c) Fabrication of the nanochannel array using interference lithography to produce 120-nm channels in silicon followed by tuning to a smaller diameter with material deposition and capping with a glass cover to allow for fluorescence imaging. (d) A profile scanning electron microscopy image of 45-nm channels. (e) An s.e.m. image of the 45-nm channels patterned on the silicon substrate before bonding to the glass.
Figure 2
Figure 2
Genome mapping. (a) Nick-labeling by Nt.BspQI and DNA polymerase is accomplished by top-strand DNA cleavage (blue arrow), one nucleotide 3′ from the recognition sequence (in bold italics), followed by incorporation of fluorescent nucleotide analogs (in red) with concomitant DNA strand displacement. (b) The DNA molecule is stained with YOYO-1 and loaded into the port of a nanoarray flowcell (left panel). The DNA molecules are introduced into the region with pillars and micrometer-scale relaxation channels by an electric field where they unwind and linearize (top right panel). Finally, the DNA molecules are pushed by a low-voltage electrical pulse, and they enter the 45-nm nanochannels, where they are stretched uniformly to 85% of the length of perfectly linear B-DNA (bottom right panel). The DNA is visualized as blue linear structures in the nanochannels, with green labels marking the Nt.BspQI nick sites. (c) The length of the DNA molecules and the positions of nick labels on each DNA molecule are determined after automated image capture. The fragment size profile of a 183-kb BAC is shown, with the narrow peak width indicating uniform DNA linearization. (d) The DNA molecules are clustered into groups (representing individual BACs) based on nick-labeling pattern similarity. As BAC molecules can enter the nanochannels in either orientation, each BAC is represented by two clusters with opposite orientations (top panel). After combining the two clusters, histogram plots of nick-labeled DNA (bottom panel) are used to define the locations of Nt.BspQI sites. n ≈ 100 molecules.
Figure 3
Figure 3
Genome mapping of mixtures of 95 BACs from the PGF and COX libraries. (a) Image of a single field of view (FOV 73 × 73 µm) containing a mixture of nick-labeled DNA molecules in the nanoarray. This FOV is part of 108 FOVs shown in the bottom part of the panel (outlined in green). Each FOV can accommodate up to 250 kb of a DNA molecule from top to bottom. The images of four FOVs are stitched together so that longer molecules (up to 1 Mb) in a single channel can be analyzed whole. In all, there are 27 sets of four vertical FOVs per array scan. (b) The distribution of the DNA molecules imaged on the nanoarray by length. The majority of the molecules are 100–170 kb in length as expected from the BAC-clone sizes. (c) After clustering of DNA molecules based on nick-labeling patterns, consensus maps with overlapping patterns are assembled into contiguous-sequence motif maps. In this example, three overlapping consensus maps (each ~150 kb long) are assembled into a 300-kb map.
Figure 4
Figure 4
Sequence motif map of the MHC region. (a) Alignment of the in silico reference sequence motif map for the PGF library (black line with the Nt.BspQI sites marked with black dots) and the map of the same region produced by genome mapping (blue line with blue dots). Where there are motif variations between COX and PGF, the COX motif is represented with red lines and red dots. Asterisks mark the gaps in the Nt.BspQI map produced by genome mapping. Gene locations and the location of the variable RCCX module are noted. Additional loci of special interest are marked with boxes and are discussed in detail in the text. The green bar from ~400 kb to 1,000 kb represents the region assembled from sequence data displayed in Figure 5. (b,c) Discrepancies between the reference Nt.BspQI map and that produced by genome mapping. (b) The reference Nt.BspQI maps of the region (a,i) indicate that the COX genome (gray line with red dots) has a 4-kb deletion as compared with the PGF genome (gray line with blue dots), with a 7-kb and an 11-kb fragment between two neighboring sites in the COX and PGF genomes, respectively. The map of the same region produced by genome mapping from both libraries (histogram plot in black) shows the same haplotype for both COX and PGF genomes, with an 11-kb fragment between the corresponding two sites. (c) An Nt.BspQI site identified in the region (a,iii) (arrow) is found in the PGF genome (blue histogram plot) by genome mapping, splitting the 24-kb fragment in the reference map (black line) into 7-kb and 17-kb fragments. The COX reference map (red line) and the COX map produced by genome mapping (red histogram plot) are also displayed to show that the COX genome has the 24-kb fragment and a haplotype variation in the adjacent region.
Figure 5
Figure 5
De novo sequence assembly of the MHC region. DNA of 95 BACs from the PGF and COX libraries was sequenced and the sequence reads were assembled into contigs (arrows). The contigs were aligned to the Nt.BspQI map produced by genome mapping, providing information on the relationship and orientation of contigs together with the location and size of each gap between contigs. Shown are in silico sequence motif maps of the contigs (green dots in arrows) and of the reference sequence (black dots) of a 575-kb region marked in green in Figure 4a.
Figure 6
Figure 6
Haplotype resolution and structural variation detected by genome mapping. (a) Single-site variation resulting from the creation or destruction of an Nt.BspQI site can be identified by genome mapping. The region in Figure 4a, ii shows that the PGF genome (blue line) contains an extra Nt.BspQI site not found in the COX genome (red line) with the maps generated by genome mapping (blue and red histogram plots) showing the expected pattern. (b) Shifting of a site relative to others in two haplotypes may be due to a double mutation or an inversion event. In Figure 4a, vi, the 21-kb region is split into 12- and 9-kb fragments in the COX genome (red line and red histogram plots) but 14- and 7-kb fragments in the PGF genome (blue line and blue histogram plot). (c) Insertions can be identified and localized by genome mapping for haplotyping resolution. In Figure 4a, v, the PGF genome has a 5-kb insertion that also includes an Nt.BspQI site (blue line, blue histogram plot) when compared to the COX genome (red line, red histogram plot). (d) A 30-kb duplication at the RCCX locus (Fig. 4a, iv) is identified and localized in both the reference map (gray line) and that produced by genome mapping (blue histogram plot).

Source: PubMed

3
購読する