Cell-free DNA Comprises an In Vivo Nucleosome Footprint that Informs Its Tissues-Of-Origin

Matthew W Snyder, Martin Kircher, Andrew J Hill, Riza M Daza, Jay Shendure, Matthew W Snyder, Martin Kircher, Andrew J Hill, Riza M Daza, Jay Shendure

Abstract

Nucleosome positioning varies between cell types. By deep sequencing cell-free DNA (cfDNA), isolated from circulating blood plasma, we generated maps of genome-wide in vivo nucleosome occupancy and found that short cfDNA fragments harbor footprints of transcription factors. The cfDNA nucleosome occupancies correlate well with the nuclear architecture, gene structure, and expression observed in cells, suggesting that they could inform the cell type of origin. Nucleosome spacing inferred from cfDNA in healthy individuals correlates most strongly with epigenetic features of lymphoid and myeloid cells, consistent with hematopoietic cell death as the normal source of cfDNA. We build on this observation to show how nucleosome footprints can be used to infer cell types contributing to cfDNA in pathological states such as cancer. Since this strategy does not rely on genetic differences to distinguish between contributing tissues, it may enable the noninvasive monitoring of a much broader set of clinical conditions than currently possible.

Copyright © 2016 Elsevier Inc. All rights reserved.

Figures

Figure 1. Origins and characteristics of cfDNA…
Figure 1. Origins and characteristics of cfDNA fragments in human plasma
(A) Schematic overview of cfDNA fragmentation. Apoptotic or necrotic cell death results in near-complete digestion of native chromatin. Protein-bound DNA fragments, typically associated with histones or TFs, preferentially survive digestion and are released into the circulation, while naked DNA is lost. Fragments can be recovered from peripheral blood plasma following proteinase treatment. In healthy individuals, cfDNA is primarily derived from myeloid and lymphoid cell lineages, but contributions from one or more additional tissues may be present in certain medical conditions. (B) Fragment length of cfDNA observed with conventional sequencing library preparation, inferred from alignment of paired-end reads. A reproducible peak in fragment length at 167 bp (green dashed line) is consistent with association with chromatosomes. Additional peaks evidence ~10.4 bp periodicity, corresponding to the helical pitch of DNA on the nucleosome core. Enzymatic end-repair during library preparation removes 5’ and 3’ overhangs and may obscure true cleavage sites. (C) Dinucleotide composition of 167 bp fragments and flanking genomic sequence in conventional libraries. Observed dinucleotide frequencies in the BH01 library were compared to expected frequencies from simulated fragments. (D) Fragment length of cfDNA in single-stranded sequencing library preparation. No enzymatic end-repair is performed to template molecules during library preparation. Short fragments of 50–120 bp are highly enriched compared to conventional libraries. While ~10.4 bp periodicity remains, its phase is shifted by ~3 bp. (E) Dinucleotide composition of 167 bp fragments and flanking genomic sequence in single-stranded library IH02, calculated as in (C). The apparent difference in the background level of bias between BH01 and IH02 relate to differences between the simulations, rather than the real libraries (data not shown). See also Figure S1 and Table S1.
Figure 2. Genome-wide determination of nucleosome positions…
Figure 2. Genome-wide determination of nucleosome positions from cfDNA fragmentation patterns
(A) Schematic of inference of nucleosome positioning. A per-base windowed protection score (WPS) is calculated by subtracting the number of fragment endpoints within a 120 bp window from the number of fragments completely spanning the window. High WPS values indicate increased protection of DNA from digestion; low values indicate that DNA is unprotected. Peak calls identify contiguous regions of elevated WPS. (B) Strongly positioned nucleosomes at a well-studied alpha-satellite array. Coverage, fragment endpoints, and WPS values from sample CH01 are shown for long fragment (120 bp window; 120–180 bp fragments) or short fragment (16 bp window; 35–80 bp fragments) bins at a pericentromeric locus on chromosome 12. Nucleosome calls from CH01 (middle, blue boxes) are regularly spaced across the locus. Nucleosome calls from two published callsets (Gaffney et al., 2012; Pedersen et al., 2014) (middle, purple and black boxes) are also displayed. (C) Inferred nucleosome positioning around a DHS site. Coverage, fragment endpoints, WPS values, and nucleosome calls are shown as in (B). The hypersensitive region (gray shading), is marked by reduced coverage in the long fragment bin. Nucleosome calls adjacent to the DHS site are spaced more widely than typical adjacent pairs, consistent with accessibility of the intervening sequence to regulatory proteins including TFs. Coverage of short fragments, which may be associated with such proteins, is increased at the DHS site, which overlaps with several annotated TFBSs (not shown). (D) Distances between adjacent peaks by sample. Distances are measured between adjacent peak centers. (E) Comparison of peak calls between samples. For each pair of samples, the distances between each peak call in the sample with fewer peaks and the nearest peak call in the other sample are shown. Negative and positive numbers indicate the nearest peak is upstream or downstream, respectively. (F) Distances between adjacent peaks, sample CH01. The dotted black line indicates the mode of the distribution (185 bp). See also Figure S2.
Figure 3. Nucleosome positioning and spacing correlates…
Figure 3. Nucleosome positioning and spacing correlates with genomic features
(A) Aggregate, adjusted windowed protection scores (WPS; 120 bp window) around 22,626 transcription start sites (TSS). TSS are aligned at the 0 position after adjusting for strand and direction of transcription. Aggregate WPS is tabulated for both real data and simulated data by summing per-TSS WPS at each position relative to the centered TSS. The values plotted represent the difference between the real and simulated aggregate WPS (see Experimental Procedures for details). (B) Aggregate, adjusted WPS around 22,626 start codons. (C and D) Aggregate, adjusted WPS around 224,910 splice donor (C) and 224,910 splice acceptor (D) sites. (E) Nucleosome spacing in A/B compartments. Median nucleosome spacing in non-overlapping 100 kb bins, each containing ~500 nucleosome calls, is calculated genome-wide. A/B compartment predictions, also with 100 kb resolution, are shown for GM12878. Compartments A and B are associated with open and closed chromatin, respectively. (F) Nucleosome spacing and A/B compartments on chromosomes 7 and 11. A/B segmentation (red and blue bars) largely recapitulates chromosomal G-banding (ideograms, gray bars). Median nucleosome spacing (black dots) is calculated in 100 kb bins. See also Figure S3.
Figure 4. Short cfDNA fragments footprint CTCF…
Figure 4. Short cfDNA fragments footprint CTCF and other TF binding sites
Clustered FIMO predictions were intersected with ChIP-seq data to obtain confident sets of binding site predictions for various TFs. Aggregate, adjusted WPS were calculated for both the long (120–180 bp) and short (35–80 bp) fractions of cfDNA fragments. Higher WPS values indicate greater nucleosome or TF protection, respectively. (A) Aggregate, adjusted WPS for 93,530 predicted CTCF binding sites for the long (top) and short (bottom) cfDNA fractions. Binding of CTCF results in strong positioning of neighboring nucleosomes. (B) Aggregate, adjusted WPS, calculated for 93,530 predicted CTCF sites as in (A) and magnified for detail, for 35–80 bp cfDNA fragments. The pink shading indicates the larger 52 bp CTCF binding motif, and the black box shows the location of the 17 bp motif used for FIMO predictions. (C) Density of −1 to +1 nucleosome spacing around CTCF sites derived from clustered FIMO predictions (purely motif-based: 518,632 sites), a subset of these predictions overlapping with ENCODE ChIP-seq peaks (93,530 sites), and a further subset active across 19 cell lines (23,723 sites). Flanking nucleosome spacing at the least stringent set of sites (motif-based) mirrors the genome-wide average (~185 bp), while spacing at the most stringent set of sites is highly enriched for greater distances (~260 bp), consistent with active CTCF binding and repositioning of adjacent nucleosomes. (D) Mean WPS calculated for the long (top) and short (bottom) cfDNA fractions at the sets of CTCF sites in (C). (E and F) Aggregate, adjusted WPS calculated for both long (top) and short (bottom) cfDNA fractions at predicted binding sites for ETS (210,798 sites) (E) and MAFK (32,159 sites) (F). For both factors, short fraction WPS is consistent with TF-conferred protection of the binding site, whereas long fraction WPS evidences regular, local positioning of surrounding nucleosomes. See also Figure S4.
Figure 5. Inference of mixtures of cell…
Figure 5. Inference of mixtures of cell types contributing to cell-free DNA in healthy states and cancer
(A) The distribution of nucleosome spacing for peaks flanking DHS sites in 116 callsets is bimodal, plausibly corresponding to widened nucleosome spacing at active DHS sites due to intervening TF binding (~190 bp → 260 bp). Lymphoid or myeloid callsets have the largest proportions of DHS sites with widened nucleosome spacing, consistent with hematopoietic cell death as the dominant source of cfDNA in healthy individuals. (B and C) Partitioning adjusted WPS scores around TSS into five gene expression bins (quintiles) defined for NB-4 (an acute promyelocytic leukemia cell line) reveals differential nucleosome spacing and positioning. (B) Highly expressed genes show strong nucleosome phasing within the transcript body. Upstream of the TSS, −1 nucleosomes are well-positioned across expression bins, but −2 and −3 nucleosomes are well-positioned only for medium to highly expressed genes. (C) For medium to highly expressed genes, a short fragment WPS peak is observed between the TSS and the −1 nucleosome, plausibly footprinting some or all of the transcription preinitiation complex at transcriptionally active genes. (D) Median nucleosome spacing in the transcript body is negatively correlated with gene expression in NB-4 (ρ = −0.17, n = 19,677 genes). Spacing in genes with low or no expression is 193 bp, while spacing in expressed genes ranges from 186 to 193 bp. (E) To deconvolve multiple contributions, intensities from fast Fourier transformation (FFT) quantified the specific frequency contributions in the long fragment WPS for 10 kb windows downstream of each TSS. Shown are correlation trajectories for RNA expression in 76 cell lines and primary tissues at different frequencies. Correlations are strongest for intensities in the 193–199 bp frequency range. (F) The ranks of correlation for 76 RNA expression datasets with average intensity in the 193–199 bp frequency range for various cfDNA libraries are shown, categorized by type and listed from highest (top row) to lowest rank (bottom row). Correlation values and full cell line or tissue names are provided in Table S3. All of the strongest correlations for all three healthy samples (BH01, IH01 and IH02; first three columns) are with lymphoid and myeloid cell lines or with bone marrow. In contrast, cfDNA samples obtained from stage IV cancer patients (IC15, IC17, IC20, IC35, IC37; last five columns) show top correlations with various cancer cell lines, e.g. IC17 (hepatocellular carcinoma, HCC) showing highest correlations with HepG2 (HCC cell line), and IC35 (breast ductal carcinoma, DC) with MCF7 (metastatic breast adenocarcinoma cell line). When comparing cell line/tissue ranks observed for the cancer samples to each of the three healthy samples and averaging the rank changes, maximum rank changes are over two-fold higher than those observed from comparing the three healthy samples with each other and averaging rank changes (‘Control’). For example, for IC15 (small cell lung carcinoma, SCLC) the rank of SCLC-21H (SCLC cell line) increased by an average of 31 positions, for IC20 (squamous cell lung carcinoma, SCC) SK-BR-3 (metastatic breast adenocarcinoma cell line) increased by an average rank of 21, and for IC37 (colorectal adenocarcinoma, AC) HepG2 increased by 24 ranks. See also Figure S5 and Table S4.
Figure 6. Comparison of nucleosome callsets
Figure 6. Comparison of nucleosome callsets
(A) Distance between nucleosome peak calls across three published data sets (Gaffney et al. 2012, Pedersen et al. 2014, Schep et al. 2015) and calls produced in this study. Previously published callsets lack one defined mode at the canonical ~185 bp nucleosome spacing, possibly due to sparse sampling or wide call ranges. In contrast, all the nucleosome calls from cfDNA show one well-defined mode, the magnitude of which increases with the number of fragments examined. The callset produced from simulation has a lower mode (166 bp) and a wider distribution. (B) Number of calls in each set. The densest cfDNA-derived callset contains nearly 13M nucleosome calls. (C–G) Comparison of peak locations between samples. For each pair of samples, the distribution of distances between each peak call in the sample with fewer peaks and the nearest peak call in the other sample is shown. Negative numbers indicate the nearest peak is upstream; positive numbers indicate the nearest peak is downstream. Concordance between callsets increases with the number of cfNDA fragments examined. (H), As in (G), comparing other callsets to the matched simulation of CA01. See also Table S5.

Source: PubMed

3
Abonneren