A genome-wide analysis of CpG dinucleotides in the human genome distinguishes two distinct classes of promoters

Serge Saxonov, Paul Berg, Douglas L Brutlag, Serge Saxonov, Paul Berg, Douglas L Brutlag

Abstract

A striking feature of the human genome is the dearth of CpG dinucleotides (CpGs) interrupted occasionally by CpG islands (CGIs), regions with relatively high content of the dinucleotide. CGIs are generally associated with promoters; genes, whose promoters are especially rich in CpG sequences, tend to be expressed in most tissues. However, all working definitions of what constitutes a CGI rely on ad hoc thresholds. Here we adopt a direct and comprehensive survey to identify the locations of all CpGs in the human genome and find that promoters segregate naturally into two classes by CpG content. Seventy-two percent of promoters belong to the class with high CpG content (HCG), and 28% are in the class whose CpG content is characteristic of the overall genome (low CpG content). The enrichment of CpGs in the HCG class is symmetric and peaks around the core promoter. The broad-based expression of the HCG promoters is not a consequence of a correlation with CpG content because within the HCG class the breadth of expression is independent of the CpG content. The overall depletion of CpGs throughout the genome is thought to be a consequence of the methylation of some germ-line CpGs and their susceptibility to mutation. A comparison of the frequencies of inferred deamination mutations at CpG and GpC dinucleotides in the two classes of promoters using SNPs in human-chimpanzee sequence alignments shows that CpGs mutate at a lower frequency in the HCG promoters, suggesting that CpGs in the HCG class are hypomethylated in the germ line.

Figures

Fig. 1.
Fig. 1.
Patterns of CpG occurrence with respect to gene features. The measures were made on overlapping segments aligned with respect to the TSS and identified by the distance of the midpoint from the TSS. The analysis included all (15,880) RefSeq genes for which the TSS was annotated differently from the start of the coding region. (A) To compare CpG presence in exons and introns as well as coding and noncoding sequences, the normalized CpG fraction was computed on overlapping 99-bp segments downstream of the TSS. Sequences were filtered according to whether they were in introns or exons; exons were further split into coding and noncoding (3′ and 5′ UTRs) sets. Exons carry a consistently higher level of CpGs than introns; the difference between the coding and noncoding exonic sequence shows that the CpG content of noncoding exons is only slightly above that of introns, suggesting the culpability of the coding potential in maintaining the higher CpG levels in exons. (B and C) Patterns of CpG occurrence (B) and GC content (C) around transcription start sites. Normalized CpG fraction and GC content were computed in 50-bp overlapping segments across 4-kb regions centered at the TSS.
Fig. 2.
Fig. 2.
Distribution of promoters with respect to CpG properties. (A and B) Histograms of normalized CpG fractions (A) and GC content (B) of 3-kb regions around TSSs. The y axis counts the number of promoters with the given CpG or GC content in the 3 kb centered at each promoter's TSS. Two Gaussian curves were fitted to the distribution in A with means of 0.23 and 0.61, σ values of 0.07 and 0.14, and weights of 4,430 and 11,450, respectively. The intersection of the two curves, at 0.35, is the decision boundary we used to separate promoters and their genes into classes LCG and HCG. See Table 6, which is published as supporting information on the PNAS web site, for a full listing of the TSSs in the two classes, along with their RefSeq IDs and chromosome locations. (C and D) Plotting the normalized CpG fraction (C) and GC content (D) separately for the two classes.
Fig. 3.
Fig. 3.
A microarray analysis of tissue distribution of genes in class LCG and class HCG. (A) Tissue distributions of genes in the two classes were significantly different (P = 1.6 × 10-52). The fraction of genes expressed in only a few tissues was higher in the LCG class, whereas the fraction of universally expressed genes was higher in the HCG class. For plotting convenience we show distributions of genes grouped in 16 larger bins of size 5. (B) We partitioned class HCG into thirds by CpG content. One-third of promoters had normalized CpG fractions between 0.350 and 0.563, the next third was between 0.563 and 0.683, and the last third comprised all of the promoters with normalized CpG at >0.683. The tissue distributions of genes in the three HCG partitions were similar to each other and different from class LCG. (C) We quantified that conclusion by measuring dissimilarities between distributions by using χ2 values (P values in parentheses).

Source: PubMed

3
Suscribir