ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia

Stephen G Landt, Georgi K Marinov, Anshul Kundaje, Pouya Kheradpour, Florencia Pauli, Serafim Batzoglou, Bradley E Bernstein, Peter Bickel, James B Brown, Philip Cayting, Yiwen Chen, Gilberto DeSalvo, Charles Epstein, Katherine I Fisher-Aylor, Ghia Euskirchen, Mark Gerstein, Jason Gertz, Alexander J Hartemink, Michael M Hoffman, Vishwanath R Iyer, Youngsook L Jung, Subhradip Karmakar, Manolis Kellis, Peter V Kharchenko, Qunhua Li, Tao Liu, X Shirley Liu, Lijia Ma, Aleksandar Milosavljevic, Richard M Myers, Peter J Park, Michael J Pazin, Marc D Perry, Debasish Raha, Timothy E Reddy, Joel Rozowsky, Noam Shoresh, Arend Sidow, Matthew Slattery, John A Stamatoyannopoulos, Michael Y Tolstorukov, Kevin P White, Simon Xi, Peggy J Farnham, Jason D Lieb, Barbara J Wold, Michael Snyder, Stephen G Landt, Georgi K Marinov, Anshul Kundaje, Pouya Kheradpour, Florencia Pauli, Serafim Batzoglou, Bradley E Bernstein, Peter Bickel, James B Brown, Philip Cayting, Yiwen Chen, Gilberto DeSalvo, Charles Epstein, Katherine I Fisher-Aylor, Ghia Euskirchen, Mark Gerstein, Jason Gertz, Alexander J Hartemink, Michael M Hoffman, Vishwanath R Iyer, Youngsook L Jung, Subhradip Karmakar, Manolis Kellis, Peter V Kharchenko, Qunhua Li, Tao Liu, X Shirley Liu, Lijia Ma, Aleksandar Milosavljevic, Richard M Myers, Peter J Park, Michael J Pazin, Marc D Perry, Debasish Raha, Timothy E Reddy, Joel Rozowsky, Noam Shoresh, Arend Sidow, Matthew Slattery, John A Stamatoyannopoulos, Michael Y Tolstorukov, Kevin P White, Simon Xi, Peggy J Farnham, Jason D Lieb, Barbara J Wold, Michael Snyder

Abstract

Chromatin immunoprecipitation (ChIP) followed by high-throughput DNA sequencing (ChIP-seq) has become a valuable and widely used approach for mapping the genomic location of transcription-factor binding and histone modifications in living cells. Despite its widespread use, there are considerable differences in how these experiments are conducted, how the results are scored and evaluated for quality, and how the data and metadata are archived for public use. These practices affect the quality and utility of any global ChIP experiment. Through our experience in performing ChIP-seq experiments, the ENCODE and modENCODE consortia have developed a set of working standards and guidelines for ChIP experiments that are updated routinely. The current guidelines address antibody validation, experimental replication, sequencing depth, data and metadata reporting, and data quality assessment. We discuss how ChIP quality, assessed in these ways, affects different uses of ChIP-seq data. All data sets used in the analysis have been deposited for public viewing and downloading at the ENCODE (http://encodeproject.org/ENCODE/) and modENCODE (http://www.modencode.org/) portals.

Figures

Figure 1.
Figure 1.
Overview of ChIP-seq workflow and antibody characterization procedures. (A) Steps for which specific ENCODE guidelines are presented in this document are indicated in red. For other steps, standard ENCODE protocols exist that should be validated and optimized for each new cell line/tissue type or sonicator. (*) A commonly used but optional step. (B) Flowchart for characterization of new antibodies or antibody lots. (C) Flowchart for use of antibody characterization assays.
Figure 2.
Figure 2.
Representative results from antibody characterization assays. (A) Immunoblot analyses of antibodies against SIN3B that (left) pass quality control (Santa Cruz sc13145) and (right) fail quality control (Santa Cruz sc996). Lanes contain nuclear extract from GM12878 cells (G) and K562 cells (K). Arrows indicate band of expected size of 133 kDa. Molecular weights (MW) are in kilodaltons. (B) Immunoblot analysis of an antibody against TBLR1 (Abcam ab24550) that passes quality control and can be used for immunoprecipitation. Immunoprecipitations (IPs) were performed from nuclear lysates of K562 cells. Arrow indicates band of expected size (56 kDa) that is detected in the input lysate (lane 1) and is efficiently (cf. lanes 3 and 2) and specifically (absent in lane 4) immunoprecipitated. (*) IgG light and heavy chains. (C) Immunofluorescence analyses of antibodies that pass (left) and fail (right) quality control. (D) Immunoprecipitation/mass spectrometry analysis of an antibody against SP1 (Santa Cruz sc-17824). Whole-cell lysates (WCL) of K562, GM12878, and HepG2 were immunoprecipitated, and a band of expected size (∼106 kDa) was detected on a Western blot with SP1 primary antibody. The immunoprecipitation was repeated in K562 WCL, separated on a gel, stained with Coomassie Blue, and the band previously detected on the Western blot was excised and analyzed by mass spectrometry. Peptides were identified using MASCOT (Matrix Science) with probability-based matching at P < 0.05. Subsequent analysis was performed in Scaffold (Proteome Software, Inc.) at 0.0% protein FDR and 0.0% peptide FDR. SP1 protein was detected (along with common contaminants that are often obtained in control experiments) (data not shown) and is highlighted in bold and light blue. (E) Histogram depicting motif fold-enrichment (blue) for all transcription factors for which ENCODE ChIP-seq data is available (85 factors). Enrichments are relative to all DNase-accessible sites and were corrected for sequence bias using shuffle motifs. Motif searches were conducted with a matching stringency of 4–6. Where multiple data sets are available for a factor, the data set with the highest enrichment was counted. Data sets that meet the ENCODE standard of fourfold enrichment (indicated by blue line) were found for 60% of factors. Motif representation, as a percentage of all analyzed peaks, is shown in red for all factors for which a data set exists that exceeds the enrichment standard. A total of 96% of these data sets meet the ENCODE standard of >10% motif representation (red line). All calculations were carried out on peaks identified by IDR analysis (0.01 cut-off).
Figure 3.
Figure 3.
Peak counts depend on sequencing depth. (A) Number of peaks called with Peak-seq (0.01% FDR cut-off) for 11 ENCODE ChIP-seq data sets. (B) Called peak numbers for 11 ChIP-seq data sets as a function of the number of uniquely mapped reads used for peak calling. (Inset) Called peak data for the MAFK data set from HepG2 cells, currently the most deeply sequenced ENCODE ChIP-seq data set (displayed separately due to the significantly larger number of reads relative to the other data sets). Data sets are indicated by cell line and transcription factor (e.g., cell line HepG2, transcription factor MAFK). (C) Fold-enrichment for newly called peaks as a function of sequencing depth. For each incremental addition of 2.5 million uniquely mapped reads, the median fold-enrichment for newly called peaks as compared with an IgG control data set sequenced to identical depth is plotted.
Figure 4.
Figure 4.
Criteria for assessing the quality of a ChIP-seq experiment. (A) Library complexity. Individual reads mapping to the plus (red) or minus strand (blue) are represented. (B) Distribution of functional regulatory elements with respect to the strength of the ChIP-seq signal. ChIP-seq was performed against myogenin, a major regulator of muscle differentiation, in differentiated mouse myocytes. While many extensively characterized muscle regulatory elements exhibit strong myogenin binding, a large number of known functional sites are at the low end of the binding strength continuum. (C) Number of called peaks vs. ChIP enrichment. Except in special cases, successful experiments identify thousands to tens of thousands of peaks for most TFs and, depending on the peak finder used, numbers in the hundreds or low thousands indicate a failure. Peaks were called using MACS with default thresholds. (D) Generation of a cross-correlation plot. Reads are shifted in the direction of the strand they map to by an increasing number of base pairs and the Pearson correlation between the per-position read count vectors for each strand is calculated. Read coverage as wigglegram is represented, not to the same scale in the top and bottom panels.) (E) Two cross-correlation peaks are usually observed in a ChIP experiment, one corresponding to the read length (“phantom” peak) and one to the average fragment length of the library. (F) Correlation between the fraction of reads within called regions and the relative cross-correlation coefficient for 1052 human ChIP-seq experiments. (G) The absolute and relative height of the two peaks are useful determinants of the success of a ChIP-seq experiment. A high-quality IP is characterized by a ChIP peak that is much higher than the “phantom” peak, while often very small or no such peak is seen in failed experiments.
Figure 5.
Figure 5.
Quality control of ChIP-seq data sets in practice. EGR1 ChIP-seq was performed in K562 cells in two replicates. ChIP enriched regions were identified using MACS. However, the cross-correlation plot profiles (A) indicated that both IPs were suboptimal, with one being unacceptable. In agreement with this judgment, ChIP enrichment (C) and peak number (D) also indicated failure. The ChIP-seq assays were repeated (B), with all quality control metrics improving significantly (B,D), and many additional EGR1 peaks were identified as a result. (E) Representative browser snapshot of the four EGR1 ChIP-seq experiments, showing the much stronger peaks obtained with the second set of replicates. (F) Distribution of EGR1 motifs relative to the bioinformatically defined peak position of EGR1-occupied regions derived from ChIP-seq data in K562 cells. Regions are ranked by their confidence scores as called by SPP.
Figure 6.
Figure 6.
The irreproducible discovery rate (IDR) framework for assessing reproducibility of ChIP-seq data sets. (AC) Reproducibility analysis for a pair of high-quality RAD21 ChIP-seq replicates. (D,E) The same analysis for a pair of low quality SPT20 ChIP-seq replicates. (A,D) Scatter plots of signal scores of peaks that overlap in each pair of replicates. (B,E) Scatter plots of ranks of peaks that overlap in each pair of replicates. Note that low ranks correspond to high signal and vice versa. (C,F) The estimated IDR as a function of different rank thresholds. (A,B,D,E) Black data points represent pairs of peaks that pass an IDR threshold of 1%, whereas the red data points represent pairs of peaks that do not pass the IDR threshold of 1%. The RAD21 replicates show high reproducibility with ∼30,000 peaks passing an IDR threshold of 1%, whereas the SPT20 replicates show poor reproducibility with only six peaks passing the 1% IDR threshold.
Figure 7.
Figure 7.
Analysis of ENCODE data sets using the quality control guidelines. (A–C) Thresholds and distribution of quality control metric values in human ENCODE transcription-factor ChIP-seq data sets. (A) NSC, (B) RSC, (C) NRF. (D) IDR pipeline for assessing ChIP-seq quality using replicate data sets. (E,F) Thresholds and distribution of IDR pipeline quality control metrics in human ENCODE transcription factor ChIP-seq data sets. (Dashed lines) Current ENCODE thresholds for the given metric, which are NSC > 1.05 (A); RSC > 0.8 (B); NRF > 0.8, N1/N2 ≥ 2 (where N1 refers to the replicate with higher N) (E); Np/Nt ≥ 2 (F).

References

    1. Auerbach RK, Euskirchen G, Rozowsky J, Lamarre-Vincent N, Moqtaderi Z, Lefrançois P, Struhl K, Gerstein M, Snyder M 2009. Mapping accessible chromatin regions using Sono-Seq. Proc Natl Acad Sci 106: 14926–14931
    1. Barski A, Cuddapah S, Cui K, Roh T, Schones DE, Wang Z, Wei G, Chepelev I, Zhao K 2007. High-resolution profiling of histone methylations in the human genome. Cell 129: 823–837
    1. Celniker SE, Dillon LAL, Gerstein MB, Gunsalus KC, Henikoff S, Karpen GH, Kellis M, Lai EC, Lieb JD, MacAlpine DM, et al. 2009. Unlocking the secrets of the genome. Nature 459: 927–930
    1. Contrino S, Smith RN, Butano D, Carr A, Hu F, Lyne R, Rutherford K, Kalderimis A, Sullivan J, Carbon S, et al. 2011. modMine: Flexible access to modENCODE data. Nucleic Acids Res 40: D1082–D1088
    1. DeKoter RP, Singh H 2000. Regulation of B lymphocyte and macrophage development by graded expression of PU.1. Science 288: 1439–1441
    1. Dohm JC, Lottaz C, Borodina T, Himmelbauer H 2008. Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Nucleic Acids Res 36: e105 doi: 10.1093/nar/gkn425
    1. Egelhofer TA, Minoda A, Klugman S, Lee K, Kolasinska-Zwierz P, Alekseyenko AA, Cheung M, Day DS, Gadel S, Gorchakov AA, et al. 2011. An assessment of histone-modification antibody quality. Nat Struct Mol Biol 18: 91–93
    1. The ENCODE Project Consortium 2004. The ENCODE (ENCyclopedia Of DNA Elements) Project. Science 306: 636–640
    1. The ENCODE Project Consortium 2011. A user's guide to the encyclopedia of DNA elements (ENCODE). PLoS Biol 9: e1001046 doi: 10.1371/journal.pbio.1001046
    1. Ernst J, Kellis M 2010. Discovery and characterization of chromatin states for systematic annotation of the human genome. Nat Biotechnol 28: 817–825
    1. Ernst J, Kheradpour P, Mikkelsen TS, Shoresh N, Ward LD, Epstein CB, Zhang X, Wang L, Issner R, Coyne M, et al. 2011. Mapping and analysis of chromatin state dynamics in nine human cell types. Nature 473: 43–49
    1. Fernandez PC, Frank SR, Wang L, Schroeder M, Liu S, Greene J, Cocito A, Amati B 2003. Genomic targets of the human c-Myc protein. Genes Dev 17: 1115–1129
    1. Frietze S, O’Geen H, Blahnik KR, Jin VX, Farnham PJ 2010. ZNF274 recruits the histone methyltransferase SETDB1 to the 3′ ends of ZNF genes. PLoS ONE 5: e15082 doi: 10.1371/journal.pone.0015082
    1. Gerstein MB, Lu ZJ, Van Nostrand EL, Cheng C, Arshinoff BI, Liu T, Yip KY, Robilotto R, Rechtsteiner A, Ikegami K, et al. 2010. Integrative analysis of the Caenorhabditis elegans genome by the modENCODE project. Science 330: 1775–1787
    1. Ghisletti S, Barozzi I, Mietton F, Polletti S, De Santa F, Venturini E, Gregory L, Lonie L, Chew A, Wei C, et al. 2010. Identification and characterization of enhancers controlling the inflammatory gene expression program in macrophages. Immunity 32: 317–328
    1. Guttman M, Garber M, Levin JZ, Donaghey J, Robinson J, Adiconis X, Fan L, Koziol MJ, Gnirke A, Nusbaum C, et al. 2010. Ab initio reconstruction of cell type-specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs. Nat Biotechnol 28: 503–510
    1. He A, Kong SW, Ma Q, Pu WT 2011. Co-occupancy by multiple cardiac transcription factors identifies transcriptional enhancers active in heart. Proc Natl Acad Sci 108: 5632–5637
    1. He Q, Bardet AF, Patton B, Purvis J, Johnston J, Paulson A, Gogol M, Stark A, Zeitlinger J 2011. High conservation of transcription factor binding and evidence for combinatorial regulation across six Drosophila species. Nat Genet 43: 414–420
    1. Heinz S, Benner C, Spann N, Bertolino E, Lin YC, Laslo P, Cheng JX, Murre C, Singh H, Glass CK 2010. Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. Mol Cell 38: 576–589
    1. Horak CE, Snyder M 2002. ChIP-chip: A genomic approach for identifying transcription factor binding sites. Methods Enzymol 350: 469–483
    1. Hua S, Kittler R, White KP 2009. Genomic antagonism between retinoic acid and estrogen signaling in breast cancer. Cell 137: 1259–1271
    1. Iyer VR, Horak CE, Scafe CS, Botstein D, Snyder M, Brown PO 2001. Genomic binding sites of the yeast cell-cycle transcription factors SBF and MBF. Nature 409: 533–538
    1. Ji H, Jiang H, Ma W, Johnson DS, Myers RM, Wong WH 2008. An integrated software system for analyzing ChIP-chip and ChIP-seq data. Nat Biotechnol 26: 1293–1300
    1. Johnson DS, Mortazavi A, Myers RM, Wold B 2007. Genome-wide mapping of in vivo protein-DNA interactions. Science 316: 1497–1502
    1. Kharchenko PV, Tolstorukov MY, Park PJ 2008. Design and analysis of ChIP-seq experiments for DNA-binding proteins. Nat Biotechnol 26: 1351–1359
    1. Krig SR, Jin VX, Bieda MC, O’Geen H, Yaswen P, Green R, Farnham PJ 2007. Identification of genes directly regulated by the oncogene ZNF217 using chromatin immunoprecipitation (ChIP)-chip assays. J Biol Chem 282: 9703–9712
    1. Lefebvre C, Rajbhandari P, Alvarez MJ, Bandaru P, Lim WK, Sato M, Wang K, Sumazin P, Kustagi M, Bisikirska BC, et al. 2010. A human B-cell interactome identifies MYB and FOXM1 as master regulators of proliferation in germinal centers. Mol Syst Biol 6: 377 doi: 10.1038/msb.2010.31
    1. Li Q, Brown J, Huang H, Bickel P 2011. Measuring reproducibility of high-throughput experiments. Ann Appl Stat 5: 1752–1779
    1. Lieb JD, Liu X, Botstein D, Brown PO 2001. Promoter-specific binding of Rap1 revealed by genome-wide maps of protein-DNA association. Nat Genet 28: 327–334
    1. Lin YC, Jhunjhunwala S, Benner C, Heinz S, Welinder E, Mansson R, Sigvardsson M, Hagman J, Espinoza CA, Dutkowski J, et al. 2010. A global network of transcription factors, involving E2A, EBF1 and Foxo1, that orchestrates B cell fate. Nat Immunol 11: 635–643
    1. Mikkelsen TS, Ku M, Jaffe DB, Issac B, Lieberman E, Giannoukos G, Alvarez P, Brockman W, Kim T, Koche RP, et al. 2007. Genome-wide maps of chromatin state in pluripotent and lineage-committed cells. Nature 448: 553–560
    1. Myers RM, Stamatoyannopoulos J, Snyder M, Dunham I, Hardison RC, Bernstein BE, Gingeras TR, Kent WJ, Birney E, Wold B, et al. 2011. A user’s guide to the encyclopedia of DNA elements (ENCODE). PLoS Biol 9: e1001046 doi: 10.1371/journal.pbio.1001046
    1. Ozdemir A, Fisher-Aylor KI, Pepke S, Samanta M, Dunipace L, McCue K, Zeng L, Ogawa N, Wold BJ, Stathopoulos A 2011. High resolution mapping of Twist to DNA in Drosophila embryos: Efficient functional analysis and evolutionary conservation. Genome Res 21: 566–577
    1. Park PJ 2009. ChIP-seq: Advantages and challenges of a maturing technology. Nat Rev Genet 10: 669–680
    1. Pepke S, Wold B, Mortazavi A 2009. Computation for ChIP-seq and RNA-seq studies. Nat Methods 6: S22–S32
    1. Poser I, Sarov M, Hutchins JRA, Hériché J, Toyoda Y, Pozniakovsky A, Weigl D, Nitzsche A, Hegemann B, Bird AW, et al. 2008. BAC TransgeneOmics: A high-throughput method for exploration of protein function in mammals. Nat Methods 5: 409–415
    1. Raha D, Wang Z, Moqtaderi Z, Wu L, Zhong G, Gerstein M, Struhl K, Snyder M 2010. Close association of RNA polymerase II and many transcription factors with Pol III genes. Proc Natl Acad Sci 107: 3639–3644
    1. Rashid NU, Giresi PG, Ibrahim JG, Sun W, Lieb JD 2011. ZINBA integrates local covariates with DNA-seq data to identify broad and narrow regions of enrichment, even within amplified genomic regions. Genome Biol 12: R67 doi: 10.1186/gb-2011-12-7-r67
    1. Ren B, Robert F, Wyrick JJ, Aparicio O, Jennings EG, Simon I, Zeitlinger J, Schreiber J, Hannett N, Kanin E, et al. 2000. Genome-wide location and function of DNA binding proteins. Science 290: 2306–2309
    1. Robertson G, Hirst M, Bainbridge M, Bilenky M, Zhao Y, Zeng T, Euskirchen G, Bernier B, Varhol R, Delaney A, et al. 2007. Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing. Nat Methods 4: 651–657
    1. Rosenbloom KR, Dreszer TR, Long JC, Malladi VS, Sloan CA, Raney BJ, Cline MS, Karolchik D, Barber GP, Clawson H, et al. 2011. ENCODE whole-genome data in the UCSC Genome Browser: Update 2012. Nucleic Acids Res 40: D912–D917
    1. Roy S, Ernst J, Kharchenko PV, Kheradpour P, Negre N, Eaton ML, Landolin JM, Bristow CA, Ma L, Lin MF, et al. 2010. Identification of functional elements and regulatory circuits by Drosophila modENCODE. Science 330: 1787–1797
    1. Rozowsky J, Euskirchen G, Auerbach RK, Zhang ZD, Gibson T, Bjornson R, Carriero N, Snyder M, Gerstein MB 2009. PeakSeq enables systematic scoring of ChIP-seq experiments relative to controls. Nat Biotechnol 27: 66–75
    1. Squazzo SL, O’Geen H, Komashko VM, Krig SR, Jin VX, Jang S, Margueron R, Reinberg D, Green R, Farnham PJ 2006. Suz12 binds to silenced regions of the genome in a cell-type-specific manner. Genome Res 16: 890–900
    1. Tijssen MR, Cvejic A, Joshi A, Hannah RL, Ferreira R, Forrai A, Bellissimo DC, Oram SH, Smethurst PA, Wilson NK, et al. 2011. Genome-wide analysis of simultaneous GATA1/2, RUNX1, FLI1, and SCL binding in megakaryocytes identifies hematopoietic regulators. Dev Cell 20: 597–609
    1. Valouev A, Johnson DS, Sundquist A, Medina C, Anton E, Batzoglou S, Myers RM, Sidow A 2008. Genome-wide analysis of transcription factor binding sites based on ChIP-Seq data. Nat Methods 5: 829–834
    1. Wang Z, Zang C, Rosenfeld JA, Schones DE, Barski A, Cuddapah S, Cui K, Roh T, Peng W, Zhang MQ, et al. 2008. Combinatorial patterns of histone acetylations and methylations in the human genome. Nat Genet 40: 897–903
    1. Weinmann AS, Yan PS, Oberley MJ, Huang TH, Farnham PJ 2002. Isolating human transcription factor targets by coupling chromatin immunoprecipitation and CpG island microarray analysis. Genes Dev 16: 235–244
    1. Wilson NK, Foster SD, Wang X, Knezevic K, Schütte J, Kaimakis P, Chilarska PM, Kinston S, Ouwehand WH, Dzierzak E, et al. 2010. Combinatorial transcriptional control in blood stem/progenitor cells: Genome-wide analysis of ten major transcriptional regulators. Cell Stem Cell 7: 532–544
    1. Zhang Y, Liu T, Meyer CA, Eeckhoute J, Johnson DS, Bernstein BE, Nusbaum C, Myers RM, Brown M, Li W, et al. 2008. Model-based analysis of ChIP-Seq (MACS). Genome Biol 9: R137 doi: 10.1186/gb-2008-9-9-r137
    1. Zhong M, Niu W, Lu ZJ, Sarov M, Murray JI, Janette J, Raha D, Sheaffer KL, Lam HYK, Preston E, et al. 2010. Genome-wide identification of binding sites defines distinct functions for Caenorhabditis elegans PHA-4/FOXA in development and environmental response. PLoS Genet 6: e1000848 doi: 10.1371/journal.pgen.1000848

Source: PubMed

3
Subskrybuj