Integrative analysis of public ChIP-seq experiments reveals a complex multi-cell regulatory landscape

Aurélien Griffon, Quentin Barbier, Jordi Dalino, Jacques van Helden, Salvatore Spicuglia, Benoit Ballester, Aurélien Griffon, Quentin Barbier, Jordi Dalino, Jacques van Helden, Salvatore Spicuglia, Benoit Ballester

Abstract

The large collections of ChIP-seq data rapidly accumulating in public data warehouses provide genome-wide binding site maps for hundreds of transcription factors (TFs). However, the extent of the regulatory occupancy space in the human genome has not yet been fully apprehended by integrating public ChIP-seq data sets and combining it with ENCODE TFs map. To enable genome-wide identification of regulatory elements we have collected, analysed and retained 395 available ChIP-seq data sets merged with ENCODE peaks covering a total of 237 TFs. This enhanced repertoire complements and refines current genome-wide occupancy maps by increasing the human genome regulatory search space by 14% compared to ENCODE alone, and also increases the complexity of the regulatory dictionary. As a direct application we used this unified binding repertoire to annotate variant enhancer loci (VELs) from H3K4me1 mark in two cancer cell lines (MCF-7, CRC) and observed enrichments of specific TFs involved in biological key functions to cancer development and proliferation. Those enrichments of TFs within VELs provide a direct annotation of non-coding regions detected in cancer genomes. Finally, full access to this catalogue is available online together with the TFs enrichment analysis tool (http://tagc.univ-mrs.fr/remap/).

© The Author(s) 2014. Published by Oxford University Press on behalf of Nucleic Acids Research.

Figures

Figure 1.
Figure 1.
ChIP-seq binding pattern of 395 data sets. (A) A genome browser example of complex ChIP-seq binding patterns of the 395 data sets at the SMAD4/ELAC1 promoters, and a detailed view of the redundant peaks for a FOXA1 site. The following genome tracks correspond to the ChIP-seq peak summits (black vertical lines), the 100 vertebrates conservation track from UCSC and the condensed ENCODE TF bindings. (B) Co-binding correlation patterns of the 395 data sets are clustered and shown as a heatmap with blue to red indicating low to high correlations for each co-localized data sets. Co-binding relationships between TFs and cell types across all data sets are observable. Co-localization clusters are highlighted with coloured bars and (C) some clustered data sets are shown in details (e.g. ESR1 in MCF-7 cells).
Figure 2.
Figure 2.
ChIP-seq peaks and CRMs. (A) A schematic diagram of the three types of regulatory regions: all peaks, non-redundant peaks and CRMs. Peaks for similar TFs overlapping the same regions were merged into single peaks defined as non-redundant. For each genomic region bound by at least two different TFs, those bindings were regrouped into CRMs. (B) Proportion of single and combined binding sites observed after identification of CRMs. The vertical barplot correspond to proportion of CRMs found in combinatorial binding categories across all identified CRMs. (C) Genomic distribution of single or combined binding sites in six different genomic regions. The percentage of binding sites in each category is shown on the vertical axis, for the overall genome, singletons and each combinatorial binding complexity from 2 to 50+ TFs. (D) Distribution of CRMs at TSS (±2.5 kb) for increasing levels of combinatorial binding complexity from 2 to 50+ TFs. (E) Proportion of our regulatory catalogue covering different types of genomic features. Percentages of elements recovered are shown for CRMs only (green) and both CRMs and singletons (blue). (F) The WebLogo position weight matrix diagrams for CTCF identified across the diverse databases, showing subtle position-specific differences. (G) DNA sequence constraint around the peak summits of FOXA1, CTCF, CEBPA, NFYB were plotted by observed-expected GERP scores (22).
Figure 3.
Figure 3.
Comparison with ENCODE and integration with public data. (A) Comparisons of public regulatory regions versus ENCODE regions. The vertical barplots correspond to the proportion of TFBS from the integrative analysis of public data that can be recapitulated in the ENCODE CRMs and singletons. ‘No overlap’ corresponds to potential novel regulatory regions. Overlap analyses are performed both ways. (B) A genome browser example of binding patterns from public data only, and complemented patterns with the public and ENCODE merge. The following genome tracks correspond to the ChIP-seq peak summits (black vertical lines) and the 100 vertebrates conservation track from UCSC. (C) Venn diagrams of TFs, CRMs and regulatory features (CRMs and singletons) between the public set and ENCODE. (D) Genomic distribution of single or combined binding sites in six different genomic regions. The percentage of binding sites in each category is shown on the vertical axis for singletons and each combinatorial binding complexity from 2 to 100+ TFs. (E) Saturation analysis of the ReMap data with increasing numbers of TFs. The plot is generated from the merge of both public and ENCODE TFBS catalogues. This plot illustrates the saturation of CRMs identified by TF ChIP-seq as additional factors are analysed across the multi-cell integrative analysis. We calculate CRMs counts across the genome from an increasing number of TFs randomly selected. The distribution of CRMs counts for 100 TFs selection is plotted as a boxplot on the x-axis. We continue to do this for all incremental steps up to and including all TFs. A lowess line smoothing the medians of the CRMs count is highlighted in orange.
Figure 4.
Figure 4.
Network representations of TFs co-localization across the genome. (A) In this filtered TF co-localization network, nodes indicate individual TFs and colours indicate subnetworks identified by applying a partitioning algorithm; edge colours depict the percentages of overlap between TFBS and weights the co-localization specificity between two TFs. Overlapping binding sites were computed using IntervalStats tool and co-localization specificity was determined by identifying outliers based on the percentages of significant overlapping sites. (B) Highlighted subnetworks of highly connected and strongly specific TFs with functional annotations. Barplots represent Gene Ontology Biological Process enrichments calculated by DAVID (x-axis = −log10 Benjamini score).
Figure 5.
Figure 5.
Specific TFs signature in VELs. (A) UCSC browser views of H3K4me1 profile of a normal mammary epithelial cell line (MCF-10A) and a breast cancer cell line (MCF-7), illustrating an example of a gained (left in red) and a lost (right in blue) VELs. (B) Similar view of H3K4me1 profile for a primary CRC (CRC V400) and a normal colon epithelium crypt (C104). (C) TFs specifically enriched within regions defined as gained or lost VELs in MCF-7 and CRC (D) cell lines.

References

    1. Hoffman M.M., Ernst J., Wilder S.P., Kundaje A., Harris R.S., Libbrecht M., Giardine B., Ellenbogen P.M., Bilmes J.A., Birney E., et al. Integrative annotation of chromatin elements from ENCODE data. Nucleic Acids Res. 2013;41:827–841.
    1. Negre N., Brown C.D., Ma L., Bristow C.A., Miller S.W., Wagner U., Kheradpour P., Eaton M.L., Loriaux P., Sealfon R., et al. A cis-regulatory map of the Drosophila genome. Nature. 2011;471:527–531.
    1. Shen Y., Yue F., McCleary D.F., Ye Z., Edsall L., Kuan S., Wagner U., Dixon J., Lee L., Lobanenkov V.V., et al. A map of the cis-regulatory sequences in the mouse genome. Nature. 2012:116–120. 488.
    1. ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489:57–74.
    1. Neph S., Vierstra J., Stergachis A.B., Reynolds A.P., Haugen E., Vernot B., Thurman R.E., John S., Sandstrom R., Johnson A.K., et al. An expansive human regulatory lexicon encoded in transcription factor footprints. Nature. 2012;489:83–90.
    1. Hnisz D., Abraham B.J., Lee T.I., Lau A., Saint-André V., Sigova A.A., Hoke H.A., Young R.A. Super-enhancers in the control of cell identity and disease. Cell. 2013;155:934–947.
    1. Chapuy B., McKeown M.R., Lin C.Y., Monti S., Roemer M.G.M., Qi J., Rahl P.B., Sun H.H., Yeda K.T., Doench J.G., et al. Discovery and characterization of super-enhancer-associated dependencies in diffuse large B cell lymphoma. Cancer Cell. 2013;24:777–790.
    1. Andersson R., Gebhard C., Miguel-Escalada I., Hoof I., Bornholdt J., Boyd M., Chen Y., Zhao X., Schmidl C., Suzuki T., et al. An atlas of active enhancers across human cell types and tissues. Nature. 2014;507:455–461.
    1. Liu T., Ortiz J.A., Taing L., Meyer C.A., Lee B., Zhang Y., Shin H., Wong S.S., Ma J., Lei Y., et al. Cistrome: an integrative platform for transcriptional regulation studies. Genome Biol. 2011;12:R83.
    1. Marinov G.K., Kundaje A., Park P.J., Wold B.J. Large-scale quality analysis of published ChIP-seq data. G3 (Bethesda) 2014;4:209–223.
    1. Langmead B., Salzberg S.L. Fast gapped-read alignment with Bowtie 2. Nat. Methods. 2012;9:357–359.
    1. Zhang Y., Liu T., Meyer C.A., Eeckhoute J., Johnson D.S., Bernstein B.E., Nusbaum C., Myers R.M., Brown M., Li W., et al. Model-based analysis of ChIP-Seq (MACS) Genome Biol. 2008;9:R137.
    1. Salmon-Divon M., Dvinge H., Tammoja K., Bertone P. PeakAnalyzer: genome-wide annotation of chromatin binding and modification loci. BMC Bioinformatics. 2010;11:415.
    1. Landt S.G., Marinov G.K., Kundaje A., Kheradpour P., Pauli F., Batzoglou S., Bernstein B.E., Bickel P., Brown J.B., Cayting P., et al. ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia. Genome Res. 2012;22:1813–1831.
    1. Kharchenko P.V., Tolstorukov M.Y., Park P.J. Design and analysis of ChIP-seq experiments for DNA-binding proteins. Nat. Biotechnol. 2008;26:1351–1359.
    1. Quinlan A.R., Hall I.M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010;26:841–842.
    1. Shin H., Liu T., Manrai A.K., Liu X.S. CEAS: cis-regulatory element annotation system. Bioinformatics. 2009;25:2605–2606.
    1. Thomas-Chollier M., Herrmann C., Defrance M., Sand O., Thieffry D., van Helden J. RSAT peak-motifs: motif analysis in full-size ChIP-seq datasets. Nucleic Acids Res. 2012;40:e31.
    1. Wang J., Zhuang J., Iyer S., Lin X.-Y., Greven M.C., Kim B.-H., Moore J., Pierce B.G., Dong X., Virgil D., et al. : a Wiki-based database for transcription factor-binding data generated by the ENCODE consortium. Nucleic Acids Res. 2013;41:D171–D176.
    1. Mathelier A., Zhao X., Zhang A.W., Parcy F., Worsley-Hunt R., Arenillas D.J., Buchman S., Chen C.-Y., Chou A., Ienasescu H., et al. JASPAR 2014: an extensively expanded and updated open-access database of transcription factor binding profiles. Nucleic Acids Res. 2014;42:D142–D147.
    1. Flicek P., Aken B.L., Ballester B., Beal K., Bragin E., Brent S., Chen Y., Clapham P., Coates G., Fairley S., et al. Ensembl's 10th year. Nucleic Acids Res. 2010;38:D557–D562.
    1. Cooper G.M., Stone E.A., Asimenos G., Green E.D., Batzoglou S., Sidow A., NISC Comparative Sequencing Program Distribution and intensity of constraint in mammalian genomic sequence. Genome Res. 2005;15:901–913.
    1. Chikina M.D., Troyanskaya O.G. An effective statistical evaluation of ChIPseq dataset similarity. Bioinformatics. 2012;28:607–613.
    1. Bastian M., Heymann S., Jacomy M. Gephi: an open source software for exploring and manipulating networks. Intl. AAAI Conf. Weblogs Social Media. 2009
    1. Blondel V.D., Guillaume J.-L., Lambiotte R., Lefebvre E. Fast unfolding of communities in large networks. J. Stat. Mech. 2008;2008:P10008.
    1. Akhtar-Zaidi B., Cowper-Sal Lari R., Corradin O., Saiakhova A., Bartels C.F., Balasubramanian D., Myeroff L., Lutterbaugh J., Jarrar A., Kalady M.F. Epigenomic enhancer profiling defines a signature of colon cancer. Science. 2012;336:736–739.
    1. Choe M.K., Hong C.-P., Park J., Seo S.H., Roh T.-Y. Functional elements demarcated by histone modifications in breast cancer cells. Biochem. Biophys. Res. Commun. 2012;418:475–482.
    1. Barrett T., Troup D.B., Wilhite S.E., Ledoux P., Rudnev D., Evangelista C., Kim I.F., Soboleva A., Tomashevsky M., Edgar R. NCBI GEO: mining tens of millions of expression profiles–database and tools update. Nucleic Acids Res. 2007;35:D760–D765.
    1. Parkinson H., Kapushesky M., Shojatalab M., Abeygunawardena N., Coulson R., Farne A., Holloway E., Kolesnykov N., Lilja P., Lukk M., et al. ArrayExpress–a public database of microarray experiments and gene expression profiles. Nucleic Acids Res. 2007;35:D747–D750.
    1. Cuddapah S., Jothi R., Schones D.E., Roh T.-Y., Cui K., Zhao K. Global analysis of the insulator binding protein CTCF in chromatin barrier regions reveals demarcation of active and repressive domains. Genome Res. 2009;19:24–32.
    1. Heintzman N.D., Hon G.C., Hawkins R.D., Kheradpour P., Stark A., Harp L.F., Ye Z., Lee L.K., Stuart R.K., Ching C.W., et al. Histone modifications at human enhancers reflect global cell-type-specific gene expression. Nature. 2009;459:108–112.
    1. Moorman C., Sun L.V., Wang J., de Wit E., Talhout W., Ward L.D., Greil F., Lu X.-J., White K.P., Bussemaker H.J., et al. Hotspots of transcription factor colocalization in the genome of Drosophila melanogaster. Proc. Natl. Acad. Sci. U.S.A. 2006;103:12027–12032.
    1. Foley J.W., Sidow A. Transcription-factor occupancy at HOT regions quantitatively predicts RNA polymerase recruitment in five human cell lines. BMC Genom. 2013;14:720.
    1. Lee B.-K., Bhinge A.A., Battenhouse A., Liu Z., McDaniell R.M., Song L., Ni Y., Birney E., Lieb J.D., Furey T.S. Cell-type specific and combinatorial usage of diverse transcription factors revealed by genome-wide binding studies in multiple human cells. Genome Res. 2011;22:9–24.
    1. Xie D., Boyle A.P., Wu L., Zhai J., Kawli T., Snyder M. Dynamic trans-acting factor colocalization in human cells. Cell. 2013;155:713–724.
    1. Vlieghe D., Sandelin A., Bleser P.J., Vleminckx K., Wasserman W.W., van Roy F., Lenhard B. A new generation of JASPAR, the open-access repository for transcription factor binding site profiles. Nucleic Acids Res. 2006;34:D95–D97.
    1. Schmidt D., Wilson M.D., Ballester B., Schwalie P.C., Brown G.D., Marshall A., Kutter C., Watt S., Martinez-Jimenez C.P., Mackay S., et al. Five-vertebrate ChIP-seq reveals the evolutionary dynamics of transcription factor binding. Science (New York, NY) 2010;328:1036–1040.
    1. Consortium E.P. A user's guide to the encyclopedia of DNA elements (ENCODE) PLoS Biol. 2011;9:e1001046.
    1. Kazemian M., Pham H., Wolfe S.A., Brodsky M.H., Sinha S. Widespread evidence of cooperative DNA binding by transcription factors in Drosophila development. Nucleic Acids Res. 2013;41:8237–8252.
    1. Cheng Q., Kazemian M., Pham H., Blatti C., Celniker S.E., Wolfe S.A., Brodsky M.H., Sinha S. Computational identification of diverse mechanisms underlying transcription factor-DNA occupancy. PLoS Genet. 2013;9:e1003571.
    1. Gerstein M.B., Kundaje A., Hariharan M., Landt S.G., Yan K.-K., Cheng C., Mu X.J., Khurana E., Rozowsky J., Alexander R., et al. Architecture of the human regulatory network derived from ENCODE data. Nature. 2012;489:91–100.
    1. Wang J., Zhuang J., Iyer S., Lin X., Whitfield T.W., Greven M.C., Pierce B.G., Dong X., Kundaje A., Cheng Y., et al. Sequence features and chromatin structure around the genomic regions bound by 119 human transcription factors. Genome Res. 2012;22:1798–1812.
    1. Vaquerizas J.M., Kummerfeld S.K., Teichmann S.A., Luscombe N.M. A census of human transcription factors: function, expression and evolution. Nat. Rev. Genet. 2009;10:252–263.
    1. Hawkins S.M., Loomans H.A., Wan Y.-W., Ghosh-Choudhury T., Coffey D., Xiao W., Liu Z., Sangi-Haghpeykar H., Anderson M.L. Expression and functional pathway analysis of nuclear receptor NR2F2 in ovarian cancer. J. Clin. Endocrinol. Metab. 2013;98:E1152–E1162.
    1. Su X., Chakravarti D., Cho M.S., Liu L., Gi Y.J., Lin Y.-L., Leung M.L., El-Naggar A., Creighton C.J., Suraokar M.B., et al. TAp63 suppresses metastasis through coordinate regulation of Dicer and miRNAs. Nature. 2010;467:986–990.
    1. Schaab C., Geiger T., Stoehr G., Cox J., Mann M. Analysis of high accuracy, quantitative proteomics data in the MaxQB database. Mol. Cell Proteom. 2012;11 M111.014068.
    1. Tang W., Dodge M., Gundapaneni D., Michnoff C., Roth M., Lum L. A genome-wide RNAi screen for Wnt/beta-catenin pathway components identifies unexpected roles for TCF transcription factors in cancer. Proc. Natl. Acad. Sci. U.S.A. 2008;105:9697–9702.
    1. Slattery M.L., Folsom A.R., Wolff R., Herrick J., Caan B.J., Potter J.D. Transcription factor 7-like 2 polymorphism and colon cancer. Cancer Epidemiol. Biomarkers Prev. 2008;17:978–982.
    1. Shaulian E., Karin M. AP-1 as a regulator of cell life and death. Nat. Cell Biol. 2002;4:E131–E136.
    1. Teng L., He B., Gao P., Gao L., Tan K. Discover context-specific combinatorial transcription factor interactions by integrating diverse ChIP-Seq data sets. Nucleic Acids Res. 2013;42 gkt1105–e24.
    1. Ernst J., Kellis M. Interplay between chromatin state, regulator binding, and regulatory motifs in six human cell types. Genome Res. 2013;23:1142–1154.
    1. Mendoza-Parra M.-A., Van Gool W., Mohamed Saleem M.A., Ceschin D.G., Gronemeyer H. A quality control system for profiles obtained by ChIP sequencing. Nucleic Acids Res. 2013;41:e196.
    1. Khurana E., Fu Y., Colonna V., Mu X.J., Kang H.M., Lappalainen T., Sboner A., Lochovsky L., Chen J., Harmanci A., et al. Integrative annotation of variants from 1092 humans: application to cancer genomics. Science (New York, NY) 2013;342:1235587.
    1. Ritchie G.R.S., Dunham I., Zeggini E., Flicek P. Functional annotation of noncoding sequence variants. Nat. Methods. 2014;11:294–296.
    1. Whyte W.A., Orlando D.A., Hnisz D., Abraham B.J., Lin C.Y., Kagey M.H., Rahl P.B., Lee T.I., Young R.A. Master transcription factors and mediator establish super-enhancers at key cell identity genes. Cell. 2013;153:307–319.
    1. Lovén J., Hoke H.A., Lin C.Y., Lau A., Orlando D.A., Vakoc C.R., Bradner J.E., Lee T.I., Young R.A. Selective inhibition of tumor oncogenes by disruption of super-enhancers. Cell. 2013;153:320–334.
    1. Zhao Z., Tavoosidana G., Sjölinder M., Göndör A., Mariano P., Wang S., Kanduri C., Lezcano M., Sandhu K.S., Singh U., et al. Circular chromosome conformation capture (4C) uncovers extensive networks of epigenetically regulated intra- and interchromosomal interactions. Nat. Genet. 2006;38:1341–1347.
    1. Jin F., Li Y., Dixon J.R., Selvaraj S., Ye Z., Lee A.Y., Yen C.-A., Schmitt A.D., Espinoza C.A., Ren B. A high-resolution map of the three-dimensional chromatin interactome in human cells. Nature. 2013;503:290–294.
    1. Li G., Ruan X., Auerbach R.K., Sandhu K.S., Zheng M., Wang P., Poh H.M., Goh Y., Lim J., Zhang J., et al. Extensive promoter-centered chromatin interactions provide a topological basis for transcription regulation. Cell. 2012;148:84–98.
    1. Zhang Y., Wong C.-H., Birnbaum R.Y., Li G., Favaro R., Ngan C.Y., Lim J., Tai E., Poh H.M., Wong E., et al. Chromatin connectivity maps reveal dynamic promoter-enhancer long-range associations. Nature. 2013;504:306–310.

Source: PubMed

Подписаться