Inferring tumour purity and stromal and immune cell admixture from expression data

Kosuke Yoshihara, Maria Shahmoradgoli, Emmanuel Martínez, Rahulsimham Vegesna, Hoon Kim, Wandaliz Torres-Garcia, Victor Treviño, Hui Shen, Peter W Laird, Douglas A Levine, Scott L Carter, Gad Getz, Katherine Stemke-Hale, Gordon B Mills, Roel G W Verhaak, Kosuke Yoshihara, Maria Shahmoradgoli, Emmanuel Martínez, Rahulsimham Vegesna, Hoon Kim, Wandaliz Torres-Garcia, Victor Treviño, Hui Shen, Peter W Laird, Douglas A Levine, Scott L Carter, Gad Getz, Katherine Stemke-Hale, Gordon B Mills, Roel G W Verhaak

Abstract

Infiltrating stromal and immune cells form the major fraction of normal cells in tumour tissue and not only perturb the tumour signal in molecular studies but also have an important role in cancer biology. Here we describe 'Estimation of STromal and Immune cells in MAlignant Tumours using Expression data' (ESTIMATE)--a method that uses gene expression signatures to infer the fraction of stromal and immune cells in tumour samples. ESTIMATE scores correlate with DNA copy number-based tumour purity across samples from 11 different tumour types, profiled on Agilent, Affymetrix platforms or based on RNA sequencing and available through The Cancer Genome Atlas. The prediction accuracy is further corroborated using 3,809 transcriptional profiles available elsewhere in the public domain. The ESTIMATE method allows consideration of tumour-associated normal cells in genomic and transcriptomic studies. An R-library is available on https://sourceforge.net/projects/estimateproject/.

Figures

Figure 1. An overview of the ESTIMATE…
Figure 1. An overview of the ESTIMATE algorithm.
The ESTIMATE algorithm uses gene expression data to output the estimated levels of infiltrating stromal and immune cells and estimated tumour purity. Infiltrating stromal- and immune cell-related genes were identified by five gene filterings.
Figure 2. Stromal and immune scores for…
Figure 2. Stromal and immune scores for tumour cell and stromal fractions of tumour samples.
Stromal and immune scores were generated using expression data sets obtained from tumour cell or stromal cell-enriched samples. (a,b) Heatmaps display stromal (upper row) and immune score (lower row) per sample (each column) using ovarian cancer samples after (a) microbead-based cell sorting and (b) laser-capture microdissection (red=high, blue=low score). (c,d) Box and whisker plots display reduced (c) stromal and (d) immune scores for the tumour cell-enriched samples (tumour part) after laser-capture microdissection compared with matched stromal cell-enriched (ovary, breast) or bulk tumour samples (lung). Box represents the median (thick line) and the quartiles (line). Whisker expresses 1.5 interquartile range (IQR) of the lower or the upper quartile.
Figure 3. The association between tumour purity…
Figure 3. The association between tumour purity variables in TCGA’s ovarian cancer data set.
(ad) Scatterplots between tumour purity and (a) stromal, (b) immune, (c) ESTIMATE scores and between (d) stromal and immune scores in the TCGA ovarian cancer data set. TCGA ovarian cancer samples used in the gene selection (n=28) were not included in the figure. Dash lines denote each median value for stromal and immune scores. (e) The association between tumour purity and stromal- or immune-dominant pattern. Four subgroups were divided based on the median of stromal and immune scores. (f) The ROC curves for four cutoff values in TCGA ovarian cancer data set. N=417.
Figure 4. Evaluation of ESTIMATE algorithm.
Figure 4. Evaluation of ESTIMATE algorithm.
The accuracy of the ESTIMATE algorithm was evaluated by the AUC when tumour samples were divided into high- and low-purity groups on the basis of DNA copy number-based tumour purity. (a,b) The ROC curves for four cutoff values in (a) the Agilent data set, the Affymetrix data set, and the RNAseq data set, the RNAseqV2 data set, and (b) the validation data set. (c) An example of ESTIMATE for new Affymetrix sample, with an ESTIMATE-predicted tumour purity of 0.58. Black dot and grey dash lines show ESTIMATE tumour purity and 95% prediction interval, respectively. The grey dots represent the background distribution based on 955 samples from the TCGA Affymetrix data set.
Figure 5. Correlation of scores with histological…
Figure 5. Correlation of scores with histological findings.
Scatterplots between stromal, immune, ESTIMATE scores and ABSOLUTE-based tumour purity versus the following histological findings: percentage of stromal cells (left upper corner), percentage of infiltrating lymphocytes (right upper corner), and percentage of tumour cells (bottoms panels). Twenty-eight TCGA ovarian cancer samples used in the gene selection were excluded from this analysis.
Figure 6. Unique distribution of stromal and…
Figure 6. Unique distribution of stromal and immune scores.
(a,b) Distinct distributions of (a) stromal and (b) immune scores across different tumour types were observed in RNAseqV2Affymetrix platform data sets. The number of parenthesis means sample size per data sets.

References

    1. Hanahan D. & Weinberg R. A. Hallmarks of cancer: the next generation. Cell 144, 646–674 (2011).
    1. Kalluri R. & Zeisberg M. Fibroblasts in cancer. Nat. Rev. Cancer 6, 392–401 (2006).
    1. Straussman R. et al. Tumour micro-environment elicits innate resistance to RAF inhibitors through HGF secretion. Nature 487, 500–504 (2012).
    1. Fridman W. H., Pages F., Sautes-Fridman C. & Galon J. The immune contexture in human tumours: impact on clinical outcome. Nat. Rev. Cancer 12, 298–306 (2012).
    1. Zhang L. et al. Intratumoral T cells, recurrence, and survival in epithelial ovarian cancer. N. Engl. J. Med. 348, 203–213 (2003).
    1. Sato E. et al. Intraepithelial CD8+ tumor-infiltrating lymphocytes and a high CD8+/regulatory T cell ratio are associated with favorable prognosis in ovarian cancer. Proc. Natl Acad. Sci. USA 102, 18538–18543 (2005).
    1. Pages F. et al. Effector memory T cells, early metastasis, and survival in colorectal cancer. N. Engl. J. Med. 353, 2654–2666 (2005).
    1. Mlecnik B. et al. Histopathologic-based prognostic factors of colorectal cancers are associated with the state of the local immune reaction. J. Clin. Oncol. 29, 610–618 (2011).
    1. van de Vijver M. J. et al. A gene-expression signature as a predictor of survival in breast cancer. N. Engl. J. Med. 347, 1999–2009 (2002).
    1. Tomlins S. A. et al. Recurrent fusion of TMPRSS2 and ETS transcription factor genes in prostate cancer. Science 310, 644–648 (2005).
    1. Director's Challenge Consortium for the Molecular Classification of lung adenocarcinoma. Gene expression-based survival prediction in lung adenocarcinoma: a multi-site, blinded validation study. Nat. Med. 14, 822–827 (2008).
    1. Verhaak R. G. et al. Prediction of molecular subtypes in acute myeloid leukemia based on gene expression profiling. Haematologica 94, 131–134 (2009).
    1. Verhaak R. G. et al. Integrated genomic analysis identifies clinically relevant subtypes of glioblastoma characterized by abnormalities in PDGFRA, IDH1, EGFR, and NF1. Cancer Cell 17, 98–110 (2010).
    1. Van Loo P. et al. Allele-specific copy number analysis of tumors. Proc. Natl Acad. Sci. USA 107, 16910–16915 (2010).
    1. Carter S. L. et al. Absolute quantification of somatic DNA alterations in human cancer. Nat. Biotechnol. 30, 413–421 (2012).
    1. de Ridder D. et al. Purity for clarity: the need for purification of tumor cells in DNA microarray studies. Leukemia 19, 618–627 (2005).
    1. Su X., Zhang L., Zhang J., Meric-Bernstam F. & Weinstein J. N. PurityEst: estimating purity of human tumor samples using next-generation sequencing data. Bioinformatics 28, 2265–2266 (2012).
    1. Venet D., Pecasse F., Maenhaut C. & Bersini H. Separation of samples into their constituents using gene expression data. Bioinformatics 17, (Suppl 1): S279–S287 (2001).
    1. Erkkila T. et al. Probabilistic analysis of gene expression measurements from heterogeneous tissues. Bioinformatics 26, 2571–2577 (2010).
    1. Shen-Orr S. S. et al. Cell type-specific gene expression differences in complex tissues. Nat. Methods 7, 287–289 (2010).
    1. Shoemaker J. E. et al. CTen: a web-based platform for identifying enriched cell types from heterogeneous microarray data. BMC Genomics 13, 460 (2012).
    1. Bolen C. R., Uduman M. & Kleinstein S. H. Cell subset prediction for blood genomic studies. BMC Bioinformatics 12, 258 (2011).
    1. Barbie D. A. et al. Systematic RNA interference reveals that oncogenic KRAS-driven cancers require TBK1. Nature 462, 108–112 (2009).
    1. The Cancer Genome Atlas Research Network. Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature 455, 1061–1068 (2008).
    1. The Cancer Genome Atlas Research Network. Integrated genomic analyses of ovarian carcinoma. Nature 474, 609–615 (2011).
    1. The Cancer Genome Atlas Network. Comprehensive molecular portraits of human breast tumours. Nature 490, 61–70 (2012).
    1. The Cancer Genome Atlas Network. Comprehensive molecular characterization of human colon and rectal cancer. Nature 487, 330–337 (2012).
    1. The Cancer Genome Atlas Research Network. Comprehensive genomic characterization of squamous cell lung cancers. Nature 489, 519–525 (2012).
    1. The Cancer Genome Atlas Research Network. Integrated genomic characterization of endometrial carcinoma. Nature 497, 67–73 (2013).
    1. Tothill R. W. et al. Novel molecular subtypes of serous and endometrioid ovarian cancer linked to clinical outcome. Clin. Cancer Res. 14, 5198–5208 (2008).
    1. Ma X. J., Dahiya S., Richardson E., Erlander M. & Sgroi D. C. Gene expression profiling of the tumor microenvironment during breast cancer progression. Breast Cancer Res. 11, R7 (2009).
    1. Nishida N. et al. Microarray analysis of colorectal cancer stromal tissue reveals upregulation of two oncogenic miRNA clusters. Clin. Cancer Res. 18, 3054–3070 (2012).
    1. Munz M., Baeuerle P. A. & Gires O. The emerging role of EpCAM in cancer and stem cell signaling. Cancer Res. 69, 5627–5629 (2009).
    1. Robin X. et al. pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics 12, 77 (2011).
    1. Yao M. et al. Gene expression analysis of renal carcinoma: adipose differentiation-related protein as a potential diagnostic and prognostic biomarker for clear-cell renal carcinoma. J Pathol. 205, 377–387 (2005).
    1. Mueller E. et al. Terminal differentiation of human breast cancer through PPAR gamma. Mol. Cell 1, 465–470 (1998).
    1. Verhaak R. G. et al. Prognostically relevant gene signatures of high-grade serous ovarian carcinoma. J. Clin. Invest. 123, 517–525 (2013).
    1. Grabmaier K. et al. Molecular cloning and immunogenicity of renal cell carcinoma-associated antigen G250. Int. J. Cancer 85, 865–870 (2000).
    1. Grivennikov S. I., Greten F. R. & Karin M. Immunity, inflammation, and cancer. Cell 140, 883–899 (2010).
    1. Lynch T. J. et al. Ipilimumab in combination with paclitaxel and carboplatin as first-line treatment in stage IIIB/IV non-small-cell lung cancer: results from a randomized, double-blind, multicenter phase II study. J. Clin. Oncol. 30, 2046–2054 (2012).
    1. Cohen D. A. et al. Interobserver agreement among pathologists for semiquantitative hormone receptor scoring in breast carcinoma. Am. J. Clin. Pathol. 138, 796–802 (2012).
    1. Gerlinger M. et al. Intratumor heterogeneity and branched evolution revealed by multiregion sequencing. N. Engl. J. Med. 366, 883–892 (2012).
    1. Kalluri R. & Weinberg R. A. The basics of epithelial-mesenchymal transition. J. Clin. Invest. 119, 1420–1428 (2009).
    1. Ding L. et al. Somatic mutations affect key pathways in lung adenocarcinoma. Nature 455, 1069–1075 (2008).
    1. The Cancer Genome Atlas Data Portal .
    1. Mortazavi A., Williams B. A., McCue K., Schaeffer L. & Wold B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat. Methods 5, 621–628 (2008).
    1. Li B. & Dewey C. N. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics 12, 323 (2011).
    1. Edgar R., Domrachev M. & Lash A. E. Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res. 30, 207–210 (2002).
    1. Rustici G. et al. ArrayExpress update--trends in database growth and links to data analysis tools. Nucleic Acids Res. 41, (Database issue): D987–D990 (2013).
    1. NCI. REMBRANDT .
    1. Barretina J. et al. The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature 483, 603–607 (2012).
    1. Liu H. et al. AffyProbeMiner: a web resource for computing or retrieving accurately redefined Affymetrix probe sets. Bioinformatics 23, 2385–2390 (2007).
    1. Irizarry R. A. et al. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics 4, 249–264 (2003).
    1. Yoshihara K. et al. High-risk ovarian cancer based on 126-gene expression signature is uniquely characterized by downregulation of antigen presentation pathway. Clin. Cancer Res. 18, 1374–1385 (2012).
    1. Tusher V. G., Tibshirani R. & Chu G. Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl Acad. Sci. USA 98, 5116–5121 (2001).
    1. TCGA ovarian cancer unified expression data (2010).
    1. Rosner B. Percentage points for genralized ESD many-outlier procedure. Technometrics 25, 165–172 (1983).
    1. Caroni C. & Prescott P. Sequential application of Wilks's multivariate outlier test. Appl. Stat. 41, 355–364 (1992).
    1. Eureqa (2000).
    1. Schmidt M. & Lipson H. Distilling free-form natural laws from experimental data. Science 324, 81–85 (2009).
    1. ABSOLUTE (2013).
    1. Affymetrix Power Tools (2013).
    1. Adachi S. et al. Meta-analysis of genome-wide association scans for genetic susceptibility to endometriosis in Japanese population. J. Hum. Genet. 55, 816–821 (2010).
    1. Synapse BETA (2013).
    1. Dees N. D. et al. MuSiC: identifying mutational significance in cancer genomes. Genome Res. (2012) 22, 1589–1598.
    1. Nik-Zainal S. et al. The life history of 21 breast cancers. Cell 149, 994–1007 (2012).
    1. Imielinski M. et al. Mapping the hallmarks of lung adenocarcinoma with massively parallel sequencing. Cell 150, 1107–1120 (2012).
    1. R Development Core Team. R: a language and environment for statistical computing. (2011).
    1. Benjamini Y. & Hochberg Y. Controlling the false discovery rate - a practical and powerful approach to multiple testing. J. Roy. Stat. Soc. B Met. 57, 289–300 (1995).

Source: PubMed

3
구독하다