Capturing heterogeneity in gene expression studies by surrogate variable analysis
Jeffrey T Leek, John D Storey, Jeffrey T Leek, John D Storey
Abstract
It has unambiguously been shown that genetic, environmental, demographic, and technical factors may have substantial effects on gene expression levels. In addition to the measured variable(s) of interest, there will tend to be sources of signal due to factors that are unknown, unmeasured, or too complicated to capture through simple models. We show that failing to incorporate these sources of heterogeneity into an analysis can have widespread and detrimental effects on the study. Not only can this reduce power or induce unwanted dependence across genes, but it can also introduce sources of spurious signal to many genes. This phenomenon is true even for well-designed, randomized studies. We introduce "surrogate variable analysis" (SVA) to overcome the problems caused by heterogeneity in expression studies. SVA can be applied in conjunction with standard analysis techniques to accurately capture the relationship between expression and any modeled variables of interest. We apply SVA to disease class, time course, and genetics of gene expression studies. We show that SVA increases the biological accuracy and reproducibility of analyses in genome-wide expression studies.
Conflict of interest statement
Competing interests. The authors have declared that no competing interests exist.
Figures
References
- Qiu X, Xiao Y, Gordon A, Yakovlev A. Assessing stability of gene selection in microarray data analysis. BMC Bioinformatics. 2006;7:50.
- Klebanov L, Yakovlev A. Treating expression levels of different genes as a sample in microarray data analysis: is it worth a risk? Stat Appl Genet Mol Biol. 2006;5:art9.
- Kerr MK, Martin M, Churchill GA. Analysis of variance for gene expression microarray data. J Comput Biol. 2000;7:819–837.
- Kerr MK, Churchill GA. Experimental design for gene expression microarrays. Biostatistics. 2001;2:183–201.
- Holter NS, Mitra M, Maritan A, Cieplak M, Banavar JR, et al. Fundamental patterns underlying gene expression profiles: simplicity from complexity. Proc Natl Acad Sci U S A. 2000;97:8409–8414.
- Gasch AP, Spellman PT, Kao CM, Carmel-Harel O, Eisen MB, et al. Genomic expression programs in the response of yeast cells to environmental changes. Mol Biol Cell. 2000;11:4241–4257.
- Rodwell GE, Sonu R, Zahn JM, Lund J, Wilhelmy J, et al. A transcriptional profile of aging in the human kidney. PLoS Biol. 2004;2:2191–2201. doi: .
- Storey JD, Xiao W, T LJ, Tompkins RG, Davis RW. Significance analysis of time course microarray experiments. Proc Natl Acad Sci U S A. 2005;102:12837–12842.
- DeRisi JL, Iyer VR, Brown PO. Exploring the metabolic and genetic control of gene expression on a genomic scale. Science. 1997;278:680–686.
- Brem RB, Yvert G, Clinton R, Kruglyak L. Genetic dissection of transcriptional regulation in budding yeast. Science. 2002;296:752–755.
- Schadt EE, Monks SA, Drake TA, Lusis AJ, Che N, et al. Genetics of gene expression surveyed in maize, mouse and man. Nature. 2003;422:297–302.
- Tseng G, Oh M, Rohlin L, Liao J, Wong W. Issues in cDNA microarray analysis: quality filtering, channel normalization, models of variations and assessment of gene effects. Nucleic Acids Res. 2001;29:2540–2557.
- Yang Y, Dudoit S, Luu P, Lin D, Peng V, et al. Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Res. 2002;30:e15.
- Qui X, Klebanov L, Yakovlev A. Correlation between gene expression levels and limitations of the empirical bayes methodology for finding differentially expressed genes. Stat Appl Genet Mol Biol. 2005;4:art34.
- Morley M, Molony CM, Weber T, Devlin JL, Ewens KG, et al. Genetic analysis of genome-wide variation in human gene expression. Nature. 2004;430:743–747.
- Rhodes DR, Chinnaiyan AM. Integrative analysis of the cancer transcriptome. Nat Genet. 2005;37:31–37.
- Nguyen DM, Sam K, Tsimelzon A, Li X, Wong H, et al. Molecular heterogeneity of inflammatory breast cancer: A hyperproliferative phenotype. Clin Cancer Res. 2006;12:5047–5054.
- Amundson SA, Bittner M, Chen Y, Trent J, Meltzer P, et al. Flourescent cdna microarray hybridization reveals complexity and heterogeneity of cellular genotoxic stress response. Oncogene. 1999;18:3666–3672.
- Lamb J, Crawford ED, Peck D, Modell JW, Blat IC, et al. The connectivity map: using gene-expression signatures to connect small molecules, genes and disease. Science. 2006;313:1929–1935.
- Dabney AR, Storey JD. A new approach to intensity-dependent normalization of two-channel microarrays. Biostatistics. 2007;8:128–139.
- Brem RB, Storey JD, Whittle J, Kruglyak L. Genetic interactions between polymorphisms that affect gene expression in yeast. Nature. 2005;436:701–703.
- Hedenfalk I, Duggan D, Chen Y, Radmacher M, Bittner M, et al. Gene-expression profiles in hereditary breast cancer. New Engl J Med. 2001;344:539–548.
- Storey JD, Tibshirani R. Statistical significance for genome-wide studies. Proc Natl Acad Sci USA. 2003;100:9440–9445.
- Dabney AR, Storey JD. A reanalysis of a published Affymetrix genechip control dataset. Genome Biol. 2006;7:401.
- Rice JA. Mathematical statistics and data analysis. 2nd edition. Belmont (California): Duxbury Press; 1995.
- Storey JD. A direct approach to false discovery rates. J Royal Stat Soc Ser B. 2002;64:479–498.
- Buja A, Eyuboglu N. Remarks on parallel analysis. Multivariate Behav Res. 1992;27:509–540.
- Lehman EL, Romano JP. Testing statistical hypotheses. New York: Springer-Verlag; 2005.
- Owen AB. Variance of the number of false discoveries. J Royal Stat Soc Ser B. 2005;67:411–426.
- Qiu X, Yakovlev A. Some comments on instability of false discovery rate estimation. J Bioinform Comput Biol. 2006;4:1057–1068.
- Efron B. Large-scale simultaneous hypothesis testing: the choice of a null hypothesis. J Am Stat Assoc. 2004;99:96–104.
- Efron B. Correlation and large-scale simultaneous significance testing. J Am Stat Assoc. 2007;102:93–103.
- Cai GQ, Sarkar SK. Modified simes' critical values under positive dependence. J Stat Plan Inference. 2006;136:4129–4146.
- Benjamini Y, Yekultieli D. The control of the false discovery rate in multiple testing under dependency. Ann Stat. 2001;29:1165–1188.
- Pawitan Y, Calza S, Ploner A. Estimation of false discovery proportion under general dependence. Bioinformatics. 2006;22:3025–3031.
- Yvert G, Brem RB, Whittle J, Akey JM, Foss E, et al. Trans-acting regulatory variation in Saccharomyces cerevisiae and the role of transcription factors. Nat Genet. 2003;35:57–64.
- Eisen MB, Spellman PT, Brown PO, Botstein D. Cluster analysis and display of genome-wide expression patterns Proc. Natl Acad Sci U S A. 1998;95:14863–14868.
- Hedenfalk I, Ringer M, Ben-Dor A, Yakhini Z, Chen Y, et al. Molecular classification of familial non-brca1/brca2 breast cancer. Proc Natl Acad Sci U S A. 2003;100:2532–2537.
- Mardia KV, Kent JT, Bibby JM. Multivariate analysis. London: Academic Press; 1980.
- Alter O, Brown PO, Botstein D. Singular value decomposition for genome-wide expression data processing and modeling. Proc Natl Acad Sci U S A. 2000;97:10101–10106.
- Price AL, Patterson NJ, Plenge RM, Weinblatt ME, SN A, et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet. 2006;38:904–909.
- Storey JD, Akey JM, Kruglyak L. Multiple locus linkage analysis of genomewide expression in yeast. PLoS Biol. 2005;3:1380–1390.:e267. doi: .
- R Development Core Team. R: a language and environment for statistical computing. Vienna: R Foundation for Statistical Computing; 2004.
- Hastie T, Tibshirani R. Generalized additive models. New York: Chapman & Hall; 1990.
Source: PubMed