Removing batch effects in analysis of expression microarray data: an evaluation of six batch adjustment methods
Chao Chen, Kay Grennan, Judith Badner, Dandan Zhang, Elliot Gershon, Li Jin, Chunyu Liu, Chao Chen, Kay Grennan, Judith Badner, Dandan Zhang, Elliot Gershon, Li Jin, Chunyu Liu
Abstract
The expression microarray is a frequently used approach to study gene expression on a genome-wide scale. However, the data produced by the thousands of microarray studies published annually are confounded by "batch effects," the systematic error introduced when samples are processed in multiple batches. Although batch effects can be reduced by careful experimental design, they cannot be eliminated unless the whole study is done in a single batch. A number of programs are now available to adjust microarray data for batch effects prior to analysis. We systematically evaluated six of these programs using multiple measures of precision, accuracy and overall performance. ComBat, an Empirical Bayes method, outperformed the other five programs by most metrics. We also showed that it is essential to standardize expression data at the probe level when testing for correlation of expression profiles, due to a sizeable probe effect in microarray data that can inflate the correlation among replicates and unrelated samples.
Conflict of interest statement
Competing Interests: The authors have declared that no competing interests exist.
Figures
References
- Brown PO, Botstein D. Exploring the new world of the genome with DNA microarrays. Nature Genetics. 1999;21:33–37.
- Lockhart DJ, Dong HL, Byrne MC, Follettie MT, Gallo MV, et al. Expression monitoring by hybridization to high-density oligonucleotide arrays. Nature Biotechnology. 1996;14:1675–1680.
- Schena M, Shalon D, Davis RW, Brown PO. Quantitative Monitoring of Gene-Expression Patterns with a Complementary-DNA Microarray. Science. 1995;270:467–470.
- Schena M, Shalon D, Heller R, Chai A, Brown PO, et al. Parallel human genome analysis: Microarray-based expression monitoring of 1000 genes. Proceedings of the National Academy of Sciences of the United States of America. 1996;93:10614–10619.
- Sims AH. Bioinformatics and breast cancer: what can high-throughput genomic approaches actually tell us? Journal of Clinical Pathology. 2009;62:879–885.
- Kerr MK. Design considerations for efficient and effective microarray studies. Biometrics. 2003;59:822–828.
- Lander ES. Array of hope. Nature Genetics. 1999;21:3–4.
- Fare TL, Coffey EM, Dai HY, He YDD, Kessler DA, et al. Effects of atmospheric ozone on microarray data quality. Analytical Chemistry. 2003;75:4672–4675.
- Edgar R, Domrachev M, Lash AE. Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Research. 2002;30:207–210.
- Scherer A, editor. Batch Effects and Noise in Microarray Experiments: Sources and Solutions. John Wiley and Sons. 2009.
- Benito M, Parker J, Du Q, Wu JY, Xang D, et al. Adjustment of systematic microarray data biases. Bioinformatics. 2004;20:105–114.
- Sims AH, Smethurst GJ, Hey Y, Okoniewski MJ, Pepper SD, et al. The removal of multiplicative, systematic bias allows integration of breast cancer gene expression datasets - improving meta-analysis and prediction of prognosis. -Bmc Medical Genomics. 2008;1
- Leek JT, Storey JD. Capturing heterogeneity in gene expression studies by surrogate variable analysis. Plos Genetics. 2007;3:1724–1735.
- Luo J, Schumacher M, Scherer A, Sanoudou D, Megherbi D, et al. A comparison of batch effect removal methods for enhancement of prediction performance using MAQC-II microarray gene expression data. Pharmacogenomics Journal. 2010:S48–S61.
- Johnson WE, Li C, Rabinovic A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics. 2007;8:118–127.
- Li C, Wong WH. Model-based analysis of oligonucleotide arrays: Expression index computation and outlier detection. Proceedings of the National Academy of Sciences of the United States of America. 2001;98:31–36.
- Alter O, Brown PO, Botstein D. Singular value decomposition for genome-wide expression data processing and modeling. Proceedings of the National Academy of Sciences of the United States of America. 2000;97:10101–10106.
- Torrey EF, Webster M, Knable M, Johnston N, Yolken RH. The Stanley Foundation brain collection and Neuropathology Consortium. Schizophrenia Research. 2000;44:151–155.
- Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, et al. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics. 2003;4:249–264.
- Boedigheimer MJ, Wolfinger RD, Bass MB, Bushel PR, Chou JW, et al. Sources of variation in baseline gene expression levels from toxicogenomics study control animals across multiple laboratories. BMC Genomics. 2008;9
- McCall MN, Irizarry RA. Consolidated strategy for the analysis of microarray spike-in data. Nucleic Acids Research. 2008;36
- McGraw KO, Wong SP. Forming inferences about some intraclass correlation coefficients. Psychological Methods. 1996;1:30–46.
- Irizarry RA, Warren D, Spencer F, Kim IF, Biswal S, et al. Multiple-laboratory comparison of microarray platforms (vol 2, pg 345, 2005). Nature Methods. 2005;2:477–477.
- Hanley JA, Mcneil BJ. A Method of Comparing the Areas under Receiver Operating Characteristic Curves Derived from the Same Cases. Radiology. 1983;148:839–843.
- Kang HM, Ye C, Eskin E. Accurate Discovery of Expression Quantitative Trait Loci Under Confounding From Spurious and Genuine Regulatory Hotspots. Genetics. 2008;180:1909–1925.
- Scharpf RB, Ruczinski I, Carvalho B, Doan B, Chakravarti A, et al. A multilevel model to address batch effects in copy number estimation using SNP arrays. Biostatistics. 2011;12:33–50.
- Lamb J, Crawford ED, Peck D, Modell JW, Blat IC, et al. The connectivity map: Using gene-expression signatures to connect small molecules, genes, and disease. Science. 2006;313:1929–1935.
- Kitchen RR, Sabine VS, Sims AH, Macaskill EJ, Renshaw L, et al. Correcting for intra-experiment variation in Illumina BeadChip data is necessary to generate robust gene-expression profiles. Bmc Genomics. 2010;11
- Shi LM, Campbell G, Jones WD, Campagne F, Wen ZN, et al. The MicroArray Quality Control (MAQC)-II study of common practices for the development and validation of microarray-based predictive models. Nature Biotechnology. 2010:S5–S16.
- Affymetrix Expression Console Software website. Jan.28 Available: . Accessed 2011.
- Zhang DD, Cheng LJ, Badner JA, Chen C, Chen Q, et al. Genetic Control of Individual Differences in Gene-Specific Methylation in Human Brain. American Journal of Human Genetics. 2010;86:411–419.
- R.Sokal R, Rolf J. Biometry - Principles and Practice of Statistics in Biological Research. 1995. 877 3rd edition. W. H. Freeman and Co.
- Ogle DH. NCStats: Helper Functions for Statistics at Northland College. R package version. 2010;02-0
- Sing T, Sander O, Beerenwinkel N, Lengauer T. ROCR: visualizing classifier performance in R. Bioinformatics. 2005;21:3940–3941.
- Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, et al. Bioconductor: open software development for computational biology and bioinformatics. Genome Biology. 2004;5:R80.
Source: PubMed