Removing batch effects in analysis of expression microarray data: an evaluation of six batch adjustment methods

Chao Chen, Kay Grennan, Judith Badner, Dandan Zhang, Elliot Gershon, Li Jin, Chunyu Liu, Chao Chen, Kay Grennan, Judith Badner, Dandan Zhang, Elliot Gershon, Li Jin, Chunyu Liu

Abstract

The expression microarray is a frequently used approach to study gene expression on a genome-wide scale. However, the data produced by the thousands of microarray studies published annually are confounded by "batch effects," the systematic error introduced when samples are processed in multiple batches. Although batch effects can be reduced by careful experimental design, they cannot be eliminated unless the whole study is done in a single batch. A number of programs are now available to adjust microarray data for batch effects prior to analysis. We systematically evaluated six of these programs using multiple measures of precision, accuracy and overall performance. ComBat, an Empirical Bayes method, outperformed the other five programs by most metrics. We also showed that it is essential to standardize expression data at the probe level when testing for correlation of expression profiles, due to a sizeable probe effect in microarray data that can inflate the correlation among replicates and unrelated samples.

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1. PVCA results in VAS data.
Figure 1. PVCA results in VAS data.
The contribution of each factor to the overall variation was estimated by PVCA. All the effects, including batch effects, Profile effects, interaction between batch and Profile effects, and residuals, were estimated for their contribution to the overall variation. A. Data without batch adjustment. B. Data processed by ComBat_p as batch adjustment tool/model. C. Data processed by ComBat_n. D. Data processed by PAMR. E. Data processed by DWD. F. Data processed by Ratio_G. G. Data processed by SVA.
Figure 2. PVCA results in SMRI data.
Figure 2. PVCA results in SMRI data.
The contribution of each factor to the overall variation was estimated by PVCA, A. Data without batch adjustment. B. Data preprocessed by RMA with ComBat_n as batch adjustment tool/model. C. Data preprocessed by RMA with ComBat_p. D. Data preprocessed by RMA with PAMR. E. Data preprocessed by RMA with DWD. F. Data preprocessed by RMA with Ratio_G. G. Data preprocessed by RMA with SVA.
Figure 3. Distribution of SMRI ICCs after…
Figure 3. Distribution of SMRI ICCs after transformation.
Boxplots of the distribution of z-scores transformed from intraclass correlation coefficients of probe set expression levels between three SMRI technical replicates. The methods are listed along the X axis. The Y axis is the distributions of all probe sets' ICC z-scores. The top of the box represents the top of the third quartile, the bottom of the box represents the bottom of the first quartile, the middle bar is the median value, box whiskers extend to 1.5 times the interquartile range from the box and circles are possible outliers. Δmedian indicates the median difference of z-score distributions between RMA data and data that has been processed with both RMA and the batch-adjustment method. Except for SVA, all batch adjustment methods significantly increased z-scores (p

Figure 4. Correlation between the nominal fold…

Figure 4. Correlation between the nominal fold changes and observed fold changes in AAS data.

Figure 4. Correlation between the nominal fold changes and observed fold changes in AAS data.
Correlation between the nominal fold changes and observed fold changes in RMA data and data after batch adjustment programs. We simulated 1200 genes out of 10000 genes as differentially expressed, with log 2 fold change range −1.58, −1.32, −1, −0.58, −0.26, −0.14 and 0.14, 0.26, 0.58, 1, 1.32, 1.58, responding to fold changes that range −3, −2.5, −2, −1.5, −1.2, −1.1 and 1.1,1.2,1.5,2,2.5,3 to reflect the approximate number of differentially expressed genes in the real data. The regression slopes were shown in colors by different program. Correlation coefficients (r2) were shown in legend, separately.

Figure 5. ROC curves in AAS data.

Figure 5. ROC curves in AAS data.

ROC curves are graphical representations of both specificity…

Figure 5. ROC curves in AAS data.
ROC curves are graphical representations of both specificity and sensitivity that take into account both differentially and non-differentially expressed genes. ComBat_p and ComBat_n performed almost identically, so their curves overlap each other almost completely.
Figure 4. Correlation between the nominal fold…
Figure 4. Correlation between the nominal fold changes and observed fold changes in AAS data.
Correlation between the nominal fold changes and observed fold changes in RMA data and data after batch adjustment programs. We simulated 1200 genes out of 10000 genes as differentially expressed, with log 2 fold change range −1.58, −1.32, −1, −0.58, −0.26, −0.14 and 0.14, 0.26, 0.58, 1, 1.32, 1.58, responding to fold changes that range −3, −2.5, −2, −1.5, −1.2, −1.1 and 1.1,1.2,1.5,2,2.5,3 to reflect the approximate number of differentially expressed genes in the real data. The regression slopes were shown in colors by different program. Correlation coefficients (r2) were shown in legend, separately.
Figure 5. ROC curves in AAS data.
Figure 5. ROC curves in AAS data.
ROC curves are graphical representations of both specificity and sensitivity that take into account both differentially and non-differentially expressed genes. ComBat_p and ComBat_n performed almost identically, so their curves overlap each other almost completely.

References

    1. Brown PO, Botstein D. Exploring the new world of the genome with DNA microarrays. Nature Genetics. 1999;21:33–37.
    1. Lockhart DJ, Dong HL, Byrne MC, Follettie MT, Gallo MV, et al. Expression monitoring by hybridization to high-density oligonucleotide arrays. Nature Biotechnology. 1996;14:1675–1680.
    1. Schena M, Shalon D, Davis RW, Brown PO. Quantitative Monitoring of Gene-Expression Patterns with a Complementary-DNA Microarray. Science. 1995;270:467–470.
    1. Schena M, Shalon D, Heller R, Chai A, Brown PO, et al. Parallel human genome analysis: Microarray-based expression monitoring of 1000 genes. Proceedings of the National Academy of Sciences of the United States of America. 1996;93:10614–10619.
    1. Sims AH. Bioinformatics and breast cancer: what can high-throughput genomic approaches actually tell us? Journal of Clinical Pathology. 2009;62:879–885.
    1. Kerr MK. Design considerations for efficient and effective microarray studies. Biometrics. 2003;59:822–828.
    1. Lander ES. Array of hope. Nature Genetics. 1999;21:3–4.
    1. Fare TL, Coffey EM, Dai HY, He YDD, Kessler DA, et al. Effects of atmospheric ozone on microarray data quality. Analytical Chemistry. 2003;75:4672–4675.
    1. Edgar R, Domrachev M, Lash AE. Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Research. 2002;30:207–210.
    1. Scherer A, editor. Batch Effects and Noise in Microarray Experiments: Sources and Solutions. John Wiley and Sons. 2009.
    1. Benito M, Parker J, Du Q, Wu JY, Xang D, et al. Adjustment of systematic microarray data biases. Bioinformatics. 2004;20:105–114.
    1. Sims AH, Smethurst GJ, Hey Y, Okoniewski MJ, Pepper SD, et al. The removal of multiplicative, systematic bias allows integration of breast cancer gene expression datasets - improving meta-analysis and prediction of prognosis. -Bmc Medical Genomics. 2008;1
    1. Leek JT, Storey JD. Capturing heterogeneity in gene expression studies by surrogate variable analysis. Plos Genetics. 2007;3:1724–1735.
    1. Luo J, Schumacher M, Scherer A, Sanoudou D, Megherbi D, et al. A comparison of batch effect removal methods for enhancement of prediction performance using MAQC-II microarray gene expression data. Pharmacogenomics Journal. 2010:S48–S61.
    1. Johnson WE, Li C, Rabinovic A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics. 2007;8:118–127.
    1. Li C, Wong WH. Model-based analysis of oligonucleotide arrays: Expression index computation and outlier detection. Proceedings of the National Academy of Sciences of the United States of America. 2001;98:31–36.
    1. Alter O, Brown PO, Botstein D. Singular value decomposition for genome-wide expression data processing and modeling. Proceedings of the National Academy of Sciences of the United States of America. 2000;97:10101–10106.
    1. Torrey EF, Webster M, Knable M, Johnston N, Yolken RH. The Stanley Foundation brain collection and Neuropathology Consortium. Schizophrenia Research. 2000;44:151–155.
    1. Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, et al. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics. 2003;4:249–264.
    1. Boedigheimer MJ, Wolfinger RD, Bass MB, Bushel PR, Chou JW, et al. Sources of variation in baseline gene expression levels from toxicogenomics study control animals across multiple laboratories. BMC Genomics. 2008;9
    1. McCall MN, Irizarry RA. Consolidated strategy for the analysis of microarray spike-in data. Nucleic Acids Research. 2008;36
    1. McGraw KO, Wong SP. Forming inferences about some intraclass correlation coefficients. Psychological Methods. 1996;1:30–46.
    1. Irizarry RA, Warren D, Spencer F, Kim IF, Biswal S, et al. Multiple-laboratory comparison of microarray platforms (vol 2, pg 345, 2005). Nature Methods. 2005;2:477–477.
    1. Hanley JA, Mcneil BJ. A Method of Comparing the Areas under Receiver Operating Characteristic Curves Derived from the Same Cases. Radiology. 1983;148:839–843.
    1. Kang HM, Ye C, Eskin E. Accurate Discovery of Expression Quantitative Trait Loci Under Confounding From Spurious and Genuine Regulatory Hotspots. Genetics. 2008;180:1909–1925.
    1. Scharpf RB, Ruczinski I, Carvalho B, Doan B, Chakravarti A, et al. A multilevel model to address batch effects in copy number estimation using SNP arrays. Biostatistics. 2011;12:33–50.
    1. Lamb J, Crawford ED, Peck D, Modell JW, Blat IC, et al. The connectivity map: Using gene-expression signatures to connect small molecules, genes, and disease. Science. 2006;313:1929–1935.
    1. Kitchen RR, Sabine VS, Sims AH, Macaskill EJ, Renshaw L, et al. Correcting for intra-experiment variation in Illumina BeadChip data is necessary to generate robust gene-expression profiles. Bmc Genomics. 2010;11
    1. Shi LM, Campbell G, Jones WD, Campagne F, Wen ZN, et al. The MicroArray Quality Control (MAQC)-II study of common practices for the development and validation of microarray-based predictive models. Nature Biotechnology. 2010:S5–S16.
    1. Affymetrix Expression Console Software website. Jan.28 Available: . Accessed 2011.
    1. Zhang DD, Cheng LJ, Badner JA, Chen C, Chen Q, et al. Genetic Control of Individual Differences in Gene-Specific Methylation in Human Brain. American Journal of Human Genetics. 2010;86:411–419.
    1. R.Sokal R, Rolf J. Biometry - Principles and Practice of Statistics in Biological Research. 1995. 877 3rd edition. W. H. Freeman and Co.
    1. Ogle DH. NCStats: Helper Functions for Statistics at Northland College. R package version. 2010;02-0
    1. Sing T, Sander O, Beerenwinkel N, Lengauer T. ROCR: visualizing classifier performance in R. Bioinformatics. 2005;21:3940–3941.
    1. Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, et al. Bioconductor: open software development for computational biology and bioinformatics. Genome Biology. 2004;5:R80.

Source: PubMed

3
Předplatit