limma powers differential expression analyses for RNA-sequencing and microarray studies

Matthew E Ritchie, Belinda Phipson, Di Wu, Yifang Hu, Charity W Law, Wei Shi, Gordon K Smyth, Matthew E Ritchie, Belinda Phipson, Di Wu, Yifang Hu, Charity W Law, Wei Shi, Gordon K Smyth

Abstract

limma is an R/Bioconductor software package that provides an integrated solution for analysing data from gene expression experiments. It contains rich features for handling complex experimental designs and for information borrowing to overcome the problem of small sample sizes. Over the past decade, limma has been a popular choice for gene discovery through differential expression analyses of microarray and high-throughput PCR data. The package contains particularly strong facilities for reading, normalizing and exploring such data. Recently, the capabilities of limma have been significantly expanded in two important directions. First, the package can now perform both differential expression and differential splicing analyses of RNA sequencing (RNA-seq) data. All the downstream analysis tools previously restricted to microarray data are now available for RNA-seq as well. These capabilities allow users to analyse both RNA-seq and microarray data with very similar pipelines. Second, the package is now able to go past the traditional gene-wise expression analyses in a variety of ways, analysing expression profiles in terms of co-regulated sets of genes or in terms of higher-order expression signatures. This provides enhanced possibilities for biological interpretation of gene expression differences. This article reviews the philosophy and design of the limma package, summarizing both new and historical features, with an emphasis on recent enhancements and features that have not been previously described.

Figures

**Figure 1.**
Schematic of the major components that are central to any *limma* analysis. For each gene g, we have a vector of gene expression values (yg) and a design matrix X that relates these values to some coefficients of interest (βg). The *limma* package includes statistical methods that (i) facilitate information borrowing using empirical Bayes methods to obtain posterior variance estimators (), (ii) incorporate observation weights (wgj where j refers to sample) to allow for variations in data quality, (iii) allow variance modelling to accommodate technical or biological heterogeneity that may be present and (iv) pre-processing methods such as variance stabilization to reduce noise. These methods all help improve inference at both the gene and gene set level in small experiments.

**Figure 2.**
The *limma* workflow. The diagram shows the main steps in a gene expression analysis, along with individual functions that might be used and the corresponding classes used to store data or results. Online documentation pages are available both for each individual function and for each major step.

**Figure 3.**
Example diagnostic plots produced by *limma*. (A) Plot of variability versus count size for RNA-seq data, generated by voom with plot=TRUE. This plot shows that technical variability decreases with count size. Total variability asymptotes to biological variability as count sizes increases. (B) Mean-difference plot produced by the plotMA function for a two-colour microarray. The plot highlights negative (NC), constant (DR) and differentially expressed (D03, D10, U03, U10) spike-in controls. Regular probes are non-highlighted. (C) Multidimensional scaling (MDS) plot of a set of 30 microarrays, generated by plotMDS. All arrays are biologically identical and the plot reveals strong batch effects. Distances represent leading log2-fold changes between samples.

**Figure 4.**
Example plots displaying results from DE and gene set analyses. (A) Volcano plot showing fold changes and posterior odds of DE for a particular comparison (RUNX1 over-expression versus wild-type in this case), generated by volcanoplot. Probes with P < 0.00001 are highlighted in red. (B) Venn diagram showing overlap in the number of DE genes for three comparisons from the same study as (A), generated by the vennDiagram function. (C) Gene set enrichment plot produced by barcodeplot. The central bar orders differentially expressed genes by significance from up to down upon Pax5 restoration in an RNA-seq experiment (7). The vertical bars mark genes that are induced (red) or repressed (blue) upon the transition from large cycling pre-B cells to small resting pre-B cells during normal B cell development according to the published literature (47). The plot shows a strong positive concordance between Pax5 restoration and the large to small cell transition. The roast function can be used to assign statistical significance to this correlation.

References

1. Gentleman R.C., Carey V.J., Bates D.M., Bolstad B., Dettling M., Dudoit S., Ellis B., Gautier L., Ge Y., Gentry J., et al. Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 2004;5:R80.
1. Smyth G. Limma: linear models for microarray data. In: Gentleman R., Carey V., Dudoit S., Irizarry R., Huber W., editors. Bioinformatics and Computational Biology Solutions Using R and Bioconductor. New York: Springer; 2005. pp. 397–420.
1. Peart M., Smyth G., Van Laar R., Bowtell D., Richon V., Marks P., Holloway A., Johnstone R. Identification and functional significance of genes regulated by structurally different histone deacetylase inhibitors. Proc. Natl. Acad. Sci. U.S.A. 2005;102:3697–3702.
1. Caiazzo M., Dell'Anno M.T., Dvoretskova E., Lazarevic D., Taverna S., Leo D., Sotnikova T.D., Menegon A., Roncaglia P., Colciago G., et al. Direct generation of functional dopaminergic neurons from mouse and human fibroblasts. Nature. 2011;476:224–227.
1. Hubert F., Kinkel S., Crewther P., Cannon P., Webster K., Link M., Uibo R., O'Bryan M., Meager A., Forehan S., et al. Aire-deficient c57bl/6 mice mimicking the common human 13-base pair deletion mutation present with only a mild autoimmune phenotype. J. Immunol. 2009;182:3902–3918.
1. Mannsperger H.A., Gade S., Henjes F., Beissbarth T., Korf U. Rppanalyzer: analysis of reverse-phase protein array data. Bioinformatics. 2010;26:2202–2203.
1. Liu G.J., Cimmino L., Jude J.G., Hu Y., Witkowski M.T., McKenzie M.D., Kartal-Kaess M., Best S.A., Tuohey L., Liao Y., et al. Pax5 loss imposes a reversible differentiation block in B progenitor acute lymphoblastic leukemia. Genes Dev. 2014;28:1337–1350.
1. Su Z., Labaj P.P., Li S., Thierry-Mieg J., Thierry-Mieg D., Shi W., Wang C., Schroth G.P., Setterquist R.A., Thompson J.F., et al. A comprehensive assessment of RNA-seq accuracy, reproducibility and information content by the Sequencing Quality Control Consortium. Nat. Biotechnol. 2014;32:903–914.
1. Pickrell J.K., Marioni J.C., Pai A.A., Degner J.F., Engelhardt B.E., Nkadori E., Veyrieras J.B., Stephens M., Gilad Y., Pritchard J.K. Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature. 2010;464:768–772.
1. Law C.W., Chen Y., Shi W., Smyth G.K. Voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol. 2014;15:R29.
1. Ritchie M., Diyagama D., Neilson J., Van Laar R., Dobrovic A., Holloway A., Smyth G. Empirical array quality weights in the analysis of microarray data. BMC Bioinformatics. 2006;7:261.
1. Yang Y.H., Dudoit S., Luu P., Speed T.P. Normalization for cDNA microarray data. In: Bittner ML, Chen Y, Dorsel AN, Dougherty ER, editors. Microarrays: Optical Technologies and Informatics. Vol. 4266. San Jose, CA: Proceedings of SPIE; 2001. pp. 141–152.
1. Michaud J., Simpson K., Escher R., Buchet-Poyau K., Beissbarth T., Carmichael C., Ritchie M., Schütz F., Cannon P., Liu M., et al. Integrative analysis of runx1 downstream pathways and target genes. BMC Genomics. 2008;9:363.
1. Efron B., Morris C. Stein's estimation rule and its competitors—an empirical Bayes approach. J. Am. Stat. Assoc. 1973;68:117–130.
1. Morris C.N. Parametric empirical Bayes inference: theory and applications. J. Am. Stat. Assoc. 1983;78:47–55.
1. Smyth G. Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Stat. Appl. Genet. Mol. Biol. 2004;3 Article 3.
1. Sartor M.A., Tomlinson C.R., Wesselkamper S.C., Sivaganesan S., Leikauf G.D., Medvedovic M. Intensity-based hierarchical bayes method improves testing for differentially expressed genes in microarray experiments. BMC Bioinformatics. 2006;7:538.
1. Phipson B., Lee S., Majewski I.J., Alexander W.S., Smyth G.K. Technical Report. Melbourne, Australia: Bioinformatics Division, Walter and Eliza Hall Institute of Medical Research; 2013. Empirical Bayes in the presence of exceptional cases, with application to microarray data. .
1. Phipson B. Ph.D. Thesis. University of Melbourne: Department of Mathematics and Statistics; 2013. Empirical Bayes modelling of expression profiles and their associations.
1. Robinson M., McCarthy D., Smyth G. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26:139–140.
1. Rapaport F., Khanin R., Liang Y., Pirun M., Krek A., Zumbo P., Mason C.E., Socci N.D., Betel D. Comprehensive evaluation of differential gene expression analysis methods for RNA-seq data. Genome Biol. 2013;14:R95.
1. Silver J., Ritchie M., Smyth G. Microarray background correction: maximum likelihood estimation for the normal–exponential convolution. Biostatistics. 2009;10:352–363.
1. Ritchie M., Silver J., Oshlack A., Holmes M., Diyagama D., Holloway A., Smyth G. A comparison of background correction methods for two-colour microarrays. Bioinformatics. 2007;23:2700–2707.
1. Shi W., Oshlack A., Smyth G. Optimizing the noise versus bias trade-off for Illumina Whole Genome Expression Beadchips. Nucleic Acids Res. 2010;38:e204.
1. Martin B.J., Altman D.G. Statistical methods for assessing agreement between two methods of clinical measurement. Lancet. 1986;327:307–310.
1. Cleveland W.S. Visualizing Data. Murray Hill, NJ: AT&T Bell Laboratories; 1993.
1. Dudoit S., Yang Y.H., Callow M.J., Speed T.P. Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Stat. Sin. 2002;12:111–140.
1. Bolstad B.M., Irizarry R.A., Åstrand M., Speed T.P. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics. 2003;19:185–193.
1. Liao Y., Smyth G.K., Shi W. featureCounts: an efficient general-purpose read summarization program. Bioinformatics. 2014;30:923–930.
1. Anders S., Pyl P.T., Huber W. HTSeq—a Python framework to work with high-throughput sequencing data. Bioinformatics. 2015;31:166–169.
1. Li B., Dewey C. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics. 2011;12:323.
1. Liao Y., Smyth G.K., Shi W. The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote. Nucleic Acids Res. 2013;41:e108.
1. Kooperberg C., Fazzio T.G., Delrow J.J., Tsukiyama T. Improved background correction for spotted DNA microarrays. J. Comput. Biol. 2002;9:55–66.
1. Shi W., De Graaf C., Kinkel S., Achtman A., Baldwin T., Schofield L., Scott H., Hilton D., Smyth G. Estimating the proportion of microarray probes expressed in an RNA sample. Nucleic Acids Res. 2010;38:2168–2176.
1. Yang Y.H., Dudoit S., Luu P., Lin D.M., Peng V., Ngai J., Speed T.P. Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Res. 2002;30:e15.
1. Yang Y.H., Thorne N.P. Normalization for two-color cDNA microarray data. In: Goldstein D.R., editor. Science and Statistics: A Festschrift for Terry Speed. Vol. 40. Institute of Mathematical Statistics Lecture Notes – Monograph Series; 2003. pp. 403–418.
1. Smyth G.K., Altman N.S. Separate-channel analysis of two-channel microarrays: recovering inter-spot information. BMC Bioinformatics. 2013;14:165.
1. Smyth G., Speed T. Normalization of cDNA microarray data. Methods. 2003;31:265–273.
1. Oshlack A., Emslie D., Corcoran L., Smyth G. Normalization of boutique two-color microarrays with a high proportion of differentially expressed probes. Genome Biol. 2007;8:R2.
1. Wu D., Hu Y., Tong S., Williams B.R., Smyth G.K., Gantier M.P. The use of miRNA microarrays for the analysis of cancer samples with global miRNA decrease. RNA. 2013;19:876–888.
1. Robinson M.D., Oshlack A. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol. 2010;11:R25.
1. Hansen K.D., Irizarry R.A., Zhijin W. Removing technical variability in RNA-seq data using conditional quantile normalization. Biostatistics. 2012;13:204–216.
1. Ritchie M.E. Ph.D. Thesis. University of Melbourne: Department of Medical Biology; 2004. Quantitative quality control and background correction for two-colour microarray data.
1. Smyth G., Michaud J., Scott H. Use of within-array replicate spots for assessing differential expression in microarray experiments. Bioinformatics. 2005;21:2067–2075.
1. McCarthy D.J., Smyth G.K. Testing significance relative to a fold-change threshold is a TREAT. Bioinformatics. 2009;25:765–771.
1. Benjamini Y., Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B. 1995;57:289–300.
1. Hoffmann R., Seidl T., Neeb M., Rolink A., Melchers F. Changes in gene expression profiles in developing b cells of murine bone marrow. Genome Res. 2002;12:98–111.
1. Mosig M.O., Lipkin E., Khutoreskaya G., Tchourzyna E., Soller M., Friedmann A. A whole genome scan for quantitative trait loci affecting milk protein percentage in Israeli-Holstein cattle, by means of selective milk DNA pooling in a daughter design, using an adjusted false discovery rate criterion. Genetics. 2001;157:1683–1698.
1. Nettleton D., Hwang J.G., Caldo R.A., Wise R.P. Estimating the number of true null hypotheses from a histogram of p values. J. Agric. Biol. Environ. Stat. 2006;11:337–356.
1. Langaas M., Lindqvist B., Ferkingstad E. Estimating the proportion of true null hypotheses, with application to DNA microarray data. J. R. Stat. Soc. Ser. B. 2005;67:555–572.
1. Majewski I.J., Ritchie M.E., Phipson B., Corbin J., Pakusch M., Ebert A., Busslinger M., Koseki H., Hu Y., Smyth G.K., et al. Opposing roles of polycomb repressive complexes in hematopoietic stem and progenitor cells. Blood. 2010;116:731–739.
1. Ashburner M., Ball C.A., Blake J.A., Botstein D., Butler H., Cherry J.M., Davis A.P., Dolinski K., Dwight S.S., Eppig J.T., et al. Gene Ontology: tool for the unification of biology. Nat. Genet. 2000;25:25–29.
1. Young M., Wakefield M., Smyth G., Oshlack A. Gene ontology analysis for RNA-seq: accounting for selection bias. Genome Biol. 2010;11:R14.
1. Tarca A.L., Bhatti G., Romero R. A comparison of gene set analysis methods in terms of sensitivity, prioritization and specificity. PLOS ONE. 2013;8:e79217.
1. Wu D., Lim E., Vaillant F., Asselin-Labat M., Visvader J., Smyth G. ROAST: rotation gene set tests for complex microarray experiments. Bioinformatics. 2010;26:2176–2182.
1. Wu D., Smyth G. Camera: a competitive gene set test accounting for inter-gene correlation. Nucleic Acids Res. 2012;40:e133.
1. Wu D., Pang Y., Wilkerson M.D., Wang D., Hammerman P.S., Liu J.S. Gene-expression data integration to squamous cell lung cancer subtypes reveals drug sensitivity. Br. J. Cancer. 2013;109:1599–608.
1. Langsrud Ø. Rotation tests. Stat. Comput. 2005;15:53–60.
1. Lim E., Wu D., Pal B., Bouras T., Asselin-Labat M., Vaillant F., Yagita H., Lindeman G., Smyth G., Visvader J. Transcriptome analyses of mouse and human mammary cell subpopulations reveal multiple conserved genes and pathways. Breast Cancer Res. 2010;12:R21.
1. Lim E., Vaillant F., Wu D., Forrest N., Pal B., Hart A., Asselin-Labat M., Gyorki D., Ward T., Partanen A., et al. Aberrant luminal progenitors as the candidate target population for basal tumor development in BRCA1 mutation carriers. Nat. Med. 2009;15:907–913.
1. Asselin-Labat M., Vaillant F., Sheridan J., Pal B., Wu D., Simpson E., Yasuda H., Smyth G., Martin T., Lindeman G., et al. Control of mammary stem cell function by steroid hormone signalling. Nature. 2010;465:798–802.
1. Mootha V.K., Lindgren C.M., Eriksson K.F., Subramanian A., Sihag S., Lehar J., Puigserver P., Carlsson E., Ridderstrale M., Laurila E., et al. PGC-1alpha-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nat. Genet. 2003;34:267–273.
1. Subramanian A., Tamayo P., Mootha V.K., Mukherjee S., Ebert B.L., Gillette M.A., Paulovich A., Pomeroy S.L., Golub T.R., Lander E.S., et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. U.S.A. 2005;102:15545–15550.
1. Liberzon A., Subramanian A., Pinchback R., Thorvaldsdóttir H., Tamayo P., Mesirov J.P. Molecular signatures database (MSigDB) 3.0. Bioinformatics. 2011;27:1739–1740.
1. R Development Core Team. R: A Language and Environment for Statistical Computing. Vienna: R Foundation for Statistical Computing; 2014.
1. Wettenhall J.M., Smyth G.K. limmaGUI: a graphical user interface for linear modeling of microarray data. Bioinformatics. 2004;20:3705–3706.
1. Wettenhall J.M., Simpson K.M., Satterley K., Smyth G.K. affylmGUI: a graphical user interface for linear modeling of single channel microarray data. Bioinformatics. 2006;22:897–899.
1. Xia X., McClelland M., Wang Y. Webarray: an online platform for microarray data analysis. BMC Bioinformatics. 2005;6:306.
1. Psarros M., Heber S., Sick M., Thoppae G., Harshman K., Sick B. RACE: Remote Analysis Computation for gene Expression data. Nucleic Acids Res. 2005;33:W638–W643.
1. Rainer J., Sanchez-Cabo F., Stocker G., Sturn A., Trajanoski Z. CARMAweb: comprehensive R- and Bioconductor-based web service for microarray data analysis. Nucleic Acids Res. 2006;34:W498–W503.
1. Lemoine S., Combes F., Servant N., Le Crom S. Goulphar: rapid access and expertise for standard two-color microarray normalization methods. BMC Bioinformatics. 2006;7:467.
1. Rehrauer H., Zoller S., Schlapbach R. MAGMA: analysis of two-channel microarrays made easy. Nucleic Acids Res. 2007;35:W86–W90.
1. Diaz-Uriarte R., Alibes A., Morrissey E.R., Canada A., Rueda O.M., Neves M.L. Asterias: integrated analysis of expression and aCGH data using an open-source, web-based, parallelized software suite. Nucleic Acids Res. 2007;35:W75–W80.
1. De Groot P.J., Reiff C., Mayer C., Muller M. NuGO contributions to GenePattern. Genes Nutr. 2008;3:143–146.
1. Petryszak R., Burdett T., Fiorelli B., Fonseca N.A., Gonzalez-Porta M., Hastings E., Huber W., Jupp S., Keays M., Kryvych N., et al. Expression Atlas update—a database of gene and transcript expression from microarray and sequencing-based functional genomics experiments. Nucleic Acids Res. 2014;42:D926–D932.
1. Choi J. Guide: a desktop application for analysing gene expression data. BMC Genomics. 2013;14:688.
1. Leisch F. Sweave: dynamic generation of statistical reports using literate data analysis. In: Härdle W., Rönz B., editors. Compstat 2002—Proceedings in Computational Statistics. Heidelberg: Physica Verlag; 2002. pp. 575–580.
1. Xie Y. Dynamic Documents with R and knitr. Boca Raton, FL: CRC Press; 2013.
1. Gentleman R. Reproducible research: a bioinformatics case study. Stat. Appl. Genet. Mol. Biol. 2005;4 Article 2.
1. Brusniak M.Y., Bodenmiller B., Campbell D., Cooke K., Eddes J., Garbutt A., Lau H., Letarte S., Mueller L.N., Sharma V., et al. Corra: computational framework and tools for LC-MS discovery and targeted mass spectrometry-based proteomics. BMC Bioinformatics. 2008;9:542.
1. Lun A.T., Smyth G.K. De novo detection of differentially bound regions for ChIP-seq data using peaks and windows: controlling error rates correctly. Nucleic Acids Res. 2014;42:e95.
1. Phipson B., Oshlack A. DiffVar: a new method for detecting differential variability with application to methylation in cancer and aging. Genome Biol. 2014;15:465.

Source: PubMed

limma powers differential expression analyses for RNA-sequencing and microarray studies

Abstract

Figures

References

赞助者和合作者

医疗条件

药物干预