voom: Precision weights unlock linear model analysis tools for RNA-seq read counts

Charity W Law, Yunshun Chen, Wei Shi, Gordon K Smyth, Charity W Law, Yunshun Chen, Wei Shi, Gordon K Smyth

Abstract

New normal linear modeling strategies are presented for analyzing read counts from RNA-seq experiments. The voom method estimates the mean-variance relationship of the log-counts, generates a precision weight for each observation and enters these into the limma empirical Bayes analysis pipeline. This opens access for RNA-seq analysts to a large body of methodology developed for microarrays. Simulation studies show that voom performs as well or better than count-based RNA-seq methods even when the data are generated according to the assumptions of the earlier methods. Two case studies illustrate the use of linear modeling and gene set testing methods.

Figures

Figure 1
Figure 1
Mean-variance relationships. Gene-wise means and variances of RNA-seq data are represented by black points with a LOWESS trend. Plots are ordered by increasing levels of biological variation in datasets. (a) voom trend for HBRR and UHRR genes for Samples A, B, C and D of the SEQC project; technical variation only. (b) C57BL/6J and DBA mouse experiment; low-level biological variation. (c) Simulation study in the presence of 100 upregulating genes and 100 downregulating genes; moderate-level biological variation. (d) Nigerian lymphoblastoid cell lines; high-level biological variation. (e)Drosophila melanogaster embryonic developmental stages; very high biological variation due to systematic differences between samples. (f) LOWESS voom trends for datasets (a)–(e). HBRR, Ambion’s Human Brain Reference RNA; LOWESS, locally weighted regression; UHRR, Stratagene’s Universal Human Reference RNA.
Figure 2
Figure 2
voom mean-variance modeling.(a) Gene-wise square-root residual standard deviations are plotted against average log-count. (b) A functional relation between gene-wise means and variances is given by a robust LOWESS fit to the points. (c) The mean-variance trend enables each observation to map to a square-root standard deviation value using its fitted value for log-count. LOWESS, locally weighted regression.
Figure 3
Figure 3
Type I error rates in the absence of true differential expression. The bar plots show the proportion of genes with P<0.01 for each method (a) when the library sizes are equal and (b) when the library sizes are unequal. The red line shows the nominal type I error rate of 0.01. Results are averaged over 100 simulations. Methods that control the type I error at or below the nominal level should lie below the red line.
Figure 4
Figure 4
Power to detect true differential expression. Bars show the total number of genes that are detected as statistically significant (FDR < 0.1) (a) with equal library sizes and (b) with unequal library sizes. The blue segments show the number of true positives while the red segments show false positives. 200 genes are genuinely differentially expressed. Results are averaged over 100 simulations. Height of the blue bars shows empirical power. The ratio of the red to blue segments shows empirical FDR. FDR, false discovery rate.
Figure 5
Figure 5
False discovery rates. The number of false discoveries is plotted for each method versus the number of genes selected as differentially expressed. Results are averaged over 100 simulations (a) with equal library sizes and (b) with unequal library sizes. voom has the lowest FDR at any cutoff in either scenario. FDR, false discovery rate.
Figure 6
Figure 6
False discovery rates evaluated from SEQC spike-in data. The number of false discoveries is plotted for each method versus the number of genes selected as differentially expressed. voom has the lowest false discovery rate overall.
Figure 7
Figure 7
Computing times of RNA-seq methods. Bars show time in seconds required for the analysis of one simulated dataset on a MacBook laptop. Methods are ordered from quickest to most expensive.
Figure 8
Figure 8
MA plot of male vs female comparison with male- and female-specific genes highlighted. The MA plot was produced by the limma plotMA function, and is a scatterplot of log-fold-change versus average log-cpm for each gene. Genes on the male-specific region of the Y chromosome genes are highlighted blue and are consistently upregulated in males, while genes on the X chromosome reported to escape X inactivation are highlighted red and are generally down in males. log-cpm, log-counts per million.
Figure 9
Figure 9
Multidimensional scaling plot ofDrosophila melanogaster embryonic stages. Distances are computed from the log-cpm values. The 12 successive embryonic developmental stages are labeled 1 to 12, from earliest to latest.
Figure 10
Figure 10
Number of genes associated with eachDrosophila melanogaster embryonic stage. The number of genes whose peak estimated expression occurs at each of the stages is recorded.
Figure 11
Figure 11
Expression trends for genes that peak at eachDrosophila melanogaster embryonic stage. Panels (1) to (12) correspond to the 12 successive developmental stages. Each panel displays the fitted expression trends for the top ten genes that achieve their peak expression during that stage. In particular, panel (1) shows genes that are most highly expressed at the first stage and panel (12) shows genes most highly expressed at the last stage. Panels (7) and (8) are notable because they show genes with marked peaks at 12–14 hours and 14–16 hours respectively.

References

    1. Tusher VG, Tibshirani R, Chu G. Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci. 2001;98:5116–5121. doi: 10.1073/pnas.091062498.
    1. Wright GW, Simon RM. A random variance model for detection of differential gene expression in small microarray experiments. Bioinformatics. 2003;19:2448–2455. doi: 10.1093/bioinformatics/btg345.
    1. Smyth G. Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol. 2004;3:Article 3.
    1. Cui X, Hwang JG, Qiu J, Blades NJ, Churchill GA. Improved statistical tests for differential gene expression by shrinking variance components estimates. Biostatistics. 2005;6:59–75. doi: 10.1093/biostatistics/kxh018.
    1. Smyth G, Michaud J, Scott H. Use of within-array replicate spots for assessing differential expression in microarray experiments. Bioinformatics. 2005;21:2067–2075. doi: 10.1093/bioinformatics/bti270.
    1. Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci USA. 2005;102:15545–15550. doi: 10.1073/pnas.0506580102.
    1. Wu D, Lim E, Vaillant F, Asselin-Labat M, Visvader J, Smyth G. ROAST rotation gene set tests for complex microarray experiments. Bioinformatics. 2010;26:2176–2182. doi: 10.1093/bioinformatics/btq401.
    1. Wu D, Smyth G. Camera: a competitive gene set test accounting for inter-gene correlation. Nucleic Acids Res. 2012;40:e133–e133. doi: 10.1093/nar/gks461.
    1. Smyth G. In: Bioinformatics and Computational Biology Solutions using R and Bioconductor. Gentleman R, Carey V, Dudoit S, Irizarry R, Huber W, editor. New York: Springer; 2005. Limma: linear models for microarray data; pp. 397–420.
    1. Wang Z, Gerstein M, Snyder M. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009;10:57–63. doi: 10.1038/nrg2484.
    1. Cloonan N, Forrest ARR, Kolle G, Gardiner BBA, Faulkner GJ, Brown MK, Taylor DF, Steptoe AL, Wani S, Bethel G, Robertson AJ, Perkins AC, Bruce SJ, Lee CC, Ranade SS, Peckham HE, Manning JM, Mckernan KJ, Grimmond SM. Stem cell transcriptome profiling via massive-scale mRNA sequencing. Nature Methods. 2008;5:613–619. doi: 10.1038/nmeth.1223.
    1. Robinson M, McCarthy D, Smyth G. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26:139–140. doi: 10.1093/bioinformatics/btp616.
    1. Anders S, Huber W. Differential expression analysis for sequence count data. Genome Biol. 2010;11:R106. doi: 10.1186/gb-2010-11-10-r106.
    1. Oshlack A, Robinson MD, Young MD. From RNA-seq reads to differential expression results. Genome Biol. 2010;11:220. doi: 10.1186/gb-2010-11-12-220.
    1. Perkins TT, Kingsley RA, Fookes MC, Gardner PP, James KD, Yu L, Assefa SA, He M, Croucher NJ, Pickard DJ, Maskell DJ, Parkhill J, Choudhary J, Thomson NR, Dougan G. A strand-specific RNA-seq analysis of the transcriptome of the typhoid bacillus Salmonella Typhi . PLoS Genet. 2009;5:e1000569. doi: 10.1371/journal.pgen.1000569.
    1. Han X, Wu X, Chung WY, Li T, Nekrutenko A, Altman NS, Chen G, Ma H. Transcriptome of embryonic and neonatal mouse cortex by high-throughput RNA sequencing. Proc Natl Acad Sci. 1274;106:1–12746.
    1. Parikh A, Miranda ER, Katoh-Kurasawa M, Fuller D, Rot G, Zagar L, Curk T, Sucgang R, Chen R, Zupan B, Loomis WF, Kuspa A, Shaulsky G. Conserved developmental transcriptomes in evolutionarily divergent species. Genome Biol. 2010;11:R35. doi: 10.1186/gb-2010-11-3-r35.
    1. Robinson MD, Smyth GK. Small-sample estimation of negative binomial dispersion, with applications to SAGE data. Biostatistics. 2008;9:321–332.
    1. Zhou YH, Xia K, Wright FA. A powerful and flexible approach to the analysis of RNA sequence count data. Bioinformatics. 2011;27:2672–2678. doi: 10.1093/bioinformatics/btr449.
    1. Robinson MD, Smyth GK. Moderated statistical tests for assessing differences in tag abundance. Bioinformatics. 2007;23:2881–2887. doi: 10.1093/bioinformatics/btm453.
    1. Hardcastle TJ, Kelly KA. baySeq: empirical Bayesian methods for identifying differential expression in sequence count data. BMC Bioinformatics. 2010;11:422. doi: 10.1186/1471-2105-11-422.
    1. Wu H, Wang C, Wu Z. A new shrinkage estimator for dispersion improves differential expression detection in RNA-seq data. Biostatistics. 2013;14:232–243. doi: 10.1093/biostatistics/kxs033.
    1. Lund S, Nettleton D, McCarthy D, Smyth G. Detecting differential expression in RNA-sequence data using quasi-likelihood with shrunken dispersion estimates. Stat Appl Genet Mol Biol. 2012;11:Article 8.
    1. Srivastava S, Chen L. A two-parameter generalized Poisson model to improve the analysis of RNA-seq data. Nucleic Acids Res. 2010;38:e170. doi: 10.1093/nar/gkq670.
    1. Auer PL, Doerge RW. A two-stage Poisson model for testing RNA-seq data. Stat Appl Genet Mol Biol. 2011;10:Article 26.
    1. Li J, Witten D, Johnstone I, Tibshirani R. Normalization, testing, and false discovery rate estimation for RNA-sequencing data. Biostatistics. 2012;13:523–538. doi: 10.1093/biostatistics/kxr031.
    1. Robles JA, Qureshi SE, Stephen SJ, Wilson SR, Burden CJ, Taylor JM. Efficient experimental design and analysis strategies for the detection of differential expression using RNA-Sequencing. BMC Genomics. 2012;13:484. doi: 10.1186/1471-2164-13-484.
    1. Soneson C, Delorenzi M. A comparison of methods for differential expression analysis of RNA-seq data. BMC Bioinformatics. 2013;14:91. doi: 10.1186/1471-2105-14-91.
    1. Ritchie M, Diyagama D, Neilson J, Van Laar R, Dobrovic A, Holloway A, Smyth G. Empirical array quality weights in the analysis of microarray data. BMC Bioinformatics. 2006;7:261. doi: 10.1186/1471-2105-7-261.
    1. McCullagh P, Nelder JA. Generalized Linear Models. Boca Raton: Chapman & Hall/CRC; 1989.
    1. Wedderburn RWM. Quasi-likelihood functions, generalized linear models, and the Gauss-Newton method. Biometrika. 1974;61:439–447.
    1. Carroll RJ, Ruppert D. A comparison between maximum likelihood and generalized least squares in a heteroscedastic linear model. J Am Stat Assoc. 1982;77:878–882. doi: 10.1080/01621459.1982.10477901.
    1. Nelder JA, Pregibon D. An extended quasi-likelihood function. Biometrika. 1987;74:221–232. doi: 10.1093/biomet/74.2.221.
    1. McCarthy DJ, Chen Y, Smyth GK. Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation. Nucleic Acids Res. 2012;40:4288–4297. doi: 10.1093/nar/gks042.
    1. Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nature Methods. 2008;5:621–628. doi: 10.1038/nmeth.1226.
    1. Sartor MA, Tomlinson CR, Wesselkamper SC, Sivaganesan S, Leikauf GD, Medvedovic M. Intensity-based hierarchical Bayes method improves testing for differentially expressed genes in microarray experiments. BMC Bioinformatics. 2006;7:538. doi: 10.1186/1471-2105-7-538.
    1. Sequencing Quality Control (SEQC) Project. [ ]
    1. Ambion FirstChoice Human Brain Reference RNA. [ ]
    1. Baker SC, Bauer SR, Beyer RP, Brenton JD, Bromley B, Burrill J, Causton H, Conley MP, Elespuru R, Fero M, Foy C, Fuscoe J, Gao X, Gerhold DL, Gilles P, Goodsaid F, Guo X, Hackett J, Hockett RD, Ikonomi P, Irizarry RA, Kawasaki ES, Kaysser-Kranich T, Kerr K, Kiser G, Koch WH, Lee KY, Liu C, Liu ZL, Lucas A. et al.The external RNA controls consortium: a progress report. Nature Methods. 2005;2:731–734. doi: 10.1038/nmeth1005-731.
    1. Pickrell JK, Marioni JC, Pai AA, Degner JF, Engelhardt BE, Nkadori E, Veyrieras JB, Stephens M, Gilad Y, Pritchard JK. Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature. 2010;464:768–772. doi: 10.1038/nature08872.
    1. Esnaola M, Puig P, Gonzalez D, Castelo R, Gonzalez JR. A flexible count data model to fit the wide diversity of expression profiles arising from extensively replicated RNA-seq experiments. BMC Bioinformatics. 2013;14:254. doi: 10.1186/1471-2105-14-254.
    1. Skaletsky H, Kuroda-Kawaguchi T, Minx PJ, Cordum HS, Hillier L, Brown LG, Repping S, Pyntikova T, Ali J, Bieri T, Chinwalla A, Delehaunty A, Delehaunty K, Du H, Fewell G, Fulton L, Fulton R, Graves T, Hou S-F, Latrielle P, Leonard S, Mardis E, Maupin R, McPherson J, Miner T, Nash W, Nguyen C, Ozersky P, Pepin K, Rock S. et al.The male-specific region of the human Y chromosome is a mosaic of discrete sequence classes. Nature. 2003;423:825–837. doi: 10.1038/nature01722.
    1. Gonzalez JR, Esnaola M. tweeDEseqCountData: RNA-seq count data employed in the vignette of the tweeDEseq package. [ ]
    1. Carrel L, Willard HF. X-inactivation profile reveals extensive variability in X-linked gene expression in females. Nature. 2005;434:400–404. doi: 10.1038/nature03479.
    1. Graveley BR, Brooks N, Carlson JW, Duff MO, Landolin JM, Yang L, Artieri G, van Baren MJ, Boley N, Booth BW, Brown JB, Cherbas L, Davis CA, Dobin A, Li R, Lin W, Malone JH, Mattiuzzo NR, Miller D, Sturgill D, Tuch BB, Zaleski C, Zhang D, Blanchette M, Dudoit S. The developmental transcriptome of Drosophila melanogaster . Nature. 2011;471:473–479. doi: 10.1038/nature09715.
    1. Djebali S, Davis CA, Merkel A, Dobin A, Lassmann T, Mortazavi A, Tanzer A, Lagarde J, Lin W, Schlesinger F, Xue C, Marinov GK, Khatun J, Williams BA, Zaleski C, Rozowsky J, Röder M, Kokocinski F, Abdelhamid RF, Alioto T, Antoshechkin I, Baer MT, Bar NS, Batut P, Bell K, Bell I, Chakrabortty S, Chen X, Chrast J, Curado J. et al.Landscape of transcription in human cells. Nature. 2012;489:101–108. doi: 10.1038/nature11233.
    1. Gonzalez-Porta M, Frankish A, Rung J, Harrow J, Brazma A. Transcriptome analysis of human tissues and cell lines reveals one dominant transcript per gene. Genome Biol. 2013;14:R70. doi: 10.1186/gb-2013-14-7-r70.
    1. Bera AK, Bilias Y. Rao’s score, Neyman’s C α and Silvey’s LM tests: an essay on historical developments and some new results. J Stat Plann Inference. 2001;97:9–44. doi: 10.1016/S0378-3758(00)00343-8.
    1. Pregibon D. In: GLIM 82 Proceedings of the International Conference on Generalised Linear Models. Gilchrist R, editor. New York: Springer; 1982. Score tests in GLIM with applications; pp. 87–97.
    1. Phipson B, Lee S, Majewski IJ, Alexander WS, Smyth GK. Empirical Bayes in the presence of exceptional cases, with application to microarray data. 2013. [ ]
    1. Oehlert GW. A note on the delta method. Am Statistician. 1992;46:27–29.
    1. Cleveland WS. Robust locally weighted regression and smoothing scatterplots. J Am Stat Assoc. 1979;74:829–836. doi: 10.1080/01621459.1979.10481038.
    1. Oshlack A, Emslie D, Corcoran L, Smyth G. Normalization of boutique two-color microarrays with a high proportion of differentially expressed probes. Genome Biol. 2007;8:R2. doi: 10.1186/gb-2007-8-1-r2.
    1. Robinson MD, Oshlack A. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol. 2010;11:R25. doi: 10.1186/gb-2010-11-3-r25.
    1. Gale WA, Sampson G. Good-Turing frequency estimation without tears. J Quant Linguist. 1995;2:217–237. doi: 10.1080/09296179508590051.
    1. Law CW, Chen Y, Shi W, Smyth GK. Supplementary information for ‘Voom: precision weights unlock linear model analysis tools for RNA-seq read counts’. [ ]
    1. Shi L, Reid LH, Jones WD, Shippy R, Warrington JA, Baker SC, Collins PJ, de Longueville F, Kawasaki ES, Lee KY, Luo Y, Sun YA, Willey JC, Setterquist RA, Fischer GM, Tong W, Dragan YP, Dix DJ, Frueh FW, Goodsaid FM, Herman D, Jensen RV, Johnson CD, Lobenhofer EK, Puri RK, Scherf U, Thierry-Mieg J, Wang C, Wilson M, Wolber PK. et al.The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. Nature Biotechnol. 2006;24:1151–1161. doi: 10.1038/nbt1239.
    1. Liao Y, Smyth GK, Shi W. The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote. Nucleic Acids Res. 2013;41:e108. doi: 10.1093/nar/gkt214.
    1. Liao Y, Smyth GK, Shi W. featureCounts: an efficient general-purpose read summarization program. Bioinformatics. 2013. [ ]
    1. Shi W, Liao Y. Rsubread: a super fast, sensitive and accurate read aligner for mapping next-generation sequencing reads. [ ]
    1. Bolstad BM, Irizarry RA, Åstrand M, Speed TP. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics. 2003;19:185–193. doi: 10.1093/bioinformatics/19.2.185.
    1. Frazee AC, Langmead B, Leek JT. ReCount: a multi-experiment resource of analysis-ready RNA-seq gene count datasets. BMC Bioinformatics. 2011;12:449. doi: 10.1186/1471-2105-12-449.
    1. Frazee A, Langmead B, Leek J. ReCount: a multi-experiment resource of analysis-ready RNA-seq gene count datasets. [ ]
    1. Falcon S, Gentleman R. Using GOstats to test gene lists for GO term association. Bioinformatics. 2007;23:257–258. doi: 10.1093/bioinformatics/btl567.
    1. Carlson M. org.Dm.eg.db: Genome wide annotation for Fly. [ ]
    1. Bottomly D, Walter NA, Hunter JE, Darakjian P, Kawane S, Buck KJ, Searles RP, Mooney M, McWeeney S K Hitzemann R. Evaluating gene expression in C57BL/6J and DBA/2J mouse striatum using RNA-Seq and microarrays. PLoS One. 2011;6:e17820. doi: 10.1371/journal.pone.0017820.
    1. Gentleman R, Carey V, Bates D, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, Hornik K, Hothorn T, Huber W, Iacus S, Irizarry R, Leisch F, Li C, Maechler M, Rossini AJ, Sawitzki G, Smith C, Smyth GK, Tierney L, Yang JY, Zhang J. Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 2004;5:R80. doi: 10.1186/gb-2004-5-10-r80.
    1. Bioconductor: open source software for bioinformatics. [ ]
    1. Comprehensive R Archive Network. [ ]
    1. Auer P, Doerge RW. TSPM.R: R code for a two-stage Poisson model for testing RNA-seq data. 2011. [ ~doerge/software/TSPM.R]
    1. Smyth GK. limma: Linear Models for Microarray Data. [ ]

Source: PubMed

3
订阅