Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2

Michael I Love, Wolfgang Huber, Simon Anders, Michael I Love, Wolfgang Huber, Simon Anders

Abstract

In comparative high-throughput sequencing assays, a fundamental task is the analysis of count data, such as read counts per gene in RNA-seq, for evidence of systematic changes across experimental conditions. Small replicate numbers, discreteness, large dynamic range and the presence of outliers require a suitable statistical approach. We present DESeq2, a method for differential analysis of count data, using shrinkage estimation for dispersions and fold changes to improve stability and interpretability of estimates. This enables a more quantitative analysis focused on the strength rather than the mere presence of differential expression. The DESeq2 package is available at http://www.bioconductor.org/packages/release/bioc/html/DESeq2.html webcite.

Figures

Figure 1
Figure 1
Shrinkage estimation of dispersion. Plot of dispersion estimates over the average expression strength (A) for the Bottomly et al. [16] dataset with six samples across two groups and (B) for five samples from the Pickrell et al. [17] dataset, fitting only an intercept term. First, gene-wise MLEs are obtained using only the respective gene’s data (black dots). Then, a curve (red) is fit to the MLEs to capture the overall trend of dispersion-mean dependence. This fit is used as a prior mean for a second estimation round, which results in the final MAP estimates of dispersion (arrow heads). This can be understood as a shrinkage (along the blue arrows) of the noisy gene-wise estimates toward the consensus represented by the red line. The black points circled in blue are detected as dispersion outliers and not shrunk toward the prior (shrinkage would follow the dotted line). For clarity, only a subset of genes is shown, which is enriched for dispersion outliers. Additional file 1: Figure S1 displays the same data but with dispersions of all genes shown. MAP, maximum a posteriori; MLE, maximum-likelihood estimate.
Figure 2
Figure 2
Effect of shrinkage on logarithmic fold change estimates. Plots of the (A) MLE (i.e., no shrinkage) and (B) MAP estimate (i.e., with shrinkage) for the LFCs attributable to mouse strain, over the average expression strength for a ten vs eleven sample comparison of the Bottomly et al. [16] dataset. Small triangles at the top and bottom of the plots indicate points that would fall outside of the plotting window. Two genes with similar mean count and MLE logarithmic fold change are highlighted with green and purple circles. (C) The counts (normalized by size factors sj) for these genes reveal low dispersion for the gene in green and high dispersion for the gene in purple. (D) Density plots of the likelihoods (solid lines, scaled to integrate to 1) and the posteriors (dashed lines) for the green and purple genes and of the prior (solid black line): due to the higher dispersion of the purple gene, its likelihood is wider and less peaked (indicating less information), and the prior has more influence on its posterior than for the green gene. The stronger curvature of the green posterior at its maximum translates to a smaller reported standard error for the MAP LFC estimate (horizontal error bar). adj., adjusted; LFC, logarithmic fold change; MAP, maximum a posteriori; MLE, maximum-likelihood estimate.
Figure 3
Figure 3
Stability of logarithmic fold changes.DESeq2 is run on equally split halves of the data of Bottomly et al. [16], and the LFCs from the halves are plotted against each other. (A) MLEs, i.e., without LFC shrinkage. (B) MAP estimates, i.e., with shrinkage. Points in the top left and bottom right quadrants indicate genes with a change of sign of LFC. Red points indicate genes with adjusted P value <0.1. The legend displays the root-mean-square error of the estimates in group I compared to those in group II. LFC, logarithmic fold change; MAP, maximum a posteriori; MLE, maximum-likelihood estimate; RMSE, root-mean-square error.
Figure 4
Figure 4
Hypothesis testing involving non-zero thresholds. Shown are plots of the estimated fold change over average expression strength (“minus over average”, or MA-plots) for a ten vs eleven comparison using the Bottomly et al. [16] dataset, with highlighted points indicating low adjusted P values. The alternate hypotheses are that logarithmic (base 2) fold changes are (A) greater than 1 in absolute value or (B) less than 1 in absolute value. adj., adjusted.
Figure 5
Figure 5
Variance stabilization and clustering after rlog transformation. Two transformations were applied to the counts of the Hammer et al. [26] dataset: the logarithm of normalized counts plus a pseudocount, i.e. f(Kij)= log2(Kij/sj+1), and the rlog. The gene-wise standard deviation of transformed values is variable across the range of the mean of counts using the logarithm (A), while relatively stable using the rlog (B). A hierarchical clustering on Euclidean distances and complete linkage using the rlog (D) transformed data clusters the samples into the groups defined by treatment and time, while using the logarithm-transformed counts (C) produces a more ambiguous result. sd, standard deviation.
Figure 6
Figure 6
Sensitivity and precision of algorithms across combinations of sample size and effect size.DESeq2 and edgeR often had the highest sensitivity of those algorithms that controlled the FDR, i.e., those algorithms which fall on or to the left of the vertical black line. For a plot of sensitivity against false positive rate, rather than FDR, see Additional file 1: Figure S8, and for the dependence of sensitivity on the mean of counts, see Additional file 1: Figure S9. Note that EBSeq filters low-count genes (see main text for details).
Figure 7
Figure 7
Benchmark of false positive calling. Shown are estimates of P(P value<0.01) under the null hypothesis. The FPR is the number of P values less than 0.01 divided by the total number of tests, from randomly selected comparisons of five vs five samples from the Pickrell et al. [17] dataset, with no known condition dividing the samples. Type-I error control requires that the tool does not substantially exceed the nominal value of 0.01 (black line). EBSeq results were not included in this plot as it returns posterior probabilities, which unlike P values are not expected to be uniformly distributed under the null hypothesis. FPR, false positive rate.
Figure 8
Figure 8
Sensitivity estimated from experimental reproducibility. Each algorithm’s sensitivity in the evaluation set (box plots) is evaluated using the calls of each other algorithm in the verification set (panels with grey label).
Figure 9
Figure 9
Precision estimated from experimental reproducibility. Each algorithm’s precision in the evaluation set (box plots) is evaluated using the calls of each other algorithm in the verification set (panels with grey label).

References

    1. Lönnstedt I, Speed T. Replicated microarray data. Stat Sinica. 2002;12:31–46.
    1. Robinson MD, Smyth GK. Moderated statistical tests for assessing differences in tag abundance. Bioinformatics. 2007;23:2881–2887. doi: 10.1093/bioinformatics/btm453.
    1. McCarthy DJ, Chen Y, Smyth GK. Differential expression analysis of multifactor RNA-seq experiments with respect to biological variation. Nucleic Acids Res. 2012;40:4288–4297. doi: 10.1093/nar/gks042.
    1. Anders S, Huber W. Differential expression analysis for sequence count data. Genome Biol. 2010;11:106. doi: 10.1186/gb-2010-11-10-r106.
    1. Zhou Y-H, Xia K, Wright FA. A powerful and flexible approach to the analysis of RNA sequence count data. Bioinformatics. 2011;27:2672–2678. doi: 10.1093/bioinformatics/btr449.
    1. Wu H, Wang C, Wu Z. A new shrinkage estimator for dispersion improves differential expression detection in RNA-seq data. Biostatistics. 2013;14:232–243. doi: 10.1093/biostatistics/kxs033.
    1. Hardcastle T, Kelly K. baySeq: empirical Bayesian methods for identifying differential expression in sequence count data. BMC Bioinformatics. 2010;11:422. doi: 10.1186/1471-2105-11-422.
    1. Van De Wiel MA, Leday GGR, Pardo L, Rue H, Van Der Vaart AW, Van Wieringen WN. Bayesian analysis of RNA sequencing data by estimating multiple shrinkage priors. Biostatistics. 2013;14:113–128. doi: 10.1093/biostatistics/kxs031.
    1. Boer JM, Huber WK, Sültmann H, Wilmer F, von Heydebreck A, Haas S, Korn B, Gunawan B, Vente A, Füzesi L, Vingron M, Poustka A. Identification and classification of differentially expressed genes in renal cell carcinoma by expression profiling on a global human 31,500-element cDNA array. Genome Res. 2001;11:1861–1870.
    1. DESeq2. []
    1. Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, Hornik K, Hothorn T, Huber W, Iacus S, Irizarry R, Leisch F, Li C, Maechler M, Rossini AJ, Sawitzki G, Smith C, Smyth G, Tierney L, Yang JY, Zhang J. Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 2004;5:R80. doi: 10.1186/gb-2004-5-10-r80.
    1. McCullagh P, Nelder JA. Monographs on Statistics & Applied Probability. London, UK: Chapman & Hall/CRC; 1989. Generalized linear models.
    1. Hansen KD, Irizarry RA, Wu Z. Removing technical variability in RNA-seq data using conditional quantile normalization. Biostatistics. 2012;13:204–216. doi: 10.1093/biostatistics/kxr054.
    1. Risso D, Schwartz K, Sherlock G, Dudoit S. GC-content normalization for RNA-seq data. BMC Bioinformatics. 2011;12:480. doi: 10.1186/1471-2105-12-480.
    1. Smyth GK. Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol. 2004;3:1–25.
    1. Bottomly D, Walter NAR, Hunter JE, Darakjian P, Kawane S, Buck KJ, Searles RP, Mooney M, McWeeney SK, Hitzemann R. Evaluating gene expression in C57BL/6J and DBA/2J mouse striatum using RNA-seq and microarrays. PLoS ONE. 2011;6:17820. doi: 10.1371/journal.pone.0017820.
    1. Pickrell JK, Marioni JC, Pai AA, Degner JF, Engelhardt BE, Nkadori E, Veyrieras J-B, Stephens M, Gilad Y, Pritchard JK. Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature. 2010;464:768–772. doi: 10.1038/nature08872.
    1. Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. New York City, USA: Springer; 2009.
    1. Bi Y, Davuluri R. NPEBseq: nonparametric empirical Bayesian-based procedure for differential expression analysis of RNA-seq data. BMC Bioinformatics. 2013;14:262. doi: 10.1186/1471-2105-14-262.
    1. Feng J, Meyer CA, Wang Q, Liu JS, Liu XS, Zhang Y. GFOLD: a generalized fold change for ranking differentially expressed genes from RNA-seq data. Bioinformatics. 2012;28:2782–2788. doi: 10.1093/bioinformatics/bts515.
    1. Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B Methodol. 1995;57:289–300.
    1. Bourgon R, Gentleman R, Huber W. Independent filtering increases detection power for high-throughput experiments. Proc Natl Acad Sci USA. 2010;107:9546–9551. doi: 10.1073/pnas.0914005107.
    1. McCarthy DJ, Smyth GK. Testing significance relative to a fold-change threshold is a TREAT. Bioinformatics. 2009;25:765–771. doi: 10.1093/bioinformatics/btp053.
    1. Li J, Tibshirani R. Finding consistent patterns: a nonparametric approach for identifying differential expression in RNA-seq data. Stat Methods Med Res. 2013;22:519–536. doi: 10.1177/0962280211428386.
    1. Cook RD. Detection of influential observation in linear regression. Technometrics. 1977;19:15–18. doi: 10.2307/1268249.
    1. Hammer P, Banck MS, Amberg R, Wang C, Petznick G, Luo S, Khrebtukova I, Schroth GP, Beyerlein P, Beutler AS. mRNA-seq with agnostic splice site discovery for nervous system transcriptomics tested in chronic pain. Genome Res. 2010;20:847–860. doi: 10.1101/gr.101204.109.
    1. Frazee A, Langmead B, Leek J. ReCount: a multi-experiment resource of analysis-ready RNA-seq gene count datasets. BMC Bioinformatics. 2011;12:449. doi: 10.1186/1471-2105-12-449.
    1. Trapnell C, Hendrickson DG, Sauvageau M, Goff L, Rinn JL, Pachter L. Differential analysis of gene regulation at transcript resolution with RNA-seq. Nat Biotechnol. 2012;31:46–53. doi: 10.1038/nbt.2450.
    1. Glaus P, Honkela A, Rattray M. Identifying differentially expressed transcripts from RNA-seq data with biological variation. Bioinformatics. 2012;28:1721–1728. doi: 10.1093/bioinformatics/bts260.
    1. Anders S, Reyes A, Huber W. Detecting differential usage of exons from RNA-seq data. Genome Res. 2012;22:2008–2017. doi: 10.1101/gr.133744.111.
    1. Sammeth M. Complete alternative splicing events are bubbles in splicing graphs. J Comput Biol. 2009;16:1117–1140. doi: 10.1089/cmb.2009.0108.
    1. Pagès H, Bindreither D, Carlson M, Morgan M: SplicingGraphs: create, manipulate, visualize splicing graphs, and assign RNA-seq reads to them2013. Bioconductor package []
    1. Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2009;26:139–140. doi: 10.1093/bioinformatics/btp616.
    1. Zhou X, Lindsay H, Robinson MD. Robustly detecting differential expression in RNA sequencing data using observation weights. Nucleic Acids Res. 2014;42:e91. doi: 10.1093/nar/gku310.
    1. Leng N, Dawson JA, Thomson JA, Ruotti V, Rissman AI, Smits BMG, Haag JD, Gould MN, Stewart RM, Kendziorski C. EBSeq: an empirical Bayes hierarchical model for inference in RNA-seq experiments. Bioinformatics. 2013;29:1035–1043. doi: 10.1093/bioinformatics/btt087.
    1. Law CW, Chen Y, Shi W, Smyth GK. Voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol. 2014;15:29. doi: 10.1186/gb-2014-15-2-r29.
    1. Hubert L, Arabie P. Comparing partitions. J Classif. 1985;2:193–218. doi: 10.1007/BF01908075.
    1. Witten DM. Classification and clustering of sequencing data using a Poisson model. Ann Appl Stat. 2011;5:2493–2518. doi: 10.1214/11-AOAS493.
    1. Irizarry RA, Wu Z, Jaffee HA. Comparison of affymetrix GeneChip expression measures. Bioinformatics. 2006;22:789–794. doi: 10.1093/bioinformatics/btk046.
    1. Asangani IA, Dommeti VL, Wang X, Malik R, Cieslik M, Yang R, Escara-Wilke J, Wilder-Romans K, Dhanireddy S, Engelke C, Iyer MK, Jing X, Wu Y-M, Cao X, Qin ZS, Wang S, Feng FY, Chinnaiyan AM. Therapeutic targeting of BET bromodomain proteins in castration-resistant prostate cancer. Nature. 2014;510:278–282. doi: 10.1038/nature13229.
    1. Stark R, Brown G: DiffBind: differential binding analysis of ChIP-seq peak data2013. Bioconductor package []
    1. Ross-Innes CS, Stark R, Teschendorff AE, Holmes KA, Ali HR, Dunning MJ, Brown GD, Gojis O, Ellis IO, Green AR, Ali S, Chin S-F, Palmieri C, Caldas C, Carroll JS. Differential oestrogen receptor binding is associated with clinical outcome in breast cancer. Nature. 2012;481:389–393.
    1. Robinson DG, Chen W, Storey JD, Gresham D. Design and analysis of bar-seq experiments. G3 (Bethesda) 2013;4:11–18. doi: 10.1534/g3.113.008565.
    1. McMurdie PJ, Holmes S. Waste not, want not: why rarefying microbiome data is inadmissible. PLoS Comput Biol. 2014;10:1003531. doi: 10.1371/journal.pcbi.1003531.
    1. Vasquez J, Hon C, Vanselow JT, Schlosser A, Siegel TN. Comparative ribosome profiling reveals extensive translational complexity in different Trypanosoma brucei life cycle stages. Nucleic Acids Res. 2014;42:3623–3637. doi: 10.1093/nar/gkt1386.
    1. Zhou Y, Zhu S, Cai C, Yuan P, Li C, Huang Y, Wei W. High-throughput screening of a CRISPR/Cas9 library for functional genomics in human cells. Nature. 2014;509:487–491. doi: 10.1038/nature13166.
    1. Cox DR, Reid N. Parameter orthogonality and approximate conditional inference. J R Stat Soc Ser B Methodol. 1987;49:1–39.
    1. Robinson MD, Smyth GK. Small-sample estimation of negative binomial dispersion, with applications to SAGE data. Biostatistics. 2007;9:321–332. doi: 10.1093/biostatistics/kxm030.
    1. Pawitan Y. In All Likelihood: Statistical Modelling and Inference Using Likelihood. New York City, USA: Oxford University Press; 2001.
    1. Armijo L. Minimization of functions having Lipschitz continuous first partial derivatives. Pac J Math. 1966;16:1–3. doi: 10.2140/pjm.1966.16.1.
    1. Di Y, Schafer DW, Cumbie JS, Chang JH. The NBP negative binomial model for assessing differential gene expression from RNA-seq. Stat Appl Genet Mol Biol. 2011;10:1–28.
    1. Abramowitz M, Stegun I. Handbook of Mathematical Functions. New York, USA: Dover Publications; 1965.
    1. Newton M, Kendziorski C, Richmond C, Blattner F, Tsui K. On differential variability of expression ratios: improving statistical inference about gene expression changes from microarray data. J Comput Biol. 2001;8:37–52. doi: 10.1089/106652701300099074.
    1. Huber W, von Heydebreck A, Sultmann H, Poustka A, Vingron M. Variance stabilization applied to microarray data calibration and to the quantification of differential expression. Bioinformatics. 2002;18:96–104. doi: 10.1093/bioinformatics/18.suppl_1.S96.
    1. Durbin BP, Hardin JS, Hawkins DM, Rocke DM. A variance-stabilizing transformation for gene-expression microarray data. Bioinformatics. 2002;18:105–110. doi: 10.1093/bioinformatics/18.suppl_1.S105.
    1. Park MY: Generalized linear models with regularization. PhD thesis.Stanford University, Department of Statistics; 2006.
    1. Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. J Stat Softw. 2010;33:1–22.
    1. Cule E, Vineis P, De Iorio M. Significance testing in ridge regression for genetic data. BMC Bioinformatics. 2011;12:372. doi: 10.1186/1471-2105-12-372.
    1. Cook RD, Weisberg S. Residuals and Influence in Regression. New York, USA: Chapman and Hall/CRC; 1982.
    1. Lawrence M, Huber W, Pagès H, Aboyoun P, Carlson M, Gentleman R, Morgan MT, Carey VJ. Software for computing and annotating genomic ranges. PLoS Comput Biol. 2013;9:1003118. doi: 10.1371/journal.pcbi.1003118.
    1. Pagès H, Obenchain V, Morgan M: GenomicAlignments: Representation and manipulation of short genomic alignments2013. Bioconductor package []
    1. Anders S, Pyl PT, Huber W. HTSeq - A Python framework to work with high-throughput sequencing data. Bioinformatics. 2015;31:166. doi: 10.1093/bioinformatics/btu638.
    1. Delhomme N, Padioleau I, Furlong EE, Steinmetz LM. easyRNASeq: a Bioconductor package for processing RNA-seq data. Bioinformatics. 2012;28:2532–2533. doi: 10.1093/bioinformatics/bts477.
    1. Liao Y, Smyth GK, Shi W. featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics. 2014;30:923–930. doi: 10.1093/bioinformatics/btt656.
    1. Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg S. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol. 2013;14:36. doi: 10.1186/gb-2013-14-4-r36.
    1. DESeq2paper. []

Source: PubMed

3
購読する