Differential expression analysis for sequence count data

Simon Anders, Wolfgang Huber, Simon Anders, Wolfgang Huber

Abstract

High-throughput sequencing assays such as RNA-Seq, ChIP-Seq or barcode counting provide quantitative readouts in the form of count data. To infer differential signal in such data correctly and with good statistical power, estimation of data variability throughout the dynamic range and a suitable error model are required. We propose a method based on the negative binomial distribution, with variance and mean linked by local regression and present an implementation, DESeq, as an R/Bioconductor package.

Figures

Figure 1
Figure 1
Dependence of the variance on the mean for condition A in the fly RNA-Seq data. (a) The scatter plot shows the common-scale sample variances (Equation (7)) plotted against the common-scale means (Equation (6)). The orange line is the fit w(q). The purple lines show the variance implied by the Poisson distribution for each of the two samples, that is, s^jq^i,A. The dashed orange line is the variance estimate used by edgeR. (b) Same data as in (a), with the y-axis rescaled to show the squared coefficient of variation (SCV), that is all quantities are divided by the square of the mean. In (b), the solid orange line incorporated the bias correction described in Supplementary Note C in Additional file 1. (The plot only shows SCV values in the range [0, 0.2]. For a zoom-out to the full range, see Supplementary Figure S9 in Additional file 1.)
Figure 2
Figure 2
Type-I error control. The panels show empirical cumulative distribution functions (ECDFs) for P values from a comparison of one replicate from condition A of the fly RNA-Seq data with the other one. No genes are truly differentially expressed, and the ECDF curves (blue) should remain below the diagonal (gray). Panel (a): top row corresponds to DESeq, middle row to edgeR and bottom row to a Poisson-based χ2 test. The right column shows the distributions for all genes, the left and middle columns show them separately for genes below and above a mean of 100. Panel (b) shows the same data, but zooms into the range of small P values. The plots indicate that edgeR and DESeq control type I error at (and in fact slightly below) the nominal rate, while the Poisson-based χ2 test fails to do so. edgeR has an excess of small P values for low counts: the blue line lies above the diagonal. This excess is, however, compensated by the method being more conservative for high counts. All methods show a point mass at p = 1, this is due to the discreteness of the data, whose effect is particularly evident at low counts.
Figure 3
Figure 3
Testing for differential expression between conditions A and B: Scatter plot of log2 ratio (fold change) versus mean. The red colour marks genes detected as differentially expressed at 10% false discovery rate when Benjamini-Hochberg multiple testing adjustment is used. The symbols at the upper and lower plot border indicate genes with very large or infinite log fold change. The corresponding volcano plot is shown in Supplementary Figure S8 in Additional file 2.
Figure 4
Figure 4
Distribution of hits through the dynamic range. The density of common-scale mean values qi for all genes in the fly data (gray line, scaled down by a factor of seven), and for the hits reported by DESeq (red line) and by edgeR at a false discovery rate of 10% (dark blue line: with tag-wise dispersion estimation; light blue line: common dispersion mode).
Figure 5
Figure 5
Sample clustering for the neural cell data of Kasowski et al. [18]. A common variance function was estimated for all samples and used to apply a variance-stabilizing transformation. The heat map shows a false colour representation of the Euclidean distance matrix (from dark blue for zero distance to orange for large distance), and the dendrogram represents a hierarchical clustering. Two GNS samples were derived from the same patient (marked with '(*)') and show the highest degree of similarity. The two other GNS samples (including one with atypically large cells, marked '(L)') are as dissimilar from the former as the two NS samples.
Figure 6
Figure 6
Application to ChIP-Seq data. Shown are ECDF curves for P values resulting from comparisons of Pol-II ChIP-Seq data between replicates of the same individual (first and second column) and between two different individuals (third and forth column). The upper row corresponds to an analysis with DESeq ('D'), the lower row to one based on Poisson GLMs ('P'). If no true differential occupation exists (that is, when comparing replicates), the ECDF (blue) should stay below the diagonal (gray), which corresponds to uniform P values. In the first column, two replicates from HapMap individual GM12878 (A1) were compared against two further replicates from the same individual (A2). Similarly, in the second column, two replicates from individual GM12891 (B1) were compared against two further replicates from the same individual (B2). For DESeq, no excess of low P values was seen, as expected when comparing replicates. In contrast, the Poisson GLM analysis produced strong enrichments of small P values; this is a reflection of overdispersion in the data, that is, the variance in the data was larger than what the Poisson GLM assumes (see also Section Choice of distribution). The third column compares two replicates from individual GM12878 (A1) against two from the other individual (B1). True occupation differences are expected, and both methods result in enrichment of small P values. The forth column shows the comparison of four replicates of GM12878 (A1 combined with A2) against four replicates of GM12891 (B1, B2); increased sample size leads to higher detection power and hence smaller P values.
Figure 7
Figure 7
Noise estimates for the data of Nagalakshmi et al. [1]. The data allow assessment of technical variability (between library preparations from aliquots of the same yeast culture) and biological variability (between two independently grown cultures). The blue curves depict the squared coefficient of variation at the common scale, (q)/q2 (see Equation (9)) for technical replicates, the red curves for biological replicates (solid lines, dT data set, dashed lines, RH data set). The data density is shown by the histogram in the top panel. The purple area marks the range of the shot noise for the range of size factors in the data set. One can see that the noise between technical replicates follows closely the shot noise limit, while the noise between biological replicates exceeds shot noise already for low count values.

References

    1. Nagalakshmi U, Wang Z, Waern K, Shou C, Raha D, Gerstein M, Snyder M. The transcriptional landscape of the yeast genome defined by RNA sequencing. Science. 2008;320:1344–1349. doi: 10.1126/science.1158441.
    1. Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods. 2008;5:621–628. doi: 10.1038/nmeth.1226.
    1. Robertson G, Hirst M, Bainbridge M, Bilenky M, Zhao Y, Zeng T, Euskirchen G, Bernier B, Varhol R, Delaney A, Thiessen N, Griffith OL, He A, Marra M, Snyder M, Jones S. Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing. Nat Methods. 2007;4:651–657. doi: 10.1038/nmeth1068.
    1. Licatalosi DD, Mele A, Fak JJ, Ule J, Kayikci M, Chi SW, Clark TA, Schweitzer AC, Blume JE, Wang X, Darnell JC, Darnell RB. HITS-CLIP yields genome-wide insights into brain alternative RNA processing. Nature. 2008;456:464–469. doi: 10.1038/nature07488.
    1. Smith AM, Heisler LE, Mellor J, Kaper F, Thompson MJ, Chee M, Roth FP, Giaever G, Nislow C. Quantitative phenotyping via deep barcode sequencing. Genome Res. 2009;19:1836–1842. doi: 10.1101/gr.093955.109.
    1. Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y. RNA-seq: An assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 2008;18:1509–1517. doi: 10.1101/gr.079558.108.
    1. Wang L, Feng Z, Wang X, Wang X, Zhang X. DEGseq: an R package for identifying differentially expressed genes from RNA-seq data. Bioinformatics. 2010;26:136–138. doi: 10.1093/bioinformatics/btp612.
    1. Robinson MD, Smyth GK. Moderated statistical tests for assessing differences in tag abundance. Bioinformatics. 2007;23(21):2881–2887. doi: 10.1093/bioinformatics/btm453.
    1. Whitaker L. On the Poisson law of small numbers. Biometrika. 1914;10:36–71. doi: 10.1093/biomet/10.1.36.
    1. Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26:139–140. doi: 10.1093/bioinformatics/btp616.
    1. Robinson MD, Smyth GK. Small-sample estimation of negative binomial dispersion, with applications to SAGE data. Biostatistics. 2008;9:321–332. doi: 10.1093/biostatistics/kxm030.
    1. Cameron AC, Trivedi PK. Regression Analysis of Count Data. Cambridge University Press; 1998.
    1. Robinson MD, Oshlack A. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol. 2010;11:R25. doi: 10.1186/gb-2010-11-3-r25.
    1. Loader C. Local Regression and Likelihood. Springer; 1999.
    1. McCullagh P, Nelder JA. Generalized Linear Models. 2. Chapman & Hall/CRC; 1989.
    1. locfit: Local regression, likelihood and density estimation.
    1. Agresti A. Categorical Data Analysis. 2. Wiley; 2002.
    1. Engström P, Tommei D, Stricker S, Smith A, Pollard S, Bertone P. Transcriptional characterization of glioblastoma stem cell lines using tag sequencing. 2010. in press .
    1. Morrissy AS, Morin RD, Delaney A, Zeng T, McDonald H, Jones S, Zhao Y, Hirst M, Marra MA. Next-generation tag sequencing for cancer gene expression profiling. Genome Res. 2009;19:1825–1835. doi: 10.1101/gr.094482.109.
    1. Kasowski M, Grubert F, Heffelfinger C, Hariharan M, Asabere A, Waszak SM, Habegger L, Rozowsky J, Shi M, Urban AE, Hong MY, Karczewski KJ, Huber W, Weissman SM, Gerstein MB, Korbel JO, Snyder M. Variation in transcription factor binding among humans. Science. 2010;328:232–235. doi: 10.1126/science.1183621.
    1. Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J Roy Stat Soc B. 1995;57:289–300.
    1. Bullard J, Purdom E, Hansen K, Dudoit S. Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics. 2010;11:94. doi: 10.1186/1471-2105-11-94.
    1. Bloom JS, Khan Z, Kruglyak L, Singh M, Caudy AA. Measuring differential gene expression by short read sequencing: quantitative comparison to 2-channel gene expression microarrays. BMC Genomics. 2009;10:221. doi: 10.1186/1471-2164-10-221.
    1. Smyth GK. In: Bioinformatics and Computational Biology Solutions Using R and Bioconductor. Gentleman R, Carey V, Dudoit S, R Irizarry WH, editor. New York: Springer; 2005. Limma: linear models for microarray data. pp. 397–420. full_text.
    1. Smyth GK. Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol. 2004;3:Article3.
    1. Lönnstedt I, Speed T. Replicated microarray data. Stat Sin. 2002;12:31–46.
    1. R: A Language and Environment for Statistical Computing.
    1. Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, Hornik K, Hothorn T, Huber W, Iacus S, Irizarry R, Leisch F, Li C, Maechler M, Rossini AJ, Sawitzki G, Smith C, Smyth G, Tierney L, Yang JYH, Zhang J. Bioconductor: Open software development for computational biology and bioinformatics. Genome Biol. 2004;5:R80. doi: 10.1186/gb-2004-5-10-r80.
    1. Bliss CI, Fisher RA. Fitting the negative binomial distribution to biological data. Biometrics. 1953;9:176–200. doi: 10.2307/3001850.
    1. Clark SJ, Perry JN. Estimation of the negative binomial parameter κ by maximum quasi-likelihood. Biometrics. 1989;45:309–316. doi: 10.2307/2532055.
    1. Lawless JF. Negative binomial and mixed Poisson regression. Can J Stat. 1987;15:209–225. doi: 10.2307/3314912.
    1. Saha K, Paul S. Bias-corrected maximum likelihood estimator of the negative binomial dispersion parameter. Biometrics. 2005;61:179–285. doi: 10.1111/j.0006-341X.2005.030833.x.
    1. Fast and accurate computation of binomial probabilities. (Note: This is a copy of the original paper, which is no longer available online.)
    1. Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009;10:R25. doi: 10.1186/gb-2009-10-3-r25.
    1. HTSeq: Analysing high-throughput sequencing data with Python.
    1. DESeq.

Source: PubMed

Nadchodzące badania kliniczne

Subskrybuj