Power analysis and sample size estimation for RNA-Seq differential expression

Travers Ching, Sijia Huang, Lana X Garmire, Travers Ching, Sijia Huang, Lana X Garmire

Abstract

It is crucial for researchers to optimize RNA-seq experimental designs for differential expression detection. Currently, the field lacks general methods to estimate power and sample size for RNA-Seq in complex experimental designs, under the assumption of the negative binomial distribution. We simulate RNA-Seq count data based on parameters estimated from six widely different public data sets (including cell line comparison, tissue comparison, and cancer data sets) and calculate the statistical power in paired and unpaired sample experiments. We comprehensively compare five differential expression analysis packages (DESeq, edgeR, DESeq2, sSeq, and EBSeq) and evaluate their performance by power, receiver operator characteristic (ROC) curves, and other metrics including areas under the curve (AUC), Matthews correlation coefficient (MCC), and F-measures. DESeq2 and edgeR tend to give the best performance in general. Increasing sample size or sequencing depth increases power; however, increasing sample size is more potent than sequencing depth to increase power, especially when the sequencing depth reaches 20 million reads. Long intergenic noncoding RNAs (lincRNA) yields lower power relative to the protein coding mRNAs, given their lower expression level in the same RNA-Seq experiment. On the other hand, paired-sample RNA-Seq significantly enhances the statistical power, confirming the importance of considering the multifactor experimental design. Finally, a local optimal power is achievable for a given budget constraint, and the dominant contributing factor is sample size rather than the sequencing depth. In conclusion, we provide a power analysis tool (http://www2.hawaii.edu/~lgarmire/RNASeqPowerCalculator.htm) that captures the dispersion in the data and can serve as a practical reference under the budget constraint of RNA-Seq experiments.

Keywords: RNA-Seq; bioinformatics; power analysis; sample size; simulation.

© 2014 Ching et al.; Published by Cold Spring Harbor Laboratory Press for the RNA Society.

Figures

FIGURE 1.
FIGURE 1.
Power curves based on the number of samples per condition for the six public data sets and five RNA-Seq differential expression analysis packages. Library sizes were estimated from the gene counts of the real data sets. Per-gene dispersion was estimated through the Cox–Reid adjusted profile likelihood. (A) Power curves relative to sample size and differential expression methods in six public data sets. The four unpaired-sample data sets (Bottomly, Bullard, Huang, M–P) were analyzed with edgeR, DESeq, DESeq2, EBSeq, and sSeq. The paired-sample data sets (Tuch and Qian) were analyzed with edgeR, DESeq, DESeq2, and sSeq. Note that EBSeq is not included as it is currently not adapted to analyzing paired-sample data. (B) Heatmap of averaged power over the differential expression methods in six public data sets.
FIGURE 2.
FIGURE 2.
Performance comparison with receiver operator characteristics (ROC) curves and other metrics for the six public data sets and five RNA-Seq differential expression analysis packages. Sensitivity and 1 − specificity were estimated in each simulation for n = 4 samples per condition. The simulations were conducted as in Figure 1. (A) ROC curve comparison. True positive rate (TPR) versus false positive rate (FPR) was plotted. (B) Other performance metrics. Area under the curve (AUC) was measured up to FPR = 0.5 of the ROC curves in A. Matthew correlation coefficient (MCC) and F-measure were measured at the threshold of α = 0.05.
FIGURE 3.
FIGURE 3.
Paired versus single-factor power analysis of paired-sample data sets (Qian and Tuch). The data sets were evaluated with pairing information (paired analysis, solid line) or without pairing information (single-factor analysis, dashed line), using the standard analysis pipelines for the respective packages as in Figure 1. Note that EBSeq is not included as it is currently not adapted to analyzing paired-sample data.
FIGURE 4.
FIGURE 4.
Power of protein coding genes versus long noncoding RNA (lincRNA) transcripts. The comparison was made using the Huang data set, which used ribosomal RNA removal for RNA library construction. The transcriptome was separated into protein coding genes (solid line) or lincRNA (dashed line) categories. Power was estimated in each simulation for these two categories, using the standard analysis pipelines for the respective packages as in Figure 1.
FIGURE 5.
FIGURE 5.
Optimization of power given a budget constraint. The cost of RNA-Seq per sample is dependent on the cost of constructing the RNA-Seq library and the cost of single-end sequencing under the multiplex arrangement, where multiple samples could be barcoded to share one lane of the HiSeq flow cell. Both sequencing depth and sample size are variables under the budget constraint. (A) Power curves relative to samples, exemplified by increasing budgets of $3000, $5000, and $10,000 among five RNA-Seq differential expression analysis packages. (B) Optimal powers achieved for given budget constraints. (C) Biological replicates required to obtain optimal powers for given budget constraints. (D) Sequencing depths required to obtain optimal powers for given budget constraints.

References

    1. Aban IB, Cutter GR, Mavinga N. 2008. Inferences and power analysis concerning two negative binomial distributions with an application to MRI lesion counts data. Comput Stat Data Anal 53: 820–833
    1. Anders S, Huber W. 2010. Differential expression analysis for sequence count data. Genome Biol 11: R106.
    1. Bottomly D, Walter NA, Hunter JE, Darakjian P, Kawane S, Buck KJ, Searles RP, Mooney M, McWeeney SK, Hitzemann R. 2011. Evaluating gene expression in C57BL/6J and DBA/2J mouse striatum using RNA-Seq and microarrays. PLoS One 6: e17820.
    1. Bullard JH, Purdom E, Hansen KD, Dudoit S. 2010. Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics 11: 94.
    1. Busby MA, Stewart C, Miller CA, Grzeda KR, Marth GT. 2013. Scotty: a web tool for designing RNA-Seq experiments to measure differential gene expression. Bioinformatics 29: 656–657
    1. Cabili MN, Trapnell C, Goff L, Koziol M, Tazon-Vega B, Regev A, Rinn JL. 2011. Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses. Genes Dev 25: 1915–1927
    1. Chen Z, Liu J, Ng HKT, Nadarajah S, Kaufman HL, Yang JY, Deng Y. 2011. Statistical methods on detecting differentially expressed genes for RNA-seq data. BMC Syst Biol 5(Suppl 3): S1
    1. Frazee A, Langmead B, Leek J. 2011. ReCount: a multi-experiment resource of analysis-ready RNA-seq gene count datasets. BMC Bioinformatics 12: 449.
    1. Garmire LX, Garmire DG, Huang W, Yao J, Glass CK, Subramaniam S. 2011. A global clustering algorithm to identify long intergenic non-coding RNA—with applications in mouse macrophages. PLoS One 6: e24051.
    1. Hart SN, Therneau TM, Zhang Y, Poland GA, Kocher J-P. 2013. Calculating sample size estimates for RNA sequencing data. J Comput Biol 20: 970–978
    1. Huang R, Jaritz M, Guenzl P, Vlatkovic I, Sommer A, Tamir IM, Marks H, Klampfl T, Kralovics R, Stunnenberg HG. 2011. An RNA-Seq strategy to detect the complete coding and non-coding transcriptome including full-length imprinted macro ncRNAs. PLoS One 6: e27288.
    1. Jiang H, Wong WH. 2009. Statistical inferences for isoform expression in RNA-Seq. Bioinformatics 25: 1026–1032
    1. Kim D, Salzberg SL. 2011. TopHat-Fusion: an algorithm for discovery of novel fusion transcripts. Genome Biol 12: R72.
    1. Kvam VM, Liu P, Si Y. 2012. A comparison of statistical methods for detecting differentially expressed genes from RNA-seq data. Am J Bot 99: 248–256
    1. Leng N, Dawson JA, Thomson JA, Ruotti V, Rissman AI, Smits BM, Haag JD, Gould MN, Stewart RM, Kendziorski C. 2013. EBSeq: an empirical Bayes hierarchical model for inference in RNA-seq experiments. Bioinformatics 29: 1035–1043
    1. Li CI, Su PF, Shyr Y. 2013. Sample size calculation based on exact test for assessing differential expression analysis in RNA-seq data. BMC Bioinformatics 14: 357.
    1. Liu Y, Zhou J, White KP. 2014. RNA-seq differential expression studies: more sequence or more replication? Bioinformatics 30: 301–304
    1. Love MI, Huber W, Anders S. 2014. Moderated estimation of fold change and dispersion for RNA-Seq data with DESeq2. bioRxiv10.1101/002832
    1. Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y. 2008. RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res 18: 1509–1517
    1. McCarthy DJ, Chen Y, Smyth GK. 2012. Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation. Nucleic Acids Res 40: 4288–4297
    1. McCulloch CE. 1997. Maximum likelihood algorithms for generalized linear mixed models. J Am Stat Assoc 92: 162–170
    1. McIntyre LM, Lopiano KK, Morse AM, Amin V, Oberg AL, Young LJ, Nuzhdin SV. 2011. RNA-seq: technical variability and sampling. BMC Genomics 12: 293.
    1. Montgomery SB, Sammeth M, Gutierrez-Arcelus M, Lach RP, Ingle C, Nisbett J, Guigo R, Dermitzakis ET. 2010. Transcriptome genetics using second generation sequencing in a Caucasian population. Nature 464: 773–777
    1. Morin RD, Bainbridge M, Fejes A, Hirst M, Krzywinski M, Pugh TJ, McDonald H, Varhol R, Jones SJM, Marra MA. 2008. Profiling the HeLa S3 transcriptome using randomly primed cDNA and massively parallel short-read sequencing. Biotechniques 45: 81–94
    1. Mutch DM, Berger A, Mansourian R, Rytz A, Roberts M-A. 2002. The limit fold change model: a practical approach for selecting differentially expressed genes from microarray data. BMC Bioinformatics 3: 17.
    1. Nookaew I, Papini M, Pornputtapong N, Scalcinati G, Fagerberg L, Uhlén M, Nielsen J. 2012. A comprehensive comparison of RNA-Seq-based transcriptome analysis from reads to differential gene expression and cross-comparison with microarrays: a case study in Saccharomyces cerevisiae. Nucleic Acids Res 40: 10084–10097
    1. Pham TV, Jimenez CR. 2012. An accurate paired sample test for count data. Bioinformatics 28: i596–i602
    1. Pickrell JK, Marioni JC, Pai AA, Degner JF, Engelhardt BE, Nkadori E, Veyrieras J-B, Stephens M, Gilad Y, Pritchard JK. 2010. Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature 464: 768–772
    1. Qian F, Chung L, Zheng W, Bruno V, Alexander RP, Wang Z, Wang X, Kurscheid S, Zhao H, Fikrig E, et al.2013. Identification of genes critical for resistance to infection by West Nile virus using RNA-Seq analysis. Viruses 5: 1664–1681
    1. Rapaport F, Khanin R, Liang Y, Pirun M, Krek A, Zumbo P, Mason CE, Socci ND, Betel D. 2013. Comprehensive evaluation of differential gene expression analysis methods for RNA-seq data. Genome Biol 14: R95.
    1. Rau A, Gallopin M, Celeux G, Jaffrezic F. 2013. Data-based filtering for replicated high-throughput transcriptome sequencing experiments. Bioinformatics 29: 2146–2152
    1. Robinson MD, Oshlack A. 2010. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol 11: R25.
    1. Robinson MD, Smyth GK. 2007. Moderated statistical tests for assessing differences in tag abundance. Bioinformatics 23: 2881–2887
    1. Robinson MD, McCarthy DJ, Smyth GK. 2010. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26: 139–140
    1. Robles JA, Qureshi SE, Stephen SJ, Wilson SR, Burden CJ, Taylor JM. 2012. Efficient experimental design and analysis strategies for the detection of differential expression using RNA-Sequencing. BMC Genomics 13: 484.
    1. Soneson C, Delorenzi M. 2013. A comparison of methods for differential expression analysis of RNA-seq data. BMC Bioinformatics 14: 91.
    1. Srivastava S, Chen L. 2010. A two-parameter generalized Poisson model to improve the analysis of RNA-seq data. Nucleic Acids Res 38: e170.
    1. Tarazona S, García-Alcalde F, Dopazo J, Ferrer A, Conesa A. 2011. Differential expression in RNA-seq: a matter of depth. Genome Res 21: 2213–2223
    1. Trapnell C, Pachter L, Salzberg SL. 2009. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics 25: 1105–1111
    1. Tuch BB, Laborde RR, Xu X, Gu J, Chung CB, Monighetti CK, Stanley SJ, Olsen KD, Kasperbauer JL, Moore EJ. 2010. Tumor transcriptome sequencing reveals allelic expression imbalances associated with copy number alterations. PLoS One 5: e9317.
    1. Vijay N, Poelstra JW, Künstner A, Wolf JBW. 2013. Challenges and strategies in transcriptome assembly and differential gene expression quantification. A comprehensive in silico assessment of RNA-seq experiments. Mol Ecol 22: 620–634
    1. Wang Z, Gerstein M, Snyder M. 2009. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet 10: 57–63
    1. Wang L, Feng Z, Wang X, Wang X, Zhang X. 2010. DEGseq: an R package for identifying differentially expressed genes from RNA-seq data. Bioinformatics 26: 136–138
    1. Wu H, Wang C, Wu Z. 2013. A new shrinkage estimator for dispersion improves differential expression detection in RNA-seq data. Biostatistics 14: 232–243
    1. Yu D, Huber W, Vitek O. 2013. Shrinkage estimation of dispersion in Negative Binomial models for RNA-seq experiments with small sample size. Bioinformatics 29: 1275–1282

Source: PubMed

3
Prenumerera