RnaSeqSampleSize: real data based sample size estimation for RNA sequencing

Shilin Zhao, Chung-I Li, Yan Guo, Quanhu Sheng, Yu Shyr, Shilin Zhao, Chung-I Li, Yan Guo, Quanhu Sheng, Yu Shyr

Abstract

Background: One of the most important and often neglected components of a successful RNA sequencing (RNA-Seq) experiment is sample size estimation. A few negative binomial model-based methods have been developed to estimate sample size based on the parameters of a single gene. However, thousands of genes are quantified and tested for differential expression simultaneously in RNA-Seq experiments. Thus, additional issues should be carefully addressed, including the false discovery rate for multiple statistic tests, widely distributed read counts and dispersions for different genes.

Results: To solve these issues, we developed a sample size and power estimation method named RnaSeqSampleSize, based on the distributions of gene average read counts and dispersions estimated from real RNA-seq data. Datasets from previous, similar experiments such as the Cancer Genome Atlas (TCGA) can be used as a point of reference. Read counts and their dispersions were estimated from the reference's distribution; using that information, we estimated and summarized the power and sample size. RnaSeqSampleSize is implemented in R language and can be installed from Bioconductor website. A user friendly web graphic interface is provided at http://cqs.mc.vanderbilt.edu/shiny/RnaSeqSampleSize/ .

Conclusions: RnaSeqSampleSize provides a convenient and powerful way for power and sample size estimation for an RNAseq experiment. It is also equipped with several unique features, including estimation for interested genes or pathway, power curve visualization, and parameter optimization.

Keywords: Power analysis; RNA-Seq; Sample size; Simulation.

Conflict of interest statement

Ethics approval and consent to participate

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Figures

Fig. 1
Fig. 1
RnaSeqSampleSize package workflow
Fig. 2
Fig. 2
Read counts and dispersion distribution greatly influence the estimated sample size and power. a The read counts and dispersion distribution for all genes from TCGA Rectum adenocarcinoma (READ) dataset. The red lines indicate read counts equal to one and 10. And the green line indicates the 95% quantile of all gene dispersions. b The estimated sample size required to achieve 0.8 power in different combinations of read counts and dispersions
Fig. 3
Fig. 3
Sample size estimation with real data. a The read counts distribution for all genes from TCGA Breast Invasive Carcinoma (BRCA) and Rectum adenocarcinoma (READ) dataset; (b) The dispersion distribution for all genes from TCGA BRCA and READ dataset; (c) The power distribution based on the count and dispersion distributions in TCGA BRCA dataset when sample size equals 71. The red lines indicate the mean value of power distribution. d The power distribution based on the count and dispersion distributions in TCGA READ dataset when sample size equals 71. The red lines indicate the mean value of power distribution
Fig. 4
Fig. 4
Sample size estimation for interested genes. a The read counts distribution for genes in three KEGG pathways in TCGA READ dataset; (b) The dispersion distribution for genes in three KEGG pathways in TCGA READ dataset; (c) The power distribution based on the count and dispersion distributions in TCGA READ dataset for genes in Calcium signaling pathway when sample size equals 71. The red lines indicate the mean value of power distribution. d The power distribution based on the count and dispersion distributions in TCGA READ dataset for genes in Proteasome pathway when sample size equals 71. The red lines indicate the mean value of the power distribution
Fig. 5
Fig. 5
Power curve visualization and parameter optimization by RnaSeqSampleSize. a Power curves for balanced (same sample size in two groups) and unbalanced (different sample size in two groups) experiment design. The power curves indicate that the balanced experiment design (red line) will achieve the highest power with the same total number of samples; (b) Optimization of parameters in sample size estimation. The dispersion and fold change were set as 0.5 and two, respectively. A power matrix with different pairs of numbers of samples and read counts were generated. The power distribution indicates that the number of samples plays a more significant role in determining the power, and suggests at least 96 samples should be used in RNA-Seq experiments with these parameters to get 0.8 power

References

    1. Wang Z, Gerstein M, Snyder M. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009;10(1):57–63. doi: 10.1038/nrg2484.
    1. Jung SH, Bang H, Young S. Sample size calculation for multiple testing in microarray data analysis. Biostatistics. 2005;6(1):157–169. doi: 10.1093/biostatistics/kxh026.
    1. Müller P, Parmigiani G, Robert C, Rousseau J. Optimal sample size for multiple testing: the case of gene expression microarrays. J Am Stat Assoc. 2004;99(468):990–1001. doi: 10.1198/016214504000001646.
    1. Busby MA, Stewart C, Miller CA, Grzeda KR, Marth GT. Scotty: a web tool for designing RNA-Seq experiments to measure differential gene expression. Bioinformatics. 2013;29(5):656–657. doi: 10.1093/bioinformatics/btt015.
    1. Chen Z, Liu J, Ng HK, Nadarajah S, Kaufman HL, Yang JY, Deng Y. Statistical methods on detecting differentially expressed genes for RNA-seq data. BMC Syst Biol. 2011;5(Suppl 3):S1. doi: 10.1186/1752-0509-5-S3-S1.
    1. Fang Z, Cui X. Design and validation issues in RNA-seq experiments. Brief Bioinform. 2011;12(3):280–287. doi: 10.1093/bib/bbr004.
    1. Robinson MD, Oshlack A. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol. 2010;11(3):R25. doi: 10.1186/gb-2010-11-3-r25.
    1. Anders S, Huber W. Differential expression analysis for sequence count data. Genome Biol. 2010;11(10):R106. doi: 10.1186/gb-2010-11-10-r106.
    1. Hart SN, Therneau TM, Zhang Y, Poland GA, Kocher JP. Calculating sample size estimates for RNA sequencing data. J Comput Biol. 2013;20(12):970–978. doi: 10.1089/cmb.2012.0283.
    1. Li CI, Su PF, Shyr Y. Sample size calculation based on exact test for assessing differential expression analysis in RNA-seq data. BMC bioinformatics. 2013;14:357. doi: 10.1186/1471-2105-14-357.
    1. Liu Y, Zhou J, White KP. RNA-seq differential expression studies: more sequence or more replication? Bioinformatics. 2014;30(3):301–304. doi: 10.1093/bioinformatics/btt688.
    1. Ching T, Huang S, Garmire LX. Power analysis and sample size estimation for RNA-Seq differential expression. RNA. 2014;20(11):1684–1696. doi: 10.1261/rna.046011.114.
    1. Li CI, Samuels DC, Zhao YY, Shyr Y, Guo Y. Power and sample size calculations for high-throughput sequencing-based experiments. Brief Bioinform. 2017; .
    1. Therneau TM, Hart SN, Kocher JP. RNASeqPower: Calculating samples Size estimates for RNA Seq studies. R package version 1.18.0. 2013.
    1. Guo Y, Li J, Li CI, Shyr Y, Samuels DC. MitoSeek: extracting mitochondria information and performing high-throughput mitochondria sequencing analysis. Bioinformatics. 2013;29(9):1210–1211. doi: 10.1093/bioinformatics/btt118.
    1. Wu H, Wang C, Wu ZJ. PROPER: comprehensive power evaluation for differential expression using RNA-seq. Bioinformatics. 2015;31(2):233–241. doi: 10.1093/bioinformatics/btu640.
    1. Zhou X, Lindsay H, Robinson MD. Robustly detecting differential expression in RNA sequencing data using observation weights. Nucleic Acids Res. 2014;42(11):e91. doi: 10.1093/nar/gku310.
    1. Yu L, Fernandez S, Brock G. Power analysis for RNA-Seq differential expression studies. BMC Bioinformatics. 2017;18(1):234. doi: 10.1186/s12859-017-1648-2.
    1. Croft D, O’Kelly G, Wu G, Haw R, Gillespie M, Matthews L, Caudy M, Garapati P, Gopinath G, Jassal B, et al. Reactome: a database of reactions, pathways and biological processes. Nucleic Acids Res. 2011;39(Database issue):D691–D697. doi: 10.1093/nar/gkq1018.
    1. Rapaport F, Khanin R, Liang Y, Pirun M, Krek A, Zumbo P, Mason CE, Socci ND, Betel D. Comprehensive evaluation of differential gene expression analysis methods for RNA-seq data. Genome Biol. 2013;14(9):R95. doi: 10.1186/gb-2013-14-9-r95.
    1. R Core Team . R foundation for statistical computing. 2016. R: a language and environment for statistical computing.
    1. Huber W, Carey VJ, Gentleman R, Anders S, Carlson M, Carvalho BS, Bravo HC, Davis S, Gatto L, Girke T, et al. Orchestrating high-throughput genomic analysis with Bioconductor. Nat Methods. 2015;12(2):115–121. doi: 10.1038/nmeth.3252.
    1. Robinson MD, Smyth GK. Small-sample estimation of negative binomial dispersion, with applications to SAGE data. Biostatistics. 2008;9(2):321–332. doi: 10.1093/biostatistics/kxm030.
    1. Robinson MD, Smyth GK. Moderated statistical tests for assessing differences in tag abundance. Bioinformatics. 2007;23(21):2881–2887. doi: 10.1093/bioinformatics/btm453.
    1. Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26(1):139–140. doi: 10.1093/bioinformatics/btp616.

Source: PubMed

3
구독하다