A comparative study of techniques for differential expression analysis on RNA-Seq data

Zong Hong Zhang, Dhanisha J Jhaveri, Vikki M Marshall, Denis C Bauer, Janette Edson, Ramesh K Narayanan, Gregory J Robinson, Andreas E Lundberg, Perry F Bartlett, Naomi R Wray, Qiong-Yi Zhao, Zong Hong Zhang, Dhanisha J Jhaveri, Vikki M Marshall, Denis C Bauer, Janette Edson, Ramesh K Narayanan, Gregory J Robinson, Andreas E Lundberg, Perry F Bartlett, Naomi R Wray, Qiong-Yi Zhao

Abstract

Recent advances in next-generation sequencing technology allow high-throughput cDNA sequencing (RNA-Seq) to be widely applied in transcriptomic studies, in particular for detecting differentially expressed genes between groups. Many software packages have been developed for the identification of differentially expressed genes (DEGs) between treatment groups based on RNA-Seq data. However, there is a lack of consensus on how to approach an optimal study design and choice of suitable software for the analysis. In this comparative study we evaluate the performance of three of the most frequently used software tools: Cufflinks-Cuffdiff2, DESeq and edgeR. A number of important parameters of RNA-Seq technology were taken into consideration, including the number of replicates, sequencing depth, and balanced vs. unbalanced sequencing depth within and between groups. We benchmarked results relative to sets of DEGs identified through either quantitative RT-PCR or microarray. We observed that edgeR performs slightly better than DESeq and Cuffdiff2 in terms of the ability to uncover true positives. Overall, DESeq or taking the intersection of DEGs from two or more tools is recommended if the number of false positives is a major concern in the study. In other circumstances, edgeR is slightly preferable for differential expression analysis at the expense of potentially introducing more false positives.

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1. The workflow of differential expression…
Figure 1. The workflow of differential expression analysis for RNA-Seq data.
Figure 2. The effects of replicates for…
Figure 2. The effects of replicates for detecting DEGs based on ROC curves.
ROC curves for evaluating the performance of Cuffdiff2, DESeq and edgeR on 1 to 2 technical replicates based on the MAQC dataset (A–C), 1 to 4 biological replicates based on the K_N dataset (D–F), and 1 to 20 biological replicates based on the LCL2 dataset (H–I).
Figure 3. The effects of biological replicates…
Figure 3. The effects of biological replicates on the differential expression analysis.
The numbers of differentially expressed genes identified by each of three tools under different numbers of biological replicates based on the K_N dataset (A) and the LCL2 dataset (B).
Figure 4. The effects of sequencing depth…
Figure 4. The effects of sequencing depth for detecting DEGs.
ROC curves for evaluating the performance of Cuffdiff2, DESeq and edgeR with different sequencing depths based on the K_N subsets (A–C) and the LCL3 simulated dataset (D–F).
Figure 5. The effects of sequencing depth…
Figure 5. The effects of sequencing depth on the differential expression analysis.
The numbers of differentially expressed genes identified by Cuffdiff2, DESeq and edgeR are shown based on the K_N subsets (A) and the LCL3 simulated dataset (B).
Figure 6. The effects of balanced and…
Figure 6. The effects of balanced and unbalanced depths of reads for detecting DEGs based on ROC curve.
ROC curves for evaluating the performance of Cuffdiff2, DESeq and edgeR on balanced and on unbalanced depths of reads based on the K_N dataset (A–C) and the LCL3 simulated dataset (D–F).
Figure 7. The performance of the three…
Figure 7. The performance of the three tools.
ROC curves of the three tools are shown based on the MAQC (A), K_N (B) and LCL2 (C) datasets. Venn diagrams are used to show the intersection of the numbers of the differentially expressed genes identified by three tools compared with the benchmarks based on MAQC (D), K_N (E) and LCL2 (F).

References

    1. Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, et al. (2011) Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotechnol 29: 644–652.
    1. Robertson G, Schein J, Chiu R, Corbett R, Field M, et al. (2010) De novo assembly and analysis of RNA-seq data. Nat Methods 7: 909–912.
    1. Schulz MH, Zerbino DR, Vingron M, Birney E (2012) Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels. Bioinformatics 28: 1086–1092.
    1. Zhao QY, Wang Y, Kong YM, Luo D, Li X, et al. (2011) Optimizing de novo transcriptome assembly from short-read RNA-Seq data: a comparative study. BMC Bioinformatics 12 Suppl 14: S2.
    1. Oshlack A, Robinson MD, Young MD (2010) From RNA-seq reads to differential expression results. Genome Biol 11: 220.
    1. Anders S, Huber W (2010) Differential expression analysis for sequence count data. Genome Biol 11: R106.
    1. Robinson MD, Oshlack A (2010) A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol 11: R25.
    1. Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, et al. (2010) Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol 28: 511–515.
    1. Li B, Dewey CN (2011) RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics 12: 323.
    1. Anders S, Reyes A, Huber W (2012) Detecting differential usage of exons from RNA-seq data. Genome Res 22: 2008–2017.
    1. Pan Q, Shai O, Lee LJ, Frey BJ, Blencowe BJ (2008) Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing. Nat Genet 40: 1413–1415.
    1. Wang ET, Sandberg R, Luo S, Khrebtukova I, Zhang L, et al. (2008) Alternative isoform regulation in human tissue transcriptomes. Nature 456: 470–476.
    1. Wang Z, Gerstein M, Snyder M (2009) RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet 10: 57–63.
    1. Morin R, Bainbridge M, Fejes A, Hirst M, Krzywinski M, et al. (2008) Profiling the HeLa S3 transcriptome using randomly primed cDNA and massively parallel short-read sequencing. Biotechniques 45: 81–94.
    1. Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y (2008) RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res 18: 1509–1517.
    1. Hansen KD, Brenner SE, Dudoit S (2010) Biases in Illumina transcriptome sequencing caused by random hexamer priming. Nucleic Acids Res 38: e131.
    1. McIntyre LM, Lopiano KK, Morse AM, Amin V, Oberg AL, et al. (2011) RNA-seq: technical variability and sampling. BMC Genomics 12: 293.
    1. Trapnell C, Hendrickson DG, Sauvageau M, Goff L, Rinn JL, et al. (2013) Differential analysis of gene regulation at transcript resolution with RNA-seq. Nat Biotechnol 31: 46–53.
    1. Bullard JH, Purdom E, Hansen KD, Dudoit S (2010) Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics 11: 94.
    1. Nagalakshmi U, Wang Z, Waern K, Shou C, Raha D, et al. (2008) The transcriptional landscape of the yeast genome defined by RNA sequencing. Science 320: 1344–1349.
    1. Robinson MD, Smyth GK (2007) Moderated statistical tests for assessing differences in tag abundance. Bioinformatics 23: 2881–2887.
    1. Robinson MD, McCarthy DJ, Smyth GK (2010) edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26: 139–140.
    1. Trapnell C, Roberts A, Goff L, Pertea G, Kim D, et al. (2012) Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat Protoc 7: 562–578.
    1. McGettigan PA (2013) Transcriptomics in the RNA-seq era. Curr Opin Chem Biol 17: 4–11.
    1. Dasgupta N, Xu YH, Oh S, Sun Y, Jia L, et al. (2013) Gaucher disease: transcriptome analyses using microarray or mRNA sequencing in a Gba1 mutant mouse model treated with velaglucerase alfa or imiglucerase. PLoS One 8: e74912.
    1. Kissopoulou A, Jonasson J, Lindahl TL, Osman A (2013) Next Generation Sequencing Analysis of Human Platelet PolyA+ mRNAs and rRNA-Depleted Total RNA. PLoS One 8: e81809.
    1. Merrick BA, Phadke DP, Auerbach SS, Mav D, Stiegelmeyer SM, et al. (2013) RNA-Seq profiling reveals novel hepatic gene expression pattern in aflatoxin B1 treated rats. PLoS One 8: e61768.
    1. Soneson C, Delorenzi M (2013) A comparison of methods for differential expression analysis of RNA-seq data. BMC Bioinformatics 14: 91.
    1. Kvam VM, Liu P, Si Y (2012) A comparison of statistical methods for detecting differentially expressed genes from RNA-seq data. Am J Bot 99: 248–256.
    1. Robles JA, Qureshi SE, Stephen SJ, Wilson SR, Burden CJ, et al. (2012) Efficient experimental design and analysis strategies for the detection of differential expression using RNA-Sequencing. BMC Genomics 13: 484.
    1. Shi L, Reid LH, Jones WD, Shippy R, Warrington JA, et al. (2006) The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. Nat Biotechnol 24: 1151–1161.
    1. Wan L, Sun F (2012) CEDER: accurate detection of differentially expressed genes by combining significance of exons using RNA-Seq. IEEE/ACM Trans Comput Biol Bioinform 9: 1281–1292.
    1. Nacu S, Yuan W, Kan Z, Bhatt D, Rivers CS, et al. (2011) Deep RNA sequencing analysis of readthrough gene fusions in human prostate adenocarcinoma and reference samples. BMC Med Genomics 4: 11.
    1. Jhaveri DJ, Mackay EW, Hamlin AS, Marathe SV, Nandam LS, et al. (2010) Norepinephrine directly activates adult hippocampal precursors via beta3-adrenergic receptors. J Neurosci 30: 2795–2806.
    1. Pickrell JK, Marioni JC, Pai AA, Degner JF, Engelhardt BE, et al. (2010) Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature 464: 768–772.
    1. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, et al. (2009) The Sequence Alignment/Map format and SAMtools. Bioinformatics 25: 2078–2079.
    1. Narayanan RK, Mangelsdorf M, Panwar A, Butler TJ, Noakes PG, et al. (2013) Identification of RNA bound to the TDP-43 ribonucleoprotein complex in the adult mouse brain. Amyotroph Lateral Scler Frontotemporal Degener 14: 252–260.
    1. Trapnell C, Pachter L, Salzberg SL (2009) TopHat: discovering splice junctions with RNA-Seq. Bioinformatics 25: 1105–1111.
    1. Seyednasrollah F, Laiho A, Elo LL (2013) Comparison of software packages for detecting differential expression in RNA-seq studies. Brief Bioinform
    1. Liu Y, Zhou J, White KP (2014) RNA-seq differential expression studies: more sequence or more replication? Bioinformatics 30: 301–304.
    1. Tarazona S, Garcia-Alcalde F, Dopazo J, Ferrer A, Conesa A (2011) Differential expression in RNA-seq: a matter of depth. Genome Res 21: 2213–2223.
    1. Rapaport F, Khanin R, Liang Y, Pirun M, Krek A, et al. (2013) Comprehensive evaluation of differential gene expression analysis methods for RNA-seq data. Genome Biol 14: R95.
    1. Li J, Witten DM, Johnstone IM, Tibshirani R (2012) Normalization, testing, and false discovery rate estimation for RNA-sequencing data. Biostatistics 13: 523–538.
    1. Giorgi FM, Del Fabbro C, Licausi F (2013) Comparative study of RNA-seq- and microarray-derived coexpression networks in Arabidopsis thaliana. Bioinformatics 29: 717–724.
    1. Frazee AC, Pertea G, Jaffe AE, Langmead B, Salzberg SL, et al. (2014) Flexible isoform-level differential expression analysis with Ballgown. bioRxiv doi:
    1. Di YM, Schafer DW, Cumbie JS, Chang JH (2011) The NBP Negative Binomial Model for Assessing Differential Gene Expression from RNA-Seq. Statistical Applications in Genetics and Molecular Biology 10.
    1. Hardcastle TJ, Kelly KA (2010) baySeq: empirical Bayesian methods for identifying differential expression in sequence count data. BMC Bioinformatics 11: 422.
    1. Li J, Tibshirani R (2013) Finding consistent patterns: a nonparametric approach for identifying differential expression in RNA-Seq data. Stat Methods Med Res 22: 519–536.
    1. Van De Wiel MA, Leday GG, Pardo L, Rue H, Van Der Vaart AW, et al. (2013) Bayesian analysis of RNA sequencing data by estimating multiple shrinkage priors. Biostatistics 14: 113–128.
    1. Fang Z, Cui X (2011) Design and validation issues in RNA-seq experiments. Brief Bioinform 12: 280–287.

Source: PubMed

3
Tilaa