Comparison of normalization methods for the analysis of metagenomic gene abundance data

Mariana Buongermino Pereira, Mikael Wallroth, Viktor Jonsson, Erik Kristiansson, Mariana Buongermino Pereira, Mikael Wallroth, Viktor Jonsson, Erik Kristiansson

Abstract

Background: In shotgun metagenomics, microbial communities are studied through direct sequencing of DNA without any prior cultivation. By comparing gene abundances estimated from the generated sequencing reads, functional differences between the communities can be identified. However, gene abundance data is affected by high levels of systematic variability, which can greatly reduce the statistical power and introduce false positives. Normalization, which is the process where systematic variability is identified and removed, is therefore a vital part of the data analysis. A wide range of normalization methods for high-dimensional count data has been proposed but their performance on the analysis of shotgun metagenomic data has not been evaluated.

Results: Here, we present a systematic evaluation of nine normalization methods for gene abundance data. The methods were evaluated through resampling of three comprehensive datasets, creating a realistic setting that preserved the unique characteristics of metagenomic data. Performance was measured in terms of the methods ability to identify differentially abundant genes (DAGs), correctly calculate unbiased p-values and control the false discovery rate (FDR). Our results showed that the choice of normalization method has a large impact on the end results. When the DAGs were asymmetrically present between the experimental conditions, many normalization methods had a reduced true positive rate (TPR) and a high false positive rate (FPR). The methods trimmed mean of M-values (TMM) and relative log expression (RLE) had the overall highest performance and are therefore recommended for the analysis of gene abundance data. For larger sample sizes, CSS also showed satisfactory performance.

Conclusions: This study emphasizes the importance of selecting a suitable normalization methods in the analysis of data from shotgun metagenomics. Our results also demonstrate that improper methods may result in unacceptably high levels of false positives, which in turn may lead to incorrect or obfuscated biological interpretation.

Keywords: False discovery rate; Gene abundances; High-dimensional data; Normalization; Shotgun metagenomics; Systematic variability.

Conflict of interest statement

Ethics approval and consent to participate

The study does has not directly involved humans, animals or plants.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Figures

Fig. 1
Fig. 1
True positive rate analysis for group size 10+10. True positive rate at a fixed false positive rate of 0.01 (y-axis) for nine normalization methods and three metagenomic datasets (x-axis). The results were based on resampled data consisting of two groups with 10 samples in each, 10% DAGs with an average fold-change of 3. The DAGs were added in (a) equal proportion between the groups (‘balanced’) and in (b) in only one of the groups (’unbalanced’). The following methods are included in the figure: trimmed mean of M-values (TMM), relative log expression (RLE), cumulative sum scaling (CSS), reversed cumulative sum scaling (RCSS), quantile-quantile (Quant), upper quartile (UQ), median (Med), total count (TC) and rarefying (Rare)
Fig. 2
Fig. 2
True positive rate analysis for group size 3+3. True positive rate at a fixed false positive rate of 0.01 (y-axis) for nine normalization methods and three metagenomic datasets (x-axis). The results were based on resampled data consisting of two groups with 3 samples in each, 10% DAGs with an average fold-change of 3. The DAGs were added in (a) equal proportion between the groups (‘balanced’) and in (b) only one of the groups (‘unbalanced’). The following methods are included in the figure: trimmed mean of M-values (TMM), relative log expression (RLE), cumulative sum scaling (CSS), reversed cumulative sum scaling (RCSS), quantile-quantile (Quant), upper quartile (UQ), median (Med), total count (TC) and rarefying (Rare)
Fig. 3
Fig. 3
True and false positive rates for increasing unbalanced effects. (a) True positive rate at a fixed false positive rate of 0.01 (y-axis) and (b) false positive rate at a fix true positive rate of 0.50 (y-axis) for different distributions of effects between groups: balanced (’B’) with 10% of effects divided equally between the two groups, lightly-unbalanced (‘LU’) with effects added 75%-25% in each group, unbalanced (’U’) with all effects added to only one group, and heavily-unbalanced (‘HU’) with 20% of effects added to only one group (x-axis). The results were based on resampled data consisting of two groups with 10 samples in each and an average fold-change of 3. Three metagenomic datasets were used Human gut I, Human gut II and Marine. The following methods are included in the figure: trimmed mean of M-values (TMM), relative log expression (RLE), cumulative sum scaling (CSS), reversed cumulative sum scaling (RCSS), quantile-quantile (Quant), upper quartile (UQ), median (Med), total count (TC) and rarefying (Rare)
Fig. 4
Fig. 4
Quantile-quantile plots for p-values of non-DAGs. Data quantiles for the Human gut I dataset (y-axis) were plotted against the theoretical quantiles of the uniform distribution (x-axis) for nine normalization methods. The results were based on resampled data consisting of two groups with 10 samples in each, an average fold-change of 3, for balanced (‘B’) case where 10% effects were equally distributed in the two groups and heavily-unbalanced (‘HU’) case where 20% effects were added in only one group. Each dashed line is one of the 100 iterations. Lines deviating from the diagonal indicates biased p-values. The following methods are included in the figure: trimmed mean of M-values (TMM), relative log expression (RLE), cumulative sum scaling (CSS), reversed cumulative sum scaling (RCSS), quantile-quantile (Quant), upper quartile (UQ), median (Med), total count (TC) and rarefying (Rare)
Fig. 5
Fig. 5
True false discovery rate at an estimated false discovery rate of 0.05 (y-axis) for different distribution of effects between groups (x-axis): balanced (‘B’) with 10% of effects divided equally between the two groups, lightly-unbalanced (‘LU’) with effects added 75–25% in each group, unbalanced (‘U’) with all effects added to only one group, and heavily-unbalanced (’HU’) with 20% of effects added to only one group. The results were based on resampled data consisting of two groups with 10 samples in each, and an average fold-change of 3. P-values where adjusted using Benjamini-Hochberg correction. Three metagenomic datasets were used Human gut I, Human gut II and Marine. The following methods are included in the figure trimmed mean of M-values (TMM), relative log expression (RLE), cumulative sum scaling (CSS), reversed cumulative sum scaling (RCSS), quantile-quantile (Quant), upper quartile (UQ), median (Med), total count (TC) and rarefying (Rare)

References

    1. Sharpton TJ. An introduction to the analysis of shotgun metagenomic data, Front Plant Sci. 2014;5:209. doi: 10.3389/fpls.2014.00209.
    1. Schloss PD, Handelsman J. Metagenomics for studying unculturable microorganisms: cutting the Gordian knot, Genome Biol. 2005;6:229. doi: 10.1186/gb-2005-6-8-229.
    1. Kim Y, Koh IS, Rho M. Deciphering the human microbiome using next-generation sequencing data and bioinformatics approaches. Methods. 2015;79-80:52–9. doi: 10.1016/j.ymeth.2014.10.022.
    1. Kristiansson E, Fick J, Janzon A, Grabic R, Rutgersson C, So H, Larsson DGJ. Pyrosequencing of Antibiotic-Contaminated River Sediments Reveals High Levels of Resistance and Gene Transfer Elements. PloS ONE. 2011;6(2):17038. doi: 10.1371/journal.pone.0017038.
    1. Qin J, Li Y, Cai Z, Li S, Zhu J, Zhang F, Liang S, Zhang W, Guan Y, Shen D, Peng Y, Zhang D, Jie Z, Wu W, Qin Y, Xue W, Li J, Han L, Lu D, Wu P, Dai Y, Sun X, Li Z, Tang A, Zhong S, Li X, Chen W, Xu R, Wang M, Feng Q, Gong M, Yu J, Zhang Y, Zhang M, Hansen T, Sanchez G, Raes J, Falony G, Okuda S, Almeida M, LeChatelier E, Renault P, Pons N, Batto JM, Zhang Z, Chen H, Yang R, Zheng W, Yang H, Wang J, Ehrlich SD, Nielsen R, Pedersen O, Kristiansen K. A metagenome-wide association study of gut microbiota in type 2 diabetes. Nature. 2012;490(7418):55–60. doi: 10.1038/nature11450.
    1. Sunagawa S, Coelho LP, Chaffron S, Kultima JR, Labadie K, Salazar G, Djahanschiri B, Zeller G, Mende DR, Alberti A, Cornejo-Castillo FM, Costea PI, Cruaud C, D’Ovidio F, Engelen S, Ferrera I, Gasol JM, Guidi L, Hildebrand F, Kokoszka F, Lepoivre C, Lima-Mendez G, Poulain J, Poulos BT, Royo-Llonch M, Sarmento H, Vieira-Silva S, Dimier C, Picheral M, Searson S, Kandels-Lewis S, Bowler C, de Vargas C, Gorsky G, Grimsley N, Hingamp P, Iudicone D, Jaillon O, Not F, Ogata H, Pesant S, Speich S, Stemmann L, Sullivan MB, Weissenbach J, Wincker P, Karsenti E, Raes J, Acinas SG, Bork P, Boss E, Bowler C, Follows M, Karp-Boss L, Krzic U, Reynaud EG, Sardet C, Sieracki M, Velayoudon D. Structure and function of the global ocean microbiome. Science. 2015;348(6237):1261359. doi: 10.1126/science.1261359.
    1. Boulund F, Sjögren A, Kristiansson E. Tentacle: distributed quantification of genes in metagenomes. GigaScience. 2015;4(1):40. doi: 10.1186/s13742-015-0078-1.
    1. Österlund T, Jonsson V, Kristiansson E. HirBin: high-resolution identification of differentially abundant functions in metagenomes. BMC Genomics. 2017;18(1):316. doi: 10.1186/s12864-017-3686-6.
    1. Bengtsson-Palme J. Strategies for Taxonomic and Functional Annotation of Metagenomes. In: Nagarajan M, editor. Metagenomics: Perspectives, Methods and Applications. Cambridge: Academic Press; 2018.
    1. Wooley JC, Godzik A, Friedberg I. A primer on metagenomics. PLoS Comput Biol. 2010;6(2):1000667. doi: 10.1371/journal.pcbi.1000667.
    1. Jonsson V, Österlund T, Nerman O, Kristiansson E. Variability in metagenomic count data and its influence on the identification of differentially abundant genes. J Comput Biol. 2017;24(4):311–26. doi: 10.1089/cmb.2016.0180.
    1. Boulund F, Pereira MB, Jonsson V, Kristiansson E. Computational and statistical considerations in the analysis of metagenomic data. In: Nagarajan M, editor. Metagenomics: Perspectives, Methods and Applications. Cambridge: Academic Press; 2018.
    1. McMurdie PJ, Holmes S. Waste Not, Want Not: Why Rarefying Microbiome Data Is Inadmissible. PLoS Comput Biol. 2014;10(4):1003531. doi: 10.1371/journal.pcbi.1003531.
    1. Morgan JL, Darling AE, Eisen JA. Metagenomic sequencing of an in vitro-simulated microbial community. PLoS ONE. 2010;5(4):10209. doi: 10.1371/journal.pone.0010209.
    1. Manor O, Borenstein E. MUSiCC: a marker genes based framework for metagenomic normalization and accurate profiling of gene abundances in the microbiome, Genome Biol. 2015;16(1):53. doi: 10.1186/s13059-015-0610-8.
    1. Hansen KD, Irizarry RA, Wu Z. Removing technical variability in RNA-seq data using conditional quantile normalization. Biostatistics. 2012;13(2):204–16. doi: 10.1093/biostatistics/kxr054.
    1. Mitra S, Klar B, Huson DH. Visual and statistical comparison of metagenomes. Bioinformatics. 2009;25(15):1849–55. doi: 10.1093/bioinformatics/btp341.
    1. White JR, Nagarajan N, Pop M. Statistical Methods for Detecting Differentially Abundant Features in Clinical Metagenomic Samples. PLoS Comput Biol. 2009;5(4):1000352. doi: 10.1371/journal.pcbi.1000352.
    1. Bullard JH, Purdom E, Hansen KD, Dudoit S. Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics. 2010;11:94. doi: 10.1186/1471-2105-11-94.
    1. Paulson JN, Stine OC, Bravo HC, Pop M. Differential abundance analysis for microbial marker-gene surveys, Nat Methods. 2013;10(12):1200–2. doi: 10.1038/nmeth.2658.
    1. Robinson MD, Oshlack A. A scaling normalization method for differential expression analysis of RNA-seq data, Genome Biol. 2010;11:25. doi: 10.1186/gb-2010-11-3-r25.
    1. Anders S, Huber W. Differential expression analysis for sequence count data. Genome Biol. 2010;11:106. doi: 10.1186/gb-2010-11-10-r106.
    1. Weiss S, Xu ZZ, Peddada S, Amir A, Bittinger K, Gonzalez A, Lozupone C, Zaneveld JR, Vázquez-Baeza Y, Birmingham A, Hyde ER, Knight R. Normalization and microbial differential abundance strategies depend upon data characteristics. Microbiome. 2017;5(1):27. doi: 10.1186/s40168-017-0237-y.
    1. Bolstad BM, Irizarry RA, Åstrand M, Speed TP. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics. 2003;19(2):185–93. doi: 10.1093/bioinformatics/19.2.185.
    1. Choi H, Kim S, Fermin D, Tsou C-C, Nesvizhskii AI. QPROT: Statistical method for testing differential expression using protein-level intensity data in label-free quantitative proteomics. J Proteomics. 2015;129(1):121–6. doi: 10.1016/j.jprot.2015.07.036.
    1. Dillies M-A, Rau A, Aubert J, Hennequet-Antier C, Jeanmougin M, Servant N, Keime C, Marot G, Castel D, Estelle J, Guernec G, Jagla B, Jouneau L, Laloe D, Le Gall C, Schaeffer B, Le Crom S, Guedj M, Jaffrezic F. A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis. Brief Bioinformatics. 2013;14(6):671–83. doi: 10.1093/bib/bbs046.
    1. Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y. RNA-seq: An assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 2008;18(9):1509–17. doi: 10.1101/gr.079558.108.
    1. R Core Team . R: A Language and Environment for Statistical Computing. Vienna: R Foundation for Statistical Computing; 2017.
    1. Robinson MD, McCarthy DJ, Smyth GK. edgeR: A Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26(1):139–40. doi: 10.1093/bioinformatics/btp616.
    1. Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2, Genome Biol. 2014;15(12):550. doi: 10.1186/s13059-014-0550-8.
    1. Kristiansson E, Hugenholtz P, Dalevi D. ShotgunFunctionalizeR: An R-package for functional comparison of metagenomes. Bioinformatics. 2009;25(20):2737–8. doi: 10.1093/bioinformatics/btp508.
    1. Jonsson V, Österlund T, Nerman O, Kristiansson E. Statistical evaluation of methods for identification of differentially abundant genes in comparative metagenomics. BMC Genomics. 2016;17:78. doi: 10.1186/s12864-016-2386-y.
    1. Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Series B (Methodological) 1995;57(1):298–300.
    1. Powell S, Szklarczyk D, Trachana K, Roth A, Kuhn M, Muller J, Arnold R, Rattei T, Letunic I, Doerks T, Jensen LJ, Von Mering C, Bork P. eggNOG v3.0: Orthologous groups covering 1133 organisms at 41 different taxonomic ranges. Nucleic Acids Res. 2012;40(Database issue):284–9. doi: 10.1093/nar/gkr1060.
    1. Yatsunenko T, Rey FE, Manary MJ, Trehan I, Dominguez-Bello MG, Contreras M, Magris M, Hidalgo G, Baldassano RN, Anokhin AP, Heath AC, Warner B, Reeder J, Kuczynski J, Caporaso JG, Lozupone CA, Lauber C, Clemente JC, Knights D, Knight R, Gordon JI. Human gut microbiome viewed across age and geography. Nature. 2012;486(7402):222–7.
    1. Meyer F, Paarmann D, Souza MD, Olson R, Glass EM, Kubal M, Paczian T, Rodriguez A, Stevens R, Wilke A, Wilkening J, Edwards RA. The Metagenomics RAST Server: A Public Resource for the Automatic Phylogenetic and Functional Analysis of Metagenomes. BMC Bioinformatics. 2008;9:386. doi: 10.1186/1471-2105-9-386.
    1. Huerta-Cepas J, Szklarczyk D, Forslund K, Cook H, Heller D, Walter MC, Rattei T, Mende DR, Sunagawa S, Kuhn M, Jensen LJ, Von Mering C, Bork P. EGGNOG 4.5: A hierarchical orthology framework with improved functional annotations for eukaryotic, prokaryotic and viral sequences. Nucleic Acids Res. 2016;44(D1):286–93. doi: 10.1093/nar/gkv1248.
    1. Eddy SR. Accelerated profile HMM searches. PLoS Comput Biol. 2011;7(10):1002195. doi: 10.1371/journal.pcbi.1002195.
    1. Kultima JR, Sunagawa S, Li J, Chen W, Chen H, Mende DR, Arumugam M, Pan Q, Liu B, Qin J, Wang J, Bork P. MOCAT: A Metagenomics Assembly and Gene Prediction Toolkit. PLoS ONE. 2012;7(10):47656. doi: 10.1371/journal.pone.0047656.
    1. Pereira MB, Wallroth M, Jonsson V, Kristiansson E. Gene abundance data used for the comparison of normalization methods in shotgun metagenomics. 2017. . Accessed 01 Sept 2017.
    1. Sohn MB, Du R, An L. A robust approach for identifying differentially abundant features in metagenomic samples. Bioinformatics. 2015;31(14):2269–75. doi: 10.1093/bioinformatics/btv165.
    1. Paulson JN. Normalization and differential abundance analysis of metagenomic biomarker-gene surveys. PhD thesis, University of Maryland. 2015. 10.13016/M2Q63C. .
    1. Parks DH, Tyson GW, Hugenholtz P, Beiko RG. STAMP: Statistical analysis of taxonomic and functional profiles. Bioinformatics. 2014;30(21):3123–4. doi: 10.1093/bioinformatics/btu494.
    1. Storey JD, Tibshirani R. Statistical significance for genomewide studies. Proc Natl Acad Sci. 2003;100(16):9440–5. doi: 10.1073/pnas.1530509100.
    1. Hawinkel S, Mattiello F, Bijnens L, Thas O. A broken promise: microbiome differential abundance methods do not control the false discovery rate. Brief Bioinform. 2017;20:1684–96.
    1. Navas-Molina JA, Peralta-Sánchez JM, González A, McMurdie PJ, Vázquez-Baeza Y, Xu Z, Ursell LK, Lauber C, Zhou H, Song SJ, Huntley J, Ackermann GL, Berg-Lyons D, Holmes S, Caporaso JG, Knight R. Advancing Our Understanding of the Human Microbiome Using QIIME. Methods Enzymol. 2013;531:371–444. doi: 10.1016/B978-0-12-407863-5.00019-8.
    1. Koren O, Knights D, Gonzalez A, Waldron L, Segata N, Knight R, Huttenhower C, Ley RE. A Guide to Enterotypes across the Human Body: Meta-Analysis of Microbial Community Structures in Human Microbiome Datasets. PLoS Comput Biol. 2013;9(1):1002863. doi: 10.1371/journal.pcbi.1002863.
    1. Hughes JB, Hellmann JJ. The application of rarefaction techniques to molecular inventories of microbial diversity. Methods Enzymol. 2005;397:292–308. doi: 10.1016/S0076-6879(05)97017-1.
    1. Karlsson FH, Tremaroli V, Nookaew I, Bergstrom G, Behre CJ, Fagerberg B, Nielsen J, Backhed F. Gut metagenome in European women with normal, impaired and diabetic glucose control. Nature. 2013;498(7452):99–103. doi: 10.1038/nature12198.
    1. Sunagawa S, Mende DR, Zeller G, Izquierdo-Carrasco F, Berger SA, Kultima JR, Coelho LP, Arumugam M, Tap J, Nielsen HB, Rasmussen S, Brunak S, Pedersen O, Guarner F, de Vos WM, Wang J, Li J, Dore J, Ehrlich SD, Stamatakis A, Bork P. Metagenomic species profiling using universal phylogenetic marker genes. Nat Methods. 2013;10(12):1196–9. doi: 10.1038/nmeth.2693.
    1. Chen S-Y, Tsai C-N, Lee Y-S, Lin C-Y, Huang K-Y, Chao H-C, Lai M-W, Chiu C-H. Intestinal microbiome in children with severe and complicated acute viral gastroenteritis, Sci Rep. 2017;7:46130. doi: 10.1038/srep46130.
    1. Wang H-L, Sun L. Comparative metagenomics reveals insights into the deep-sea adaptation mechanism of the microorganisms in Iheya hydrothermal fields. World J Microbiol Biotechnol. 2017;33(86):1–17.
    1. Ericsson AC, Personett AR, Turner G, Dorfmeyer RA, Franklin CL. Variable colonization after reciprocal fecal microbiota transfer between mice with low and high richness microbiota. Front Microbiol. 2017;8(2):196.
    1. SEQC/MAQC-III Consortium A comprehensive assessment of RNA-seq accuracy, reproducibility and information content by the Sequencing Quality Control Consortium. Nat Biotechnol. 2014;32:903. doi: 10.1038/nbt.2957.
    1. McMurdie PJ, Holmes S. Phyloseq: An R Package for Reproducible Interactive Analysis and Graphics of Microbiome Census Data. PLoS ONE. 2013;8(4):61217. doi: 10.1371/journal.pone.0061217.

Source: PubMed

3
Se inscrever