Systematic benchmarking of tools for CpG methylation detection from nanopore sequencing

Zaka Wing-Sze Yuen, Akanksha Srivastava, Runa Daniel, Dennis McNevin, Cameron Jack, Eduardo Eyras, Zaka Wing-Sze Yuen, Akanksha Srivastava, Runa Daniel, Dennis McNevin, Cameron Jack, Eduardo Eyras

Abstract

DNA methylation plays a fundamental role in the control of gene expression and genome integrity. Although there are multiple tools that enable its detection from Nanopore sequencing, their accuracy remains largely unknown. Here, we present a systematic benchmarking of tools for the detection of CpG methylation from Nanopore sequencing using individual reads, control mixtures of methylated and unmethylated reads, and bisulfite sequencing. We found that tools have a tradeoff between false positives and false negatives and present a high dispersion with respect to the expected methylation frequency values. We described various strategies to improve the accuracy of these tools, including a consensus approach, METEORE ( https://github.com/comprna/METEORE ), based on the combination of the predictions from two or more tools that shows improved accuracy over individual tools. Snakemake pipelines are also provided for reproducibility and to enable the systematic application of our analyses to other datasets.

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1. Analysis pipeline for 5mC detection…
Fig. 1. Analysis pipeline for 5mC detection at CpG sites from nanopore sequencing.
The diagram describes the approach used to test the tools Nanopolish, Tombo, DeepSignal, Megalodon, Guppy, and DeepMod. Snakemake pipelines and command lines used are available at https://github.com/comprna/METEORE. A Snakemake pipeline was not developed for Megalodon and DeepMod as they can be run with a single command. Excluding DeepMod, all tools produce predictions per individual read and per CG site. In addition, all tools predict the methylation frequency at each genome site from fast5 input files. Methods that currently only accept single-read fast5 format are indicated.
Fig. 2. Accuracy analysis per CpG site…
Fig. 2. Accuracy analysis per CpG site on control mixture dataset 1.
a Violin plots showing the predicted methylation frequencies (y axis) for each control mixture set with a given proportion of methylated reads (x axis). The Pearson’s correlation (r), coefficient of determination (r2), and root mean square error (RMSE) are given for each tool. b For each method, we indicate the proportion of sites predicted outside a 10% window around the expected methylation proportion, i.e., each predicted site in the m% dataset was classified as “outside” if its predicted percentage methylation was outside the interval [(m − 5)%, (m + 5)%] for intermediate methylation values, or outside the intervals [0,5%] or [95%, 100%] for the fully unmethylated or fully methylated sets, respectively. The percentage is indicated on top of each bar, except for 100%. c Empirical cumulative distribution function (ECDF) plot showing the number of true negatives (y axis) for each tool according to different thresholds for the predicted methylation frequency below which a site was called unmethylated (x axis), using the dataset of 100 fully unmethylated sites. d ECDF plot showing the number of true positives (y axis) for each tool according to different thresholds for the predicted methylation frequency above which a site was called fully methylated (x axis), using the dataset of 100 fully methylated sites.
Fig. 3. Model accuracy at the individual…
Fig. 3. Model accuracy at the individual read level and per-site accuracy analysis on control mixture dataset 2.
a Receiver operating characteristic (ROC) curves showing the false-positive rate (x axis) and true positive rate (y axis) for the predictions at individual read levels for the five methods tested, using reads from 0 and 100% methylated sets. b Precision–recall (PR) curves showing the recall (x axis) and precision (y axis) for the predictions at individual read levels for the five methods tested, using reads from 0 and 100% methylated sets. c ROC curves for METEORE for the random forest (RF) model (parameters: max_depth = 3 and n_estimator = 10) combining two methods, as well as combining the five methods. The curves were built from the average of a tenfold cross-validation with mixture dataset 1. Similar plots for another RF model using default parameters and a regression (REG) model are shown in Supplementary Fig. 3. d PR curves for the same models as in c. e Violin plots showing the predicted methylation frequencies (y axis) for each control mixture set with a given proportion of methylated reads (x axis) from the mixture dataset 2 for the five tested tools plus METEORE combining Megalodon and DeepSignal using a random forest (RF) or a regression (REG) model. The Pearson’s correlation (r) and coefficient of determination (r2) are given for each tool. f We indicate the proportion of sites predicted outside a window around the expected methylation proportion, i.e., a site in the m% dataset was “outside” if the predicted percentage methylation was outside the interval [(m − 5)%, (m + 5)%] for intermediate methylation sets, or outside the intervals [0,5%] or [95%,100%] for the fully unmethylated or fully methylated sets, respectively. The percentage is indicated on top of each bar, except for 100%.
Fig. 4. Comparison of CpG methylation predictions…
Fig. 4. Comparison of CpG methylation predictions from nanopore with whole-genome bisulfite sequencing (WGBS).
a Distribution of methylation calls (n = 1743) from nanopore (y axis) across three WGBS methylation bins: no or low methylation (0.0–0.3) (n = 793), intermediate methylation (0.3–0.7) (n = 264), and high or full methylation (0.7–1.0) (n = 686). In the boxplots, the lower and upper boundaries of the box are the first and third quartiles of the data, respectively, with the median indicated by a thick black line. The lower and upper whiskers extend to 1.5 times the interquartile range. The outliers are represented by the black dots. b Pearson’s correlation (r) (y axis) between methylation frequencies calculated from nanopore reads and WGBS at sites by each of the tested tools (combining predictions from both strands) at each level of minimal input coverage, i.e., minimum number of nanopore reads considered per site as reported from the BAM file (x axis). c Mean reported coverage (using the coverage reported by each tool for all sites) at each value of minimum input coverage in b. METEORE (RF) is the combination of Megalodon and DeepSignal using a random forest (parameters: max_depth = 3 and n_estimator = 10), and METEORE (REG) is the combination of Megalodon and DeepSignal using a regression model.
Fig. 5. Comparison of CpG methylation predictions…
Fig. 5. Comparison of CpG methylation predictions from nanopore with whole-genome bisulfite sequencing (WGBS) along Cas9-targeted regions.
Locally estimated scatterplot smoothing (LOESS) line plots of methylation calls frequency predictions (left y axes) from WGBS Illumina and from nanopore data using seven tools: Nanopolish, DeepSignal, Megalodon, Tombo, Guppy, DeepMod, and METEORE random forest (RF) and regression (REG) models. The plots include the nanopore read coverage (right y axes), shown as a light gray area. The panels below show the GC content of the region, using a window size of 50 bases. ac Three of our ten target regions and d one of the eight target regions from Gilpatrick et al.. The depicted regions are a chr6:392228-401463, which covers the first and second introns of gene IRF4; b chr1:159199780-159212236, which covers the genes ACKR1 and CADM3;c chr2:1480363-1494141, which covers the gene TPO; and d chr3:49352525-49366169, which covers the genes GPX1. A zoom in of the LOESS line plots with the individual methylation calls and logos for the sequence context around the CpG site in four CpG islands (CGIs) of our target regions are shown in Supplementary Fig. 15.

References

    1. Greenberg MVC, Bourc’his D. The diverse roles of DNA methylation in mammalian development and disease. Nat. Rev. Mol. Cell Biol. 2019;20:590–607. doi: 10.1038/s41580-019-0159-6.
    1. Kader F, Ghai M. DNA methylation and application in forensic sciences. Forensic Sci. Int. 2015;249:255–265. doi: 10.1016/j.forsciint.2015.01.037.
    1. Jones PA. Functions of DNA methylation: islands, start sites, gene bodies and beyond. Nat. Rev. Genet. 2012;13:484–492. doi: 10.1038/nrg3230.
    1. Yong W-S, Hsu F-M, Chen P-Y. Profiling genome-wide DNA methylation. Epigenetics Chromatin. 2016;9:26. doi: 10.1186/s13072-016-0075-3.
    1. Raiber E-A, Hardisty R, van Delft P, Balasubramanian S. Mapping and elucidating the function of modified bases in DNA. Nat. Rev. Chem. 2017;1:0069. doi: 10.1038/s41570-017-0069.
    1. Grunau C, Clark S, Rosenthal A. Bisulfite genomic sequencing: systematic investigation of critical experimental parameters. Nucleic Acids Res. 2001;29:e65–e65. doi: 10.1093/nar/29.13.e65.
    1. Ehrich M, Zoll S, Sur S, Van Den Boom D. A new method for accurate assessment of DNA quality after bisulfite treatment. Nucleic Acids Res. 2007;35:e29. doi: 10.1093/nar/gkl1134.
    1. Simpson JT, et al. Detecting DNA cytosine methylation using nanopore sequencing. Nat. Methods. 2017;14:407–410. doi: 10.1038/nmeth.4184.
    1. Laszlo AH, et al. Detection and mapping of 5-methylcytosine and 5-hydroxymethylcytosine with nanopore MspA. Proc. Natl Acad. Sci. USA. 2013;110:18904–18909. doi: 10.1073/pnas.1310240110.
    1. Rand AC, et al. Mapping DNA methylation with high-throughput nanopore sequencing. Nat. Methods. 2017;14:411–413. doi: 10.1038/nmeth.4189.
    1. Yuen, Z. W.-S., Srivastava, A., Jack, C. & Eyras, E. Systematic benchmarking of tools for CpG methylation detecgtion from Nanopore sequencing., 10.5281/zenodo.4748319 (2021).
    1. Oxford Nanopore Technologies. GitHub—Megalodon (Oxford Nanopore Technologies, 2020). .
    1. Ni P, et al. DeepSignal: detecting DNA methylation state from nanopore sequencing reads using deep-learning. Bioinformatics. 2019;35:4586–4595. doi: 10.1093/bioinformatics/btz276.
    1. Oxford Nanopore Technologies. GitHub (Oxford Nanopore Technologies, 2020). .
    1. Stoiber, M. et al. De novo identification of dna modifications enabled by genome-guided nanopore signal processing. 10.1101/094672 (2017).
    1. Liu Q, et al. Detection of DNA base modifications by deep recurrent neural network on Oxford Nanopore sequencing data. Nat. Commun. 2019;10:2449. doi: 10.1038/s41467-019-10168-2.
    1. Köster J, Rahmann S. Snakemake—a scalable bioinformatics workflow engine. Bioinformatics. 2012;28:2520–2522. doi: 10.1093/bioinformatics/bts480.
    1. Gilpatrick, T. et al. Targeted nanopore sequencing with Cas9-guided adapter ligation. Nat. Biotechnol.38, 433–438 (2020).
    1. Dunham I, et al. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489:57–74. doi: 10.1038/nature11247.
    1. Chen P-Y, Feng S, Joo JWJ, Jacobsen SE, Pellegrini M. A comparative analysis of DNA methylation across human embryonic stem cell lines. Genome Biol. 2011;12:R62. doi: 10.1186/gb-2011-12-7-r62.
    1. Li H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics. 2018;34:3094–3100. doi: 10.1093/bioinformatics/bty191.
    1. Liu Q, Georgieva DC, Egli D, Wang K. NanoMod: a computational tool to detect DNA modifications using Nanopore long-read sequencing data. BMC Genom. 2019;20:78. doi: 10.1186/s12864-018-5372-8.
    1. McIntyre ABR, et al. Single-molecule sequencing detection of N6-methyladenine in microbial reference materials. Nat. Commun. 2019;10:579. doi: 10.1038/s41467-019-08289-9.
    1. Oxford Nanopore Technologies. Rerio GitHub (Oxford Nanopore Technologies, 2020). .
    1. Breiman L. Random forests. Mach. Learn. 2001;45:5–32. doi: 10.1023/A:1010933404324.
    1. Pedregosa F, et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 2011;12:2825–2830.
    1. Crooks GE, Hon G, Chandonia J-M, Brenner SE. WebLogo: a sequence logo generator. Genome Res. 2004;14:1188–1190. doi: 10.1101/gr.849004.
    1. O’Shea JP, et al. pLogo: a probabilistic approach to visualizing sequence motifs. Nat. Methods. 2013;10:1211–1212. doi: 10.1038/nmeth.2646.
    1. Labun K, et al. CHOPCHOP v3: expanding the CRISPR web toolbox beyond genome editing. Nucleic Acids Res. 2019;47:W171–W174. doi: 10.1093/nar/gkz365.
    1. Integrated DNA Technologies. CRISPR-Cas9 Guide RNA Design Checker (Integrated DNA Technologies, 2019). .
    1. Robinson JT, et al. Integrative genomics viewer. Nat. Biotechnol. 2011;29:24–26. doi: 10.1038/nbt.1754.
    1. Danecek, P. et al. Twelve years of SAMtools and BCFtools. GigaScience10, giab008 (2021).
    1. Oxford Nanopore Technologies. Evaluation of Read-mapping Characteristics from a Cas-Mediated PCR-Free Enrichment (Oxford Nanopore Technologies, 2019). .
    1. R Core Team. R: a Language and Environment for Statistical Computing (R Foundation for Statistical Computing, 2020). .

Source: PubMed

3
Subscribe