Random forest versus logistic regression: a large-scale benchmark experiment

Raphael Couronné, Philipp Probst, Anne-Laure Boulesteix, Raphael Couronné, Philipp Probst, Anne-Laure Boulesteix

Abstract

Background and goal: The Random Forest (RF) algorithm for regression and classification has considerably gained popularity since its introduction in 2001. Meanwhile, it has grown to a standard classification approach competing with logistic regression in many innovation-friendly scientific fields.

Results: In this context, we present a large scale benchmarking experiment based on 243 real datasets comparing the prediction performance of the original version of RF with default parameters and LR as binary classification tools. Most importantly, the design of our benchmark experiment is inspired from clinical trial methodology, thus avoiding common pitfalls and major sources of biases.

Conclusion: RF performed better than LR according to the considered accuracy measured in approximately 69% of the datasets. The mean difference between RF and LR was 0.029 (95%-CI =[0.022,0.038]) for the accuracy, 0.041 (95%-CI =[0.031,0.053]) for the Area Under the Curve, and - 0.027 (95%-CI =[-0.034,-0.021]) for the Brier score, all measures thus suggesting a significantly better performance of RF. As a side-result of our benchmarking experiment, we observed that the results were noticeably dependent on the inclusion criteria used to select the example datasets, thus emphasizing the importance of clear statements regarding this dataset selection process. We also stress that neutral studies similar to ours, based on a high number of datasets and carefully designed, will be necessary in the future to evaluate further variants, implementations or parameters of random forests which may yield improved accuracy compared to the original version with default values.

Keywords: Classification; Comparison study; Logistic regression; Prediction.

Conflict of interest statement

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Figures

Fig. 1
Fig. 1
Example of partial dependence plots. Plot of the PDP for the three simulated datasets. Each line is related to a dataset. On the left, visualization of the dataset. On the right, the partial dependence for the variable X1. First dataset: β0=1,β1=5,β2=−2 (linear), second dataset: β0=1,β1=1,β2=−1,β3=3 (interaction), third dataset β0=−2,β4=5 (non-linear)
Fig. 2
Fig. 2
Selection of datasets. Flowchart representing the criteria for selection of the datasets
Fig. 3
Fig. 3
Main results of the benchmark experiment. Boxplots of the performance for the three considered measures on the 243 considered datasets. Top: boxplot of the performance of LR (dark) and RF (white) for each performance measure. Bottom: boxplot of the difference of performances Δperf=perfRFperfLR
Fig. 4
Fig. 4
Influence of n and p: subsampling experiment based on dataset ID=310. Top: Boxplot of the performance (acc) of RF (dark) and LR (white) for N=50 sub-datasets extracted from the OpenML dataset with ID=310 by randomly picking n′≤n observations and p′<p features. Bottom: Boxplot of the differences in performances Δacc=AccRFAccLR between RF and LR. p′∈{1,2,3,4,5,6}. n′∈{5e2,1e3,5e3,1e4}. Performance is evaluated through 5-fold-cross-validation repeated 2 times
Fig. 5
Fig. 5
Subgroup analyses. Top: for each of the four selected meta-features n, p, p/n and Cmax, boxplots of Δacc for different thresholds as criteria for dataset selection. Bottom: distribution of the four meta-features (log scale), where the chosen thresholds are displayed as vertical lines. Note that outliers are not shown here for a more convenient visualization. For a corresponding figure including the outliers as well as the results for auc and brier, see Additional file 1
Fig. 6
Fig. 6
Plot of the partial dependence for the 4 considered meta-features : log(n), log(p), logpn, Cmax. The log scale was chosen for 3 of the 4 features to obtain more uniform distribution (see Fig. 5 where the distribution is plotted in log scale). For each plot, the black line denotes the median of the individual partial dependences, and the lower and upper curves of the grey regions represent respectively the 25%- und 75%-quantiles. Estimated mse is 0.00382 via a 5-CV repeated 4 times

References

    1. Shmueli G. To explain or to predict? Stat Sci. 2010;25:289–310. doi: 10.1214/10-STS330.
    1. Breiman L. Random forests. Mach Learn. 2001;45(1):5–32. doi: 10.1023/A:1010933404324.
    1. Liaw A, Wiener M. Classification and regression by randomforest. R News. 2002;2:18–22.
    1. Probst P. tuneRanger: Tune Random Forest of the ’ranger’ Package. 2018. R package version 0.1.
    1. Boulesteix A-L, Lauer S, Eugster MJ. A plea for neutral comparison studies in computational sciences. PLoS ONE. 2013;8(4):61562. doi: 10.1371/journal.pone.0061562.
    1. De Bin R, Janitza S, Sauerbrei W, Boulesteix A-L. Subsampling versus bootstrapping in resampling-based model selection for multivariable regression. Biometrics. 2016;72:272–80. doi: 10.1111/biom.12381.
    1. Boulesteix A-L, De Bin R, Jiang X, Fuchs M. IPF-LASSO: integrative L1-penalized regression with penalty factors for prediction based on multi-omics data. Comput Math Models Med. 2017. 10.1155/2017/7691937.
    1. Boulesteix A-L, Bender A, Bermejo JL, Strobl C. Random forest gini importance favours snps with large minor allele frequency: impact, sources and recommendations. Brief Bioinform. 2012;13(3):292–304. doi: 10.1093/bib/bbr053.
    1. Boulesteix A-L, Schmid M. Machine learning versus statistical modeling. Biom J. 2014;56(4):588–93. doi: 10.1002/bimj.201300226.
    1. Boulesteix A-L, Janitza S, Hornung R, Probst P, Busen H, Hapfelmeier A. Making complex prediction rules applicable for readers: Current practice in random forest literature and recommendations. Biometrical J. 2016. In press.
    1. Boulesteix A-L, Wilson R, Hapfelmeier A. Towards evidence-based computational statistics: lessons from clinical research on the role and design of real-data benchmark studies. BMC Med Res Methodol. 2017;17(1):138. doi: 10.1186/s12874-017-0417-2.
    1. Friedman JH. Greedy function approximation: a gradient boosting machine. Ann Stat. 2001;29:1189–232. doi: 10.1214/aos/1013203451.
    1. Hothorn T, Hornik K, Zeileis A. Unbiased recursive partitioning: A conditional inference framework. J Comput Graph Stat. 2006;15:651–74. doi: 10.1198/106186006X133933.
    1. Strobl C, Boulesteix A-L, Zeileis A, Hothorn T. Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC Bioinformatics. 2007;8:25. doi: 10.1186/1471-2105-8-25.
    1. Geurts P, Ernst D, Wehenkel L. Extremely randomized trees. Mach Learn. 2006;63(1):3–42. doi: 10.1007/s10994-006-6226-1.
    1. Boulesteix A-L, Janitza S, Kruppa J, König IR. Overview of random forest methodology and practical guidance with emphasis on computational biology and bioinformatics. Wiley Interdiscip Rev Data Min Knowl Discov. 2012;2(6):493–507. doi: 10.1002/widm.1072.
    1. Huang BF, Boutros PC. The parameter sensitivity of random forests. BMC Bioinformatics. 2016;17:331. doi: 10.1186/s12859-016-1228-x.
    1. Probst P, Boulesteix A-L. To tune or not to tune the number of trees in random forest. J Mach Learn Res. 2018;18(181):1–18.
    1. Probst P, Bischl B, Boulesteix A-L. Tunability: Importance of hyperparameters of machine learning algorithms. 2018. arXiv preprint. .
    1. Probst P, Wright M, Boulesteix A-L. Hyperparameters and Tuning Strategies for Random Forest. 2018. ArXiv preprint. .
    1. Bischl B, Mersmann O, Trautmann H, Weihs C. Resampling methods for meta-model validation with recommendations for evolutionary computation. Evol Comput. 2012;20(2):249–75. doi: 10.1162/EVCO_a_00069.
    1. Steyerberg EW, Vickers AJ, Cook NR, Gerds T, Gonen M, Obuchowski N, Pencina MJ, Kattan MW. Assessing the performance of prediction models: a framework for some traditional and novel measures. Epidemiology. 2010;21(1):128. doi: 10.1097/EDE.0b013e3181c30fb2.
    1. Rufibach K. Use of brier score to assess binary predictions. J Clin Epidemiol. 2010;63(8):938–9. doi: 10.1016/j.jclinepi.2009.11.009.
    1. Lichman M. UCI Machine Learning Repository. 2013. . Accessed 4 July 2018.
    1. Brazma A, Parkinson H, Sarkans U, Shojatalab M, Vilo J, Abeygunawardena N, Holloway E, Kapushesky M, Kemmeren P, Lara GG, et al. Arrayexpress—a public repository for microarray gene expression data at the EBI. Nucleic Acids Res. 2003;31:68–71. doi: 10.1093/nar/gkg091.
    1. Vanschoren J, Van Rijn JN, Bischl B, Torgo L. OpenML: networked science in machine learning. ACM SIGKDD Explor Newsl. 2014;15(2):49–60. doi: 10.1145/2641190.2641198.
    1. Yousefi MR, Hua J, Sima C, Dougherty ER. Reporting bias when using real data sets to analyze classification performance. Bioinformatics. 2010;26(1):68–76. doi: 10.1093/bioinformatics/btp605.
    1. Boulesteix A-L. Ten simple rules for reducing overoptimistic reporting in methodological computational research. PLoS Comput Biol. 2015;11(4):1004191. doi: 10.1371/journal.pcbi.1004191.
    1. Giraud-Carrier C, Vilalta R, Brazdil P. Introduction to the special issue on meta-learning. Mach Learn. 2004;54(3):187–93. doi: 10.1023/B:MACH.0000015878.60765.42.
    1. Jong VL, Novianti PW, Roes KC, Eijkemans MJ. Selecting a classification function for class prediction with gene expression data. Bioinformatics. 2016;32:1814–22. doi: 10.1093/bioinformatics/btw034.
    1. Boulesteix A-L, Hable R, Lauer S, Eugster MJ. A statistical framework for hypothesis testing in real data comparison studies. Am Stat. 2015;69(3):201–12. doi: 10.1080/00031305.2015.1005128.
    1. Bischl B, Lang M, Kotthoff L, Schiffner J, Richter J, Jones Z, Casalicchio G. Mlr: Machine Learning in R. 2016. R package version 2.10. .
    1. Casalicchio G, Bischl B, Kirchhoff D, Lang M, Hofner B, Bossek J, Kerschke P, Vanschoren J. OpenML: Exploring Machine Learning Better, Together. 2016. R package version 1.0. .
    1. Lang M, Bischl B, Surmann D. batchtools: Tools for R to work on batch systems. J Open Source Softw. 2017;2(10). 10.21105/joss.00135.
    1. Couronné R, Probst P. 2017. 10.5281/zenodo.439090https://.
    1. Couronné R, Probst P. Docker image: Benchmarking random forest: a large- scale experiment. 2017. 10.5281/zenodo.804427.
    1. Boettiger C. An introduction to docker for reproducible research. SIGOPS Oper Syst Rev. 2015;49(1):71–9. doi: 10.1145/2723872.2723882.
    1. Davison AC, Hinkley DV. Bootstrap Methods and Their Application. Cambridge: Cambridge University Press; 1997.
    1. Muchlinski D, Siroky D, He J, Kocher M. Comparing random forest with logistic regression for predicting class-imbalanced civil war onset data. Polit Anal. 2015;24(1):87–103. doi: 10.1093/pan/mpv024.
    1. Cummings MP, Myers DS. Simple statistical models predict C-to-U edited sites in plant mitochondrial RNA. BMC Bioinform. 2004;5(1):132. doi: 10.1186/1471-2105-5-132.
    1. Breiman L. Statistical modeling: The two cultures (with comments and a rejoinder by the author) Stat Sci. 2001;16(3):199–231. doi: 10.1214/ss/1009213726.

Source: PubMed

3
Prenumerera