Machine learning algorithm validation with a limited sample size

Andrius Vabalas, Emma Gowen, Ellen Poliakoff, Alexander J Casson, Andrius Vabalas, Emma Gowen, Ellen Poliakoff, Alexander J Casson

Abstract

Advances in neuroimaging, genomic, motion tracking, eye-tracking and many other technology-based data collection methods have led to a torrent of high dimensional datasets, which commonly have a small number of samples because of the intrinsic high cost of data collection involving human participants. High dimensional data with a small number of samples is of critical importance for identifying biomarkers and conducting feasibility and pilot work, however it can lead to biased machine learning (ML) performance estimates. Our review of studies which have applied ML to predict autistic from non-autistic individuals showed that small sample size is associated with higher reported classification accuracy. Thus, we have investigated whether this bias could be caused by the use of validation methods which do not sufficiently control overfitting. Our simulations show that K-fold Cross-Validation (CV) produces strongly biased performance estimates with small sample sizes, and the bias is still evident with sample size of 1000. Nested CV and train/test split approaches produce robust and unbiased performance estimates regardless of sample size. We also show that feature selection if performed on pooled training and testing data is contributing to bias considerably more than parameter tuning. In addition, the contribution to bias by data dimensionality, hyper-parameter space and number of CV folds was explored, and validation methods were compared with discriminable data. The results suggest how to design robust testing methodologies when working with small datasets and how to interpret the results of other studies based on what validation method was used.

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1. Relationship between log 10 transformed…
Fig 1. Relationship between log10 transformed sample size and reported accuracy.
For 55 studies in the survey which applied ML methods in autism research. A: Relationship between log10 transformed sample size and accuracy, with a dark blue regression line and light blue area showing 95% confidence intervals. B: Classifiers used in the studies. C: Relationship between reported accuracy and log10 transformed sample size by year, bottom scatter-plots are for the studies published in that year. Year 2019 includes studies up to 18/04/2019 when the literature search was performed. N—sample size. D: Relationship between reported accuracy and log10 transformed sample size by modality of data used in the study.
Fig 2. Validation methods.
Fig 2. Validation methods.
A: Train/Test Split. B: K-Fold CV. C: Nested CV. D: Partially nested CV. ACC—overall accuracy of the model, ACCi.—accuracy in a single CV fold.
Fig 3. Gaussian noise classification accuracy distributions…
Fig 3. Gaussian noise classification accuracy distributions with different validation approaches.
K-Fold, Nested, Train/Test Split and two types of partially nested validation methods used. Thick lines show mean validation accuracy and dash-dot lines show 95% confidence intervals for 50 runs. A: SVM-RFE feature selection and SVM classification. B: t-test feature selection and logistic regression classification.
Fig 4. Other factors apart from sample…
Fig 4. Other factors apart from sample size influencing overfitting when K-Fold CV is used.
SVM-RFE and t-test feature selection, SVM classification, and sample size fixed at N = 100. A: Feature number manipulated from 20 to 200. B: Parameter tuning grid size manipulated from 2 × 2 to 20 × 20 with C = 2j, where j varied from 2 to 20 and γ = 2i, where i varied from −2 to −20. C: Number of CV folds varied from two-fold to leave-one out. Thick dashed lines show fitted 5th order polynomial trend.
Fig 5. Other factors apart from sample…
Fig 5. Other factors apart from sample size influencing overfitting when K-Fold CV is used.
SVM-RFE and t-test feature selection, logistic regression classification, and sample size fixed at N = 100. A: Feature number manipulated from 20 to 200. B: Parameter tuning grid size manipulated from 2 × 2 to 200 × 2 with penalty set to L1, L2 and C = ei, where i varied from −4 to 4. Thick lines show fitted 5th order polynomial trend. C: Number of CV folds varied from two-fold to leave-one-out. Thick dashed lines show fitted 5th order polynomial trend.
Fig 6. K-Fold CV with different feature-to-sample…
Fig 6. K-Fold CV with different feature-to-sample ratios.
Sample size ranged from 14 to 446 and feature number was set accordingly to keep feature-to-sample ratios at 20, 10, 3, 2, 1, 1/2 and 1/3. A: SVM-RFE feature selection and SVM classifier. B: t-test feature selection and logistic regression classifier.
Fig 7. Classification with discriminable data using…
Fig 7. Classification with discriminable data using K-Fold CV, Nested CV and Train/Test Split validation methods.
A: Comparison of different validation methods. Dash-dot lines show 95% confidence intervals for 50 runs. B: Size of 95% confidence intervals. Inset plot shows more refined view of confidence intervals for K-fold CV in a sample size range of 20 to 200 (in the inset plot sample sizes were N = 20, 22, … 198, 200, in the main plot N = 20, 40, … 980, 1000).
Fig 8. Illustrative examples of why models…
Fig 8. Illustrative examples of why models overfit.
A: SVM-RBF decision boundary. Red and blue circles/crosses show data points from two classes, red and blue areas show learned decision boundary by SVM-RBF. Left: Classifier trained on both train data points (circles) and validation data points (crosses). Right: Classifier trained only on train data points (circles). B: Two-sample t-test feature selection performed both, on pooled and on independent train and validation data. Y axis shows mean t-statistic for selected 10 features from the pool of features ranging from 20 to 100.

References

    1. Raudys SJ, Jain AK. Small sample size effects in statistical pattern recognition: recommendations for practitioners. IEEE Transactions on Pattern Analysis and Machine Intelligence. 1991. March;13(3):252–264. 10.1109/34.75512
    1. Sudlow C, Gallacher J, Allen N, Beral V, Burton P, Danesh J, et al. UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age PLoS Medicine. 2015. March;12(3):e1001779 10.1371/journal.pmed.1001779
    1. Arbabshirani MR, Plis S, Sui J, Calhoun VD. Single subject prediction of brain disorders in neuroimaging: Promises and pitfalls. NeuroImage. 2016. January;145:137–165. 10.1016/j.neuroimage.2016.02.079
    1. Varoquaux G. Cross-validation failure: Small sample sizes lead to large error bars. NeuroImage. 2018. October;180:68–77. 10.1016/j.neuroimage.2017.06.061
    1. Combrisson E, Jerbi K. Exceeding chance level by chance: The caveat of theoretical chance levels in brain signal classification and statistical assessment of decoding accuracy. Journal of Neuroscience Methods. 2015. July;250:126–36. 10.1016/j.jneumeth.2015.01.010
    1. Kanal L, Chandrasekaran B. On dimensionality and sample size in statistical pattern classification. Pattern Recognition. 1971. October;3(3):225–34. 10.1016/0031-3203(71)90013-6
    1. Varma S, Simon R. Bias in error estimation when using cross-validation for model selection. BMC Bioinformatics. 2006. December;7(1):91 10.1186/1471-2105-7-91
    1. Jain AK, Chandrasekaran B. 39 Dimensionality and sample size considerations in pattern recognition practice. Handbook of Statistics. 1982. January;2:835–55. 10.1016/S0169-7161(82)02042-2
    1. Cawley GC, Talbot NL. On over-fitting in model selection and subsequent selection bias in performance evaluation. Machine Learning Research. 2010. July;11:2079–107.
    1. Stone M. Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society: Series B (Methodological). 1974. January;36(2):111–33.
    1. Krstajic D, Buturovic LJ, Leahy DE, Thomas S. Cross-validation pitfalls when selecting and assessing regression and classification models. Journal of Cheminformatics. 2014. December;6(1):1–15. 10.1186/1758-2946-6-10
    1. Boser BE, Guyon IM, Vapnik VN. A training algorithm for optimal margin classifiers. Proceedings of the Fifth Annual Workshop on Computational Learning Theory. 1992 Jul;144-152.
    1. Guyon I, Weston J, Barnhill S, Vapnik V. Gene selection for cancer classification using support vector machines. Machine Learning. 2002. January;46(1-3):389–422. 10.1023/A:1012487302797
    1. Libbrecht MW, Noble WS. Machine learning applications in genetics and genomics. Nature Reviews Genetics. 2015. June;16(6):321 10.1038/nrg3920
    1. Saeys Y, Inza I, Larrañaga P. A review of feature selection techniques in bioinformatics. Bioinformatics. 2007. October;23(19):2507–17. 10.1093/bioinformatics/btm344
    1. Hua J, Xiong Z, Lowey J, Suh E, Dougherty ER. Optimal number of features as a function of sample size for various classification rules. Bioinformatics. 2004. November;21(8):1509–15. 10.1093/bioinformatics/bti171
    1. Chang CC, Lin CJ. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST). 2011. April;2(3):27.
    1. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research. 2011. October;12:2825–30.
    1. Devos O, Ruckebusch C, Durand A, Duponchel L, Huvenne JP. Support vector machines (SVM) in near infrared (NIR) spectroscopy: Focus on parameters optimization and model interpretation. Chemometrics and Intelligent Laboratory Systems. 2009. March;96(1):27–33. 10.1016/j.chemolab.2008.11.005
    1. Hira ZM, Gillies DF. A review of feature selection and feature extraction methods applied on microarray data. Advances in Bioinformatics. 2015;2015 10.1155/2015/198363
    1. Bolón-Canedo V, Sánchez-Maroño N, Alonso-Betanzos A, Benítez JM, Herrera F. A review of microarray datasets and applied feature selection methods. Information Sciences. 2014. October;282:111–35. 10.1016/j.ins.2014.05.042
    1. Dernoncourt D, Hanczar B, Zucker JD. Analysis of feature selection stability on high dimension and small sample data. Computational Statistics & Data Analysis. 2014. March;71:681–93. 10.1016/j.csda.2013.07.012
    1. Figueroa RL, Zeng-Treitler Q, Kandula S, Ngo LH. Predicting sample size required for classification performance. BMC Medical Informatics and Decision Making. 2012. December;12(1):8 10.1186/1472-6947-12-8
    1. Mukherjee S, Tamayo P, Rogers S, Rifkin R, Engle A, Campbell C, et al. Estimating dataset size requirements for classifying DNA microarray data. Journal of Computational Biology. 2003. April;10(2):119–42. 10.1089/106652703321825928
    1. Beleites C, Neugebauer U, Bocklitz T, Krafft C, Popp J. Sample size planning for classification models. Analytica Chimica Acta. 2013. January;760:25–33. 10.1016/j.aca.2012.11.007
    1. Hyde KK, Novack MN, LaHaye N, Parlett-Pelleriti C, Anden R, Dixon DR, et al. Applications of Supervised Machine Learning in Autism Spectrum Disorder Research: a Review. Review Journal of Autism and Developmental Disorders. 2019. June;6(2):128–46. 10.1007/s40489-019-00158-x
    1. Varoquaux G, Raamana PR, Engemann DA, Hoyos-Idrobo A, Schwartz Y, Thirion B. Assessing and tuning brain decoders: cross-validation, caveats, and guidelines. NeuroImage. 2017. January;145:166–79. 10.1016/j.neuroimage.2016.10.038
    1. Bone D, Goodwin MS, Black MP, Lee CC, Audhkhasi K, Narayanan S. Applying machine learning to facilitate autism diagnostics: pitfalls and promises. Journal of Autism and Developmental Disorders. 2015. May;45(5):1121–36. 10.1007/s10803-014-2268-6
    1. Kassraian-Fard P, Matthis C, Balsters JH, Maathuis MH, Wenderoth N. Promises, pitfalls, and basic guidelines for applying machine learning classifiers to psychiatric imaging data, with autism as an example. Frontiers in Psychiatry. 2016. December;7:177 10.3389/fpsyt.2016.00177

Source: PubMed

3
Abonneren