Predicting sample size required for classification performance

Rosa L Figueroa, Qing Zeng-Treitler, Sasikiran Kandula, Long H Ngo, Rosa L Figueroa, Qing Zeng-Treitler, Sasikiran Kandula, Long H Ngo

Abstract

Background: Supervised learning methods need annotated data in order to generate efficient models. Annotated data, however, is a relatively scarce resource and can be expensive to obtain. For both passive and active learning methods, there is a need to estimate the size of the annotated sample required to reach a performance target.

Methods: We designed and implemented a method that fits an inverse power law model to points of a given learning curve created using a small annotated training set. Fitting is carried out using nonlinear weighted least squares optimization. The fitted model is then used to predict the classifier's performance and confidence interval for larger sample sizes. For evaluation, the nonlinear weighted curve fitting method was applied to a set of learning curves generated using clinical text and waveform classification tasks with active and passive sampling methods, and predictions were validated using standard goodness of fit measures. As control we used an un-weighted fitting method.

Results: A total of 568 models were fitted and the model predictions were compared with the observed performances. Depending on the data set and sampling method, it took between 80 to 560 annotated samples to achieve mean average and root mean squared error below 0.01. Results also show that our weighted fitting method outperformed the baseline un-weighted method (p < 0.05).

Conclusions: This paper describes a simple and effective sample size prediction algorithm that conducts weighted fitting of learning curves. The algorithm outperformed an un-weighted algorithm described in previous literature. It can help researchers determine annotation sample size for supervised machine learning.

Figures

Figure 1
Figure 1
Generic learning curve.
Figure 2
Figure 2
Progression of online curve fitting for learning curve of the dataset D2-RAND.
Figure 3
Figure 3
Progression of confidence interval width and MAE for predicted values.
Figure 4
Figure 4
RMSE for predicted values on the three datasets.
Figure 5
Figure 5
Progression of confidence interval widths for the observed values (training set) and the predicted values.

References

    1. Mukherjee S, Tamayo P, Rogers S, Rifkin R, Engle A, Campbell C, Golub TR, Mesirov JP. Estimating dataset size requirements for classifying DNA microarray data. J Comput Biol. 2003;10(2):119–142. doi: 10.1089/106652703321825928.
    1. Dobbin K, Zhao Y, Simon R. How Large a Training Set is Needed to Develop a Classifier for Microarray Data? Clinical Cancer Research. 2008;14(1):108–114. doi: 10.1158/1078-0432.CCR-07-0443.
    1. Tam VH, Kabbara S, Yeh RF, Leary RH. Impact of sample size on the performance of multiple-model pharmacokinetic simulations. Antimicrobial agents and chemotherapy. 2006;50(11):3950–3952. doi: 10.1128/AAC.00337-06.
    1. Kim S-Y. Effects of sample size on robustness and prediction accuracy of a prognostic gene signature. BMC bioinformatics. 2009;10(1):147. doi: 10.1186/1471-2105-10-147.
    1. Kalayeh HM, Landgrebe DA. Predicting the Required Number of Training Samples. Pattern Analysis and Machine Intelligence, IEEE Transactions on. 1983;5(6):664–667.
    1. Nigam K, McCallum AK, Thrun S, Mitchell T. Text Classification from Labeled and Unlabeled Documents using EM. Mach Learn. 2000;39(2-3):103–134.
    1. Vlachos A. A stopping criterion for active learning. Computer Speech and Language. 2008;22(3):295–312. doi: 10.1016/j.csl.2007.12.001.
    1. Olsson F, Tomanek K. Proceedings of the Thirteenth Conference on Computational Natural Language Learning. Boulder, Colorado: Association for Computational Linguistics; 2009. An intrinsic stopping criterion for committee-based active learning; pp. 138–146.
    1. Zhu J, Wang H, Hovy E, Ma M. Confidence-based stopping criteria for active learning for data annotation. ACM Transactions on Speech and Language Processing (TSLP) 2010;6(3):1–24. doi: 10.1145/1753783.1753784.
    1. Figueroa RL, Zeng-Treitler Q. Poster session presented at: AMIA 2009 Annual Symposium in Biomedical and Health Informatics. San Francisco, CA, USA; 2009. Exploring Active Learning in Medical Text Classification.
    1. Kandula S, Figueroa R, Zeng-Treitler Q. Poster Session presented at: MEDINFO 2010 13th World Congress on MEdical Informatics. Cape Town, South Africa; 2010. Predicting Outcome Measures in Active Learning.
    1. Maxwell SE, Kelley K, Rausch JR. Sample size planning for statistical power and accuracy in parameter estimation. Annual review of psychology. 2008;59:537–563. doi: 10.1146/annurev.psych.59.103006.093735.
    1. Adcock CJ. Sample size determination: a review. Journal of the Royal Statistical Society: Series D (The Statistician) 1997;46(2):261–283. doi: 10.1111/1467-9884.00082.
    1. Lenth RV. Some Practical Guidelines for Effective Sample Size Determination. The American Statistician. 2001;55(3):187–193. doi: 10.1198/000313001317098149.
    1. Briggs AH, Gray AM. Power and Sample Size Calculations for Stochastic Cost-Effectiveness Analysis. Medical Decision Making. 1998;18(2):S81–S92. doi: 10.1177/0272989X9801800210.
    1. Carneiro AV. Estimating sample size in clinical studies: basic methodological principles. Rev Port Cardiol. 2003;22(12):1513–1521.
    1. Cohen J. Statistical Power Analysis for the Behavioural Sciences. Hillsdale, NJ: Lawrence Erlbaum Associates; 1988.
    1. Scheinin I, Ferreira JA, Knuutila S, Meijer GA, van de Wiel MA, Ylstra B. CGHpower: exploring sample size calculations for chromosomal copy number experiments. BMC bioinformatics. 2010;11:331. doi: 10.1186/1471-2105-11-331.
    1. Eng J. Sample size estimation: how many individuals should be studied? Radiology. 2003;227(2):309–313. doi: 10.1148/radiol.2272012051.
    1. Walters SJ. Sample size and power estimation for studies with health related quality of life outcomes: a comparison of four methods using the SF-36. Health and quality of life outcomes. 2004;2:26. doi: 10.1186/1477-7525-2-26.
    1. Cai J, Zeng D. Sample size/power calculation for case-cohort studies. Biometrics. 2004;60(4):1015–1024. doi: 10.1111/j.0006-341X.2004.00257.x.
    1. Algina J, Moulder BC, Moser BK. Sample Size Requirements for Accurate Estimation of Squared Semi-Partial Correlation Coefficients. Multivariate Behavioral Research. 2002;37(1):37–57. doi: 10.1207/S15327906MBR3701_02.
    1. Stalbovskaya V, Hamadicharef B, Ifeachor E. Sample Size Determination using ROC Analysis. 3rd International Conference on Computational Intelligence in Medicine and Healthcare (CIMED2007): 2007. 2007.
    1. Beal SL. Sample Size Determination for Confidence Intervals on the Population Mean and on the Difference Between Two Population Means. Biometrics. 1989;45(3):969–977. doi: 10.2307/2531696.
    1. Jiroutek MR, Muller KE, Kupper LL, Stewart PW. A New Method for Choosing Sample Size for Confidence Interval-Based Inferences. Biometrics. 2003;59(3):580–590. doi: 10.1111/1541-0420.00068.
    1. Fukunaga K, Hayes R. Effects of sample size in classifier design. Pattern Analysis and Machine Intelligence, IEEE Transactions on. 1989;11(8):873–885. doi: 10.1109/34.31448.
    1. Cortes C, Jackel LD, Solla SA, Vapnik V, Denker JS. Learning Curves: Asymptotic Values and Rate of Convergence. VI. San Francisco, CA. USA.: Morgan Kaufmann Publishers; 1994.
    1. Boonyanunta N, Zeephongsekul P. Knowledge-Based Intelligent Information and Engineering Systems. Vol. 3215. Springer Berlin/Heidelberg; 2004. Predicting the Relationship Between the Size of Training Sample and the Predictive Power of Classifiers; pp. 529–535.
    1. Hess KR, Wei C. Learning Curves in Classification With Microarray Data. Seminars in oncology. 2010;37(1):65–68. doi: 10.1053/j.seminoncol.2009.12.002.
    1. Last M. Proceedings of the Seventh IEEE International Conference on Data Mining Workshops. IEEE Computer Society; 2007. Predicting and Optimizing Classifier Utility with the Power Law; pp. 219–224.
    1. Provost F, Jensen D, Oates T. Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining. San Diego, California, United States: ACM; 1999. Efficient progressive sampling.
    1. Warmuth MK, Liao J, Ratsch G, Mathieson M, Putta S, Lemmen C. Active learning with support vector machines in the drug discovery process. J Chem Inf Comput Sci. 2003;43(2):667–673. doi: 10.1021/ci025620t.
    1. Liu Y. Active learning with support vector machine applied to gene expression data for cancer classification. J Chem Inf Comput Sci. 2004;44(6):1936–1941. doi: 10.1021/ci049810a.
    1. Li M, Sethi IK. Confidence-based active learning. Pattern Analysis and Machine Intelligence, IEEE Transactions on. 2006;28(8):1251–1261.
    1. Brinker K. Incorporating Diversity in Active Learning with Support Vector Machines. Proceedings of the Twentieth International Conference on Machine Learning (ICML): 2003. 2003. pp. 59–66.
    1. Yuan J, Zhou X, Zhang J, Wang M, Zhang Q, Wang W, Shi B. Positive Sample Enhanced Angle-Diversity Active Learning for SVM Based Image Retrieval. Proceedings of the IEEE International Conference on Multimedia and Expo (ICME 2007): 2007. 2007. pp. 2202–2205.
    1. Yelle LE. The Learning Curve: Historical Review and Comprehensive Survey. Decision Sciences. 1979;10(2):302–327. doi: 10.1111/j.1540-5915.1979.tb00026.x.
    1. Ramsay C, Grant A, Wallace S, Garthwaite P, Monk A, Russell I. Statistical assessment of the learning curves of health technologies. Health Technology Assessment. 2001;5(12)
    1. Dennis JE, Gay DM, Welsch RE. Algorithm 573: NL2SOL - An Adaptive Nonlinear Least-Squares Algorithm [E4] ACM Transactions on Mathematical Software. 1981;7(3):369–383. doi: 10.1145/355958.355966.
    1. UCI Machine Learning Repository.
    1. Weka---Machine Learning Software in Java.
    1. Tong S, Koller D. Support Vector Machine Active Learning with Applications to Text Classification. Journal of Machine Learning Research. 2001;2:45–66.

Source: PubMed

3
Prenumerera