Estimation of required sample size for external validation of risk models for binary outcomes

Menelaos Pavlou, Chen Qu, Rumana Z Omar, Shaun R Seaman, Ewout W Steyerberg, Ian R White, Gareth Ambler, Menelaos Pavlou, Chen Qu, Rumana Z Omar, Shaun R Seaman, Ewout W Steyerberg, Ian R White, Gareth Ambler

Abstract

Risk-prediction models for health outcomes are used in practice as part of clinical decision-making, and it is essential that their performance be externally validated. An important aspect in the design of a validation study is choosing an adequate sample size. In this paper, we investigate the sample size requirements for validation studies with binary outcomes to estimate measures of predictive performance (C-statistic for discrimination and calibration slope and calibration in the large). We aim for sufficient precision in the estimated measures. In addition, we investigate the sample size to achieve sufficient power to detect a difference from a target value. Under normality assumptions on the distribution of the linear predictor, we obtain simple estimators for sample size calculations based on the measures above. Simulation studies show that the estimators perform well for common values of the C-statistic and outcome prevalence when the linear predictor is marginally Normal. Their performance deteriorates only slightly when the normality assumptions are violated. We also propose estimators which do not require normality assumptions but require specification of the marginal distribution of the linear predictor and require the use of numerical integration. These estimators were also seen to perform very well under marginal normality. Our sample size equations require a specified standard error (SE) and the anticipated C-statistic and outcome prevalence. The sample size requirement varies according to the prognostic strength of the model, outcome prevalence, choice of the performance measure and study objective. For example, to achieve an SE < 0.025 for the C-statistic, 60-170 events are required if the true C-statistic and outcome prevalence are between 0.64-0.85 and 0.05-0.3, respectively. For the calibration slope and calibration in the large, achieving SE < 0.15 would require 40-280 and 50-100 events, respectively. Our estimators may also be used for survival outcomes when the proportion of censored observations is high.

Keywords: C-statistic; Sample size calculation; calibration; discrimination; prediction model.

Conflict of interest statement

Declaration of conflicting interests: The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Figures

Figure 1.
Figure 1.
Standard error of the estimated C-statistic (a), calibration slope (b) and calibration in the large (c) as the true value of the C-statistic varies and the number of events is fixed to 100, corresponding to sample sizes of 2000, 1000, 500, 334 and 250 for outcome prevalences of 0.05, 0.1, 0.2, 0.3 and 0.4, respectively. SE: standard error.
Figure 2.
Figure 2.
Number of events required to achieve required standard errors of: (a) SE = 0.025 for the estimated C-statistic of 0.025 (width of 95% CI = 0.1) or (b) SE = 0.15 for the estimated calibration slope (width of 95% CI = 0.6) or (c) SE = 0.15 for the estimated calibration in the large, as the true value of the C-statistic and the outcome prevalence varies. SE: standard error.
Figure 3.
Figure 3.
Number of events required to detect a difference of magnitude between 0.03 and 0.1 from a target value of C = 0.72 (C1 = C0 + d).

References

    1. Collins GS, Altman DG. An independent external validation and evaluation of QRISK cardiovascular risk prediction: a prospective open cohort study. BMJ 2009; 339: b2584.
    1. O'Mahony C, Jichi F, Pavlou M, et al.. A novel clinical risk prediction model for sudden cardiac death in hypertrophic cardiomyopathy (HCM risk-SCD). Eur Heart J 2014; 35: 2010–2020.
    1. Nashef SAM, Roques F, Sharples LD, et al.. EuroSCORE II†. Eur J Cardiothorac Surg 2012; 41: 734–745.
    1. McAllister KS, Ludman PF, Hulme W, et al.. A contemporary risk model for predicting 30-day mortality following percutaneous coronary intervention in England and Wales. Int J Cardiol 2016; 210: 125–132.
    1. König IR, Fuchs O, Hansen G, et al.. What is precision medicine? Eur Respir J 2017; 50: 1700391.
    1. Collins G, de Groot J, Dutton S, et al.. External validation of multivariable prediction models: a systematic review of methodological conduct and reporting. BMC Med Res Methodol 2014; 14: 40.
    1. Harrell FE, Lee KL, Mark DB. Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Stat Med 1996; 15: 361–387.
    1. Vergouwe Y, Steyerberg EW, Eijkemans MJC, et al.. Substantial effective sample sizes were required for external validation studies of predictive logistic regression models. J Clin Epidemiol 2005; 58: 475–483.
    1. Peek N, Arts DGT, Bosman RJ, et al.. External validation of prognostic models for critically ill patients required substantial sample sizes. J Clin Epidemiol 2007; 60: 491–501.
    1. Collins GS, Ogundimu EO, Altman DG. Sample size considerations for the external validation of a multivariable prognostic model: a resampling study. Stat Med 2016; 35: 214–226.
    1. Snell KI, Archer L, Ensor J, et al.. External validation of clinical prediction models: simulation-based sample size calculations were more reliable than rules-of-thumb. J Clin Epidemiol 2021; 135: 79–89.
    1. Riley RD, Snell KI, Ensor J, et al.. Minimum sample size for developing a multivariable prediction model: part II – binary and time-to-event outcomes. 2019; 38: 1276–1296.
    1. van Smeden M, Moons KGM, de Groot JAH, et al.. Sample size for binary logistic prediction models: beyond events per variable criteria. Stat Methods Med Res 2018; 28: 2455–2474.
    1. Hastie T, Tibshirani R, Friedman JH. The elements of statistical learning: data mining, inference, and prediction. New York: Springer, 2009.
    1. Cox DR. Two further applications of a model for binary regression. Biometrika 1958; 45: 562–565.
    1. Cleves MA. Comparative assessment of three common algorithms for estimating the variance of the area under the nonparametric receiver operating characteristic curve. Stata J 2002; 2: 280–289.
    1. DeLong ER, DeLong DM, Clarke-Pearson DL. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics 1988; 44: 837–845.
    1. Gail MH, Pfeiffer RM. On criteria for evaluating models of absolute risk. Biostatistics (Oxford, England) 2005; 6: 227–239.
    1. Zhou X, Obuchowski N, McClish D. Statistical methods in diagnostic medicine. New York: Wiley, 2002.
    1. Owen DB. Tables for computing bivariate normal probabilities. Ann Math Statist 1956; 27: 1075–1090.
    1. Austin PC, Steyerberg EW. Interpreting the concordance statistic of a logistic regression model: relation to the variance and odds ratio of a continuous explanatory variable. BMC Med Res Methodol 2012; 12: 82.
    1. Demler OV, Pencina MJ, D'Agostino RB. Equivalence of improvement in area under ROC curve and linear discriminant analysis coefficient under assumption of normality. Stat Med 2011; 30: 1410–1418.
    1. Efron B. The efficiency of logistic regression compared to normal discriminant analysis. J Am Stat Assoc 1975; 70: 892–898.
    1. Morris TP, White IR, Crowther MJ. Using simulation studies to evaluate statistical methods. Stat Med 2019; 38: 2074–2102.
    1. Harrell FE, Jr, Califf RM, Pryor DB, et al.. Evaluating the yield of medical tests. JAMA 1982; 247: 2543–2546.
    1. Annesi I, Moreau T, Lellouch J. Efficiency of the logistic regression and Cox proportional hazards models in longitudinal studies. Stat Med 1989; 8: 1515–1521.
    1. Green MS, Symons MJ. A comparison of the logistic risk function and the proportional hazards model in prospective epidemiologic studies. J Chronic Dis 1983; 36: 715–723.
    1. Ambler G, Omar R, Royston P, et al.. Generic, simple risk stratification model for heart valve surgery. Circulation 2005; 112: 224–231.
    1. Riley RD, Snell KIE, Ensor J, et al.. Minimum sample size for developing a multivariable prediction model: part I – continuous outcomes. Stat Med 2019; 38: 1262–1275.
    1. Archer L, Snell KIE, Ensor J, et al.. Minimum sample size for external validation of a clinical prediction model with a continuous outcome. Stat Med 2021; 40: 133–146.
    1. Hosmer DW Jr, Lemeshow S. Applied logistic regression. Hoboken, NJ: John Wiley & Sons, 2004.

Source: PubMed

3
Abonner