Adequate sample size for developing prediction models is not simply related to events per variable

Emmanuel O Ogundimu, Douglas G Altman, Gary S Collins, Emmanuel O Ogundimu, Douglas G Altman, Gary S Collins

Abstract

Objectives: The choice of an adequate sample size for a Cox regression analysis is generally based on the rule of thumb derived from simulation studies of a minimum of 10 events per variable (EPV). One simulation study suggested scenarios in which the 10 EPV rule can be relaxed. The effect of a range of binary predictors with varying prevalence, reflecting clinical practice, has not yet been fully investigated.

Study design and setting: We conducted an extended resampling study using a large general-practice data set, comprising over 2 million anonymized patient records, to examine the EPV requirements for prediction models with low-prevalence binary predictors developed using Cox regression. The performance of the models was then evaluated using an independent external validation data set. We investigated both fully specified models and models derived using variable selection.

Results: Our results indicated that an EPV rule of thumb should be data driven and that EPV ≥ 20 ​ generally eliminates bias in regression coefficients when many low-prevalence predictors are included in a Cox model.

Conclusion: Higher EPV is needed when low-prevalence predictors are present in a model to eliminate bias in regression coefficients and improve predictive accuracy.

Keywords: Cox model; Events per variable; External validation; Predictive modeling; Resampling study; Sample size.

Copyright © 2016 The Authors. Published by Elsevier Inc. All rights reserved.

Figures

Fig. 1
Fig. 1
Number of events per variable and average percent relative bias for the variables in the data set.
Fig. 2
Fig. 2
Ratio of model variance to sample variance for the variables in the data set.
Fig. 3
Fig. 3
Proportion of simulations in which the 95% confidence interval about the simulated regression coefficient includes the “true” value for the variables in the data set.

References

    1. Concato J., Peduzzi P., Holford T., Feinstein A. The importance of events per independent variable in proportional hazards regression analysis: I. Background, goals and general strategy. J Clin Epidemiol. 1995;48:1495–1501.
    1. Harrell F., Lee K., Matchar D., Reichert T. Regression models for prognostic prediction: advantages, problems, and suggested solutions. Cancer Treat Rep. 1985;69:1071–1077.
    1. Peduzzi P., Concato J., Feinstein A., Holford T. The importance of events per independent variable in proportional hazards regression analysis: II. Accuracy and precision of regression estimates. J Clin Epidemiol. 1995;48:1503–1510.
    1. Peduzzi P., Concato J., Kemper E., Holford T.R., Feinstein A. A simulation study on the number of events per variable in logistic regression analysis. J Clin Epidemiol. 1996;49:1373–1379.
    1. Vittinghoff E., McCulloch C.E. Relaxing the rule of ten events per variable in logistic and Cox regression. Am J Epidemiol. 2007;165:710–718.
    1. Courvoisier D.S., Combescure C., Agoritsas T., Gayet-Ageron A., Perneger T.V. Performance of logistic regression modeling: beyond the number of events per variable, the role of data structure. J Clin Epidemiol. 2011;64:993–1000.
    1. Steyerberg E.W., Schemper M., Harrell F.E. Logistic regression modeling and the number of events per variable: selection bias dominates. J Clin Epidemiol. 2011;64:1463–1469.
    1. Steyerberg E.W., Eijkemans M.J.C., Habbema J.F. Stepwise selection in small data sets: a simulation study of bias in logistic regression analysis. J Clin Epidemiol. 1999;52:935–942.
    1. Albert A., Anderson J.A. On the existence of maximum likelihood estimates in logistic regression. Biometrika. 1984;71:1–10.
    1. Heinze G., Schemper M. A solution to the problem of separation in logistic regression. Stat Med. 2002;21:2409–2419.
    1. Heinze G., Schempe M. A solution to problem of monotone likelihood in Cox regression. Biometrics. 2001;57:114–119.
    1. Firth D. Bias reduction of maximum likelihood estimates. Biometrika. 1993;80:27–38.
    1. Burton A., Altman D.G., Royston P., Holder R.L. The design of simulation studies in medical statistics. Stat Med. 2006;25:4279–4292.
    1. Royston P., Sauerbrei W. A new measure of prognostic separation in survival data. Stat Med. 2004;23:723–748.
    1. O’Quigley J., Xu R., Stare J. Explained randomness in proportional hazards models. Stat Med. 2005;24:479–489.
    1. Ambler G., Seaman S., Omar R.Z. An evaluation of penalised survival methods for developing prognostic models with rare events. Stat Med. 2012;31:1150–1161.
    1. Lin I.F., Chang W.P., Liao Y.N. Shrinkage methods enhanced the accuracy of parameter estimation using Cox models with small number of events. J Clin Epidemiol. 2013;66:743–751.
    1. Heinze G. Letter to the editor: a comparative study of the bias corrected estimates in logistic regression. Stat Methods Med Res. 2012;21:660–661.

Source: PubMed

3
Tilaa