Sample Size Guidelines for Logistic Regression from Observational Studies with Large Population: Emphasis on the Accuracy Between Statistics and Parameters Based on Real Life Clinical Data

Mohamad Adam Bujang, Nadiah Sa'at, Tg Mohd Ikhwan Tg Abu Bakar Sidik, Lim Chien Joo, Mohamad Adam Bujang, Nadiah Sa'at, Tg Mohd Ikhwan Tg Abu Bakar Sidik, Lim Chien Joo

Abstract

Background: Different study designs and population size may require different sample size for logistic regression. This study aims to propose sample size guidelines for logistic regression based on observational studies with large population.

Methods: We estimated the minimum sample size required based on evaluation from real clinical data to evaluate the accuracy between statistics derived and the actual parameters. Nagelkerke r-squared and coefficients derived were compared with their respective parameters.

Results: With a minimum sample size of 500, results showed that the differences between the sample estimates and the population was sufficiently small. Based on an audit from a medium size of population, the differences were within ± 0.5 for coefficients and ± 0.02 for Nagelkerke r-squared. Meanwhile for large population, the differences are within ± 1.0 for coefficients and ± 0.02 for Nagelkerke r-squared.

Conclusions: For observational studies with large population size that involve logistic regression in the analysis, taking a minimum sample size of 500 is necessary to derive the statistics that represent the parameters. The other recommended rules of thumb are EPV of 50 and formula; n = 100 + 50i where i refers to number of independent variables in the final model.

Keywords: logistic regression; observational studies; sample size.

Conflict of interest statement

Conflict of Interest All authors declare no conflict of interest.

Figures

Figure 1
Figure 1
The comparison of differences of coefficients between results derived from parameters and statistics based on various sample sizes
Figure 2
Figure 2
The comparison of differences of Nagelkerke r-squared between results derived from parameters and statistics based on various sample sizes
Figure 3
Figure 3
The comparison of differences of coefficients between results derived from parameters and statistics based on various sample sizes tested with larger sample

References

    1. Chew BH, Shariff-Ghazali S, Mastura I, Haniff J, Bujang MA. Age ≥ 60 years was an independent risk factor for diabetes-related complications despite good control of cardiovascular risk factors in patients with type 2 diabetes mellitus. Exp Gerontol. 2013;48(5):485–491. doi: 10.1016/j.exger.2013.02.017.
    1. Chew BH, Mastura I, Shariff-Ghazali S, Lee PY, Cheong AT, Ahmad Z, et al. Determinants of uncontrolled hypertension in adult type 2 diabetes mellitus: an analysis of the Malaysian diabetes registry 2009. Cardiovasc Diabetol. 2012;11:54. doi: 10.1186/1475-2840-11-54.
    1. Lee PY, Cheong AT, Zaiton A, et al. Does ethnicity contribute to the control of cardiovascular risk factors among patients with type 2 diabetes? Asia Pac J Public Health. 2013;25(4):316–325. doi: 10.1177/1010539511430521.
    1. Premsenthil M, Salowi MA, Bujang MA, Kueh A, Siew CM, Sumugam K, et al. Risk factors and prediction models for retinopathy of prematurity. Malays J Med Sci. 2015;22(5):57–63.
    1. Hsieh FY. Sample size tables for logistic regression. Stat Med. 1989;8(7):795–802. doi: 10.1002/sim.4780080704.
    1. Peduzzi P, Concato J, Kemper E, Holford TR, Feinstein AR. A simulation study of the number of events per variable in logistic regression analysis. J Clin Epidemiol. 1996;49(2):1373–1379. doi: 10.1016/S0895-4356(96)00236-3.
    1. Concato J, Peduzzi P, Holford TR, Feinstein AR. The importance of event per variable (EPV) in proportional hazard analysis: I. Background, goals and general strategy. J Clin Epidemiol. 1995;48(12):1495–1501. doi: 10.1016/0895-4356(95)00510-2.
    1. van Smeden Maarten, de Groot JAH, Moons KGM, Collins GS, Altman DG, Eijkemans MJC, et al. No rationale for 1 variable per 10 events criterion for binary logistic regression analysis. BMC Med Res Methodol. 2016;16:163. doi: 10.1186/s12874-016-0267-3.
    1. Austin PC, Steyerberg EW. Events per variable (EPV) and the relative performance of different strategies for estimating the out-of-sample validity of logistic regression models. Stat Methods Med Res. 2017;26(2):796–808. doi: 10.1177/0962280214558972.
    1. Nemes S, Jonasson JM, Genell A, Steineck G. Bias in odds ratios by logistic regression modelling and sample size. BMC Med Res Methodol. 2009;9:56. doi: 10.1186/1471-2288-9-56.
    1. Bujang MA, Ghani PA, Zolkepali NA, Selvarajah S, Haniff J. A comparison between convenience sampling versus systematic sampling in getting the true parameter in a population: explore from a clinical database: the Audit Diabetes Control Management (ADCM) registry in 2009. Int Conf Stat Sci Bus Eng. 2009;2012:1–5.
    1. Bujang MA, Sa’at N, Joys AR, Ali MM. An audit of the statistics and the comparison with the parameter in the population. AIP Conference Proceedings; 2015. p. 050019.
    1. Mastura I, Chew BH, Lee PY, Cheong AT, Sazlina SG, Jamaiyah H, et al. Control and treatment profiles of 70,889 adult type 2 diabetes mellitus patients in Malaysia. International Journal of Collaborative Research on Internal Medicine & Public Health. 2011;3(1):98–113.
    1. Cheong AT, Lee PY, Sazlina S-G, Bujang MA, Chew BH, Mastura I, et al. Poor glycemic control in younger women attending Malaysian public primary care clinics: findings from adults diabetes control and management registry. BMC Fam Pract. 2013;14:188. doi: 10.1186/1471-2296-14-188.
    1. Khattab M, Khader YS, Al-Khawaldeh A, Ajlouni K. Factors associated with poor glycemic control among patients with type 2 diabetes. J Diabetes Complications. 2010;24(2010):84–89. doi: 10.1016/j.jdiacomp.2008.12.008.
    1. Tabachnick BG, Fidell LS. Using multivariable statistics. 6th ed. Boston: Pearson Education; 2013.
    1. Bujang MA, Sa’at N, Tg Abu Bakar Sidik TMI. Requirement for multiple linear regression and analysis of covariance based on experimental and non-experimental studies. Epidemiology Biostatistics and Public Health. 2017;14(3):e12117.
    1. MacCallum RC, Widaman KF, Zhang S, Hong SH. Sample size in factor analysis. Psychol Methods. 1999;4(1):84–99. doi: 10.1037/1082-989X.4.1.84.
    1. Osborne JW, Costello AB. Sample size and subject to item ratio in principal components analysis. Pract Assess Res Eval. 2004;9:11.
    1. Bujang MA, Ghani PA, Soelar SA, Zulkifli NA. Sample size guideline for exploratory factor analysis when using small sample: taking into considerations of different measurement scales. 2012 International Conference on Statistics in Science, Business and Engineering (ICSSBE); 2012. pp. 1–5.
    1. Ioannidis JPA. Why most published research findings are false. PLoS Med. 2005;2(8):e124. doi: 10.1371/journal.pmed.0020124.
    1. Sterne JA, Davey SG. Sifting the evidence— what’s wrong with significance tests. BMJ. 2001;322:226–231. doi: 10.1136/bmj.322.7280.226.
    1. Wacholder S, Chanock S, Garcia-Closas M, Elghormli L, Rothman N. Assessing the probability that a positive report is false: an approach for molecular epidemiology studies. J Natl Cancer Inst. 2004;96(6):434–442. doi: 10.1093/jnci/djh075.
    1. Sedlmeier P, Gigerenzer G. Do studies of statistical power have an effect on the power of studies? Psychol Bull. 1989;105(2):309–316. doi: 10.1037/0033-2909.105.2.309.
    1. Rossi JC. Statistical power of psychological research: what have we gained in 20 years? J Consult Clin Psychol. 1990;58(5):646–656. doi: 10.1037/0022-006X.58.5.646.
    1. Muller KE, Benignus VA. Increasing scientific power with statistical power. Neurotoxicol Teratol. 1992;14(3):211–19. doi: 10.1016/0892-0362(92)90019-7.
    1. Cohen J. The earth is round (P < .05) Am Psychol. 1994;49(12):997–1003. doi: 10.1037/0003-066X.49.12.997.
    1. Hsieh FY, Bloch DA, Larsen MD. A simple method of sample size calculation for linear and logistic regression. Statist Med. 1998;17(14):1623–1634. doi: 10.1002/(SICI)1097-0258(19980730)17:14<1623::AIDSIM871>;2-S.

Source: PubMed

3
S'abonner