Bayesian Hodges-Lehmann tests for statistical equivalence in the two-sample setting: Power analysis, type I error rates and equivalence boundary selection in biomedical research

Riko Kelter, Riko Kelter

Abstract

Background: Null hypothesis significance testing (NHST) is among the most frequently employed methods in the biomedical sciences. However, the problems of NHST and p-values have been discussed widely and various Bayesian alternatives have been proposed. Some proposals focus on equivalence testing, which aims at testing an interval hypothesis instead of a precise hypothesis. An interval hypothesis includes a small range of parameter values instead of a single null value and the idea goes back to Hodges and Lehmann. As researchers can always expect to observe some (although often negligibly small) effect size, interval hypotheses are more realistic for biomedical research. However, the selection of an equivalence region (the interval boundaries) often seems arbitrary and several Bayesian approaches to equivalence testing coexist.

Methods: A new proposal is made how to determine the equivalence region for Bayesian equivalence tests based on objective criteria like type I error rate and power. Existing approaches to Bayesian equivalence testing in the two-sample setting are discussed with a focus on the Bayes factor and the region of practical equivalence (ROPE). A simulation study derives the necessary results to make use of the new method in the two-sample setting, which is among the most frequently carried out procedures in biomedical research.

Results: Bayesian Hodges-Lehmann tests for statistical equivalence differ in their sensitivity to the prior modeling, power, and the associated type I error rates. The relationship between type I error rates, power and sample sizes for existing Bayesian equivalence tests is identified in the two-sample setting. Results allow to determine the equivalence region based on the new method by incorporating such objective criteria. Importantly, results show that not only can prior selection influence the type I error rate and power, but the relationship is even reverse for the Bayes factor and ROPE based equivalence tests.

Conclusion: Based on the results, researchers can select between the existing Bayesian Hodges-Lehmann tests for statistical equivalence and determine the equivalence region based on objective criteria, thus improving the reproducibility of biomedical research.

Keywords: Bayes factor; Bayesian Biostatistics; Bayesian equivalence testing; Bayesian testing; Region of practical equivalence (ROPE); Student’s t-test.

Conflict of interest statement

The author declares that he has no competing interests.

Figures

**Fig. 1**
Influence of sample size n on the type I error rate attained by Bayesian equivalence approaches based on the Bayes factor (left) and the ROPE (right); the default equivalence region R=[−0.1,0.1] is used in all settings

**Fig. 2**
Power analysis for the Bayesian equivalence testing approaches based on the Bayes factor for small, medium and large effect sizes

**Fig. 3**
Power analysis for the Bayesian equivalence testing approaches based on the ROPE for an underlying small effect size

**Fig. 4**
Power analysis for the Bayesian equivalence testing approaches based on the ROPE for an underlying medium effect size

**Fig. 5**
Power analysis for the Bayesian equivalence testing approaches based on the ROPE for an underlying large effect size

**Fig. 6**
Influence of the equivalence region on the type I error rates for the Bayesian equivalence testing approaches based on the Bayes factor

**Fig. 7**
Influence of the equivalence region on the type I error rates for the Bayesian equivalence testing approaches based on the ROPE

**Fig. 8**
Total error rates for the Bayesian equivalence testing approaches based on the Bayes factor for small, medium and large effect size

**Fig. 9**
Total error rates for the Bayesian equivalence testing approaches based on the ROPE for an underlying large effect size

**Fig. 10**
Total error rates for the Bayesian equivalence testing approaches based on the ROPE for an underlying medium effect size

**Fig. 11**
Total error rates for the Bayesian equivalence testing approaches based on the ROPE for an underlying small effect size

**Fig. 12**
Differences between precise frequentist hypothesis testing and equivalence testing: Standard NHST for a sharp point null hypothesis H0:θ=0 against its alternative H1:θ≠0 (top left) or, in general, of H0:θ=θ0 against H1:θ≠θ0 (top right); TOST procedure for testing H0:θ<δL orθ>δU (shown in red) against H1:δL≤θ≤δU (shown in blue) (bottom left) or H0:θ<θ0−δL orθ>θ0+δU (shown in red) against H1:θ0−δL≤θ≤θ0+δU (shown in blue) (bottom right)

References

1. Altman DG. Statistics in medical journals: Some recent trends. Stat Med. 2000;19(23):3275–89. doi: 10.1002/1097-0258(20001215)19:23<3275::AID-SIM626>;2-M.
1. Ioannidis JPA. Why Most Clinical Research Is Not Useful. PLoS Med. 2016; 13(6). 10.1371/journal.pmed.1002049.
1. Wasserstein RL, Schirm AL, Lazar NA. Moving to a World Beyond “p<0.05”. Am Stat. 2019;73(sup1):1–19. doi: 10.1080/00031305.2019.1583913.
1. Wasserstein RL, Lazar NA. The ASA’s Statement on p-Values: Context, Process, and Purpose. Am Stat. 2016;70(2):129–33. doi: 10.1080/00031305.2016.1154108.
1. Colquhoun D. An investigation of the false discovery rate and the misinterpretation of p-values. R Soc Open Sci. 2014;1(3):140216. doi: 10.1098/rsos.140216.
1. Colquhoun D. The problem with p-values. Aeon. 2016. 10.1016/S1369-7021(08)70254-2.
1. Edwards W, Lindman H, Savage LJ. Bayesian statistical inference for psychological research. Psychol Rev. 1963;70(3):193–242. doi: 10.1037/h0044139.
1. Berger JO, Wolpert RL. The Likelihood Principle. Hayward: Institute of Mathematical Statistics; 1988.
1. Kruschke JK, Liddell TM. The Bayesian New Statistics : Hypothesis testing, estimation, meta-analysis, and power analysis from a Bayesian perspective. Psychon Bull Rev. 2018;25:178–206. doi: 10.3758/s13423-016-1221-4.
1. Birnbaum A. On the Foundations of Statistical Inference (with discussion) J Am Stat Assoc. 1962;57(298):269–306. doi: 10.1080/01621459.1962.10480660.
1. Pratt JW. Bayesian Interpretation of Standard Inference Statements. J R Stat Soc Ser B (Methodol) 1965;27(2):169–92.
1. Basu D. Statistical Information and Likelihood (with discussion) Sankhya Indian J Stat Ser A. 1975;37(1):1–71.
1. Wagenmakers E-J, Morey RD, Lee MD. Bayesian Benefits for the Pragmatic Researcher. Curr Dir Psychol Sci. 2016;25(3):169–76. doi: 10.1177/0963721416643289.
1. Morey RD, Hoekstra R, Rouder JN, Lee MD, Wagenmakers E-J. The fallacy of placing confidence in confidence intervals. Psychon Bull Rev. 2016;23(1):103–23. doi: 10.3758/s13423-015-0947-8.
1. Lehmann EL. The Fisher, Neyman-Pearson Theories of Testign Hypotheses: One Theory or Two? J Am Stat Assoc. 1993;88(424):1242–9. doi: 10.1080/01621459.1993.10476404.
1. Morey RD, Romeijn JW, Rouder JN. The philosophy of Bayes factors and the quantification of statistical evidence. J Math Psychol. 2016;72:6–18. doi: 10.1016/j.jmp.2015.11.001.
1. Hendriksen A, de Heide R, Grünwald P. Optional stopping with bayes factors: A categorization and extension of folklore results, with an application to invariant situations. Bayesian Anal. 2020. 10.1214/20-ba1234.
1. Rouder JN. Optional stopping: no problem for Bayesians. Psychon Bull Rev. 2014;21(2):301–8. doi: 10.3758/s13423-014-0595-4.
1. Ioannidis JPA. What Have We (Not) Learnt from Millions of Scientific Papers with p-Values?, Am Stat. 2019;73:20–5. doi: 10.1080/00031305.2018.1447512.
1. Pratt JW. On the Foundations of Statistical Inference: Discussion. J Am Stat Assoc. 1962;57(298):307–26.
1. Dawid AP. Proceedings of the European Meeting of Statisticians. Grenoble: North-Holland Pub. Co.; 1977. Recent Developments in Statistics.
1. Kruschke JK, Liddell TM. Bayesian data analysis for newcomers. Psychon Bull Rev. 2018;25(1):155–77. doi: 10.3758/s13423-017-1272-1.
1. Nuijten MB, Hartgerink CHJ, van Assen MALM, Epskamp S, Wicherts JM. The prevalence of statistical reporting errors in psychology (1985-2013) Behav Res Methods. 2016;48(4):1205–26. doi: 10.3758/s13428-015-0664-2.
1. Wetzels R, Matzke D, Lee MD, Rouder JN, Iverson GJ, Wagenmakers E-J. Statistical evidence in experimental psychology: An empirical comparison using 855 t tests. Perspect Psychol Sci. 2011;6(3):291–8. doi: 10.1177/1745691611406923.
1. Chen Z, Hu J, Zhang Z, Jiang S, Han S, Yan D, Zhuang R, Hu B, Zhang Z. Efficacy of hydroxychloroquine in patients with COVID-19: results of a randomized clinical trial. medRxiv. 2020; 7. 10.1101/2020.03.22.20040758.
1. Gönen M, Johnson WO, Lu Y, Westfall PH. The Bayesian Two-Sample t Test. Am Stat. 2005;59(3):252–7. doi: 10.1198/000313005X55233.
1. Jeffreys H. Scientific Inference. Cambridge: Cambridge University Press; 1931.
1. Rouder JN, Speckman PL, Sun D, Morey RD, Iverson G. Bayesian t tests for accepting and rejecting the null hypothesis. Psychon Bull Rev. 2009;16(2):225–37. doi: 10.3758/PBR.16.2.225.
1. Wetzels R, Raaijmakers JGW, Jakab E, Wagenmakers E-J. How to quantify support for and against the null hypothesis: A flexible WinBUGS implementation of a default Bayesian t test. Psychonomic Bulletin and Review. 2009;16(4):752–60. doi: 10.3758/PBR.16.4.752.
1. Wang M, Liu G. A Simple Two-Sample Bayesian t-Test for Hypothesis Testing. Am Stat. 2016;70(2):195–201. doi: 10.1080/00031305.2015.1093027.
1. Gronau QF, Ly A, Wagenmakers E-J. Informed Bayesian t -Tests. Am Stat. 2019;00(0):1–7.
1. Kelter R. Bayest: An R Package for effect-size targeted Bayesian two-sample t-tests. J Open Res Softw. 2020; 8(14). 10.5334/jors.290.
1. Kelter R. Bayesian and frequentist testing for differences between two groups with parametric and nonparametric two-sample tests. WIREs Comput Stat. 2020; 7. 10.1002/wics.1523.
1. Cohen J. Statistical Power Analysis for the Behavioral Sciences, 2nd ed. Hillsdale: Routledge; 1988.
1. Berger JO, Brown LD, Wolpert RL. A Unified Conditional Frequentist and Bayesian Test for fixed and sequential Hypothesis Testing. Ann Stat. 1994;22(4):1787–807. doi: 10.1214/aos/1176325757.
1. Kruschke JK. Rejecting or Accepting Parameter Values in Bayesian Estimation. Adv Methods Pract Psychol Sci. 2018;1(2):270–80. doi: 10.1177/2515245918771304.
1. Lakens D. Equivalence Tests: A Practical Primer for t Tests, Correlations, and Meta-Analyses. Soc Psychol Personal Sci. 2017;8(4):355–62. doi: 10.1177/1948550617697177.
1. Lakens D, Scheel AM, Isager PM. Equivalence Testing for Psychological Research: A Tutorial. Adv Methods Pract Psychol Sci. 2018;1(2):259–69. doi: 10.1177/2515245918770963.
1. Berger JO, Boukai B, Wang Y. Unified Frequentist and Bayesian Testing of a Precise Hypothesis. Stat Sci. 1997;12(3):133–60. doi: 10.1214/ss/1030037904.
1. Kelter R. Analysis of Bayesian posterior significance and effect size indices for the two-sample t-test to support reproducible medical research. BMC Med Res Methodol. 2020; 20(88). 10.1186/s12874-020-00968-2.
1. Morey RD, Rouder JN. Bayes Factor Approaches for Testing Interval Null Hypotheses. Psychol Methods. 2011;16(4):406–19. doi: 10.1037/a0024377.
1. Hodges JL, Lehmann EL. Testing the Approximate Validity of Statistical Hypotheses. J R Stat Soc Ser B (Methodol) 1954;16(2):261–8.
1. Lindley DV. Decision Analysis and Bioequivalence Trials. Stat Sci. 1998;13(2):136–41. doi: 10.1214/ss/1028905932.
1. Van Ravenzwaaij D, Monden R, Tendeiro JN, Ioannidis JPA. Bayes factors for superiority, non-inferiority, and equivalence designs. BMC Med Res Methodol. 2019;19(1):1–12. doi: 10.1186/s12874-018-0650-3.
1. Linde M, van Ravenzwaaij D. baymedr: An R Package for the Calculation of Bayes Factors for Equivalence, Non-Inferiority, and Superiority Designs. arXiv preprint: arXiv:1910.11616v1. 2020.
1. Makowski D, Ben-Shachar MS, Chen SHA, Lüdecke D. Indices of Effect Existence and Significance in the Bayesian Framework. Front Psychol. 2019;10:2767. doi: 10.3389/fpsyg.2019.02767.
1. Makowski D, Ben-Shachar M, Lüdecke D. bayestestR: Describing Effects and their Uncertainty, Existence and Significance within the Bayesian Framework. J Open Source Softw. 2019;4(40):1541. doi: 10.21105/joss.01541.
1. Haaf JM, Ly A, Wagenmakers EJ. Retire significance, but still test hypotheses. Nature. 2019;567(7749):461. doi: 10.1038/d41586-019-00972-7.
1. Tendeiro JN, Kiers HAL. A Review of Issues About Null Hypothesis Bayesian Testing. Psychol Methods. 2019;24(6):774–95. doi: 10.1037/met0000221.
1. Robert CP. The expected demise of the Bayes factor. J Math Psychol. 2016;72(2009):33–7. doi: 10.1016/j.jmp.2015.08.002.
1. Stern JM. Significance tests, Belief Calculi, and Burden of Proof in legal and Scientific Discourse. Front Artif Intell Appl. 2003;101:139–47.
1. Wagenmakers E-J, Lodewyckx T, Kuriyal H, Grasman R. Bayesian hypothesis testing for psychologists: A tutorial on the Savage-Dickey method. Cogn Psychol. 2010;60(3):158–89. doi: 10.1016/j.cogpsych.2009.12.001.
1. Dickey JM, Lientz BP. The Weighted Likelihood Ratio, Sharp Hypotheses about Chances, the Order of a Markov Chain. Ann Math Stat. 1970;41(1):214–26. doi: 10.1214/aoms/1177697203.
1. Verdinelli I, Wasserman L. Computing Bayes factors using a generalization of the Savage-Dickey density ratio. J Am Stat Assoc. 1995;90(430):614–8. doi: 10.1080/01621459.1995.10476554.
1. Gronau QF, Sarafoglou A, Matzke D, Ly A, Boehm U, Marsman M, Leslie DS, Forster JJ, Wagenmakers E-J, Steingroever H. A tutorial on bridge sampling. J Math Psychol. 2017;81:80–97. doi: 10.1016/j.jmp.2017.09.005.
1. Gronau QF, Wagenmakers E-J, Heck DW, Matzke D. A Simple Method for Comparing Complex Models: Bayesian Model Comparison for Hierarchical Multinomial Processing Tree Models Using Warp-III Bridge Sampling. Psychometrika. 2019;84(1):261–84. doi: 10.1007/s11336-018-9648-3.
1. Liao JG, Midya V, Berg A. Connecting and Contrasting the Bayes Factor and a Modified ROPE Procedure for Testing Interval Null Hypotheses. Am Stat. 2020. 10.1080/00031305.2019.1701550.
1. Kruschke JK. Bayesian estimation supersedes the t-test, J Exp Psychol Gen. 2013;142(2):573–603. doi: 10.1037/a0029146.
1. Kelter R. Bayesian alternatives to null hypothesis significance testing in biomedical research: a non-technical introduction to Bayesian inference with JASP. BMC Med Res Methodol. 2020; 20(1). 10.1186/s12874-020-00980-6.
1. Jeffreys H. Theory of Probability, 3rd ed. Oxford: Oxford University Press; 1961.
1. Kass RE, Raftery AE. Bayes factors. J Am Stat Assoc. 1995;90(430):773–95. doi: 10.1080/01621459.1995.10476572.
1. Goodman SN. Toward Evidence-Based Medical Statistics. 2: The Bayes Factor. Ann Intern Med. 1999;130(12):1005. doi: 10.7326/0003-4819-130-12-199906150-00019.
1. Lee MD, Wagenmakers E-J. Bayesian Cognitive Modeling : a Practical Course. Amsterdam: Cambridge University Press; 2013.
1. Held L, Ott M. On p-Values and Bayes Factors. Ann Rev Stat Appl. 2018;5(1):393–419. doi: 10.1146/annurev-statistics-031017-100307.
1. van Doorn J, van den Bergh D, Bohm U, Dablander F, Derks K, Draws T, Evans NJ, Gronau QF, Hinne M, Kucharský S, Ly A, Marsman M, Matzke D, Raj A, Sarafoglou A, Stefan A, Voelkel JG, Wagenmakers E-J. The JASP Guidelines for Conducting and Reporting a Bayesian Analysis. psyarxiv preprint. 2019. 10.31234/. .
1. Westlake WJ. Symmetrical confidence intervals for bioequivalence trials. Biometrics. 1976;32(4):741–4. doi: 10.2307/2529259.
1. Kirkwood TBL. Bioequivalence Testing - A Need to Rethink. Biometrics. 1981;37(3):589–94. doi: 10.2307/2530573.
1. Carlin BP, Louis TA. Bayesian Methods for Data Analysis. Boca Raton: Chapman & Hall, CRC Press; 2009.
1. Hobbs BP, Carlin BP. Practical Bayesian design and analysis for drug and device clinical trials. J Biopharm Stat. 2007;18(1):54–80. doi: 10.1080/10543400701668266.
1. Schuirmann DJ. A comparison of the two one-sided tests procedure and the power approach for assessing the equivalence of average bioavailability. J Pharmacokinet Biopharm. 1987;15(6):657–80. doi: 10.1007/BF01068419.
1. Kelter R. Bayest - Effect Size Targeted Bayesian Two-Sample t-Tests via Markov Chain Monte Carlo in Gaussian Mixture Models. Comprehensive R Archive Network. 2019. .
1. Kruschke JK. Doing Bayesian Data Analysis: A Tutorial with R, JAGS, and Stan, 2nd ed. Oxford: Academic Press; 2015.
1. Wagenmakers E-J, Gronau QF, Dablander F, Etz A. The Support Interval. Erkenntnis. 2020; 0123456789. 10.1007/s10670-019-00209-z.
1. Zieba M, Tomczak JM, Lubicz M, Światek J. Boosted SVM for extracting rules from imbalanced data in application to prediction of the post-operative life expectancy in the lung cancer patients. Appl Soft Comput J. 2014;14(PART A):99–108. doi: 10.1016/j.asoc.2013.07.016.
1. U.S. Food and Drug Administration Center for Drug Evaluation and Research. Guidance for industry: Statistical approaches to establishing bioequivalence. 2001. Web archive: . Accessed 01 Mar 2021.
1. Senn S. Statistical issues in bioequivalance. Stat Med. 2001;20(17-18):2785–99. doi: 10.1002/sim.743.
1. Cook JA, Hislop JA, Adewuyi TE, Harrild KA, Altman DG, Ramsay DG, Fraser C, Buckley B, Fayers P, Harvey I, Briggs AH, Norrie JD, Fergusson D, Ford I, Vale LD. Assessing methods to specify the target difference for a randomised controlled trial: DELTA (Difference ELicitation in TriAls) review. Health Technol Assess. 2014;18(28):1–172. doi: 10.3310/hta18280.
1. Cook JA, Julious SA, Sones W, Hampson LV, Hewitt C, Berlin JA, Ashby D, Emsley R, Fergusson DA, Walters SJ, Wilson ECF, MacLennan G, Stallard N, Rothwell JC, Bland M, Brown L, Ramsay CR, Cook A, Armstrong D, Altman D, Vale LD. DELTA 2 guidance on choosing the target difference and undertaking and reporting the sample size calculation for a randomised controlled trial. Trials. 2018;19(1):1–6. doi: 10.1186/s13063-015-1128-9.
1. Jaeschke R, Singer J, Guyatt GH. Measurement of health status: Ascertaining the minimal clinically important difference. Control Clin Trials. 1989;10(4):407–15. doi: 10.1016/0197-2456(89)90005-6.
1. Weber R, Popova L. Testing equivalence in communication research: theory and application. Commun Methods Measures. 2012;6(3):190–213. doi: 10.1080/19312458.2012.703834.
1. Simonsohn U. Small Telescopes: Detectability and the Evaluation of Replication Results. Psychol Sci. 2015;26(5):559–69. doi: 10.1177/0956797614567341.
1. Ferguson CJ. An effect size primer: A guide for clinicians and researchers. Prof Psychol Res Pract. 2009;40(5):532–8. doi: 10.1037/a0015808.
1. Beribisky N, Davidson H, Cribbie RA. Exploring perceptions of meaningfulness in visual representations of bivariate relationships. PeerJ. 2019;2019(5):6853. doi: 10.7717/peerj.6853.
1. Rusticus SA, Eva KW. Defining equivalence in medical education evaluation and research: does a distribution-based approach work? Pract Assess Res Eval. 2016;16(7):1–6.
1. Perugini M, Gallucci M, Costantini G. Safeguard Power as a Protection Against Imprecise Power Estimates, Perspect Psychol Sci. 2014;9(3):319–32. doi: 10.1177/1745691614528519.
1. Kordsmeyer T, Penke L. The association of three indicators of developmental instability with mating success in humans. Evol Hum Behav. 2017;38:704–13. doi: 10.1016/j.evolhumbehav.2017.08.002.
1. Maxwell SE, Lau MY, Howard GS. Is psychology suffering from a replication crisis?: What does ’failure to replicate’ really mean?, Am Psychol. 2015;70(6):487–98. doi: 10.1037/a0039400.
1. Rogers JL, Howard KI, Vessey JT. Using significance tests to evaluate equivalence between two experimental groups. Psychol Bull. 1993;113(3):553–65. doi: 10.1037/0033-2909.113.3.553.
1. McElreath R, Smaldino PE. Replication, communication, and the population dynamics of scientific discovery. PLoS ONE. 2015;10(8):1–16. doi: 10.1371/journal.pone.0136088.
1. Morey RD, Rouder JN. BayesFactor: Computation of Bayes Factors for Common Designs. R package version 0.9.12-4.2. 2018.
1. R Core Team . R: A Language and Environment for Statistical Computing. Vienna: R Foundation for Statistical Computing; 2020.
1. Lindley DV. A Statistical Paradox. Biometrika. 1957;44(1):187–92. doi: 10.1093/biomet/44.1-2.187.
1. Schuirmann DJ. On hypothesis testing to determine if the mean of a normal distribution is contained in a known interval. Biometrics. 1981; 37(617).
1. Anderson S, Hauck WW. A New Procedure for Testing Equivalence in Comparative Bioavailability and Other Clinical Trials. Commun Stat Theory Methods. 1983;12(23):2663–92. doi: 10.1080/03610928308828634.
1. Hauck WW, Anderson S. A new statistical procedure for testing equivalence in two-group comparative bioavailability trials. J Pharmacokinet Biopharm. 1984;12(1):83–91. doi: 10.1007/BF01063612.
1. Rocke DM. On testing for bioequivalence. Biometrics. 1984;40:225–30. doi: 10.2307/2530763.
1. Berger RL, Hsu JC, Berger RL, Hsu JC. Bioequivalence Trials, Intersection-Union Tests and Equivalence Confidence Sets. Stat Sci. 1996;11(4):283–302. doi: 10.1214/ss/1032280304.
1. Meyners M. Equivalence tests - A review. Food Qual Prefer. 2012;26:231–45. doi: 10.1016/j.foodqual.2012.05.003.
1. Chow S-C, Liu J-P. Design and Analysis of Bioavailability and Bioequivalence Studies, 3rd ed. Boca Raton: Chapman & Hall/CRC Press; 2008.
1. Wellek S. Testing Statistical Hypotheses of Equivalence and Noninferiority: CRC Press; 2010, p. 415. 10.1201/ebk1439808184.
1. Blackwelder WC. “Proving the null hypothesis” in clinical trials. Control Clin Trials. 1982;3(4):345–53. doi: 10.1016/0197-2456(82)90024-1.

Source: PubMed

Bayesian Hodges-Lehmann tests for statistical equivalence in the two-sample setting: Power analysis, type I error rates and equivalence boundary selection in biomedical research

Abstract

Conflict of interest statement

Figures

References

Sponsors en medewerkers

Medische omstandigheden

Geneesmiddelinterventies