Optimal caliper widths for propensity-score matching when estimating differences in means and differences in proportions in observational studies

Peter C Austin, Peter C Austin

Abstract

In a study comparing the effects of two treatments, the propensity score is the probability of assignment to one treatment conditional on a subject's measured baseline covariates. Propensity-score matching is increasingly being used to estimate the effects of exposures using observational data. In the most common implementation of propensity-score matching, pairs of treated and untreated subjects are formed whose propensity scores differ by at most a pre-specified amount (the caliper width). There has been a little research into the optimal caliper width. We conducted an extensive series of Monte Carlo simulations to determine the optimal caliper width for estimating differences in means (for continuous outcomes) and risk differences (for binary outcomes). When estimating differences in means or risk differences, we recommend that researchers match on the logit of the propensity score using calipers of width equal to 0.2 of the standard deviation of the logit of the propensity score. When at least some of the covariates were continuous, then either this value, or one close to it, minimized the mean square error of the resultant estimated treatment effect. It also eliminated at least 98% of the bias in the crude estimator, and it resulted in confidence intervals with approximately the correct coverage rates. Furthermore, the empirical type I error rate was approximately correct. When all of the covariates were binary, then the choice of caliper width had a much smaller impact on the performance of estimation of risk differences and differences in means.

Copyright © 2010 John Wiley & Sons, Ltd.

Figures

Figure 1
Figure 1
Caliper width and reduction in bias: risk differences.
Figure 2
Figure 2
Caliper width and MSE: risk differences.
Figure 3
Figure 3
Caliper width and coverage of 95% confidence intervals: risk differences.
Figure 4
Figure 4
Caliper width and Type 1 error rates.
Figure 5
Figure 5
Caliper width and reduction in blas: difference in means.
Figure 6
Figure 6
Caliper width and MSE: difference in means.
Figure 7
Figure 7
Caliper width and coverage of 95% confidence intervals: difference in means.
Figure 8
Figure 8
Relationship between caliper width and estimated treatment effect in case study.

References

    1. Rosenbaum PR, Rubin DB. The central role of the propensity score in observational studies for causal effects. Biometrika. 1983;70:41–55.
    1. Rosenbaum PR, Rubin DB. Reducing bias in observational studies using subclassification on the propensity score. Journal of the American Statistical Association. 1984;79:516–524.
    1. Austin PC, Mamdani MM. A comparison of propensity score methods: a case-study estimating the effectiveness of post-AMI statin use. Statistics in Medicine. 2006;25:2084–2106.
    1. Rosenbaum PR, Rubin DB. Constructing a control group using multivariate matched sampling methods that incorporate the propensity score. The American Statistician. 1985;39:33–38.
    1. Austin PC. A critical appraisal of propensity score matching in the medical literature from 1996 to 2003. Statistics in Medicine. 2008;27:2037–2049.
    1. Austin PC. Propensity-score matching in the cardiovascular surgery literature from 2004 to 2006: a systematic review and suggestions for improvement. Journal of Thoracic and Cardiovascular Surgery. 2007;134:1128–1135.
    1. Austin PC. A report card on propensity-score matching in the cardiology literature from 2004 to 2006: results of a systematic review. Circulation: Cardiovascular Quality and Outcomes. 2008;1:62–67.
    1. Imbens GW. Nonparametric estimation of average treatment effects under exogeneity: a review. The Review of Economics and Statistics. 2004;86:4–29.
    1. Austin PC. Type I error rates, coverage of confidence intervals, and variance estimation in propensity-score matched analyses. The International Journal of Biostatistics. 2009;5(1) Article 13. DOI: .
    1. Stürmer T, Joshi M, Glynn RJ, Avorn J, Rothman KJ, Schneeweiss S. A review of the application of propensity score methods yielded increasing use, advantages in specific settings, but not substantially different estimates compared with conventional multivariable methods. Journal of Clinical Epidemiology. 2006;59:437–447.
    1. Schechtman E. Odds ratio, relative risk, absolute risk reduction, and the number needed to treat–which of these should we use? Value in Health. 2002;5:431–436.
    1. Cook RJ, Sackett DL. The number needed to treat: a clinically useful measure of treatment effect. British Medical Journal. 1995;310:452–454.
    1. Jaeschke R, Guyatt G, Shannon H, Walter S, Cook D, Heddle N. Basis statistics for clinicians 3: assessing the effects of treatment: measures of association. Canadian Medical Association Journal. 1995;152:351–357.
    1. Sinclair JC, Bracken MB. Clinically useful measures of effect in binary analyses of randomized trials. Journal of Clinical Epidemiology. 1994;47:881–889.
    1. Austin PC, Grootendorst P, Normand SLT, Anderson GM. Conditioning on the propensity score can result in biased estimation of common measures of treatment effect: a Monte Carlo study. Statistics in Medicine. 2007;26:754–768.
    1. Austin PC. The performance of different propensity score methods for estimating marginal odds ratios. Statistics in Medicine. 2007;26:3078–3094.
    1. Greenland S. Interpretation and choice of effect measures in epidemiologic analyses. American Journal of Epidemiology. 1987;125:761–768.
    1. Austin PC. A data-generation process for data with specified risk differences or numbers needed to treat. Communications in Statistics–Simulation and Computation. 2010;39:563–577.
    1. Austin PC. The performance of different propensity score methods for estimating difference in proportions (risk differences or absolute risk reductions) in observational studies. Statistics in Medicine. 2010 DOI: .
    1. Austin PC, Stafford J. The performance of two data-generation processes for data with specified marginal treatment odds ratios. Communications in Statistics–Simulation and Computation. 2008;37:1039–1051.
    1. Cohen J. Statistical Power Analysis for the Behavioral Sciences. 2nd edn. Hillsdale NJ: Lawrence Erlbaum Associates Publishers; 1988.
    1. Cochran WG, Rubin DB. Controlling bias in observational studies: a review. Sankhya: The Indian Journal of Statistics. 1973;35:416–466.
    1. Agresti A, Min Y. Effects and non-effects of paired identical observations in comparing proportions with binary matched-pairs data. Statistics in Medicine. 2004;23:65–75.
    1. Lee DS, Austin PC, Rouleau JL, Liu PP, Naimark D, Tu JV. Predicting mortality among patients hospitalized for heart failure: derivation and validation of a clinical model. Journal of the American Medical Association. 2003;290:2581–2587.
    1. Tu JV, Donovan LR, Lee DS, Austin PC, Ko DT, Wang JT, Newman AM. Quality of Cardiac Care in Ontario–Phase 1. Report 1. Toronto: Institute for Clinical Evaluative Sciences; 2004.
    1. Flury BK, Riedwyl H. Standard distance in univariate and multivariate analysis. The American Statistician. 1986;40:249–251.
    1. Austin PC. Some methods of propensity-score matching had superior performance to others: results of an empirical investigation and Monte Carlo simulations. Biometrical Journal. 2009;51:171–184. DOI: .
    1. Austin PC. Balance diagnostics for comparing the distribution of baseline covariates between treatment groups in propensity-score matched samples. Statistics in Medicine. 2009;28:3083–3107.
    1. Austin PC. The relative ability of different propensity-score methods to balance measured covariates between treated and untreated subjects in observational studies. Medical Decision Making. 2009;29:661–677.

Source: PubMed

3
S'abonner