Balance diagnostics for comparing the distribution of baseline covariates between treatment groups in propensity-score matched samples

Peter C Austin, Peter C Austin

Abstract

The propensity score is a subject's probability of treatment, conditional on observed baseline covariates. Conditional on the true propensity score, treated and untreated subjects have similar distributions of observed baseline covariates. Propensity-score matching is a popular method of using the propensity score in the medical literature. Using this approach, matched sets of treated and untreated subjects with similar values of the propensity score are formed. Inferences about treatment effect made using propensity-score matching are valid only if, in the matched sample, treated and untreated subjects have similar distributions of measured baseline covariates. In this paper we discuss the following methods for assessing whether the propensity score model has been correctly specified: comparing means and prevalences of baseline characteristics using standardized differences; ratios comparing the variance of continuous covariates between treated and untreated subjects; comparison of higher order moments and interactions; five-number summaries; and graphical methods such as quantile-quantile plots, side-by-side boxplots, and non-parametric density plots for comparing the distribution of baseline covariates between treatment groups. We describe methods to determine the sampling distribution of the standardized difference when the true standardized difference is equal to zero, thereby allowing one to determine the range of standardized differences that are plausible with the propensity score model having been correctly specified. We highlight the limitations of some previously used methods for assessing the adequacy of the specification of the propensity-score model. In particular, methods based on comparing the distribution of the estimated propensity score between treated and untreated subjects are uninformative.

Figures

Figure 1
Figure 1
Absolute standardized differences for baseline covariates comparing treated to untreated subjects in the original and the matched sample.
Figure 2
Figure 2
Relationship between sample size and the standard deviation of empirical sampling distribution of standardized difference.
Figure 3
Figure 3
Side-by-side boxplots and Q–Q plots for age.
Figure 4
Figure 4
Density plots and cumulative distribution functions for age.
Figure 5
Figure 5
Distribution of estimated propensity score in treated and untreated subjects in different matched samples.

References

    1. Rosenbaum PR, Rubin DB. The central role of the propensity score in observational studies for causal effects. Biometrika. 1983;70:41–55.
    1. Rosenbaum PR, Rubin DB. Reducing bias in observational studies using subclassification on the propensity score. Journal of the American Statistical Association. 1984;79:516–524.
    1. Austin PC, Mamdani MM. A comparison of propensity score methods: a case-study estimating the effectiveness of post-AMI statin use. Statistics in Medicine. 2006;25:2084–2106.
    1. Weitzen S, Lapane KL, Toledano AY, Hume AL, Mor V. Principles for modeling propensity scores in medical research: a systematic literature review. Pharmacoepidemiology and Drug Safety. 2004;13:841–853.
    1. Shah BR, Laupacis A, Hux JE, Austin PC. Propensity score methods give similar results to traditional regression modeling in observational studies: a systematic review. Journal of Clinical Epidemiology. 2005;58:550–559.
    1. Stürmer T, Joshi M, Glynn RJ, Avorn J, Rothman KJ, Schneeweiss S. A review of the application of propensity score methods yielded increasing use, advantages in specific settings, but not substantially different estimates compared with conventional multivariable methods. Journal of Clinical Epidemiology. 2006;59:437–447.
    1. Austin PC. A critical appraisal of propensity score matching in the medical literature from 1996 to 2003. Statistics in Medicine. 2008;27:2037–2049.
    1. Austin PC. Propensity-score matching in the cardiovascular surgery literature from 2004 to 2006: a systematic review and suggestions for improvement. Journal of Thoracic and Cardiovascular Surgery. 2007;134:1128–1135.
    1. Austin PC. A report card on propensity-score matching in the cardiology literature from 2004 to 2006: results of a systematic review. Circulation: Cardiovascular Quality and Outcomes. 2008;1:62–67.
    1. Rubin DB. Using propensity scores to help design observational studies: application to the tobacco litigation. Health Services and Outcomes Research Methodology. 2001;2:169–188.
    1. Rubin DB. On principles for modeling propensity scores in medical research. Pharmacoepidemiology and Drug Safety. 2004;13:855–857.
    1. Ho DE, Imai K, King G, Stuart EA. Matching as nonparametric preprocessing for reducing model dependence in parametric causal inference. Political Analysis. 2007;15:199–236.
    1. Austin PC, Mamdani MM, Juurlink DN, Alter DA, Tu JV. Missed opportunities in the secondary prevention of myocardial infarction: an assessment of the effects of statin underprescribing on mortality. American Heart Journal. 2006;151:969–975.
    1. Austin PC, Tu JV. Comparing clinical data with administrative data for producing AMI report cards. Journal of the Royal Statistical Society—Series A (Statistics in Society) 2006;169:115–126.
    1. Austin PC. A comparison of classification and regression trees, logistic regression, generalized additive models, and multivariate adaptive regression splines for predicting AMI mortality. Statistics in Medicine. 2007;26:2937–2957.
    1. Tu JV, Donovan LR, Lee DS, Austin PC, Ko DT, Wang JT, Newman AM. Quality of Cardiac Care in Ontario. Toronto, Ontario: Institute for Clinical Evaluative Sciences; 2004.
    1. Austin PC, Grootendorst P, Anderson GM. A comparison of the ability of different propensity score models to balance measured variables between treated and untreated subjects: a Monte Carlo study. Statistics in Medicine. 2007;26:734–753.
    1. Austin PC. Some methods of propensity-score matching had superior performance to others: results of an empirical investigation and Monte Carlo simulations. Biometrical Journal. 2009;51:171–184.
    1. Moher D, Schulz KF, Altman D, for the CONSORT Group The CONSORT statement: revised recommendations for improving the quality of reports of parallel-group randomized trials. Journal of the American Medical Association. 2001;285:1787–1991.
    1. Altman DG, Schulz KF, Moher D, Egger M, Davidoff F, Elbourne D, Gotzsche PC, Lang T, for the CONSORT Group The revised CONSORT statement for reporting randomized trials: explanation and elaboration. Annals of Internal Medicine. 2001;134:663–694.
    1. Flury BK, Riedwyl H. Standard distance in univariate and multivariate analysis. The American Statistician. 1986;40:249–251.
    1. Rosenbaum PR, Rubin DB. Constructing a control group using multivariate matched sampling methods that incorporate the propensity score. The American Statistician. 1985;39:33–38.
    1. Normand SLT, Landrum MB, Guadagnoli E, Ayanian JZ, Ryan TJ, Cleary PD, McNeil BJ. Validating recommendations for coronary angiography following an acute myocardial infarction in the elderly: a matched analysis using propensity scores. Journal of Clinical Epidemiology. 2001;54:387–398.
    1. Ahmed A, Perry GJ, Fleg JL, Love TE, Goff DC, Jr, Kitzman DW. Outcomes in ambulatory chronic systolic and diastolic heart failure: a propensity score analysis. American Heart Journal. 2006;152:956–966.
    1. Ahmed A, Husain A, Love TE, Gambassi G, Dell'Italia LJ, Francis GS, Gheorghiade M, Allman RM, Meleth S, Bourge RC. Heart failure, chronic diuretic use, and increase in mortality and hospitalization: an observational study using propensity score methods. European Heart Journal. 2006;27:1431–1439.
    1. Cohen J. Statistical Power Analysis for the Behavioral Sciences. 2nd edn. Hillsdale, NJ: Lawrence Erlbaum Associates Publishers; 1988.
    1. Hedges LV, Olkin I. Statistical Methods for Meta-Analysis. San Diego, CA: Academic Press; 1985.
    1. Austin PC. Type I error rates, coverage of confidence intervals, and variance estimation in propensity-score matched analyses. The International Journal of Biostatistics. 2009;5(1) Article 13.
    1. Imai K, King G, Stuart EA. Misunderstandings between experimentalists and observationalists about causal inference. Journal of the Royal Statistical Society, Series A (Statistics in Society) 2008;171:481–502.
    1. Rosner B. Fundamentals of Biostatistics. 4th edn. Belmont, CA: Duxbury Press; 1995.
    1. Harrell FE., Jr . Regression Modeling Strategies. New York, NY: Springer; 2001.
    1. Hoaglin DC, Mosteller F, Tukey JW. Understanding Robust and Exploratory Data Analysis. New York, NY: Wiley; 1983.
    1. Casella G, Berger RL. Statistical Inference. Belmont, CA: Duxbury Press; 1990.
    1. Weitzen S, Lapane KL, Toledano AY, Hume AL, Mor V. Weaknesses of goodness-of-fit tests for evaluating propensity score models: the case of the omitted confounder. Pharmacoepidemiology and Drug Safety. 2005;14:227–238.
    1. Senn S. Testing for baseline balance in clinical trials. Statistics in Medicine. 1994;13:1715–1726.
    1. Senn SJ. Covariate imbalance and random allocation in clinical trials. Statistics in Medicine. 1989;8:467–475.
    1. Altman DG, Dore CJ. Baseline comparisons in randomized clinical trials. Statistics in Medicine. 1991;10:797–802.
    1. Lavori PW, Louis TA, Bailar JC, III, Polansky M. Designs for experiments—parallel comparisons of treatment. New England Journal of Medicine. 1983;309:1291–1298.
    1. Gail MH, Wieand S, Piantadosi S. Biased estimates of treatment effect in randomized experiments with nonlinear regressions and omitted covariates. Biometrika. 1984;7:431–444.
    1. Austin PC, Zwarenstein M, Manca A, Juurlink DN, Stanbrook MB. Handling of baseline covariates in randomized controlled trials: a review of trials published in leading medical journals. Journal of Clinical Epidemiology. DOI: .
    1. Sackett DL. Down with odds ratios! for publication. Evidence-Based Medicine. 1996;1:164–166.
    1. Newcombe RG. A deficiency of the odds ratio as a measure of effect size. Statistics in Medicine. 2006;25:4235–4240.
    1. Schechtman E. Odds ratio, relative risk, absolute risk reduction, and the number needed to treat—which of these should we use? Value in Health. 2002;5:431–436.
    1. Cook RJ, Sackett DL. The number needed to treat: a clinically useful measure of treatment effect. British Medical Journal. 1995;310:452–454.
    1. Jaeschke R, Guyatt G, Shannon H, Walter S, Cook D, Heddle N. Basis statistics for clinicians 3: assessing the effects of treatment: measures of association. Canadian Medical Association Journal. 1995;152:351–357.
    1. Sinclair JC, Bracken MB. Clinically useful measures of effect in binary analyses of randomized trials. Journal of Clinical Epidemiology. 1994;47:881–889.
    1. Austin PC. Assessing balance in baseline covariates when using many-to-one matching on the propensity-score. Pharmacoepidemiology and Drug Safety. 2008;17:1218–1225.

Source: PubMed

3
订阅