Improving the Use of Mortality Data in Public Health: A Comparison of Garbage Code Redistribution Models

Ta-Chou Ng, Wei-Cheng Lo, Chu-Chang Ku, Tsung-Hsueh Lu, Hsien-Ho Lin, Ta-Chou Ng, Wei-Cheng Lo, Chu-Chang Ku, Tsung-Hsueh Lu, Hsien-Ho Lin

Abstract

Objectives. To describe and compare 3 garbage code (GC) redistribution models: naïve Bayes classifier (NB), coarsened exact matching (CEM), and multinomial logistic regression (MLR).Methods. We analyzed Taiwan Vital Registration data (2008-2016) using a 2-step approach. First, we used non-GC death records to evaluate 3 different prediction models (NB, CEM, and MLR), incorporating individual-level information on multiple causes of death (MCDs) and demographic characteristics. Second, we applied the best-performing model to GC death records to predict the underlying causes of death. We conducted additional simulation analyses for evaluating the predictive performance of models.Results. When we did not account for MCDs, all 3 models presented high average misclassification rates in GC assignment (NB, 81%; CEM, 86%; MLR, 81%). In the presence of MCD information, NB and MLR exhibited significant improvement in assignment accuracy (19% and 17% misclassification rate, respectively). Furthermore, CEM without a variable selection procedure resulted in a substantially higher misclassification rate (40%).Conclusions. Comparing potential GC redistribution approaches provides guidance for obtaining better estimates of cause-of-death distribution and highlights the significance of MCD information for vital registration system reform.

Figures

FIGURE 1—
FIGURE 1—
Model Performance for Multinomial Logistic Regression (MLR), Coarsened Exact Matching (CEM), and Naïve Bayes Classifier (NB), (a) Without and (b) With Multiple Causes of Death Data: Taiwan, 2008–2016 Note. The color bands represent 95% confidence intervals. BSE = backward sequential elimination; BSEJ = backward sequential elimination and joining.
FIGURE 2—
FIGURE 2—
Misclassification Rate (MR) of Multinomial Logistic Regression (MLR), Coarsened Exact Matching (CEM), and Naïve Bayes Classifier (NB) by Relative Sample Size (log10N/K) Under Different Scenarios: Taiwan, 2008–2016 Note. The 9 scenarios are generated from the combination of proportion of missing data (PNA, valued 0%, 5%, or 10%) and number of redundant covariates (r, valued 0, 2, or 4). Each scenario contains 1000 iterations, where there are K = 20 target underlying-cause-of-death groups, 2 effective covariates, and a set of multiple causes of death.

Source: PubMed

3
Subscribe