Measures of agreement between many raters for ordinal classifications

Kerrie P Nelson, Don Edwards, Kerrie P Nelson, Don Edwards

Abstract

Screening and diagnostic procedures often require a physician's subjective interpretation of a patient's test result using an ordered categorical scale to define the patient's disease severity. Because of wide variability observed between physicians' ratings, many large-scale studies have been conducted to quantify agreement between multiple experts' ordinal classifications in common diagnostic procedures such as mammography. However, very few statistical approaches are available to assess agreement in these large-scale settings. Many existing summary measures of agreement rely on extensions of Cohen's kappa. These are prone to prevalence and marginal distribution issues, become increasingly complex for more than three experts, or are not easily implemented. Here we propose a model-based approach to assess agreement in large-scale studies based upon a framework of ordinal generalized linear mixed models. A summary measure of agreement is proposed for multiple experts assessing the same sample of patients' test results according to an ordered categorical scale. This measure avoids some of the key flaws associated with Cohen's kappa and its extensions. Simulation studies are conducted to demonstrate the validity of the approach with comparison with commonly used agreement measures. The proposed methods are easily implemented using the software package R and are applied to two large-scale cancer agreement studies.

Keywords: Cohen's kappa; Fleiss' kappa; generalized linear mixed model; inter-rater agreement; ordinal categorical data.

Copyright © 2015 John Wiley & Sons, Ltd.

Figures

Figure (i)
Figure (i)
Plots of agreement measures, proposed κm and κF versus ρ for varying prevalence (extreme low or high, moderate, equal in each category; the percent of observations falling into each of the Ci categories, i=1,...,5 in each prevalence case are presented in Table 2) with σ2v set to 1 and σ2u increasing in value.
Figure(2)
Figure(2)
Model-based observed agreement p0 for probit and logistic ordinal GLMMs versus increasing ρ and varying prevalence of disease for an ordinal classification scale with five categories (C=5).

Source: PubMed

3
Subscribe