Interrater reliability of sleep stage scoring: a meta-analysis

Yun Ji Lee, Jae Yong Lee, Jae Hoon Cho, Ji Ho Choi, Yun Ji Lee, Jae Yong Lee, Jae Hoon Cho, Ji Ho Choi

Abstract

Study objectives: We evaluated the interrater reliabilities of manual polysomnography sleep stage scoring. We included all studies that employed Rechtschaffen and Kales rules or American Academy of Sleep Medicine standards. We sought the overall degree of agreement and those for each stage.

Methods: The keywords were "Polysomnography (PSG)," "sleep staging," "Rechtschaffen and Kales (R&K)," "American Academy of Sleep Medicine (AASM)," "interrater (interscorer) reliability," and "Cohen's kappa." We searched PubMed, OVID Medline, EMBASE, the Cochrane library, KoreaMed, KISS, and the MedRIC. The exclusion criteria included automatic scoring and pediatric patients. We collected data on scorer histories, scoring rules, numbers of epochs scored, and the underlying diseases of the patients.

Results: A total of 101 publications were retrieved; 11 satisfied the selection criteria. The Cohen's kappa for manual, overall sleep scoring was 0.76, indicating substantial agreement (95% confidence interval, 0.71-0.81; P < .001). By sleep stage, the figures were 0.70, 0.24, 0.57, 0.57, and 0.69 for the W, N1, N2, N3, and R stages, respectively. The interrater reliabilities for stage N2 and N3 sleep were moderate, and that for stage N1 sleep was only fair.

Conclusions: We conducted a meta-analysis to generalize the variation in manual scoring of polysomnography and provide reference data for automatic sleep stage scoring systems. The reliability of manual scorers of polysomnography sleep stages was substantial. However, for certain stages, the results were poor; validity requires improvement.

Citation: Lee YJ, Lee JY, Cho JH, Choi JH. Interrater reliability of sleep stage scoring: a meta-analysis. J Clin Sleep Med. 2022;18(1):193-202.

Keywords: interrater reliability; meta-analysis; sleep stage scoring.

Conflict of interest statement

All authors have seen and approved the manuscript. Work for this study was performed in the Department of Otorhinolaryngology—Head and Neck Surgery, College of Medicine, Soonchunhyang University, Bucheon Hospital, Bucheon, Korea. This study was funded by the Soonchunhyang University Research Fund. The authors report no conflicts of interest.

© 2022 American Academy of Sleep Medicine.

Figures

Figure 1. Data matrix and formula for…
Figure 1. Data matrix and formula for calculating the Cohen’s κ.
(A) The data matrix derived when sleep scoring sought to identify 5 sleep stage categories (W, N1, N2, N3, and R). Sij is the number of epochs. (B) The formula used to calculate the κ coefficient. (a)N is the total number of epochs scored. (b)Po is the observed agreement and Pc is the expected agreement. (c, d)Po and Pc are derived using these formulas.
Figure 2. Flow diagram of study selection.
Figure 2. Flow diagram of study selection.
Figure 3. Forest plot for overall interrater…
Figure 3. Forest plot for overall interrater reliability.
CI = confidence interval.
Figure 4. Forest plot for interrater reliability…
Figure 4. Forest plot for interrater reliability of different sleep stages.
CI = confidence interval.
Figure 5. Funnel plot for overall interrater…
Figure 5. Funnel plot for overall interrater reliability.
Figure 6. Forest plot and funnel plot…
Figure 6. Forest plot and funnel plot for interrater reliabilities by stage.
CI = confidence interval, REM = rapid eye movement.
Figure 7. Data matrix and formula for…
Figure 7. Data matrix and formula for calculating the ICC.
(A) When the scorers (j = 1, 2, …, k) evaluate the PSG results of the patients (i = 1, 2, …, n), the data matrix can be filled in with target variables xij. Values of the target variables should fall along a continuous scale, such as the AHI and sleep stage (% or minutes). (B) The basic formula used to calculate the ICC in a 2-way random model. (a) Each measurement xij is assumed to be composed of a true component and a measurement error component. The model can be regarded as the sum of 5 terms: μ = mean of the patient’s scores, ri = deviation from the mean for patient i,cj = bias of scorer j,rcij = interaction between patient deviation and scorer deviation, and eij= measurement error. (b) The ICC was calculated as a ratio of variance based on the results of an analysis of variance. The total variance is equal to the sum of the variance of interest (true score variance) and the error variance. The ICC is unitless and has a value between 0 and 1; an estimate of 1 indicates perfect reliability and 0 indicates no reliability. AHI = apnea-hypopnea index, ICC = intraclass correlation coefficient, PSG = polysomnography.

References

    1. Javaheri S, Redline S. Sleep, slow-wave sleep, and blood pressure. Curr Hypertens Rep. 2012; 14( 5): 442– 448.
    1. Pillai JA, Leverenz JB. Sleep and neurodegeneration: a critical appraisal. Chest. 2017; 151( 6): 1375– 1386.
    1. Kales A, Rechtschaffen A. A Manual of Standardized Terminology, Techniques and Scoring System for Sleep Stages in Human Subjects. Washington, DC: U.S. Government Printing Office; 1968.
    1. Iber C, Ancoli-Israel S, Chesson AL Jr, Quan SF; for the American Academy of Sleep Medicine. The AASM Manual for the Scoring of Sleep and Associated Events: Rules, Terminology and Technical Specifications. 1st ed. Westchester, IL: American Academy of Sleep Medicine; 2007.
    1. Norman RG, Pal I, Stewart C, Walsleben JA, Rapoport DM. Interobserver agreement among sleep scorers from different centers in a large dataset. Sleep. 2000; 23( 7): 901– 908.
    1. Whitney CW, Gottlieb DJ, Redline S, et al. . Reliability of scoring respiratory disturbance indices and sleep staging. Sleep. 1998; 21( 7): 749– 757.
    1. Danker-Hopfe H, Kunz D, Gruber G, et al. . Interrater reliability between scorers from eight European sleep laboratories in subjects with different sleep disorders. J Sleep Res. 2004; 13( 1): 63– 69.
    1. Schaltenbrand N, Lengelle R, Toussaint M, et al. . Sleep stage scoring using the neural network model: comparison between visual and automatic analysis in normal subjects and patients. Sleep. 1996; 19( 1): 26– 35.
    1. Rosenberg RS, Van Hout S. The American Academy of Sleep Medicine inter-scorer reliability program: sleep stage scoring. J Clin Sleep Med. 2013; 9( 1): 81– 87.
    1. Stepnowsky C, Levendowski D, Popovic D, Ayappa I, Rapoport DM. Scoring accuracy of automated sleep staging from a bipolar electroocular recording compared to manual scoring by multiple raters. Sleep Med. 2013; 14( 11): 1199– 1207.
    1. Younes M, Raneri J, Hanly P. Staging sleep in polysomnograms: analysis of inter-scorer variability. J Clin Sleep Med. 2016; 12( 6): 885– 894.
    1. Cohen J. A coefficient of agreement for nominal scales. Educ Psychol Meas. 1960; 20( 1): 37– 46.
    1. Fraiwan L, Lweesy K, Khasawneh N, Wenz H, Dickhaus H. Automated sleep stage identification system based on time-frequency analysis of a single EEG channel and random forest classifier. Comput Methods Programs Biomed. 2012; 108( 1): 10– 19.
    1. Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics. 1977; 33( 1): 159– 174.
    1. Kubicki S, Höller L, Berg I, Pastelak-Price C, Dorow R. Sleep EEG evaluation: a comparison of results obtained by visual scoring and automatic analysis with the Oxford sleep stager. Sleep. 1989; 12( 2): 140– 149.
    1. Pittman SD, MacDonald MM, Fogel RB, et al. Assessment of automated scoring of polysomnographic recordings in a population with suspected sleep-disordered breathing. Sleep. 2004;27(7):1394--1403.
    1. Shambroom JR, Fábregas SE, Johnstone J. Validation of an automated wireless system to monitor sleep in healthy adults. J Sleep Res. 2012;21(2):221--230.
    1. Deng S, Zhang X, Zhang Y, et al. Interrater agreement between American and Chinese sleep centers according to the 2014 AASM standard. Sleep and Breathing. 2019;23(2):719--728.
    1. Martin WB, Johnson LC, Viglione SS, Naitoh P, Joseph RD, Moses JD. Pattern recognition of EEG-EOG as a technique for all-night sleep stage scoring. Electroencephalogr Clin Neurophysiol. 1972; 32( 4): 417– 427.
    1. Levendowski DJ, Ferini-Strambi L, Gamaldo C, Cetel M, Rosenberg R, Westbrook PR. The accuracy, night-to-night variability, and stability of frontopolar sleep electroencephalography biomarkers. J Clin Sleep Med. 2017; 13( 6): 791– 803.
    1. Jensen PS, Sorensen HB, Leonthin HL, Jennum P. Automatic sleep scoring in normals and in individuals with neurodegenerative disorders according to new international sleep scoring criteria. J Clin Neurophysiol. 2010; 27( 4): 296– 302.
    1. Fiorillo L, Puiatti A, Papandrea M, et al. . Automated sleep scoring: a review of the latest approaches. Sleep Med Rev. 2019; 48: 101204.
    1. Scholle S, Schäfer T. Atlas of states of sleep and wakefulness in infants and children. Somnologie (Berl). 1999; 3( 4): 163– 241.
    1. Anders TF, Emde RN, Parmelee AH. A Manual of Standardized Terminology, Techniques and Criteria for Scoring of States of Sleep and Wakefulness in Newborn Infants. Los Angeles, CA: UCLA Brain Information Service/BRI Publications Office, NINDS Neurological Information Network; 1971.
    1. Grigg-Damberger M, Gozal D, Marcus CL, et al. . The visual scoring of sleep and arousal in infants and children. J Clin Sleep Med. 2007; 3( 2): 201– 240.
    1. Zhang X, Dong X, Kantelhardt JW, et al. . Process and outcome for international reliability in sleep scoring. Sleep Breath. 2015; 19( 1): 191– 195.
    1. Elliott R, McKinley S, Cistulli P, Fien M. Characterisation of sleep in intensive care using 24-hour polysomnography: an observational study. Crit Care. 2013; 17( 2): R46.
    1. Magalang UJ, Chen N-H, Cistulli PA, et al. SAGIC Investigators. Agreement in the scoring of respiratory events and sleep among international sleep centers. Sleep. 2013; 36( 4): 591– 596.
    1. Ruehland WR, O’Donoghue FJ, Pierce RJ, et al. . The 2007 AASM recommendations for EEG electrode placement in polysomnography: impact on sleep and cortical arousal scoring. Sleep. 2011; 34( 1): 73– 81.
    1. Fleiss JL. Measuring nominal scale agreement among many raters. Psychol Bull. 1971; 76( 5): 378– 382.
    1. Chmura Kraemer H, Periyakoil VS, Noda A. Kappa coefficients in medical research. Stat Med. 2002; 21( 14): 2109– 2129.
    1. Danker-Hopfe H, Anderer P, Zeitlhofer J, et al. . Interrater reliability for sleep scoring according to the Rechtschaffen & Kales and the new AASM standard. J Sleep Res. 2009; 18( 1): 74– 84.
    1. Smith JR, Karacan I, Yang M. Ontogeny of delta activity during human sleep. Electroencephalogr Clin Neurophysiol. 1977; 43( 2): 229– 237.
    1. Tan X, Campbell IG, Feinberg I. Internight reliability and benchmark values for computer analyses of non-rapid eye movement (NREM) and REM EEG in normal young adult and elderly subjects. Clin Neurophysiol. 2001; 112( 8): 1540– 1552.
    1. Anderer P, Gruber G, Parapatics S, et al. . An e-health solution for automatic sleep classification according to Rechtschaffen and Kales: validation study of the Somnolyzer 24 x 7 utilizing the Siesta database. Neuropsychobiology. 2005; 51( 3): 115– 133.
    1. Anderer P, Moreau A, Woertz M, et al. . Computer-assisted sleep classification according to the standard of the American Academy of Sleep Medicine: validation study of the AASM version of the Somnolyzer 24 × 7. Neuropsychobiology. 2010; 62( 4): 250– 264.
    1. Malhotra A, Younes M, Kuna ST, et al. . Performance of an automated polysomnography scoring system versus computer-assisted manual scoring. Sleep. 2013; 36( 4): 573– 582.
    1. Muzet A, Werner S, Fuchs G, et al. . Assessing sleep architecture and continuity measures through the analysis of heart rate and wrist movement recordings in healthy subjects: comparison with results based on polysomnography. Sleep Med. 2016; 21: 47– 56.
    1. Punjabi NM, Shifa N, Dorffner G, Patil S, Pien G, Aurora RN. Computer-assisted automated scoring of polysomnograms using the Somnolyzer system. Sleep. 2015; 38( 10): 1555– 1566.
    1. Liljequist D, Elfving B, Skavberg Roaldsen K. Intraclass correlation—a discussion and demonstration of basic features. PLoS One. 2019; 14( 7): e0219854.
    1. Shrout PE, Fleiss JL. Intraclass correlations: uses in assessing rater reliability. Psychol Bull. 1979; 86( 2): 420– 428.
    1. Silber MH, Ancoli-Israel S, Bonnet MH, et al. . The visual scoring of sleep in adults. J Clin Sleep Med. 2007; 3( 2): 121– 131.

Source: PubMed

3
구독하다