Finding the experts in the crowd: Validity and reliability of crowdsourced measures of children's gradient speech contrasts

Daphna Harel, Elaine Russo Hitchcock, Daniel Szeredi, José Ortiz, Tara McAllister Byun, Daphna Harel, Elaine Russo Hitchcock, Daniel Szeredi, José Ortiz, Tara McAllister Byun

Abstract

Perceptual ratings aggregated across multiple nonexpert listeners can be used to measure covert contrast in child speech. Online crowdsourcing provides access to a large pool of raters, but for practical purposes, researchers may wish to use smaller samples. The ratings obtained from these smaller samples may not maintain the high levels of validity seen in larger samples. This study aims to measure the validity and reliability of crowdsourced continuous ratings of child speech, obtained through Visual Analog Scaling, and to identify ways to improve these measurements. We first assess overall validity and interrater reliability for measurements obtained from a large set of raters. Second, we investigate two rater-level measures of quality, individual validity and intrarater reliability, and examine the relationship between them. Third, we show that these estimates may be used to establish guidelines for the inclusion of raters, thus impacting the quality of results obtained when smaller samples are used.

Keywords: Child speech ratings; covert contrasts; reliability; validity.

Conflict of interest statement

Statement of interest

The authors report no conflicts of interest.

Figures

Figure 1
Figure 1
Screenshot of VAS interface as it appeared to crowdsourced raters.
Figure 2
Figure 2
Correlation across tokens between F3-F2 distance mean VAS click location, across all raters and presentation cycles. The size of the dot represents the standard deviation across click locations for that token.
Figure 3
Figure 3
Top panel: Correlation between intrarater reliability and individual validity. Bottom panel: Examples of individual patterns of click location relative to F3-F2 distance. Red triangles indicate the mean click location across the first four presentation cycles for each token; black dots are the individual click locations for each presentation of a token.
Figure 4
Figure 4
Boxplots representing the distribution of validity values obtained over 10,000 repeated samples with n = 9 raters, under three different levels of rater quality control.

Source: PubMed

3
Subscribe