Online crowdsourcing for efficient rating of speech: a validation study

Tara McAllister Byun, Peter F Halpin, Daniel Szeredi, Tara McAllister Byun, Peter F Halpin, Daniel Szeredi

Abstract

Blinded listener ratings are essential for valid assessment of interventions for speech disorders, but collecting these ratings can be time-intensive and costly. This study evaluated the validity of speech ratings obtained through online crowdsourcing, a potentially more efficient approach. 100 words from children with /r/ misarticulation were electronically presented for binary rating by 35 phonetically trained listeners and 205 naïve listeners recruited through the Amazon Mechanical Turk (AMT) crowdsourcing platform. Bootstrapping was used to compare different-sized samples of AMT listeners against a "gold standard" (mode across all trained listeners) and an "industry standard" (mode across bootstrapped samples of three trained listeners). There was strong overall agreement between trained and AMT listeners. The "industry standard" level of performance was matched by bootstrapped samples with n = 9 AMT listeners. These results support the hypothesis that valid ratings of speech data can be obtained in an efficient manner through AMT. Researchers in communication disorders could benefit from increased awareness of this method.

Learning outcomes: Readers will be able to (a) discuss advantages and disadvantages of data collection through the crowdsourcing platform Amazon Mechanical Turk (AMT), (b) describe the results of a validity study comparing samples of AMT listeners versus phonetically trained listeners in a speech-rating task.

Keywords: Crowdsourcing; Research methods; Speech perception; Speech rating; Speech sound disorders.

Copyright © 2014 Elsevier Inc. All rights reserved.

Figures

FIGURE 1
FIGURE 1
Distribution of stimulus items across F3 – F2 values and perceptually correct/incorrect rating categories. Classification as correct/incorrect reflects the mode across binary ratings assigned by three blinded certified clinicians.
FIGURE 2
FIGURE 2
Percentage of experienced listeners and AMT listeners rating a given stimulus item as correct
FIGURE 3
FIGURE 3
Percentage of experienced listeners and AMT listeners rating a given stimulus item as correct, as a function of F3 – F2 distance. Shaded band represents a 95% confidence interval around the best-fit line.
FIGURE 4
FIGURE 4
Percentage of sampling runs in which the mode across bootstrapped AMT listener samples matched the gold standard rating. Size of circle represents number of listeners included in sample, with n ranging from 3 to 15.
FIGURE 5
FIGURE 5
Percent agreement with gold standard as a function of size of AMT listener sample in bootstrap analysis

Source: PubMed

3
Tilaa