Hearing lips and seeing voices: how cortical areas supporting speech production mediate audiovisual speech perception

Jeremy I Skipper, Virginie van Wassenhove, Howard C Nusbaum, Steven L Small, Jeremy I Skipper, Virginie van Wassenhove, Howard C Nusbaum, Steven L Small

Abstract

Observing a speaker's mouth profoundly influences speech perception. For example, listeners perceive an "illusory" "ta" when the video of a face producing /ka/ is dubbed onto an audio /pa/. Here, we show how cortical areas supporting speech production mediate this illusory percept and audiovisual (AV) speech perception more generally. Specifically, cortical activity during AV speech perception occurs in many of the same areas that are active during speech production. We find that different perceptions of the same syllable and the perception of different syllables are associated with different distributions of activity in frontal motor areas involved in speech production. Activity patterns in these frontal motor areas resulting from the illusory "ta" percept are more similar to the activity patterns evoked by AV(/ta/) than they are to patterns evoked by AV(/pa/) or AV(/ka/). In contrast to the activity in frontal motor areas, stimulus-evoked activity for the illusory "ta" in auditory and somatosensory areas and visual areas initially resembles activity evoked by AV(/pa/) and AV(/ka/), respectively. Ultimately, though, activity in these regions comes to resemble activity evoked by AV(/ta/). Together, these results suggest that AV speech elicits in the listener a motor plan for the production of the phoneme that the speaker might have been attempting to produce, and that feedback in the form of efference copy from the motor system ultimately influences the phonetic interpretation.

Conflict of interest statement

Conflict of Interest: None declared.

Figures

Figure 1
Figure 1
Neurally specified model of AV speech perception as presented in the text. A multisensory description in the form of a hypothesis about the observed talker’s mouth movements and speech sounds (in STp areas) results in the specification (solid lines) of the motor goals of that hypothesis (in the POp the suggested human homologue of macaque area F5 where mirror neurons have been found). These motor goals are mapped to a motor plan that can be used to reach that goal (in PMv and primary motor cortices [M1]). This results in the prediction through efference copy (dashed lines) of the auditory and somatosensory states associated with executing those motor commands. Auditory (in STp areas) and somatosensory (in the SMG and primary and secondary somatosensory cortices [SI/SII]) predictions are compared with the current description of the sensory state of the listener. The result is an improvement in speech perception in AV contexts due to a reduction in ambiguity of the intended message of the observed talker.
Figure 2
Figure 2
Logical conjunction analyses. Orange indicates regions where activation associated with speaking syllables overlaps with that of activation associated with passively (A) listening to and watching the same congruent AV syllables; (B) watching only the video of these syllables without the accompanying audio track (V); and (C) listening to the syllables without the accompanying video track (A). Overlap images were created using images each thresholded at P < 0.05 corrected and logically conjoined. Blue indicates additional regions activated by passive perception alone and not activated by speech production (P < 0.05 corrected).
Figure 3
Figure 3
Correlation analyses. Correlation of the distributions of activation associated with passively listening to and watching the incongruent AV syllable made from an audio /pa/ and a visual /ka/ (denoted as ApVk) and the distributions of activation for AV/pa/ (i.e., “ApVk = AV/pa/” in gray), AV/ka/ (i.e., “ApVk = AV/ka/” in blue), or AV/ta/ (i.e., “ApVk = AV/ta/” in orange) in regions that overlapped speech production. The ApVk stimulus elicited the McGurk-MacDonald effect, perceived as “ta” in this group of participants. (A) Correlations analysis when collapsed over the entire time course of activation in all frontal, auditory, and somatosensory sensory, and occipital regions that overlap speech production (Friedman test on pairwise correlations, P values < 0.004; Nemenyi post hoc tests on resulting ranks, *P values < 0.002). This analysis was also conducted at each time point following stimulus onset in the frontal and auditory and somatosensory sensory regions that overlap speech production (see Experimental Procedures). The entire time course of activation is shown in an example (B) motor region, PMv cortex in the right hemisphere; (C) auditory and somatosensory region, the SMG in the left hemisphere; and (D) visual region, the middle occipital gyrus in the right hemisphere (P values < 0.05).
Figure 4
Figure 4
Analysis of the classification condition (i.e., run 5). Contrast (P < 0.05 corrected) of activation resulting from hearing a syllable made from an audio /pa/ and a visual /ka/ (denoted as ApVk) in one of 2 ways. Blue and orange indicate regions showing differential activation when participants classified ApVk as “ka” or “ta,” respectively, in a 3AFC task. Activation when ApVk was classified as “ka” is seen in the middle and inferior frontal gyri and insula. Activation when ApVk was classified as “ta” or “ka” is in spatially adjacent but distinct areas in the right inferior and superior parietal lobules, left somatosensory cortices, left PMv cortex, and left primary motor cortex.

Source: PubMed

3
Subscribe