Expectation and surprise determine neural population responses in the ventral visual stream

Tobias Egner, Jim M Monti, Christopher Summerfield, Tobias Egner, Jim M Monti, Christopher Summerfield

Abstract

Visual cortex is traditionally viewed as a hierarchy of neural feature detectors, with neural population responses being driven by bottom-up stimulus features. Conversely, "predictive coding" models propose that each stage of the visual hierarchy harbors two computationally distinct classes of processing unit: representational units that encode the conditional probability of a stimulus and provide predictions to the next lower level; and error units that encode the mismatch between predictions and bottom-up evidence, and forward prediction error to the next higher level. Predictive coding therefore suggests that neural population responses in category-selective visual regions, like the fusiform face area (FFA), reflect a summation of activity related to prediction ("face expectation") and prediction error ("face surprise"), rather than a homogenous feature detection response. We tested the rival hypotheses of the feature detection and predictive coding models by collecting functional magnetic resonance imaging data from the FFA while independently varying both stimulus features (faces vs houses) and subjects' perceptual expectations regarding those features (low vs medium vs high face expectation). The effects of stimulus and expectation factors interacted, whereby FFA activity elicited by face and house stimuli was indistinguishable under high face expectation and maximally differentiated under low face expectation. Using computational modeling, we show that these data can be explained by predictive coding but not by feature detection models, even when the latter are augmented with attentional mechanisms. Thus, population responses in the ventral visual stream appear to be determined by feature expectation and surprise rather than by stimulus features per se.

Figures

Figure 1.
Figure 1.
Experimental protocol and behavior. A, Each trial commenced with an intertrial interval during which a fixation cross was presented, varying in duration from 2 to 4 s, drawn from a uniform distribution of 1 s steps (i.e., 2, 3, 4 s). Then, a colored frame (green, yellow, or blue) was presented that briefly preceded (by 250 ms) the addition of either a face or house stimulus inside that frame (for 750 ms). B, It was the subjects' task to detect occasional inverted (upside-down) target stimuli, an example of which is shown here, by performing a speeded right index finger button press. Targets occurred on 10% of all trials, could be either faces or houses, were equally likely to occur in association with each frame color, and the probability of a target being an inverted face or an inverted house stimulus was equal (50%) across the different color frame conditions. C, Orthogonal to task demands, the experimental manipulations of interest concerned nontarget trials, independently varying stimulus features (faces vs houses) and expectation for stimulus features, by probabilistically pairing frame color with stimulus type, with levels of 0.25, 0.50, and 0.75 (low, medium, high) probability of encountering a face stimulus (represented by blue, yellow, and green frames, respectively, in the example depicted). D, Mean RTs (± SEM) for target detection, shown as a function of target type (inverted face vs inverted house) and face expectation condition.
Figure 2.
Figure 2.
Predicted FFA population response patterns based on predictive coding and feature detection models. A, Predictive coding argues that FFA population responses reflect the sum (right) of activity generated by representation units (face expectation, left) and error units (face surprise, middle). Note that the predicted pattern in the right-hand panel is based on the (hitherto untested) assumption that expectation and surprise contribute equally (50:50) to the FFA population response. Uneven ratios would result either in enhancing (if face surprise contributed more strongly) or attenuating (if face expectation contributed more strongly) this interaction pattern. B, Feature detection views suppose that the FFA population response is driven by stimulus features, with face stimuli eliciting stronger responses than house stimuli.
Figure 3.
Figure 3.
Functional MRI and computational modeling data. A, FFA localizer group results in the fusiform gyrus are displayed on axial and coronal sections of a single-subject normalized brain. Data are shown at a false discovery rate-corrected threshold of p < 0.05 (t = 3.53) (left FFA peak: x = −44, y = −50, z = −22, cluster = 54 voxels; right FFA peak: x = 48, y = −52, z = −18, cluster = 52 voxels). B, Mean group activation estimates (β parameters ± SEM) for each condition of the main experimental protocol are shown for the group FFA peak (x = −44, y = −50, z = −22) defined in the localizer task. C–F, Observed (colored markers) and best-fit simulated (black lines) FFA BOLD responses to face and house stimuli based on a predictive coding model (C), a feature detection model (D), and feature + attention models (E, F), where face expectation could impose either an additive baseline shift with varying levels of face expectation (E) or a multiplicative gain to faces on FFA responses (F) (see Materials and Methods).

Source: PubMed

3
구독하다