How does the brain solve visual object recognition?

James J DiCarlo, Davide Zoccolan, Nicole C Rust, James J DiCarlo, Davide Zoccolan, Nicole C Rust

Abstract

Mounting evidence suggests that 'core object recognition,' the ability to rapidly recognize objects despite substantial appearance variation, is solved in the brain via a cascade of reflexive, largely feedforward computations that culminate in a powerful neuronal representation in the inferior temporal cortex. However, the algorithm that produces this solution remains poorly understood. Here we review evidence ranging from individual neurons and neuronal populations to behavior and computational models. We propose that understanding this algorithm will require using neuronal and psychophysical data to sift through many computational models, each based on building blocks of small, canonical subnetworks with a common functional goal.

Copyright © 2012 Elsevier Inc. All rights reserved.

Figures

Figure 1. Core object recognition
Figure 1. Core object recognition
is the ability to rapidly (

Figure 2. Untangling object representations

( A…

Figure 2. Untangling object representations

( A ) The response pattern of a population of…

Figure 2. Untangling object representations
(A) The response pattern of a population of visual neurons (e.g., retinal ganglion cells) to each image (three images shown) is a point in a very high dimensional space where each axis is the response level of each neuron. (B) All possible identity-preserving transformations of an object will form a low-dimensional manifold of points in the population vector space, i.e., a continuous surface (represented here, for simplicity, as a one-dimensional trajectory; see red and blue lines). Neuronal populations in early visual areas (retinal ganglion cells, LGN, V1) contain object identity manifolds that are highly curved and tangled together (see red and blue manifolds in left panel). The solution to the recognition problem is conceptualized as a series of successive re-representations along the ventral stream (black arrow) to a new population representation (IT) that allows easy separation of one namable object’s manifold (e.g., a car; see red manifold) from all other object identity manifolds (of which the blue manifold is just one example). Geometrically, this amounts to remapping the visual images so that the resulting object manifolds can be separated by a simple weighted summation rule (i.e. a hyperplane, see black dashed line; see (DiCarlo and Cox, 2007). (C) The vast majority of naturally experienced images are not accompanied with labels (e.g. “car”, “plane”), and are thus shown as black points. However, images arising from the same source (e.g. edge, object) tend to be nearby in time (gray arrows). Recent evidence shows the ventral stream uses that implicit temporal contiguity instruction to build IT neuronal tolerance, and we speculate that this is due to an unsupervised learning strategy termed cortical local subspace untangling (see text). Note that, under this hypothetical strategy, “shape coding” is not the explicit goal -- instead, “shape” information emerges as the residual natural image variation that is not specified by naturally occurring temporal contiguity cues.

Figure 3. The ventral visual pathway

(…

Figure 3. The ventral visual pathway

( A ) Ventral stream cortical area locations in…

Figure 3. The ventral visual pathway
(A) Ventral stream cortical area locations in the macaque monkey brain, and flow of visual information from the retina. (B) Each area is plotted so that its size is proportional to its cortical surface area (Felleman and Van Essen, 1991). Approximate total number of neuron (both hemispheres) is shown in the corner of each area (M = million). The approximate dimensionality of each representation (number of projection neurons) is shown above each area, based on neuronal densities (Collins et al., 2010), layer 2/3 neuronal fraction (O’Kusky and Colonnier, 1982), and portion (color) dedicated to processing the central 10 deg of the visual field (Brewer et al., 2002). Approximate median response latency is listed on the right (Nowak and Bullier, 1997; Schmolesky et al., 1998).

Figure 4. IT single unit properties and…

Figure 4. IT single unit properties and their relationship to population performance

( A )…

Figure 4. IT single unit properties and their relationship to population performance
(A) Post stimulus spike histogram from an example IT neuron to one object image (a chair) that was the most effective among 213 tested object images (Zoccolan et al., 2007). (B) Left: The mean responses of the same IT neuron to each of 213 object images (based on spike rate in the gray time window in A). Object images are ranked according to their effectiveness in driving the neuron. As is typical, the neuron responded strongly to ~10% of objects images (four example images of nearly equal effectiveness are shown) and was suppressed below background rate by other objects (two example images shown), with no obvious indication of what critical features triggered or suppressed its firing. Colors indicate highly-effective (red), medium-effective (blue) and poorly-effective (green) images. Right: Data from a second study (new IT neuron) using natural images patches to illustrate the same point (Rust and DiCarlo, unpublished). (C) Response profiles from an example IT neuron obtained by varying the position (elevation) of three objects with high (red), medium (blue), and (low) effectiveness. While response magnitude is not preserved, the rank-order object identity preference is maintained along the entire tested range of tested positions. (D) To explain data in C, each IT neuron (right panel) is conceptualized as having joint, separable tuning for shape (identity) variables and for identity-preserving variables (e.g. position). If a population of such IT neurons tiles that space of variables (left panel), the resulting population representation conveys untangled object identity manifolds (Fig. 2B, right), while still conveying information about other variables such as position, size, etc. (Li et al., 2009). (E) Direct tests of untangled object identity manifolds consist of using simple decoders (e.g. linear classifiers) to measure the cross-validated population performance on categorization tasks (adapted from (Hung et al., 2005; Rust and DiCarlo, 2010). Performance magnitude approaches ceiling level with only a few hundred neurons (left panel), and the same population decode gives nearly perfect generalization across moderate changes in position (1.5 deg and 3 deg shifts), scale (0.5x/2x and 0.33x/3x), and context (right panel), which is consistent with previous work (Hung et al., 2005); right bar) and with the simulations in (D).

Figure 5. Abstraction layers and their potential…

Figure 5. Abstraction layers and their potential links

Here we highlight four potential abstraction layers…

Figure 5. Abstraction layers and their potential links
Here we highlight four potential abstraction layers (organized by anatomical spatial scale), and the approximate number of inputs, outputs, and elemental sub-units at each level of abstraction (M=million, K= thousand). We suggest possible computational goals (what is the “job” of each level of abstraction?), algorithmic strategies (how might it carry out that job?), and transfer function elements (mathematical forms to implement the algorithm). We raise the possibility (gray arrow) that local cortical networks termed “subspace untanglers” are a useful level of abstraction to connect math that captures the transfer functions emulated by cortical circuits (right most panel), to the most elemental type of population transformation needed to build good object representation (see Fig. 2C), and ultimately to full untangling of object identity manifolds (as hypothesized here).

Figure 6. Serial-chain discriminative models of object…

Figure 6. Serial-chain discriminative models of object recognition

A class of biologically-inspired models of object…

Figure 6. Serial-chain discriminative models of object recognition
A class of biologically-inspired models of object recognition aims to achieve a gradual untangling of object manifolds by stacking layers of neuronal units in a largely feedforward hierarchy. In this example, units in each layer process their inputs using either AND-like (see red units) and OR-like (e.g. “MAX”, see blue units) operations, and those operations are applied in parallel in alternating layers. The AND-like operation constructs some tuning for combinations of visual features (e.g. simple cells in V1), and the OR-like operation constructs some tolerance to changes in (e.g.) position and size by pooling over AND-like units with identical feature tuning, but having receptive fields with slightly different retinal locations and sizes. This can produce a gradual increase of the tolerance to variation in object appearance along the hierarchy (e.g. (Fukushima, 1980; Riesenhuber and Poggio, 1999b; Serre et al., 2007a). AND-like operations and OR-like operations can each be formulated (Kouh and Poggio, 2008) as a variant of a standard LN neuronal model with nonlinear gain control mechanisms (e.g. a type of NLN model, see dashed frame).
Figure 2. Untangling object representations
Figure 2. Untangling object representations
(A) The response pattern of a population of visual neurons (e.g., retinal ganglion cells) to each image (three images shown) is a point in a very high dimensional space where each axis is the response level of each neuron. (B) All possible identity-preserving transformations of an object will form a low-dimensional manifold of points in the population vector space, i.e., a continuous surface (represented here, for simplicity, as a one-dimensional trajectory; see red and blue lines). Neuronal populations in early visual areas (retinal ganglion cells, LGN, V1) contain object identity manifolds that are highly curved and tangled together (see red and blue manifolds in left panel). The solution to the recognition problem is conceptualized as a series of successive re-representations along the ventral stream (black arrow) to a new population representation (IT) that allows easy separation of one namable object’s manifold (e.g., a car; see red manifold) from all other object identity manifolds (of which the blue manifold is just one example). Geometrically, this amounts to remapping the visual images so that the resulting object manifolds can be separated by a simple weighted summation rule (i.e. a hyperplane, see black dashed line; see (DiCarlo and Cox, 2007). (C) The vast majority of naturally experienced images are not accompanied with labels (e.g. “car”, “plane”), and are thus shown as black points. However, images arising from the same source (e.g. edge, object) tend to be nearby in time (gray arrows). Recent evidence shows the ventral stream uses that implicit temporal contiguity instruction to build IT neuronal tolerance, and we speculate that this is due to an unsupervised learning strategy termed cortical local subspace untangling (see text). Note that, under this hypothetical strategy, “shape coding” is not the explicit goal -- instead, “shape” information emerges as the residual natural image variation that is not specified by naturally occurring temporal contiguity cues.
Figure 3. The ventral visual pathway
Figure 3. The ventral visual pathway
(A) Ventral stream cortical area locations in the macaque monkey brain, and flow of visual information from the retina. (B) Each area is plotted so that its size is proportional to its cortical surface area (Felleman and Van Essen, 1991). Approximate total number of neuron (both hemispheres) is shown in the corner of each area (M = million). The approximate dimensionality of each representation (number of projection neurons) is shown above each area, based on neuronal densities (Collins et al., 2010), layer 2/3 neuronal fraction (O’Kusky and Colonnier, 1982), and portion (color) dedicated to processing the central 10 deg of the visual field (Brewer et al., 2002). Approximate median response latency is listed on the right (Nowak and Bullier, 1997; Schmolesky et al., 1998).
Figure 4. IT single unit properties and…
Figure 4. IT single unit properties and their relationship to population performance
(A) Post stimulus spike histogram from an example IT neuron to one object image (a chair) that was the most effective among 213 tested object images (Zoccolan et al., 2007). (B) Left: The mean responses of the same IT neuron to each of 213 object images (based on spike rate in the gray time window in A). Object images are ranked according to their effectiveness in driving the neuron. As is typical, the neuron responded strongly to ~10% of objects images (four example images of nearly equal effectiveness are shown) and was suppressed below background rate by other objects (two example images shown), with no obvious indication of what critical features triggered or suppressed its firing. Colors indicate highly-effective (red), medium-effective (blue) and poorly-effective (green) images. Right: Data from a second study (new IT neuron) using natural images patches to illustrate the same point (Rust and DiCarlo, unpublished). (C) Response profiles from an example IT neuron obtained by varying the position (elevation) of three objects with high (red), medium (blue), and (low) effectiveness. While response magnitude is not preserved, the rank-order object identity preference is maintained along the entire tested range of tested positions. (D) To explain data in C, each IT neuron (right panel) is conceptualized as having joint, separable tuning for shape (identity) variables and for identity-preserving variables (e.g. position). If a population of such IT neurons tiles that space of variables (left panel), the resulting population representation conveys untangled object identity manifolds (Fig. 2B, right), while still conveying information about other variables such as position, size, etc. (Li et al., 2009). (E) Direct tests of untangled object identity manifolds consist of using simple decoders (e.g. linear classifiers) to measure the cross-validated population performance on categorization tasks (adapted from (Hung et al., 2005; Rust and DiCarlo, 2010). Performance magnitude approaches ceiling level with only a few hundred neurons (left panel), and the same population decode gives nearly perfect generalization across moderate changes in position (1.5 deg and 3 deg shifts), scale (0.5x/2x and 0.33x/3x), and context (right panel), which is consistent with previous work (Hung et al., 2005); right bar) and with the simulations in (D).
Figure 5. Abstraction layers and their potential…
Figure 5. Abstraction layers and their potential links
Here we highlight four potential abstraction layers (organized by anatomical spatial scale), and the approximate number of inputs, outputs, and elemental sub-units at each level of abstraction (M=million, K= thousand). We suggest possible computational goals (what is the “job” of each level of abstraction?), algorithmic strategies (how might it carry out that job?), and transfer function elements (mathematical forms to implement the algorithm). We raise the possibility (gray arrow) that local cortical networks termed “subspace untanglers” are a useful level of abstraction to connect math that captures the transfer functions emulated by cortical circuits (right most panel), to the most elemental type of population transformation needed to build good object representation (see Fig. 2C), and ultimately to full untangling of object identity manifolds (as hypothesized here).
Figure 6. Serial-chain discriminative models of object…
Figure 6. Serial-chain discriminative models of object recognition
A class of biologically-inspired models of object recognition aims to achieve a gradual untangling of object manifolds by stacking layers of neuronal units in a largely feedforward hierarchy. In this example, units in each layer process their inputs using either AND-like (see red units) and OR-like (e.g. “MAX”, see blue units) operations, and those operations are applied in parallel in alternating layers. The AND-like operation constructs some tuning for combinations of visual features (e.g. simple cells in V1), and the OR-like operation constructs some tolerance to changes in (e.g.) position and size by pooling over AND-like units with identical feature tuning, but having receptive fields with slightly different retinal locations and sizes. This can produce a gradual increase of the tolerance to variation in object appearance along the hierarchy (e.g. (Fukushima, 1980; Riesenhuber and Poggio, 1999b; Serre et al., 2007a). AND-like operations and OR-like operations can each be formulated (Kouh and Poggio, 2008) as a variant of a standard LN neuronal model with nonlinear gain control mechanisms (e.g. a type of NLN model, see dashed frame).

Source: PubMed

3
Předplatit