Hierarchical models in the brain

Karl Friston, Karl Friston

Abstract

This paper describes a general model that subsumes many parametric models for continuous data. The model comprises hidden layers of state-space or dynamic causal models, arranged so that the output of one provides input to another. The ensuing hierarchy furnishes a model for many types of data, of arbitrary complexity. Special cases range from the general linear model for static data to generalised convolution models, with system noise, for nonlinear time-series analysis. Crucially, all of these models can be inverted using exactly the same scheme, namely, dynamic expectation maximization. This means that a single model and optimisation scheme can be used to invert a wide range of models. We present the model and a brief review of its inversion to disclose the relationships among, apparently, diverse generative models of empirical data. We then show that this inversion can be formulated as a simple neural network and may provide a useful metaphor for inference and learning in the brain.

Conflict of interest statement

The author has declared that no competing interests exist.

Figures

Figure 1. Conditional dependencies of dynamic (right)…
Figure 1. Conditional dependencies of dynamic (right) and hierarchical (left) models, shown as directed Bayesian graphs.
The nodes of these graphs correspond to quantities in the model and the responses they generate. The arrows or edges indicate conditional dependencies between these quantities. The form of the models is provided, both in terms of their state-space equations (above) and in terms of the prior and conditional probabilities (below). The hierarchal structure of these models induces empirical priors; dynamical priors are mediated by the equations of generalised motion and structural priors by the hierarchical form, under which states in higher levels provide constraints on the level below.
Figure 2. Image representations of the precision…
Figure 2. Image representations of the precision matrices encoding temporal dependencies among the generalised motion of random fluctuations.
The precision in generalised coordinates (left) and over discrete samples in time (right) are shown for a roughness ofγ = 4 and seventeen observations (with an order ofn = 16). This corresponds to an autocorrelation function whose width is half a time bin. With this degree of temporal correlation only a few (i.e., five or six) discrete local observations are specified with any precision.
Figure 3. Example of estimation under a…
Figure 3. Example of estimation under a mixed-effects or hierarchical linear model.
The inversion was cross-validated with expectation maximization (EM), where the M-step corresponds to restricted maximum likelihood (ReML). This example used a simple two-level model that embodies empirical shrinkage priors on the first-level parameters. These models are also known as parametric empirical Bayes (PEB) models (left). Causes were sampled from the unit normal density to generate a response, which was used to recover the causes, given the parameters. Slight differences in the hyperparameter estimates (upper right), due to a different hyperparameterisation, have little effect on the conditional means of the unknown causes (lower right), which are almost indistinguishable.
Figure 4. Example of Factor Analysis using…
Figure 4. Example of Factor Analysis using a hierarchical model, in which the causes have deterministic and stochastic components.
Parameters and causes were sampled from the unit normal density to generate a response, which was then used for their estimation. The aim was to recover the causes without knowing the parameters, which is effected with reasonable accuracy (upper). The conditional estimates of the causes and parameters are shown in lower panels, along with the increase in free-energy or log-evidence, with the number of DEM iterations (lower left). Note that there is an arbitrary affine mapping between the conditional means of the causes and their true values, which we estimated, post hocto show the correspondence in the upper panel.
Figure 5. This schematic shows the linear…
Figure 5. This schematic shows the linear convolution model used in the subsequent figure in terms of a directed Bayesian graph.
In this model, a simple Gaussian ‘bump’ function acts as a cause to perturb two coupled hidden states. Their dynamics are then projected to four response variables, whose time-courses are cartooned on the left. This figure also summarises the architecture of the implicit inversion scheme (right), in which precision-weighted prediction errors drive the conditional modes to optimise variational action. Critically, the prediction errors propagate their effects up the hierarchy (c.f., Bayesian belief propagation or message passing), whereas the predictions are passed down the hierarchy. This sort of scheme can be implemented easily in neural networks (see last section and for a neurobiological treatment). This generative model uses a single cause v(1), two dynamic states and four outputsy1,…,y4. The lines denote the dependencies of the variables on each other, summarised by the equations (in this example both the equations were simple linear mappings). This is effectively a linear convolution model, mapping one cause to four outputs, which form the inputs to the recognition model (solid arrow). The inputs to the four data or sensory channels are also shown as an image in the insert.
Figure 6. The predictions and conditional densities…
Figure 6. The predictions and conditional densities on the states and parameters of the linear convolution model of the previous figure.
Each row corresponds to a level, with causes on the left and hidden states on the right. In this case, the model has just two levels. The first (upper left) panel shows the predicted response and the error on this response (their sum corresponds to the observed data). For the hidden states (upper right) and causes (lower left) the conditional mode is depicted by a coloured line and the 90% conditional confidence intervals by the grey area. These are sometimes referred to as “tubes”. Finally, the grey lines depict the true values used to generate the response. Here, we estimated the hyperparameters, parameters and the states. This is an example of triple estimation, where we are trying to infer the states of the system as well as the parameters governing its causal architecture. The hyperparameters correspond to the precision of random fluctuations in the response and the hidden states. The free parameters correspond to a single parameter from the state equation and one from the observer equation that govern the dynamics of the hidden states and response, respectively. It can be seen that the true value of the causal state lies within the 90% confidence interval and that we could infer with substantial confidence that the cause was non-zero, when it occurs. Similarly, the true parameter values lie within fairly tight confidence intervals (red bars in the lower right).
Figure 7. Ontology of models starting with…
Figure 7. Ontology of models starting with a simple general linear model with two levels (the PCA model).
This ontology is one of many that could be constructed and is based on the fact that hierarchical dynamic models have several attributes that can be combined to create an infinite number of models; some of which are shown in the figure. These attributes include; (i) the number of levels or depth; (ii) for each level, linear or nonlinear output functions; (iii) with or without random fluctuations; (iii) static or dynamic (iv), for dynamic levels, linear or nonlinear equations of motion; (v) with or without state noise and, finally, (vi) with or without generalised coordinates.
Figure 8. Schematic detailing the neuronal architectures…
Figure 8. Schematic detailing the neuronal architectures that encode an ensemble density on the states and parameters of one level in a hierarchical model.
This schematic shows the speculative cells of origin of forward driving connections that convey prediction error from a lower area to a higher area and the backward connections that are used to construct predictions. These predictions try to explain away input from lower areas by suppressing prediction error. In this scheme, the sources of forward connections are the superficial pyramidal cell population and the sources of backward connections are the deep pyramidal cell population. The differential equations relate to the optimisation scheme detailed in the main text and their constituent terms are placed alongside the corresponding connections. The state-units and their efferents are in black and the error-units in red, with causes on the left and hidden states on the right. For simplicity, we have assumed the output of each level is a function of, and only of, the hidden states. This induces a hierarchy over levels and, within each level, a hierarchical relationship between states, where hidden states predict causes.
Figure 9. Schematic detailing the neuronal architectures…
Figure 9. Schematic detailing the neuronal architectures that encode an ensemble density on the states and parameters of hierarchical models.
This schematic shows how the neuronal populations of the previous figure may be deployed hierarchically within three cortical areas (or macro-columns). Within each area the cells are shown in relation to the laminar structure of the cortex that includes supra-granular (SG) granular (L4) and infra-granular (IG) layers.
Figure 10. The ensemble density and its…
Figure 10. The ensemble density and its mean-field partition.
q(ϑ) is the ensemble density and is encoded in terms of the sufficient statistics of its marginals. These statistics or variational parameters (e.g., mean or expectation) change to extremise free-energy to render the ensemble density an approximate conditional density on the causes of sensory input. The mean-field partition corresponds to a factorization over the sets comprising the partition. Here, we have used three sets (neural activity, modulation and connectivity). Critically, the optimisation of the parameters of any one set depends on the parameters of the other sets. In this figure, we have focused on means or expectations µi of the marginal densities,q(ϑi) = N(ϑi:µi,Ci).

References

    1. Friston KJ. Variational filtering. Neuroimage. 2008;41(3):747–766.
    1. Friston KJ, Trujillo-Barreto N, Daunizeau J. DEM: a variational treatment of dynamic systems. Neuroimage. 2008;41(3):849–885.
    1. Friston KJ. Learning and inference in the brain. Neural Netw. 2003;16:1325–1352.
    1. Friston KJ. A theory of cortical responses. Philos Trans R Soc Lond B Biol Sci. 2005;360:815–836.
    1. Friston K, Kilner J, Harrison L. A free energy principle for the brain. J Physiol Paris. 2006;100(1–3):70–87.
    1. Stratonovich RL. Topics in the Theory of Random Noise. New York: Gordon and Breach; 1967.
    1. Jazwinski AH. Stochastic Processes and Filtering Theory. San Diego: Academic Press; 1970. pp. 122–125.
    1. Kass RE, Steffey D. Approximate Bayesian inference in conditionally independent hierarchical models (parametric empirical Bayes models). J Am Stat Assoc. 1989;407:717–726.
    1. Efron B, Morris C. Stein's estimation rule and its competitors – an empirical Bayes approach. J Am Stats Assoc. 1973;68:117–130.
    1. Cox DR, Miller HD. The theory of stochastic processes. 1965. Methuen. London.
    1. Feynman RP. Statistical mechanics. Reading (Massachusetts): Benjamin; 1972.
    1. Hinton GE, von Cramp D. Keeping neural networks simple by minimising the description length of weights. 1993. pp. 5–13. In: Proceedings of COLT-93.
    1. MacKay DJC. Free-energy minimisation algorithm for decoding and cryptoanalysis. Electron Lett. 1995;31:445–447.
    1. Neal RM, Hinton GE. A view of the EM algorithm that justifies incremental sparse and other variants. In: Jordan MI, editor. Learning in Graphical Models. Dordrecht, The Netherlands: Kluwer Academic; 1998.
    1. Friston K, Mattout J, Trujillo-Barreto N, Ashburner J, Penny W. Variational Bayes and the Laplace approximation. Neuroimage. 2007;34:220–234.
    1. Beal MJ, Ghahramani Z. The variational Bayesian EM algorithm for incomplete Data: with application to scoring graphical model structures. In: Bernardo JM, Bayarri MJ, Berger JO, Dawid AP, Heckerman D, Smith AFM, West M, editors. Bayesian Statistics, Chapter 7. Oxford, UK: Oxford University Press; 2003.
    1. Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B. 1977;39:1–38.
    1. Harville DA. Maximum likelihood approaches to variance component estimation and to related problems. J Am Stat Assoc. 1977;72:320–338.
    1. Ozaki T. A bridge between nonlinear time-series models and nonlinear stochastic dynamical systems: A local linearization approach. Stat Sin. 1992;2:113–135.
    1. Roweis S, Ghahramani Z. A unifying review of linear Gaussian models. Neural Comput. 1999;11(2):305–345.
    1. Rumelhart DE, Hinton GE, Williams RJ. Learning internal representations by error propagations. In: Rumelhart DE, McClelland JL, editors. Parallel Distributed Processing: Explorations in the Microstructures of Cognition. Vol. 1. Cambridge (Massachusetts): MIT Press; 1986. pp. 318–362.
    1. Chen T, Chen H. Universal approximation to nonlinear operators by neural networks with arbitrary activation functions and its application to dynamical systems. IEEE Trans Neural Netw. 1995;6(4):918–928.
    1. Fliess M, Lamnabhi M, Lamnabhi-Lagarrigue F. An algebraic approach to nonlinear functional expansions. IEEE Trans Circuits Syst. 1983;30:554–570.
    1. Friston KJ. Bayesian estimation of dynamical systems: an application to fMRI. Neuroimage. 2002;16(2):513–530.
    1. Mattout J, Phillips C, Penny WD, Rugg MD, Friston KJ. MEG source localization under multiple constraints: an extended Bayesian framework. Neuroimage. 2006;30:753–767.
    1. Tipping ME. Sparse Bayesian learning and the Relevance Vector Machine. J Mach Learn Res. 2001;1:211–244.
    1. Ripley BD. Flexible Nonlinear Approaches to Classification. In: Cherkassy V, Friedman JH, Wechsler H, editors. From Statistics to Neural Networks. New York: Springer; 1994. pp. 105–126.
    1. Rasmussen CE. Evaluation of Gaussian Processes and Other Methods for Nonlinear Regression [PhD thesis]. Toronto, Canada: Department of Computer Science, University of Toronto. 1996. .
    1. Kim H-C, Ghahramani Z. Bayesian Gaussian process classification with the EM-EP algorithm. IEEE Trans Pattern Anal Mach Intell. 2006;28(12):1948–1959.
    1. Kalman R. A new approach to linear filtering and prediction problems. ASME Trans J Basic Eng. 1960;82(1):35–45.
    1. Wang B, Titterington DM. Variational Bayesian inference for partially observed diffusions. Technical Report 04-4, University of Glasgow. 2004. .
    1. Sørensen H. Parametric inference for diffusion processes observed at discrete points in time: a survey. Int Stat Rev. 2004;72(3):337–354.
    1. Ghahramani Z. Unsupervised Learning. In: Bousquet O, Raetsch G, von Luxburg U, editors. Advanced Lectures on Machine Learning LNAI 3176. Berlin, Germany: Springer-Verlag; 2004.
    1. Friston K, Phillips J, Chawla D, Büchel C. Nonlinear PCA: characterizing interactions between modes of brain activity. Philos Trans R Soc Lond B Biol Sci. 2000;355(1393):135–46.
    1. Tipping ME, Bishop C. Probabilistic principal component analysis. J R Stat Soc Ser B. 1999;61(3):611–622.
    1. Bell AJ, Sejnowski TJ. An information maximisation approach to blind separation and blind de-convolution. Neural Comput. 1995;7:1129–1159.
    1. Olshausen BA, Field DJ. Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature. 1996;381:607–609.
    1. Maunsell JH, van Essen DC. The connections of the middle temporal visual area (MT) and their relationship to a cortical hierarchy in the macaque monkey. J Neurosci. 1983;3:2563–2586.
    1. Zeki S, Shipp S. The functional logic of cortical connections. Nature. 1988;335:311–31.
    1. Felleman DJ, Van Essen DC. Distributed hierarchical processing in the primate cerebral cortex. Cereb Cortex. 1991;1:1–47.
    1. Mesulam MM. From sensation to cognition. Brain. 1998;121:1013–1052.
    1. Rockland KS, Pandya DN. Laminar origins and terminations of cortical connections of the occipital lobe in the rhesus monkey. Brain Res. 1979;179:3–20.
    1. Murphy PC, Sillito AM. Corticofugal feedback influences the generation of length tuning in the visual pathway. Nature. 1987;329:727–729.
    1. Sherman SM, Guillery RW. On the actions that one nerve cell can have on another: distinguishing “drivers” from “modulators”. Proc Natl Acad Sci U S A. 1998;95:7121–7126.
    1. Angelucci A, Levitt JB, Walton EJ, Hupe JM, Bullier J, Lund JS. Circuits for local and global signal integration in primary visual cortex. J Neurosci. 2002;22:8633–8646.
    1. DeFelipe J, Alonso-Nanclares L, Arellano JI. Microstructure of the neocortex: comparative aspects. J Neurocytol. 2002;31:299–316.
    1. Hupe JM, James AC, Payne BR, Lomber SG, Girard P, et al. Cortical feedback improves discrimination between figure and background by V1, V2 and V3 neurons. Nature. 1998;394:784–787.
    1. Rosier AM, Arckens L, Orban GA, Vandesande F. Laminar distribution of NMDA receptors in cat and monkey visual cortex visualized by [3H]-MK-801 binding. J Comp Neurol. 1993;335:369–380.
    1. Mumford D. On the computational architecture of the neocortex. II. The role of cortico-cortical loops. Biol Cybern. 1992;66:241–251.
    1. Edelman GM. Neural Darwinism: selection and reentrant signaling in higher brain function. Neuron. 1993;10:115–125.
    1. Grossberg S, Pilly P. Temporal dynamics of decision-making during motion perception in the visual cortex. Vis Res. 2008;48:1345–1373.
    1. Grossberg S, Versace M. Spikes, synchrony, and attentive learning by laminar thalamocortical circuits. Brain Res. 2008;1218:278–312.
    1. Chait M, Poeppel D, de Cheveigné A, Simon JZ. Processing asymmetry of transitions between order and disorder in human auditory cortex. J Neurosci. 2007;27(19):5207–5214.
    1. Crick F, Koch C. Constraints on cortical and thalamic projections: the no-strong-loops hypothesis. Nature. 1998;391(6664):245–250.
    1. London M, Häusser M. Dendritic computation. Annu Rev Neurosci. 2005;28:503–532.
    1. Buonomano DV, Merzenich MM. Cortical plasticity: from synapses to maps. Annu Rev Neurosci. 1998;21:149–186.
    1. Martin SJ, Grimwood PD, Morris RG. Synaptic plasticity and memory: an evaluation of the hypothesis. Annu Rev Neurosci. 2000;23:649–711.
    1. Treue S, Maunsell HR. Attentional modulation of visual motion processing in cortical areas MT and MST. Nature. 1996;382:539–541.
    1. Martinez-Trujillo JC, Treue S. Feature-based attention increases the selectivity of population responses in primate visual cortex. Curr Biol. 2004;14:744–751.
    1. Chelazzi L, Miller E, Duncan J, Desimone R. A neural basis for visual search in inferior temporal cortex. Nature. 1993;363:345–347.
    1. Desimone R. Neural mechanisms for visual memory and their role in attention. Proc Natl Acad Sci U S A. 1996;93(24):13494–13499.
    1. Schroeder CE, Mehta AD, Foxe JJ. Determinants and mechanisms of attentional modulation of neural processing. Front Biosci. 2001;6:D672–D684.
    1. Yu AJ, Dayan P. Uncertainty, neuromodulation and attention. Neuron. 2005;46:681–692.
    1. Rao RP, Ballard DH. Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive field effects. Nat Neurosci. 1998;2:79–87.
    1. Tseng KY, O'Donnell P. Dopamine-glutamate interactions controlling prefrontal cortical pyramidal cell excitability involve multiple signaling mechanisms. J Neurosci. 2004;24:5131–5139.
    1. Brocher S, Artola A, Singer W. Agonists of cholinergic and noradrenergic receptors facilitate synergistically the induction of long-term potentiation in slices of rat visual cortex. Brain Res. 1992;573:27–36.
    1. Gu Q. Neuromodulatory transmitter systems in the cortex and their role in cortical plasticity. Neuroscience. 2002;111:815–835.
    1. Friston KJ, Tononi G, Reeke GN, Jr, Sporns O, Edelman GM. Value-dependent selection in the brain: simulation in a synthetic neural model. Neuroscience. 1994;59(2):229–243.
    1. Montague PR, Dayan P, Person C, Sejnowski TJ. Bee foraging in uncertain environments using predictive Hebbian learning. Nature. 1995;377(6551):725–728.
    1. Schultz W. Multiple dopamine functions at different time courses. Annu Rev Neurosci. 2007;30:259–288.
    1. Niv Y, Duff MO, Dayan P. Dopamine, uncertainty and TD learning. Behav Brain Funct. 2005;4:1–6.
    1. Kawato M, Hayakawa H, Inui T. A forward-inverse optics model of reciprocal connections between visual cortical areas. Network. 1993;4:415–422.
    1. Desimone R, Duncan J. Neural mechanisms of selective visual attention. Annu Rev Neurosci. 1995;18:193–222.
    1. Abbott LF, Varela JA, Sen K, Nelson SB. Synaptic depression and cortical gain control. Science. 1997;275(5297):220–224.
    1. Archambeau C, Cornford D, Opper M, Shawe-Taylor J. Gaussian process approximations of stochastic differential equations. In: JMLR: Workshop and Conference Proceedings. 2007. pp. 1–16.
    1. Kappen HJ. An introduction to stochastic control theory, path integrals and reinforcement learning. 2008. .
    1. John ER. Switchboard versus statistical theories of learning and memory. Science. 1972;177(4052):850–864.
    1. Freeman WJ. A pseudo-equilibrium thermodynamic model of information processing in nonlinear brain dynamics. Neural Netw. 2008;21(2–3):257–265.
    1. Beskos A, Papaspiliopoulos O, Roberts GO, Fearnhead P. Exact and computationally efficient likelihood-based estimation for discretely observed diffusion processes (with discussion). J R Stat Soc Ser B. 2006;68:333–361.
    1. Evensen G, van Leeuwen PJ. An ensemble Kalman smoother for nonlinear dynamics. Mon Weather Rev. 2000;128(6):1852–1867.
    1. Schiff SJ, Sauer T. Kalman filter control of a model of spatiotemporal cortical dynamics. J Neural Eng. 2008;5(1):1–8.
    1. Restrepo JM. A path integral method for data assimilation. Physica D. 2008;237(1):14–27.
    1. Friston KJ, Kiebel S. Predictive coding under the free energy principle. 2009. Philos Trans R Soc Lond. Under review.
    1. Henson R, Shallice T, Dolan R. Neuroimaging evidence for dissociable forms of repetition priming. Science. 2000;287:1269–1272.
    1. Näätänen R. Mismatch negativity: clinical research and possible applications. Int J Psychophysiol. 2003;48:179–188.
    1. Lee TS, Mumford D. Hierarchical Bayesian inference in the visual cortex. J Opt Soc Am A. 2003;20:1434–1448.
    1. Helmholtz H. Handbuch der Physiologischen Optik. English translation. In: Southall JPC, editor. Dover: New York; 1860/1962. Vol. 3.
    1. Barlow HB. Possible principles underlying the transformation of sensory messages. In: Rosenblith WA, editor. Sensory Communication. Cambridge (Massachusetts): MIT Press; 1961.
    1. Neisser U. Cognitive psychology. New York: Appleton-Century-Crofts; 1967.
    1. Ballard DH, Hinton GE, Sejnowski TJ. Parallel visual computation. Nature. 1983;306:21–26.
    1. Dayan P, Hinton GE, Neal RM. The Helmholtz machine. Neural Comput. 1995;7:889–904.

Source: PubMed

3
구독하다