Model-based learning protects against forming habits

Claire M Gillan, A Ross Otto, Elizabeth A Phelps, Nathaniel D Daw, Claire M Gillan, A Ross Otto, Elizabeth A Phelps, Nathaniel D Daw

Abstract

Studies in humans and rodents have suggested that behavior can at times be "goal-directed"-that is, planned, and purposeful-and at times "habitual"-that is, inflexible and automatically evoked by stimuli. This distinction is central to conceptions of pathological compulsion, as in drug abuse and obsessive-compulsive disorder. Evidence for the distinction has primarily come from outcome devaluation studies, in which the sensitivity of a previously learned behavior to motivational change is used to assay the dominance of habits versus goal-directed actions. However, little is known about how habits and goal-directed control arise. Specifically, in the present study we sought to reveal the trial-by-trial dynamics of instrumental learning that would promote, and protect against, developing habits. In two complementary experiments with independent samples, participants completed a sequential decision task that dissociated two computational-learning mechanisms, model-based and model-free. We then tested for habits by devaluing one of the rewards that had reinforced behavior. In each case, we found that individual differences in model-based learning predicted the participants' subsequent sensitivity to outcome devaluation, suggesting that an associative mechanism underlies a bias toward habit formation in healthy individuals.

Figures

Fig. 1
Fig. 1
Experiment 1: Reinforcement-learning task. Participants entered one of two start states on each trial, which were associated with the receipt of gold and silver coins, each worth 25¢. Participants had 2.5 seconds (s) to make a choice, costing 1¢, which would commonly (70 %) lead them to a certain second state and rarely lead them to the alternative second state (30 %). No choices were made to the second state; each second state has a unique probability of reward that slowly changed over the course of the experiment. (B) Graph depicting a purely model-free learner, whose behavior is solely predicted by reinforcement history. (C) A purely model-based learner’s behavior, in contrast, is predicted by an interaction between reward and transition, such that behavior would mirror the model-free learner only when the transition from the initial choice to the outcome was common. Following rare transitions, a purely model-free learner would show the reverse pattern
Fig. 2
Fig. 2
Experiment 1: Devaluation and consumption tests. (A) The 24-trial devaluation stage consisted of presentations of the first-stage choices only; that is, participants did not transition to the second stages and never learned the outcomes of their choices. This ensured that responding during the devaluation test was dependent only on prior learning. They were informed that the task would continue as before, but that they would no longer be shown the results of their choices. (B) After four trials of experience with the concealed trial outcomes, one type of coin was devalued by informing participants that the corresponding container was completely full. (C) This trial was followed by a consumption test, in which participants had 4 s to freely collect coins using their mouse. Next they completed the 20 test trials, in which habits were quantified as the difference between the numbers of responses made to the valued and devalued states
Fig. 3
Fig. 3
Experiment 1: Effect sizes (beta weights) from the logistic regression model (Table 1). Significant effects were observed for reward (model-free, p < .001), the Reward × Transition interaction (model-based, p = .020), and the predicted three-way interaction of reward, transition, and devaluation sensitivity (p = .003). rew = reward, trans = transition, dev = devaluation sensitivity
Fig. 4
Fig. 4
Experiment 1: Model-based learning and habit formation. (A) Histogram displaying devaluation sensitivity in the entire sample in Experiment 1. Devaluation sensitivity is defined as the difference between the numbers of valued and devalued responses performed in the test stage, with larger numbers indicating greater sensitivity to devaluation. To illustrate the relationship between model-based learning and habit formation, a median split divides the sample into (B) habit (devaluation sensitivity < 1) and (C) goal-directed (devaluation sensitivity > 1) groups. Those who displayed habits at test showed a marked absence of the signature of model-based learning, p < .003
Fig. 5
Fig. 5
Experiment 2: Reinforcement-learning task. (A) Participants entered the same starting state on each trial and had 2.5 s to make a choice between two fractal stimuli that always appeared in this state. One fractal commonly (70 %) led to one of the second-stage states and rarely (30 %) led to the other. In contrast to Experiment 1, each second-stage state was uniquely associated with a certain type of coin (gold or silver). (B) For the first 150 trials, reward probabilities (the chance of winning a coin in a given second-stage state) drifted slowly over time according to Gaussian random walks. For the next 50 trials, the reward probabilities stabilized at .9 and .1, for the second-stage states associated with the to-be-devalued and to-remain-valued outcomes, respectively. This served to systematically bias all participants toward making the action that would later be devalued. Devaluation was randomized across coin colors and reward drifts
Fig. 6
Fig. 6
Experiment 2: Model-based learning and habit formation. (A) Histogram displaying devaluation sensitivity in the entire sample from Experiment 2. Here, devaluation sensitivity is defined as the proportion of valued choices (over total choices) made at the test stage, with larger numbers indicating greater sensitivity to devaluation. To illustrate the relationship between model-based learning and habit formation, a median split divides the sample into (B) habit (devaluation sensitivity < .6) and (C) goal-directed (devaluation sensitivity > .6) groups. Consistent with Experiment 1, the participants who displayed habits in Experiment 2 (i.e. failed to prefer valued over devalued choices) showed a reduction in the signature of model-based learning, p < .001

References

    1. Adams CD. Variations in the sensitivity of instrumental responding to reinforcer devaluation. Quarterly Journal of Experimental Psychology. 1982;34B:77–98. doi: 10.1080/14640748208400878.
    1. Adams CD, Dickinson A. Instrumental responding following reinforcer devaluation. Quarterly Journal of Experimental Psychology. 1981;33B:109–121. doi: 10.1080/14640748108400816.
    1. Akam, T., Dayan, P., & Costa, R. (2013). Multi-step decision tasks for dissociating model-based and model-free learning in rodents. Paper presented at the Cosyne 2013, Salt Lake City, UT.
    1. Balleine BW, Dickinson A. Goal-directed instrumental action: Contingency and incentive learning and their cortical substrates. Neuropharmacology. 1998;37:407–419. doi: 10.1016/S0028-3908(98)00033-1.
    1. Balleine BW, O’Doherty JP. Human and rodent homologies in action control: Corticostriatal determinants of goal-directed and habitual action. Neuropsychopharmacology. 2010;35:48–69. doi: 10.1038/npp.2009.131.
    1. Crump MJC, McDonnell JV, Gureckis TM. Evaluating Amazon’s Mechanical Turk as a tool for experimental behavioral research. PLoS ONE. 2013;8:e57410. doi: 10.1371/journal.pone.0057410.
    1. Daw ND, Gershman SJ, Seymour B, Dayan P, Dolan RJ. Model-based influences on humans’ choices and striatal prediction errors. Neuron. 2011;69:1204–1215. doi: 10.1016/j.neuron.2011.02.027.
    1. Daw ND, Niv Y, Dayan P. Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control. Nature Neuroscience. 2005;8:1704–1711. doi: 10.1038/nn1560.
    1. Daw ND, O’Doherty JP. Multiple systems for value learning neuroeconomics: Decision making and the brain. Amsterdam, The Netherlands: Elsevier; 2014. pp. 393–410.
    1. de Wit S, Corlett PR, Aitken MR, Dickinson A, Fletcher PC. Differential engagement of the ventromedial prefrontal cortex by goal-directed and habitual behavior toward food pictures in humans. Journal of Neuroscience. 2009;29:11330–11338. doi: 10.1523/JNEUROSCI.1639-09.2009.
    1. de Wit S, Niry D, Wariyar R, Aitken MR, Dickinson A. Stimulus–outcome interactions during instrumental discrimination learning by rats and humans. Journal of Experimental Psychology: Animal Behavior Processes. 2007;33:1–11.
    1. de Wit S, Watson P, Harsay HA, Cohen MX, van de Vijver I, Ridderinkhof KR. Corticostriatal connectivity underlies individual differences in the balance between habitual and goal-directed action control. Journal of Neuroscience. 2012;32:12066–12075. doi: 10.1523/JNEUROSCI.1088-12.2012.
    1. Dezfouli A, Balleine BW. Actions, action sequences and habits: Evidence that goal-directed and habitual action control are hierarchically organized. PLoS Computational Biology. 2013;9:e1003364. doi: 10.1371/journal.pcbi.1003364.
    1. Dias-Ferreira, E., Sousa, J. C., Melo, I., Morgado, P., Mesquita, A. R., Cerqueira, J. J., … Sousa, N. (2009). Chronic stress causes frontostriatal reorganization and affects decision-making. Science, 325, 621–625. doi:10.1126/science.1171203
    1. Dickinson A. Actions and habits: The development of behavioural autonomy. Philosophical Transactions of the Royal Society B. 1985;308:67–78. doi: 10.1098/rstb.1985.0010.
    1. Dickinson A, Balleine B. Motivational control of goal-directed action. Animal Learning & Behavior. 1994;22:1–18. doi: 10.3758/BF03199951.
    1. Dickinson A, Nicholas DJ, Adams CD. The effect of the instrumental training contingency on susceptibility to reinforcer devaluation. Quarterly Journal of Experimental Psychology. 1983;35B:35–51. doi: 10.1080/14640748308400912.
    1. Dickinson A, Wood N, Smith JW. Alcohol seeking by rats: Action or habit? Quarterly Journal of Experimental Psychology. 2002;55B:331–348. doi: 10.1080/0272499024400016.
    1. Dolan RJ, Dayan P. Goals and habits in the brain. Neuron. 2013;80:312–325. doi: 10.1016/j.neuron.2013.09.007.
    1. Doya, K. (1999). What are the computations in the cerebellum, the basal ganglia, and the cerebral cortex. Neural Networks, 12, 961–974.
    1. Eppinger B, Walter M, Heekeren HR, Li SC. Of goals and habits: Age-related and individual differences in goal-directed decision-making. Frontiers in Neuroscience. 2013;7:253. doi: 10.3389/fnins.2013.00253.
    1. Frank MJ, Rudy JW, Levy WB, O’Reilly RC. When logic fails: Implicit transitive inference in humans. Memory & Cognition. 2005;33:742–750. doi: 10.3758/BF03195340.
    1. Friedel E, Koch SP, Wendt J, Heinz A, Deserno L, Schlagenhauf F. Devaluation and sequential decisions: Linking goal-directed and model-based behaviour. Frontiers in Human Neuroscience. 2014;8:587. doi: 10.3389/fnhum.2014.00587.
    1. Gillan, C. M., Apergis-Schoute, A. M., Morein-Zamir, S., Urcelay, G. P., Sule, A., Fineberg, N. A., … Robbins, T. W. (2015). Functional neuroimaging of avoidance habits in obsessive-compulsive disorder. American Journal of Psychiatry, 172, 284–293. doi:10.1176/appi.ajp.2014.14040525
    1. Gillan, C. M., Morein-Zamir, S., Urcelay, G. P., Sule, A., Voon, V., Apergis-Schoute, A. M., … Robbins, T. W. (2014). Enhanced avoidance habits in obsessive-compulsive disorder. Biological Psychiatry, 75, 631–638. doi:10.1016/j.biopsych.2013.02.002
    1. Gillan CM, Papmeyer M, Morein-Zamir S, Sahakian BJ, Fineberg NA, Robbins TW, de Wit S. Disruption in the balance between goal-directed behavior and habit learning in obsessive-compulsive disorder. American Journal of Psychiatry. 2011;168:718–726. doi: 10.1176/appi.ajp.2011.10071062.
    1. Gillan CM, Robbins TW. Goal-directed learning and obsessive-compulsive disorder. Philosophical Transactions of the Royal Society B. 2014;369:475. doi: 10.1098/rstb.2013.0475.
    1. Gläscher J, Daw N, Dayan P, O’Doherty JP. States versus rewards: Dissociable neural prediction error signals underlying model-based and model-free reinforcement learning. Neuron. 2010;66:585–595. doi: 10.1016/j.neuron.2010.04.016.
    1. Keramati M, Dezfouli A, Piray P. Speed/accuracy trade-off between the habitual and the goal-directed processes. PLoS Computational Biology. 2011;7:e1002055. doi: 10.1371/journal.pcbi.1002055.
    1. Miller, K., Erlich, J., Kopec, C., Botvinick, M., & Brody, C. (2014). A multi-step decision task elicits planning behavior in rats. Paper presented at Cosyne 2014, Salt Lake City, UT.
    1. Otto AR, Gershman SJ, Markman AB, Daw ND. The curse of planning: Dissecting multiple reinforcement-learning systems by taxing the central executive. Psychological Science. 2013;24:751–761. doi: 10.1177/0956797612463080.
    1. Otto AR, Raio CM, Chiang A, Phelps EA, Daw ND. Working-memory capacity protects model-based learning from stress. Proceedings of the National Academy of Sciences. 2013;110:20941–20946. doi: 10.1073/pnas.1312011110.
    1. Otto AR, Skatova A, Madlon-Kay S, Daw ND. Cognitive control predicts use of model-based reinforcement learning. Journal of Cognitive Neuroscience. 2015;27:319–333. doi: 10.1162/jocn_a_00709.
    1. Pezzulo G, Rigoli F, Chersi F. The mixed instrumental controller: Using value of information to combine habitual choice and mental simulation. Frontiers in Psychology. 2013;4:92. doi: 10.3389/fpsyg.2013.00092.
    1. Schultz W, Dayan P, Montague PR. A neural substrate of prediction and reward. Science. 1997;275:1593–1599. doi: 10.1126/science.275.5306.1593.
    1. Schwabe L, Wolf OT. Stress prompts habit behavior in humans. Journal of Neuroscience. 2009;29:7191–7198. doi: 10.1523/JNEUROSCI.0979-09.2009.
    1. Seger CA, Spiering BJ. A critical review of habit learning and the basal ganglia. Frontiers in Systems Neuroscience. 2011;5:66. doi: 10.3389/fnsys.2011.00066.
    1. Simcox T, Fiez JA. Collecting response times using Amazon Mechanical Turk and Adobe Flash. Behavior Research Methods. 2014;46:95–111. doi: 10.3758/s13428-013-0345-y.
    1. Sjoerds, Z., de Wit, S., van den Brink, W., Robbins, T. W., Beekman, A. T., Penninx, B. W. & Veltman, D. J. (2013). Behavioral and neuroimaging evidence for overreliance on habit learning in alcohol-dependent patients. Translational Psychiatry, 3, e337.
    1. Sutton R, Barto A. Reinforcement learning: An introduction. Cambridge, MA: MIT Press; 1998.
    1. Tolman EC. Cognitive maps in rats and men. Psychological Review. 1948;55:189–208. doi: 10.1037/h0061626.
    1. Tricomi E, Balleine BW, O’Doherty JP. A specific role for posterior dorsolateral striatum in human habit learning. European Journal of Neuroscience. 2009;29:2225–2232. doi: 10.1111/j.1460-9568.2009.06796.x.
    1. Valentin VV, Dickinson A, O’Doherty JP. Determining the neural substrates of goal-directed learning in the human brain. Journal of Neuroscience. 2007;27:4019–4026. doi: 10.1523/JNEUROSCI.0564-07.2007.
    1. Voon, V., Derbyshire, K., Rück, C., Irvine, M. A., Worbe, Y., Enander, J., … Bullmore, E. T. (2014). Disorders of compulsivity: A common bias towards learning habits. Molecular Psychiatry. doi:10.1038/mp.2014.44
    1. Wunderlich K, Smittenaar P, Dolan RJ. Dopamine enhances model-based over model-free choice behavior. Neuron. 2012;75:418–424. doi: 10.1016/j.neuron.2012.03.042.
    1. Yin HH, Knowlton BJ, Balleine BW. Lesions of dorsolateral striatum preserve outcome expectancy but disrupt habit formation in instrumental learning. European Journal of Neuroscience. 2004;19:181–189. doi: 10.1111/j.1460-9568.2004.03095.x.
    1. Yin HH, Ostlund SB, Knowlton BJ, Balleine BW. The role of the dorsomedial striatum in instrumental conditioning. European Journal of Neuroscience. 2005;22:513–523. doi: 10.1111/j.1460-9568.2005.04218.x.

Source: PubMed

3
Iratkozz fel