This post is based on the paper: “Cognitive Control Predicts Use of Model-Based Reinforcement-Learning,” authored by A. Ross Otto, Anya Skatova, Seth Madlon-Kay, and Nathaniel D. Daw, Journal of Cognitive Neuroscience. 2015 February ; 27(2): 319–333. doi:10.1162/jocn_a_00709. The paper is difficult to understand, but covers some interesting subject matter. Andy Clark alerted me to these authors in his book Surfing Uncertainty.
This paper makes the obvious assertion that dual process theories of decision making abound, and that a recurring theme is that the systems rely differentially upon automatic or habitual versus deliberative or goal-directed modes of processing. According to Otto et al a popular refinement of this idea proposes that the two modes of choice arise from distinct strategies for learning the values of different actions, which operate in parallel. In this theory, habitual choices are produced by model-free reinforcement learning (RL), which learns which actions tend to be followed by rewards. In contrast, goal-directed choice is formalized by model-based RL, which reasons prospectively about the value of candidate actions using knowledge (a learned internal “model”) about the environment’s structure and the organism’s current goals. Whereas model-free choice involves requires merely retrieving the (directly learned) values of previous actions, model-based valuation requires a sort of mental simulation – carried out at decision time – of the likely consequences of candidate actions, using the learned internal model. Under this framework, at any given moment both the model-based and model-free systems can provide action values to guide choices, inviting a critical question: how does the brain determine which system’s preferences ultimately control behavior?
The authors conducted two sets of experiments. In experiment 1, Otto et al demonstrated that interference effects in the Stroop color-naming task relate to the expression of model-based choice in sequential choice. They first measured participants’ susceptibility to interference in a version of the Stroop task, in which subjects respond to the ink color of a color word (e.g., “RED”) while ignoring its semantic meaning. In the Stroop task, cognitive control facilitates biasing of attentional allocation—strengthening attention to the task relevant feature and/or inhibiting task-irrelevant features—which in turn permits the overriding of inappropriate, prepotent responses. Of key interest was the incongruency effect (IE): the additional time required to produce a correct response on incongruent (“RED” in blue type) compared to congruent (“RED” in red type) trials. Incongruent trials require inhibition of the prepotent color-reading response, thought to be a critical reflection of cognitive control. They then examined whether an individual’s IE predicted the expression of model-based strategies in sequential choice. The results suggest that an individual’s susceptibility to Stroop interference (i.e., more slowing, interpreted as poorer cognitive control) means less contribution of model-based RL. This suggests an underlying correspondence between cognitive control ability and the relative expression of dual-systems RL. Experiment 2 examined individual differences in context processing more precisely, revealing how utilization of task-relevant contextual information—a more specific hallmark of cognitive control —predicts model-based tendencies in choice behavior.
In short, Otto et al found that difficulty in producing appropriate actions in the face of prepotent, interfering tendencies was associated with diminished model-based choice signatures in a separate decision-making task. The authors also conclude from their findings that rather than selection between two preferences, the interaction between the two systems may be more top-down or hierarchical in nature. In such an arbitration scheme, cognitive control might bolster the influence of model-based relative to model-free choice by actively boosting the influence of task-relevant representations or actively inhibiting model-free responses. Importantly, the authors suggest that whereas model-free action values are directly produced by learning and can simply be retrieved at choice time (consistent with prepotency), model-based values are typically viewed as computed at decision time, through a sort of mental simulation using the internal model.
Frankly, this paper seems old even though it was written in 2015. That model free learning involves no simulation while model based uses simulation brings up the cognitive continuum (see post Cognitive Continuum) once again. In some situations like the Stroop test, the cue validities are so large that there is a “prepotent” response. Hence there is maybe only one simulation. However, it is not a dichotomy. As the number of cues increase and the validities moderate, we do more and more simulation. Otto et al use of the IE as an individual factor may be related to Glockner and Jekel’s Parameter P (see post The Fog of the Blog)which involves individual sensitivity to cue validities. This is further example that there may be only a few stable individual tendencies that account for large portions of the difference in how our individual brains make decisions. It also brings me to an idea from the Parallel Constraint Satisfaction model that continues to ring true with me. That is that the automatic system is for running simulations and making decisions, while the analytic system is tasked with getting more information to run better simulations.
Andy Clark in Surfing Uncertainty agrees with me and dismisses the dichotomy of model-free and model-based as not able to stand the test of time. Nevertheless, he does adapt it to the Predictive Processing concept by associating model-free responses with the bottom flow of sensory error messages, while associating model-based responses on the top down predictions.
My idea is frankly that model-free and model-based are backwards. If I have a good model, everything is automatic. For instance, catching a fly ball is relatively easy if you can invoke the heuristic. The model works and there is little use trying to add more information or anything else except improving your speed. When your model is not so good, you have to keep tweaking and testing it.
I find Kenneth Hammond’s ideas of correspondence and coherence (see post Beyond Rationality Part 1)to be more useful. Correspondence is good when you get accurate results from multiple fallible indicators. This requires appropriate cues and measures of their validity. When your results are not accurate or you do not believe that they will be accurate, you may need to use coherence to improve your model. This may involve seeking new information or tweaking the cues. This fits nicely into the Parallel Constraint Satisfaction model. Deliberate processes are activated if the consistency of a resulting mental representation is below a threshold θ (see post Deliberate Construction in Parallel Constraint Satisfaction). I think this is interesting because it is the intuitive/automatic (so called model-free) system requiring coherency of itself. Typically, we think of the deliberate/analytical system as being rational and coherent, while the intuitive system requires correspondence. In this situation, the intuitive/automatic system finds answers that are not consistent, incoherent, and thus seeks input from the deliberate/analytical system.