The Prediction Machine

prDSCN1739-thomson-tide-machineThis post is derived from the paper, “Whatever next? Predictive brains, situated agents, and the future of cognitive science,” Behavioral and Brain Sciences (2013) 36:3, written by Andy Clark. I stumbled upon this paper and its commentary several weeks ago and have tried to figure out what to do with it. That has led me to other papers. In the next three posts, I will try to give the high points of this idea of PEM, prediction error minimization. It provides an overall background that is compatible with Parallel Constraint Satisfaction.

Clark suggests that the brain’s jobs are minimizing prediction error, selective sampling of sensory data, optimizing expected precisions, and minimizing complexity of internal models. To accomplish these tasks, the brain has evolved into a bundle of cells that support perception and action by attempting to match incoming sensory inputs with top-down expectations–predictions. This is done by using a hierarchical model that minimizes prediction error within a bidirectional cascade of cortical processing. This model maps on to perception, action, attention, and model selection, respectively (and dare I say judgment and decision making).

According to Clark, predictive coding itself was first developed as a data compression strategy in signal processing. Consider a basic task such as image transmission: in most images, the value of one pixel regularly predicts the value of its nearest neighbors, with differences marking important features such as the boundaries between objects. That means that the code for a rich image can be compressed (for a properly informed receiver) by encoding only the “unexpected” variation: the cases where the actual value departs from the predicted one. What needs to be transmitted is therefore just the difference (a.k.a. the “prediction error”) between the actual current signal and the predicted one. Descendents of this kind of compression technique are currently used in JPEGs, in various forms of loss less audio compression, and in motion-compressed coding for video. The information that needs to be communicated “upward” under all these regimes is just the prediction error: the divergence from the expected signal.

Thus prediction error is a proxy for sensory information itself. An example is the model of predictive coding in the visual cortex. At the lowest level, there is some pattern of energetic stimulation, derived by sensory receptors from ambient light patterns produced by the current visual scene. These signals are then processed via a multilevel cascade in which each level attempts to predict the activity at the level below it via backward connections. The backward connections allow the activity at one stage of the processing to return as another input at the previous stage. So long as this successfully predicts the lower level activity, all is well, and no further action needs to happen. But where there is a mismatch, “prediction error” occurs and the ensuing (error-indicating) activity is sent to the higher level. This automatically adjusts probabilistic representations at the higher level so that top-down predictions cancel prediction errors at the lower level yielding rapid perceptual inference. At the same time, prediction error is used to adjust the structure of the model so as to reduce any discrepancy next time around yielding slower timescale learning.

Forward(to the brain) connections between levels thus carry the “residual errors”, the predictions from the actual lower level activity, while backward(from the brain) connections, which as Clark says do most of the “heavy lifting” in these models, carry the predictions themselves. The generative model providing the “top-down” predictions is here doing much of the more traditionally “perceptual” work, with the bottom up driving signals really providing a kind of ongoing feedback on their activity (by fitting, or failing to fit, the cascade of downward-flowing predictions). This leads to the development of neurons that exhibit a “selectivity that is not intrinsic to the area but depends on interactions across levels of a processing hierarchy”. For hierarchical predictive coding,  context-sensitivity is fundamental.

To see this, Clark says that we need only reflect that the neuronal responses that follow an input may be expected to change quite profoundly according to the contextualizing information provided by a current winning top-down prediction. When a neuron or population is predicted by top-down inputs it will be much easier to drive than when it is not”. This is because the best overall fit between driving signal and expectations will often be found by inferring noise in the driving signal and thus recognizing a stimulus as, for example, the letter m say, in the context of the word “mother”, even though the same bare stimulus, presented out of context or in most other contexts, would have been a better fit with the letter n.  A unit normally responsive to the letter m might, under such circumstances, be successfully driven by an n-like stimulus.

In a demonstration of the power of top-down expectations, neurons in the fusiform face area (FFA) respond just as strongly to non-face stimuli under high expectation of faces as they do to face-stimuli. The suggestion, in short, is that FFA (in many ways the paradigm case of a region performing complex feature detection) might be better treated as a face-expectation region rather than as a face-detection region: a result that the authors interpret as favoring a hierarchical predictive processing model.

Attention fits into this picture, as a means of variably balancing the potent interactions between top-down and bottom-up influences by factoring in their degree of uncertainty. This is achieved by altering the volume on the error-units accordingly. Attention, if this is correct, is simply one means by which certain error-unit responses are given increased weight, hence becoming more apt to drive learning. This means that the precise mix of top down and bottom-up influence is not fixed. Instead, the weight given to sensory prediction error is varied according to how reliable (how noisy, certain, or uncertain) the signal is taken to be. Thus we are not (not quite) slaves to our expectations. Successful perception requires the brain to minimize surprisal. But the agent is able to see surprising things, at least in conditions where the brain assigns high reliability to the driving signal.

In place of any real distinction between perception and belief we now get variable differences in the mixture of top-down and bottom-up influence, and differences of temporal and spatial scale in the internal models that are making the predictions. Top-level (more belief) models correspond to increasingly abstract conceptions of the world, and these tend to capture or depend upon regularities at larger temporal and spatial scales. Lower-level (more perception) ones capture or depend upon the kinds of scale and detail most strongly associated with specific kinds of perceptual contact.