«PR Montague P Dayan SJ Nowlan A Pouget TJ Sejnowski CNL, The Salk Institute 10010 North Torrey Pines Rd. La Jolla, CA 92037, USA ...»
Using Aperiodic Reinforcement for Directed
Self-Organization During Development
PR Montague P Dayan SJ Nowlan A Pouget TJ Sejnowski
CNL, The Salk Institute
10010 North Torrey Pines Rd.
La Jolla, CA 92037, USA
We present a local learning rule in which Hebbian learning is
conditional on an incorrect prediction of a reinforcement signal.
We propose a biological interpretation of such a framework and
display its utility through examples in which the reinforcement signal is cast as the delivery of a neuromodulator to its target.
Three exam pIes are presented which illustrate how this framework can be applied to the development of the oculomotor system.
1 INTRODUCTIONActivity-dependent accounts of the self-organization of the vertebrate brain have relied ubiquitously on correlational (mainly Hebbian) rules to drive synaptic learning. In the brain, a major problem for any such unsupervised rule is that many different kinds of correlations exist at approximately the same time scales and each is effectively noise to the next. For example, relationships within and between the retinae among variables such as color, motion, and topography may mask one another and disrupt their appropriate segregation at the level of the thalamus or cortex.
It is known, however, that many of these variables can be segregrated both within and between cortical areas suggesting that certain sets of correlated inputs are somehow separated from the temporal noise of other inputs. Some form of supervised learning appears to be required. Unfortunately, detailed supervision and 970 Montague, Dayan, Nowlan, Pouget, and Sejnowski selection in a brain region is not a feasible mechanism for the vertebrate brain. The question thus arises: What kind of biological mechanism or signal could selectively bias synaptic learning toward a particular subset of correlations? One answer lies in the possible role played by diffuse neuromodulatory systems.
It is known that multiple diffuse modulatory systems are involved in the selforganization of cortical structures (eg Bear and Singer, 1986) and some of them a ppear to deliver reward and/or salience signals to the cortex and other structures to influence learning in the adult. Recent data (Ljunberg, et al, 1992) suggest that this latter influence is qualitatively similar to that predicted by Sutton and Ba.rto's (1981,1987) classical conditioning theory. These systems innervate large expanses of cortical and subcortical turf through extensive axonal projections that originate in midbrain and basal forebrain nuclei and deliver such compounds as dopamine, serotonin, norepinephrine, and acetylcholine to their targets. The small number of neurons comprising these subcortical nuclei relative to the extent of the territory their axons innervate suggests that the nuclei are reporting scalar signals to their target structures.
In this paper, these facts are synthesized into a single framework which relates the development of brain structures and conditioning in adult brains. We postulate a modification to Hebbian accounts of self-organization: Hebbian learning is conditional on a incorrect prediction of future delivered reinforcement from a diffuse neuromodulatory system. This reinforcement signal can be derived both from externally driven contingencies such as proprioception from eye movements as well as from internal pathways leading from cortical areas to subcortical nuclei.
The next section presents our framework and proposes a specific model for how predictions about future reinforcement could be made in the vertebrate brain utilizing the firing in a diffuse neuromodulatory system (figure 1). Using this model we illustrate the framework with three examples suggesting how mappings in the oculomotor system may develop. The first example shows how eye movement commands could become appropriately calibrated in the absence of visual experience (figure 3). The second example demonstrates the development of a mapping from a selected visual target to an eye movement which acquires the target. The third example describes how our framework could permit the development and alignment of multimodal maps (visual and auditory) in the superior colliculus. In this example, the transformation of auditory signals from head-centered to eyecentered coordinates results implicitly from the development of the mapping from parietal cortex onto the colliculus.
where, all at times t, Wt is a connection weight, Xt an input measure, Yt an output measure, 1't a reinforcement measure, and ex. is the learning rate.
In this case, l' can be driven by either external events in the world or by cortical projections (internal events) and it picks out those correlations between x and Y about which the system learns. Learning is shut down if nothing occurs that is independently judged to be significant, i.e. events for which l' is O.
3 MAKING PREDICTIONS IN THE BRAINIn our account of RL in the brain, the cortex is the structure tha t makes predictions of future reinforcement. This reinforcement is envisioned as the output of subcortical nuclei which deliver various neuromodulators to the cortex that permit Hebbian learning. Experiments have shown that various of these nuclei, which have access to cortical representations of complex sensory input, are necessary for instrumental and classical conditioning to occur (Ljunberg et ai., 1992).
Figure 1 shows one TD scenario in which a pattern of activity in a region of cortex makes a prediction about future expected reinforcement. At time t, the prediction of future reward Vt is viewed as an excitatory drive from the cortex onto one or more subcortical nuclei (pathway B). The high degree of convergence in B ensures that this drive predicts only a scalar output of the nucleus R. Consider a pattern of activity onto layer II which provides excitatory drive to R and concomitantly causes some output, say a movement, at time t + 1. This movement provides a separate source of excitatory drive rt+ 1 to the same nucleus through independent 972 Montague, Dayan, Nowlan, Pouget, and Sejnowski
Figure 1: Making predictions about future reinforcement. Layer I is an array of units that projects topographically onto layer II. (A) Weights from I onto II develop according to equation 3 and represent the value function Vt. (B) The weights from II onto R are fixed. The prediction of future reward by the weights onto II is a scalar because the highly convergent excitatory drive from II to the reinforcement nucleus (R) effectively sums the input. (C) External events in the world provide independent eXcitatory drive to the reinforcement nucleus. (D) Scalar signal which results from the output firing of R and is broadcast throughout layer II. This activity delivers to layer II the neuromodulator required for Hebbian learning.
The output firing of R is controlled by temporal changes in its excitatory input and habituates to constant or slowly varying input. This makes for learning in layer II according to equation 3 (see text).
connections conveying information from sensory structures such as stretch receptors (pathway C). Hence, at time t + 1, the excitatory input to R is the sum of the 'immediate reward' Tt+ 1 and the new prediction of future reward Vt+ I. If the reinforcement nucleus is driven primarily by changes in its input over some time window, then the difference between the excitatory drive at time t and t + 1, ie [(Tt+1 + Vt+d - Vt] is what its output reflects.
The output is distributed throughout a region of cortex (pathway D) and permits Hebbian weight changes at the individual connections which determine the value function Vt. The example hinges on two assumptions: 1) Hebbian learning in the cortex is contingent upon delivery of the neuromodulator, and 2) the reinforcement nucleus is sensitive to temporal changes in its input and otherwise habituates to constant or slowly varying input.
Initially, before the system is capable of predicting future delivery of reinforcement correctly, the arrival of TH 1 causes a large learning signal because the prediction error [(Tt+1 + Vt+ 1) - Vtl is large. This error drives weight changes at synaptic connections with correlated pre- and postsynaptic elements until the predictions come to a pproximate the actual future delivered reinforcement. Once these predictions become accurate, learning abates. At that point, the system has learned about whatever contingencies are currently controlling reinforcement delivery. For the case in which the delivery of reinforcement is not controlled by any predictable contingencies, Hebbian learning can still occur if the fluctuations of the prediction error have a positive mean.
Self~Organization Using Aperiodic Reinforcement for Directed During Development 973
Figure 2: Upper layer is a 64 by 64 input array with 3 by 3 center-surround filters at each position which projects topographically onto the middle layer. The middle layer projects randomly to four 4 X4 motoneuron layers which code for an equilibrium eye position signal, for example, through setting equilibrium muscle tensions in the 4 muscles. Reinforcement signals originate from either eye movement (muscle' stretch') or foveation. The eye is moved according to h = (T - t)g. " = (u - d)g where r,l,u,d are respectively the average activities on the right, left, up, down motoneuron layers and 9 is a fixed gain parameter. hand" are linearly combined to give the eye position.
In the presence of multiple statistically independent sources of control of the reinforcement signal (pathways onto R), the system can separately 'learn away' the contingencies for each of these sources. This passage of control of reinforcement delivery can allow the development of connections in a region to be staged. Hence, control of reinforcement can be passed between contingencies without supervision. In this manner, a few nuclei can be used to deliver information globally about many different circumstances. We illustrate this point below with development of a sensorimotor mapping.
4.1 Learning to calibrate without sensory experience Figure 2 illustra tes the architecture for the next two exam pIes. Briefly, cortical layers drive four 'motor' layers of units which each provide an equilibrium command to one of four extraocular muscles. The mapping from the cortical layers onto these four layers is random and sparse (15%-35% connectivity) and is plastic according to the learning rule described above. Two external events control the delivery of reinforcement: eye movement and foveation of high contrast objects in the visual input. The minimum eye movement necessary to cause a reinforcement is a change of two pixels in any direction (see figure 3).
We begin by demonstrating how an unbalanced mapping onto the motoneuron 974 Montague, Dayan, Nowlan, Pouget, and Sejnowski
Figure 3: Learning to calibrate eye movement commands. This example illustrates how a reinforcement signal could help to organize an appropriate balance in the sensorimotor mapping before visual experience. The dark bounding box represents the 64x64 pixel working area over which an 8x8 fovea can move. A Foveal position during the first 400 cycles of learning. The architecture is as in figure 2, but the weights onto the right/left and up/down pairs are not balanced. Random activity in the layer providing the drive to the motoneurons initially drives the eye to an extreme position at the upper right. From this position, no movement of the eye can occur and thus no reinforcement can be delivered from the proprioceptive feedback causing all the weights to begin to decrease. With time, the weights onto the motoneurons become balanced and the eye moves. B Foveal position after 400 cycles of learning and after increasing the gain 9 to 10 times its initial value. After the weights onto antagonistic muscles become balanced, the net excursions of the eye are small thus requiring an increase in 9 in order to allow the eye to explore its working range.
C Size of foveal region relative the working range of the eye. The fovea covered an 8x8 region of the working area of the eye and the learning rate ex was varied from 0.08 to 0.25 without changing the result.
layers can be automatically calibrated in the absence of visual experience. Imagine that the weights onto the right/left and up/down pairs are initially unbalanced, as might happen if one or more muscles are weak or the effective drives to each muscle are unequal. Figure 3, which shows the position of the fovea during learning, indicates that the initially unbalanced weights cause the eye to move immediately to an extreme position (figure 3, A).
Since the reinforcement is controlled only by eye movement and foveation and neither is occurring in this state, Tt+ 1 is roughly O. This is despite the (randomly generated) activity in the motoneurons continually making predictions that reinforcement from eye-movement should be being delivered. Therefore all the weights begin to decrease, with those mediating the unbalanced condition decreasing the fastest, until balance is achieved (see path A). Once the eye reaches equilibrium, further random noise will cause no mean net eye movement since the mappings onto each of the four motoneuron layers are balanced. The larger amplitude eye movements shown in the center of figure 3 (labeled B) are the result of increasing the gain g (figure 2).
Using Aperiodic Reinforcement for Directed Self-Organization During Development 975 Figure 4: Development of foveation map. The map after 2000 learning cycles shows the approximate eye movement vector from stimulation of each position in I7/ the visual field. Lengths were normalized to the size 1/ // of the largest movement. The undisplayed quadrants, / V"II were qualitatively similar. Note that this scheme does not account for activity or contrast differences in the i------o 1II input and assumes that these have already been normalized. Learning rate = 0.12. Connectivity from the middle layer to the motoneurons was 35% and was randomized. Unlike the previous example, the weights onto the four layers of motoneurons were initially balanced.