4
Value, Euphoria, and the Do-It-Again Signal

Pleasure and value are different things. One can have pleasure without value and value without pleasure. They are signaled by different chemicals in the brain. In particular, the brain contains a signal for “better than expected,” which can be used to learn to predict value. The lack of expected reward and the lack of expected punishment are separate processes (disappointment, relief) that require additional neural mechanisms.

Euphoria and dysphoria

It is commonly assumed that pleasure is our interpretation of the things we have to do again. “People seek pleasure and avoid pain.” While this is often true, a better description is that people seek things that they recognize will have high value. As we will see below, pleasure is dissociable from value.

Although the saying is “pleasure and pain,” pain is not the opposite of pleasure. Pain is actually a sensory system, like any of the other sensory systems (visual, auditory, etc.).1 Pain measures damage to tissues and things that are likely to damage tissues. It includes specific receptors that project up to mid-brain sensory structures that project to cortical interpretation areas. The sensation of pain depends on cortical activation in response to the pain sensors, but this is true of the other sensory systems as well. The retina in your eyes detects photons, but you don’t see photons—you see the objects that are interpreted from photons. Pleasure, in contrast, is not a sensation, but an evaluation of a sensation.

Euphoria, from the Greek word φορíα (phoria, meaning “bearing” or “feeling”) and the Greek root εimage- (eu-, meaning “good”), is probably a better word than “pleasure” for the brain signal we are discussing. And, of course, euphoria has a clear antonym in dysphoria, meaning “discomfort,” deriving from the Greek root δυσ- (dys-, bad). The terms euphoria and dysphoria represent calculations in the brain of how good or bad something is.

Differences between euphoria and reinforcement

Watching animals move toward a reward, early psychologists were uncomfortable attributing emotions to the animal, saying that it “wanted the reward” or that it felt “pleasure” at getting the reward. Instead they defined something that made the animal more likely to approach it as reinforcement (because it reinforced the animal’s actions).

Experiments in the 1950s began to identify that there was a difference between euphoria and reinforcement in humans as well. In the 1950s, it was found that an electrode placed into a certain area of the brain (called the medial forebrain bundle) would, when stimulated, lead to near-perfect reinforcement.2 Animals with these stimulations would forego food, sex, and sleep to continue pressing levers to produce this brain stimulation.

From this, the medial forebrain bundle became popularly known as “the pleasure center.”3 But even in these early experiments, the difference between euphoria and reinforcement was becoming clear. Robert Heath and his colleagues implanted several stimulating electrodes into the brain of a patient (for the treatment of epilepsy), and then offered the patient different buttons to stimulate each of the electrodes. On stimulation from pressing one button, the patient reported orgasmic euphoria. On stimulation from the second button, the patient reported mild discomfort. However, the patient continued pressing the second button over and over again, much more than the first. Euphoria and reinforcement are different.4

It turns out that the key to the reinforcement from medial forebrain bundle stimulation is a brain neurotransmitter called dopamine.5 Dopamine is chemically constructed out of precursorsA in a small area of the brain located deep in the base of the midbrain called the ventral tegmental area and the substantia nigra.B

An early theory of dopamine was that it was the representation of pleasure in the brain.10 Stimulating the medial forebrain bundle makes dopamine cells release their dopamine.11 (Remember that these stimulating electrodes were [incorrectly] hypothesized to be stimulating the “pleasure center.”) In animal studies, blocking dopamine blocked reinforcement.12 Although one could not definitively show that animals felt “pleasure,” the assumption was that reinforcement in animals translated to pleasure in humans. Most drugs of abuse produce a release of dopamine in the brain.13 And, of course, most drugs of abuse produce euphoria, at least on early use.C Since some of the largest pharmacological effects of drugs of abuse are on dopamine, it was assumed that dopamine was the “pleasure” signal. This theory turned out to be wrong.

In a remarkable set of experiments, Kent Berridge at the University of Michigan set out to test this theory that dopamine signaled pleasure.15 He first developed a way of measuring euphoria and dysphoria in animals—he watched their facial expressions. The idea that animals and humans share facial expressions was first proposed by Charles Darwin in his book The Expression of the Emotions in Man and Animals. Berridge and his colleagues used cameras tightly focused on the faces of animals, particularly rats, to identify that sweet tastes (such as sugar or saccharin) were accompanied by a licking of the lips, while bitter tastes (such as quinine) were accompanied by a projection of the tongue, matching the classic “yuck” face all parents have seen in their kids being forced to eat vegetables.

What is interesting is that Berridge and his colleagues (particularly Terry Robinson, also at the University of Michigan) were able to manipulate these expressions, but not with manipulations of dopamine. Manipulations of dopamine changed whether an animal would work for, learn to look for, or approach a reward, but if the reward was placed in the animal’s mouth or even right in front of the animal, it would eat the reward and show the same facial expressions. What did change the facial expressions were manipulations of the animal’s opioid system. The opioid system is another set of neurotransmitters in the brain, which we will see are very important to decision-making. They are mimicked by opiates—opium, morphine, heroin. Berridge and Robinson suggested that there is a distinction between “wanting something” and “liking it.” Dopamine affects the wanting, while opioids affect the liking.

Opioids

The endogenous opioid system includes three types of opioid signals and receptors in the nucleus accumbens, hypothalamus, amygdala, and other related areas.16 Neuroscientists have labeled them by three Greek letters (mu [μ], kappa [κ], and delta [δ]). Each of these receptors has a paired endogenous chemical (called a ligand) that attaches to it (endorphins associated with the μ-opioid receptors, dynorphin associated with the κ-opioid receptors, and enkephalins associated with the δ-opioid receptors). Current work suggests that activation of the μ-opioid receptors signals euphoria, and activation of the κ-opioid receptors signals dysphoria. The functionality of the δ-opioid receptors is less clear but seems to be involved in analgesia (the suppression of pain), the relief of anxiety, and the induction of craving. Chemicals that stimulate μ-opioid receptors (called agonists) are generally rewarding, reinforcing, and euphorigenic. The classic μ-opioid receptor agonist is the opium poppy, particularly after it is refined into morphine or heroin. μ-opioid receptor agonists are reliably self-administered when given the opportunity, whether by humans or other animals. In contrast, chemicals that block μ-opioid receptors (called antagonists) are aversive and dysphoric and interfere with self-administration. On the opposite side, κ-opioid receptor agonists are also aversive and dysphoric.

When Kent Berridge and his colleagues manipulated the opioid system in their animals, they found that they could change the animals’ facial responses to positive (sweet) and negative (bitter) things. Stimulating κ-opioid receptors when animals were fed the sweet solution led to the yuck face, while stimulating μ-opioid receptors when animals were fed the bitter solution led to licking the lips. This hypothesis is further supported by introspective manipulations in humans as well. Naltrexone, a general opioid antagonist, has been shown to block the feeling of pleasure on eating sugar—subjects say that they can taste the sweetness, but they don’t “care” about it.17

If pleasure or euphoria in the brain really is activation of μ-opioid receptors, then this is a great example of the importance of recognizing our physical nature. Euphoria is not a mental construct divorced from the physical neural system—it is instantiated as a physical effect in our brains. That means that a chemical that accesses that physical effect (in this case the μ-opioid receptors) can create euphoria directly. This also brings up a concept we will return to again and again in this book—the idea of a vulnerability or failure mode. We did not evolve μ-opioid receptors to take heroin; we evolved μ-opioid receptors so that we could recognize things in our lives that have value and thus give us pleasure. But heroin stimulates the μ-opioid receptors directly and produces feelings of euphoria. Heroin accesses a potential failure mode of our brains—it tricks us into feeling euphoria when we shouldn’t. At least, it does so until the body gets used to it and dials down the μ-opioid receptor, leading to the user needing to take more and more heroin to achieve the same euphoria, until eventually even massive doses of heroin don’t produce euphoria anymore, they just relieve the persistent dysphoria left behind.

Reinforcement

So if euphoria and dysphoria are the opioid receptors, then what is reinforcement, and what is its negative twin, aversion? According to the original scientific definitions first put forward by the animal learning theorists (for example by John Watson, Ivan Pavlov, and Edward Thorndike in the first decades of the 20th century, and Clark Hull and B. F. Skinner in the 1940s and 1950s), reinforcement is the process by which a reward (such as food to a hungry animal) increases responding, while aversion is the process by which a penalty decreases responding.18 (Notice how this definition is defined operationally—in terms of directly observable behavioral phenomena. We will break open the decision-making mechanism in the next section and move beyond these simple behavioral limitations. For now, let us be content with observing behavior.)

When phrased in terms of reinforcement, the question becomes When does a reward reinforce behavior? In the 1960s, Leon Kamin showed that a stimulus reinforced a behavior only if another stimulus had not already reinforced it.19 He called this blocking because the first stimulus “blocked” the second. In its simplest form, if you trained an animal to push a lever to get a reward in response to a tone and then tried to train the animal to respond to both the tone and a light presented together, the light would never become associated with the reward. You can test this by asking whether the animal presses the lever in response to the light alone. It won’t. In contrast, if you had started training with both the tone and the light at the same time (that is, if you had not pretrained the animal with the tone), then it would have learned to respond to both the tone and the light when they were presented together. Learning occurs only when something has changed.

In 1972, Robert Rescorla and Allan Wagner proposed a mathematical theory based on the idea that reinforcement was about surprise and predictability.20 They proposed that an animal learned to predict that a reward was coming, and that it learned based on the difference between the expectation and the observation. The light was blocked by the tone because the tone already completely predicted the reward. This theory is supported by more complicated versions of Kamin’s blocking task, where the reward is changed during the tone + light training.21 If the delivered reward or punishment is larger after tone + light than after just the tone, then the animal will learn to associate the changed reward with the light. If the delivered reward is smaller after tone + light than after just the tone, then the animal will learn that the light is a “negative” predictor. Testing this gets complicated—one needs a third cue (say a different sound): train the animal to respond to the third cue, then present the third cue with the light, and the animal responds less because the light predicts that the animal will receive less reward than expected from the other cues alone. Learning can also reappear if other things change, such as the actual reward given (apples instead of raisins), even when one controls for the value of the reward (by ensuring that the animal would work at similar levels for both rewards).

In the 1980s, two computer scientists, Richard Sutton and Andy Barto, laid out a new mathematical theory called temporal difference reinforcement learning, which built in part on the work of Rescorla and Wagner, but also on earlier work in control theory and “operations research”D from the 1950s, particularly that of Richard Bellman.22 Bellman originally showed that one can learn an optimal strategy by measuring the difference between the observed value and the expected value and learning to take actions to reduce that difference. Sutton and Barto recognized the similarity between Bellman’s operations research work and the Rescorla-Wagner psychology model. They derived an algorithm that was mathematically powerful enough that it could be used to actually train robots and other control systems.

There were two major differences between the Sutton and Barto equations and the Rescorla and Wagner theory. First, Rescorla and Wagner assumed that predictions about reward were derived from single, individual cues. In contrast, decisions in the Sutton and Barto theory were based on categorizations of the situation of the world (which they called the “state” of the world).23;E

Second, while Rescorla and Wagner’s theory learned to predict reward, Sutton and Barto’s learning system learned to predict value.24 Each potential situation was associated with a value. Actions of the agent (the computer program or the animal) and events in the world could change the situation. Once the expected value of each situation was known, the agent could decide the best thing to do (take the action leading to the situation with the highest expected value). The key was that Sutton and Barto showed that you could learn the value function (what the value of each situation was) at the same time that you were taking actions.F

Sutton and Barto called this the actor-critic architecture, because it required two components—an actor to take actions based on the current estimation of the value function and a critic to compare the current estimation of the value function with the observed outcomes. The critic calculated the difference between the value you actually got and the value you expected to get, which Sutton and Barto called value-prediction error and labeled as delta (δ), from the analogy to the common physics and mathematical use of the term Δ (delta) meaning “difference.”G

Value-prediction error

The concept of value-prediction error (δ) is easiest to understand by an example. Imagine that there is a soda machine in the hallway outside your office, and the soda is supposed to cost $1. You put your dollar in the machine, and you get two sodas out instead of one. Whatever the reason, this result was better than expected (δ, the difference between what you observed and what you expected, is positive) and you are likely to increase your willingness to put money into this machine. If, in contrast, you put your dollar in and get nothing out, then δ is negative and you are likely to decrease your willingness to put money into this machine. If, as expected, you put your dollar in and you get your soda out, your observations match your expectations, δ is zero, and you don’t need to learn anything about this soda machine. Notice that you still get the nutrients (such as they are) from the soda. You still get the pleasure of drinking the soda. You still know what actions to take to get a soda from the soda machine. But you don’t need to learn anything more about the machine.

In what I still feel is one of the most remarkable discoveries in neuroscience, in the early 1990s, Wolfram Schultz and his colleagues found that the transient bursts of firing in dopamine neurons increase with unexpected increases in value, decrease with unexpected decreases in value, and generally track the prediction-error signals proposed by Rescorla and Wagner and by Sutton and Barto.25 At the time (the late 1980s and early 1990s), dopamine was primarily known as a reward or pleasure signal (remember, this is wrong!) in the animal behavior literature and as a motor-enabling signal in the Parkinson’s disease literature.H

Schultz was looking at dopamine in animals learning to make associations of stimuli to reward under the expectation that he would then go on to look at its role in Parkinson’s disease. The first step was to examine the firing of dopamine cells in response to a reward. In these first experiments, Schultz and his colleagues found that dopamine neurons would respond whenever the thirsty animal received a juice reward. But in this first experiment, the reward was just delivered at random times; when Schultz and his colleagues provided a cue that predicted the reward, to their surprise, the dopamine neurons no longer responded to the reward. Instead, the cells responded to the cue that predicted the reward.36

These experimental results were unified with the theoretical work by Read Montague, Peter Dayan, and Terry Sejnowski, who showed that the unexpected-reward neurons reported by Wolfram Schultz and his colleagues were exactly the “delta” δ signal needed by the Sutton and Barto temporal difference learning algorithm. Over three years (1995, 1996, and 1997), they published three papers suggesting that the monkey dopamine cell data matched the value-prediction error signals needed for the temporal difference reinforcement learning theory.37

To really understand how dopamine tracks the delta signal, one can ask Why do the dopamine neurons increase their responding to the cue? Before the cue, the animal has no immediate expectation of reward. (For all it knows, the day’s session is about to end.) It is in a low-value situation. When the cue is shown to the animal, it realizes that it is going to get a reward soon, so it is now in a higher-value situation. The change from low-value “waiting” to high-value “going to get reward” is an unexpected change in value and produces a delta signal of “better than expected.” If the animal has learned that an instruction stimulus predicts the cue, saying that a cue will be arriving soon, then the dopamine signal will appear at the earlier instruction signal rather than at the cue (because the instruction signal is now the unexpected increase in value). If the cue then doesn’t arrive, you get a decrease in dopamine because the cue was expected but not delivered and you have an unexpected decrease in value.

Subsequent experiments have found that the signal does track the δ value-prediction error signal remarkably well. If the reward was larger than the cue predicted, dopamine neurons increase responding again, while if the reward was not delivered when expected, the dopamine neurons decrease responding. Dopamine neurons have a small baseline firing rate of a few spikes per second. Positive responses entail a burst of spikes, while negative responses entail a cessation of spikes.38 This allows dopamine neurons to signal both “better-than-expected” and “worse-than-expected.” Dopamine even tracks the value-prediction error signals in Kamin blocking paradigms.39 These results have since been replicated in rats, monkeys, and humans.40

Reinforcement and aversion, disappointment and relief

In the computer models, penalties are generally written as negative rewards and aversion is written as negative reinforcement. However, the examples of “worse-than-expected” used in the experimental data described above are not actually punishments, nor do they produce actual aversion. Rather, they are examples of disappointment—a lack of delivery of an expected reward.41 When researchers have examined actual aversive events, the dopamine response to those events has turned out to be very complicated.42;I

As we will see below, aversion and disappointment are actually different phenomena and cannot work by a shared mechanism. Some researchers have suggested that there is another neurotransmitter signal analogous to dopamine that can serve as the aversion signal (in parallel to the reinforcement signal of dopamine),47 but no one has (as yet) found it.J Responses to punishments (like retreating from painful stimuli) entail a different neural circuitry than responses to positive rewards, in part because the responses to punishment are an evolutionarily older circuit.54 Learning not to do something (aversion) uses a different set of neural systems than learning to do something (reinforcement).55

image

Figure 4.1 Providing a positive thing (a reward, leading to euphoria) produces reinforcement, while the lack of delivery of an expected reward leads to disappointment. Similarly, providing a negative thing (a punishment, leading to dysphoria) produces aversion, while the lack of delivery of an expected punishment leads to relief.

From the very beginning of animal learning experiments, it has been known that the process of recognizing that a cue no longer predicts reward does not mean that the animal has forgotten the association.56 This goes all the way back to Pavlov and his famous dogs learning to associate bells with food. Once the dogs had learned to salivate in response to the bell (in anticipation of the food), when the food was not presented to them, they showed frustration and anger, and they searched for the food. Eventually, however, they recognized that the bell no longer implied food and learned to ignore the bell. If, however, they were given food again just once after the bell was rung, they immediately started salivating for food the next time. Because learning was much faster afterward than before, Pavlov knew that the dog had not forgotten the original association.

This means that there must be a second process. In the literature, this process is usually called extinction because the lack of delivered reward has “extinguished” the response.K In fact, the exact same effects occur with aversion. If an animal is trained to associate a cue with a negative event (say an electric shock), and this association is extinguished (by providing the cue without the shock), then it can be reminded of the association (one pair of cue + shock) and the response will reappear without additional training. Disappointment is not the forgetting of reinforcement and relief is not the forgetting of aversion. Because both reinforcement and aversion have corresponding secondary processes (disappointment and relief), disappointment cannot be the same process as aversion and relief cannot be the same process as reinforcement. See Figure 4.1.

Neural mechanisms of extinction

As this point, scientists are not completely sure what the neural mechanisms are that underlie the secondary process of extinction that follows from disappointment or relief. There are two prominent theories to explain this secondary process. One is that there is a specific second inhibition process, such that neurons in this secondary area inhibit neurons in the primary association area.57 It is known, for example, that synapses in the amygdala are critical to the initial learning of an aversive association (lesions of the amygdala prevent such associations, the creation of such associations changes firing patterns in the amygdala, and the creation of such associations occurs with and depends on synaptic plasticity in the amygdala). However, neurons in the central prefrontal cortex (the infralimbic cortex in the rat, the central anterior cingulate cortex in the human) start firing after extinction and project to inhibitory neurons in the amygdala, which stop the firing of the original association cells. Yadin Dudai and his colleagues recently found that in humans, these areas are active when people with a phobia of snakes show courage and override their fear to perform a task in which they pull the snake toward themselves.58 However, this theory implies that there is only one level of recursion—that one can only have an association and a stopping of that association.

The other theory starts from the assumption that you are not provided a definition of the situation you are in, but instead that you have to infer that situation from the available cues.59 Look around you: there are millions of cues. How many of them are important to your ability to learn from this book? Around me right now, I am looking at windows, trees, an empty coffee cup, squirrels chasing each other. There is the sound of birds, a chickadee, a crow. There is the rumble of cars in the distance. There is the feel of the chair I am sitting in, the heat of my laptop (which gets uncomfortably hot when I work on it too long). Which of these cues are important? Most animal experiments are done in very limited environments with few obvious cues except for the tones and lights that are to be associated with the rewards and penalties. Animals didn’t evolve to live in these empty environments.

This second theory says that before one can associate a situation with a reward or a penalty, one needs to define the set of stimuli that will identify that situation. Any situation definition will attend to some cues that are present in the environment and ignore others. We (myself and my colleagues Adam Johnson, Steve Jensen, and Zeb Kurth-Nelson) called this process “situation-recognition” and suggested that disappointment and relief produce a narrowing of the definition of the situation such that the agent would begin to pay attention to additional cues and begin to differentiate the new situation from the old one.60 When animals no longer receive rewards or penalties (that is, they are disappointed or relieved), they begin “searching for the cue” that differentiates the two situations.61 Imagine the soda machine we talked about when we first encountered the value-prediction error δ signal. If you put your money in and get nothing out, you aren’t going to forget that putting money in soda machines can get you soda; you are going to look for what’s different about this machine. Maybe there’s a light that says “out of order” or “out of stock.” Maybe the machine is off. Once you identify the difference, your definition of the soda-machine-available situation has changed, and you can go find a soda machine that works.

The second theory suggests that the additional signals being provided from the pre-frontal cortex are providing additional dimensions on which to categorize the situation. It provides explanations for a number of specific phenomena seen in the extinction literature. For example, there is extensive evidence that extinction is about the recognition of a change.62 Animals show much slower extinction after a variable (probabilistic) reward-delivery contingency than after a regular (being sure of always getting reward) reward contingency. If an animal is provided with a reward only half the time after the cue, it will still learn to respond, but it will be slower to stop when the reward is no longer delivered. On the other hand, if it is always provided with a reward, then it will quickly stop responding when the reward is no longer delivered. However, this is not simply a question of counting, because, as shown in Figure 4.2, if there is a pattern, then even with only a 50/50 chance of getting a reward, one can easily determine the pattern, and animals stop responding as soon as the pattern is disrupted.

image

Figure 4.2 Sequences of reward and punishment. Imagine you are a subject in an experiment faced with one of these sequences of reward delivery. (R means you get the reward and N means you don’t.) Imagine seeing the sequence from left to right. (Trace your finger across to provide yourself a sequence.) When would you stop responding? Try each line separately. You will probably find yourself trying to predict whether you get a reward or not. When the observations no longer match your predictions, that’s when you recognize a change.

As we saw in Chapter 3, “value” is a complex thing. We have seen in this chapter how the calculation of value and the learning to predict value depend on multiple, interacting systems. There is a system measuring the pleasure of a thing (euphoria), a system learning the usefulness of repeating actions (reinforcement), and a system recognizing change (disappointment). In parallel, there are the dysphoria, aversion, and relief systems. As we will see in the next chapters, these value-learning components form the bases of multiple decision-making systems that interact to drive our actions. The fact that there are multiple systems interacting makes the detection and actual definition of value difficult.

Books and papers for further reading

• Kent C. Berridge and Terry E. Robinson (2003). Parsing reward. Trends in Neurosciences, 26, 507–513.

• P. Read Montague (2006). Why Choose This Book? New York: Penguin.

• Wolfram Schultz (1998). Predictive reward signal of dopamine neurons. Journal of Neurophysiology, 80, 1–27.

• A. David Redish, Steve Jensen, Adam Johnson, and Zeb Kurth-Nelson (2007). Reconciling reinforcement learning models with behavioral extinction and renewal: Implications for addiction, relapse, and problem gambling. Psychological Review, 114, 784–805.

• Richard S. Sutton and Andy G. Barto (1998). Reinforcement Learning: An Introduction. Cambridge, MA: MIT Press.