14
The Tradeoff Between Exploration and Exploitation

To take the correct actions within an environment, one has to know the structure of the environment, the consequences of those actions. Exploration is the process through which a novel environment becomes familiar. Exploitation is the process of using that knowledge to get the most out of a familiar environment. There is a tradeoff between these two processes—exploration can be dangerous, but not knowing if there are better options available can be costly.

An issue that often comes up in theoretical discussions of decision-making is that there is a tradeoff between learning more about the world and using what you’ve learned to make the best decisions. Imagine that you’re in a casino with a lot of different slot machines available to play. You don’t know what the payout probabilities are for any of them; to find out, you have to play them. As you play a machine, you learn about its parameters (how likely it is to pay how much given its costs). But you don’t know if one of the other machines would be a better choice. One can imagine where one would fall on this continuum of exploring to see what the payoffs of the various machines are and exploiting the knowledge you already have to choose the best option that you know of.A This is a task called the n-armed bandit task1 (as an extension of the original slot machine, known as the “one-armed bandit” because it stole all your money).

In a stable world, it is best to do some exploring early and to diminish your exploration with time as you learn about the world.2 A good system often used in the robotics and machine-learningB literature is to give yourself a chance of trying a random machine each round. This random level is called ε (epsilon) and the strategy is called an “epsilon strategy.” The question, then, is what should epsilon be on each round? Simple strategies with a constant, low epsilon are called “epsilon-greedy.” For example, you might try a random action one round in a hundred and use your knowledge to pick the best choice on the other ninety-nine tries out of the hundred.C In a stable world, you should explore early and then settle down to the best choice later. In the machine-learning literature, a common strategy for a stable world is to use an exponentially decreasing chance of trying a random action. This explores early and exploits after you have learned the structure of the world you are living in, but never completely stops exploring.

Of course, the real world is not generally stable, and it would not generally be safe to have a pure exploration phase followed by a pure exploitation phase. More recent machine-learning algorithms have been based on explicitly tracking the uncertainty in each option.6 When the uncertainty becomes too large relative to how much you expect to win from the option you are currently playing, it’s time to try exploring again. This requires keeping track of both the uncertainty and the expected value of an option. As we will see below, there is now behavioral evidence that people keep track of uncertainty, that uncertainty is represented within the brain, and that genetic differences affect how sensitive individuals are to uncertainty (how willing they are to switch from exploitation to exploration modes).7

The expected value of an option is how much you expect to win on average if you keep playing the option; uncertainty is how unsure you are about that expected value. This is often called the “risk” in financial markets.8 Imagine that you know that one time in twenty when you put a dollar into the game you get twenty-one dollars out. That game then has a positive payout of $21/$20 or 1.05, meaning you make 5% playing that game.D On the other hand, another machine might pay $105 one time in 100, which also has a positive payout of $105/$100 or 1.05 or 5%, but is much more variable. If you played either game for a very long time, the expected value of playing would be the same. But if you play each game only a few times, you could lose more (or gain more) in the second game. The second one is riskier.

Uncertainty

It is useful to think of three different kinds of uncertainty: expected uncertainty, unexpected uncertainty, and ambiguity.10

In the first type (expected uncertainty), one recognizes that there is a known unpredictability in the world. We see this when a meteorologist says that there is a “60% chance of rain.” If the meteorologist is correct, then 6 times out of 10 that the meteorologist says “60%,” there should be rain and 4 times out of 10, it should be dry.

The second type (unexpected uncertainty) entails a recognition that the probabilities have changed. This is what is tested in reversal learning tasks, in which reward-delivery probabilities are changed suddenly without an explicit cue telling the subject that the probabilities have changed. Rats, monkeys, and humans can all reliably track such changes.11 Since the reversal is uncued, recognition of the change in probabilities becomes an information-theory problem.12 Humans and other animals can switch as quickly as information theory says anything could. When faced with a sudden change in reward-delivery probability (such as in extinction, when rewards are suddenly no longer delivered, see Chapter 4), animals show frustration, increasing their exploration of alternative options.13

The difference between expected and unexpected uncertainty was most famously (infamously) laid out by Secretary of Defense Donald Rumsfeld in his press conference defending the failures of the American invasion of Iraq, in which he claimed that there were “known unknowns” (expected uncertainty) and “unknown unknowns” (unexpected uncertainty) and that he should not be blamed for misjudging the “unknown unknowns.”14 Of course, the complaint was not that there was unexpected uncertainty, but that Rumsfeld and his team had not planned for unexpected uncertainty (which any military planner should do15) and that Rumsfeld and his team should have expected that there would be unexpected uncertainty.

The third type of uncertainty is ambiguity, sometimes called estimation uncertainty,16 uncertainty due to a known lack of exploration. This is what is often referred to in the animal-learning literature as the transformation from novel to familiar and is the main purpose of exploration.17 Here, the uncertainty is due to having only a limited number of samples. The more samples you take, the better an estimate you can get. Exploration entails taking more samples, thus getting a better estimate of the expected uncertainty.

Humans find ambiguity very aversive. The classic example of ambiguity is to imagine two jars containing marbles and having to decide which jar to pick from.18 Each jar contains a mix of red and blue marbles. You will win $5 if you pick a red marble and $1 if you pick a blue marble. Imagine that you see the first jar being filled, so that you know that half of the marbles are red and half are blue. (You will still have to pick with your eyes closed and the jar above your head so which marble you pick will be random, but you know the distribution of marbles in the first jar.) Imagine, however, that the second jar was filled while you were not there and all you know is that there is a mix of red and blue marbles. Which jar will you pick from? In the first one, you know that you have a 50:50 chance of getting red or blue. In the second one, you might have a 90:10 chance or a 10:90 chance. Adding up all the possible chances, it turns out that you also have a 50:50 chance of getting red or blue. (The likelihood of seeing a 90:10 mix of red and blue marbles is the same likelihood of seeing a 10:90 mix. Since 90:10 is the opposite of 10:90, they average out to being equivalent to a 50:50 mix in both the first-order average [expected value] and the second-order variability [risk].) However, people reliably prefer the first jar, even when the fact that the second jar adds up to the same chance is pointed out to them. The first jar contains risk (expected uncertainty; you could get red or blue). The second jar, however, also contains ambiguity (What’s the mix?) as well as risk (What’s the variability?). This is known as ambiguity aversion and is technically an example of the Ellsberg paradox.E

There are a number of explanations that have been suggested for the Ellsberg paradox, including that humansF are averse to ambiguity due to a consequence of being more sensitive to losses than to gains or to an inability to do the full calculation of probabilities needed for the ambiguity case. Alternatively, the Ellsberg paradox could arise because humans have inherent heuristics aimed at avoiding deceit.21 Since the subject doesn’t know the mix of the second jar, it might have been chosen to be particularly bad, just so the experimenter wouldn’t have to pay much out to the subject. (Rules on human-subjects studies generally do not allow an experimenter to lie to a subject, so when the experimenter tells the subject that there is a 50:50 mix, it has to be true, but when the experimenter tells the subject that there is just a “mix,” it could have been chosen to be particularly poor.G But, of course, it is not clear that the subjects of these experiments believe that.)

Animals will reliably expend effort to reduce uncertainty. (This means that knowledge really is rewarding for its own sake.) People often say that figuring something out gives them a strong positive feeling, both of accomplishment and of knowing how things work.23 (Certainly, I can say personally that there is no feeling like the moment when a complex problem suddenly comes clear with a solution, or the moment when I realize how a complex set of experiments that seem to be incompatible with each other can be unified in a single theory.) Not only humans, but also other animals will expend energy simply to reduce ambiguity. Ethan Bromberg-Martin and Okehide Hikosaka tested this in monkeys and found that information that reduces uncertainty, even if it has no effect on the final amount of reward the animal gets, has gotten, or expects to get, produces firing in dopamine neurons24 (which, if you remember, we saw represent things being “better than expected,” Chapter 4).

Gathering information

Computational theories have proposed that a good way to encourage a computational agent or robot to explore is to provide “exploration bonuses,” where the system provides a “better than expected” signal when the agent enters a new part of the world.25 Computationally, if this is matched by a “worse than expected” signal when the agent returns to its familiar zone, then the total value function will still be learned correctly. But it is not clear if the counteracting “worse than expected” signal appears on return. People and animals returning to a safe and familiar zone tend to show relief more than an “oh, well, that’s done” disappointment signal.26 Although it is possible to prove stability only if an agent is allowed an infinite time to live in a stable world (obviously not a useful description of a real animal in a real world), simulations have shown that for many simple simulated worlds, not having a “worse than expected” signal on return does not disrupt the value function too much. These simulations have found that an agent with only a positive “better than expected” novelty signal does as well or better than agents with no novelty signals.27 Such an agent would be overly optimistic about future possibilities. This idea that the agent should receive a bonus for exploration is reflected in the human belief that things could always be better elsewhere. As Tali Sharot discusses in her book The Optimism Bias, the data are pretty strong that humans are optimistic in just such a manner—“the grass is always greener on the other side.” Humans tend to overestimate the expectation of positive rewards in novel choices, which leads to exploration.28

Several studies have found these sorts of “novelty bonuses” in dopamine signals in monkeys and in fMRI activity from dopamine-projecting areas in humans.29 Animals do find novelty rewarding, and will naturally go to explore a novel item placed in their environment.30 This is so natural that it doesn’t require any training at all. Harry Harlow, in his early monkey experiments, found that fully fed monkeys would work harder to see novel objects than hungry monkeys would work for food. Hungry rats will explore a new environment before eating. This suggests that exploration (the search for knowledge or information) is an intrinsic goal of the sort discussed in Chapter 13.

Learning from others

In the real world (outside of the laboratory), such exploration can, of course, also be dangerous. An advantage humans have is that they can imagine futures (episodic future thinking, Chapter 9) and can explore possibilities by imagining those potential future choices, even without necessarily trying them. One way this can occur is with fictive learning, in which one learns from watching others. One can observe another person’s successes and failures and learn about what happens to him or her without directly experiencing it oneself.31

Of course, this is one of the most powerful aspects of human language, because language can be used to inform and to teach. A number of authors have observed that human culture is fundamentally Lamarckian; parents really do pass the lessons they learn in their lives on to their children.32 (Although those lessons don’t always stick …) Nevertheless, this Lamarckian nature of culture is one of the reasons that humans have spread so quickly to all of the corners of the globe, from the hottest jungles to the coldest tundras, even sending men to the moon and robots to the other planets in the solar system.

We now know that when a monkey observes another monkey doing a task that it is familiar with, the set of cortical neurons that normally fire when the observer is doing a task also fire when the observer is just watching the other monkey do the task.33 These neurons are called “mirror neurons” because they mirror the behavior that the other monkey is doing. A similar set of human cortical circuits is active (as determined by fMRI) whether one is doing the action oneself or watching it.34 Although the conclusions of some of these results are controversial (Are the mirror neurons actually critical to imitation, or are they only recognizing related cues? Are they simply reflecting social interactions?), certainly, we all know that we learn a lot by observing others.35

Similarly, we also know that we learn vicariously about reward and failure by watching others. A similar mirror system seems to exist for emotions, where we recognize and interpret the emotions of others.36;H This recognition is related to the emotions of empathy, envy, and regret.39 These emotions are dependent on an interaction between hippocampal-based abilities to imagine oneself in another situation and orbitofrontal-based abilities to connect emotion to those situations.40 In empathy, we recognize another’s success or loss and feel a similar emotion. In envy, we recognize another’s gain and wish we had achieved it. There is an interesting counter-side to envy, gloating or schadenfreude, meaning taking joy in another’s adversity (from the German words schaden, harm, and freude, joy), which also depends on the same emotion-mirroring brain structures that envy does. And, in regret, we recognize a path we could have chosen but did not.

In large part, this is the key to the power of narrative and literature. We identify great stories as those that place realistic characters in situations, watching how they react. Great literature allows us to vicariously experience choices and situations that we might never experience. If one looks at the great works of literature, whether classic or popular,I one finds that they consistently describe human decision-making from remarkably realistic and complex perspectives. This is true whether one is talking about realistic dramas, magical worlds, or science fiction; even science fiction and magical realism depend on characters behaving realistically in strange situations. Although readers are generally willing to accept strange situations in other galaxies, or characters with magical powers, readers are generally unwilling to work through unrealistic characters.J Reading or watching these narratives, we place ourselves in those worlds and imagine what we would do in those situations, and we empathize with the characters and feel envy and regret about the decisions they make.

Exploration and exploitation over the life cycle

Computational work from the machine-learning literature has shown that in an unchanging world, the optimal solution to the exploration/exploitation tradeoff is to explore early and exploit late.42 Of course, “early” and “late” have to be defined by the agent’s expected lifetime. This would predict that one would see juveniles exploring and adults exploiting. While this is, in general, true, there are other important differences that change how the exploration/exploitation balance changes over the lifetime.

First, as we’ve discussed elsewhere, we don’t live in an unchanging world. Thus, one cannot explore early and be done. Instead, one needs to titrate one’s exploration based on how well one is doing overall.43 When the rewards one is receiving start to decrease, one should increase one’s exploration. When faced with a decrease in delivered rewards, animals get stressed and frustrated and show an increase in a number of behaviors, both the behavior that used to drive rewards, as well as other exploratory behaviors, starting with behaviors similar to the one that used to drive reward delivery. We’ve all seen ourselves do this, starting with trying the behavior again (This used to work!), moving to minor changes (Why doesn’t this work?), and then on to new behaviors (Maybe this will work instead.).

Second, exploration is dangerous. This means that evolution will prefer creatures that develop basic abilities before going exploring. In humans, young children explore within very strict confines set by parents, and those confines expand as they age. As every parent knows, kids want to know that there are limits. They want to test those limits, but they also want those limits to exist. This suggests that the primary exploratory time in humans should occur after the person develops sufficient talents to allow handling the dangers of exploring the world.

The teen, adolescent, and early college years are notorious for high levels of drug use, exploratory sex, and risk-taking in general.44 These are the exploratory years, where humans are determining what the shape of the world is outside the confines of their family structure. In general, most humans age out of these vices and settle down as they shift from an exploration to exploitation strategy.

In the next chapter, we will see how the prefrontal cortex plays a role in measuring risk and controlling dangerous exploration. One of the most interesting results discovered over the past few decades is that the prefrontal cortex in humans doesn’t fully develop until the mid-twenties.45 This means that adolescents do not have the same prefrontal assessment of risk and behavioral inhibition of doing stupid things as adults. As we’ve seen throughout the rest of this book, behaviors are generated by the physical brain, and changes in behavior are often caused by physical changes in the brain.K

Books and papers for further reading

• Tali Sharot (2011). The Optimism Bias: A Tour of the Irrationally Positive Brain. New York: Pantheon.

• Abram Amsel (1992). Frustration Theory. Cambridge, UK: Cambridge University Press.

• B. J. Casey, Sarah Getz, and Adriana Galvan (2008). The adolescent brain. Developmental Review, 28, 62–77.