12 UNDERSTANDING NATURAL LANGUAGE (AND JEOPARDY! QUESTIONS)

Watson cannot be intimidated. It never gets cocky or discouraged. It plays its game coldly, implacably, always offering a perfectly timed buzz when it’s confident about an answer.

—Ken Jennings, human Jeopardy! champion¹

PUBLICITY STUNT OR BOON TO AI RESEARCH?

In 2006, Sebastian Thrun gave a presentation at an artificial intelligence conference about Stanley, the self-driving car he and his colleagues developed for the second DARPA Grand Challenge. The audience was electrified. Among the audience was James Fan, a graduate student at the University of Texas at Austin who was studying question answering, a quiet field of computer science devoted to developing computer programs that can answer written questions. As James watched Sebastian’s presentation, he began to speculate.

“Wouldn’t it be great,” he later asked a group of his colleagues, “If there were a Grand Challenge in question answering, hosted by Alex Trebek?”² Alex was the host of the popular American game show Jeopardy!, in which contestants must have an encyclopedic knowledge of trivia, ranging from ancient history to biology to movies. During the show, Trebek poses clues to contestants in the form of an answer, and the contestants must answer these clues while phrasing their responses in the form of questions.³

But the colleagues laughed off James’s idea. Trebek was too big a celebrity. Government pay-schedules and research grants wouldn’t be able to accommodate his speaking fees. It might be great publicity for the field of question answering, they thought, but it wouldn’t be a great use of taxpayer money.

IBM WATSON

Nearly five years later, on two cold days in January 2011, Ken Jennings and Brad Rutter, two of the most successful human players in the history of Jeopardy, faced off in a Jeopardy match against Watson, a computer program built by a team of researchers from IBM.⁴ The game was hosted at one of IBM’s research buildings, and Watson was running on racks of computers in a datacenter next door, completely cut off from the internet. The datacenter was cold and loud, as fans blew air across thousands of CPUs.⁵

The temporary studio was much warmer than both the datacenter and the winter air outside. IBM had snagged Alex Trebek to host the game; he offered the contestants clues as they selected topic categories on the game board. When the contestants knew an answer, they would buzz in. Watson also buzzed in when it knew an answer, electromechanically, its solenoid thumb hitting the buzzer with perfect timing.⁶

“Tickets aren’t needed for this event, a black hole’s boundary from which matter can’t escape,” Trebek offered.

Watson answered correctly, its screen glowing as its gentle voice—a mechanical voice of the “smooth, genial male variety” (in the words of one reporter)—rose and fell⁷: “What is event horizon?”

Long before the game ended, Jennings and Rutter realized they had no chance. The game had been humiliating for them. By the end of the two-day challenge, Jennings had earned $24,000, and Rutter had earned $21,600. Watson finished with a total of $77,147, a commanding lead over its human opponents.⁸ Jennings wrote out a statement of surrender below his answer to the final question of the game: “I, for one, welcome our new computer overlords.”

CHALLENGES IN BEATING JEOPARDY

Watson was miles ahead of the next-best computer program that could answer trivia questions. To see why Watson was such a breakthrough, let’s look at just a few of the clues Watson was designed to answer. Here’s one about the 2008 Olympics:

Milorad Čavić almost upset this man’s perfect 2008 Olympics, losing to him by one hundredth of a second.

Here’s another clue:

Wanted for general evil-ness; last seen at the Tower of Barad-Dur; it’s a giant eye, folks, kinda hard to miss.

And here’s yet another clue, in the category “The main vegetable”:

Coleslaw.

Take a moment to consider how a computer might answer these questions: what information it must know, how it might store that information, and how it might process the question to look up that information. And remember that IBM’s researchers couldn’t just program Watson to simply read the question, understand it, and recall the answer from what it had read. Its programmers needed to provide Watson with an explicit sequence of operations it could follow to answer each clue.

IBM’s Watson had no human understanding of what each word—let alone each collection of words—meant. And yet still it managed to defeat two human champions. In chapters 12 and 13, we’ll look more deeply at how Watson managed to do this. In this chapter, we’ll start with the first piece of this puzzle: how Watson figured out what the clue was even asking.

LONG LISTS OF FACTS

On the surface, some Jeopardy questions might look easy for a computer to answer: Jeopardy is a quiz show, and quiz shows are about facts. And Watson had four terabytes of disk to store its databases of facts.⁹ This should get us most of the way to building Watson, right?

For example, take the following Jeopardy clue, which appeared under the category “Who Wrote It?”:¹⁰

A “savage journey” titled “Fear & Loathing in Las Vegas.”

Here’s another example, under the category “Writers by Middle Names”:

Allan, who was “nevermore” as of Oct. 7, 1849.

To answer these questions, Watson needs to know that Hunter S. Thompson wrote Fear and Loathing in Las Vegas, and that Edgar Allan Poe passed away October 7, 1849—or, at least, that Poe was associated with the phrase “nevermore” or had the middle name “Allan.”¹¹

Facts like these can be stored in databases, and Watson did store facts like this, whenever it could. Such facts are known as relations. Relations are connections between people, places, and things. One such relation is the author-of relation, which can give us an answer to the first clue above:

Table 12.1

Charles Dickens	author-of	A Christmas Carol
Hunter S. Thompson	author-of	Fear and Loathing in Las Vegas
J. K. Rowling	author-of	Harry Potter and the Sorcerer’s Stone
…	…	…

Another relation—helpful for the second clue above—is the alive-until relation:

Table 12.2

Edgar Allan Poe	alive-until	October 7, 1849
Abraham Lincoln	alive-until	April 15, 1865
Genghis Khan	alive-until	August 18, 1227
…	…	…

As you can imagine, the set of possible relations is endless, and Watson stored millions of them, to keep track of dates, movies, books, people, places, and so on.

But millions of relations alone would still leave Watson far short of being able to play Jeopardy. Take the clue I mentioned above, which was offered during Watson’s televised match:

Wanted for general evil-ness; last seen at the Tower of Barad-Dur; it’s a giant eye, folks, kinda hard to miss.

Although Watson provided the correct response, What is Sauron?, it’s unlikely that Watson had an is-a-giant-eye relation, let alone an is-a-giant-eye-who-resided-in relation.¹² It’s unlikely that Watson had anything in its structured databases about Sauron, except that Sauron is a character in Lord of the Rings, and that Lord of the Rings was written by J. R. R. Tolkien. Just as a self-driving car couldn’t anticipate rare occurrences like a woman in an electric wheelchair chasing a duck in the middle of the street—an encounter that we know happened for one self-driving car—the researchers behind Watson couldn’t have anticipated all possible relations that might show up in a clue.

Another challenge Watson faced is that Jeopardy clues are phrased in a wide variety of ways. Take the clue above about Edgar Allan Poe, who was “nevermore” as of 1849. Watson needed some way to recognize that “nevermore” was a synonym for “dead.” Watson used dictionaries and thesauri, but a typical thesaurus doesn’t list “nevermore” as a synonym for “dead.” The synonym is only meaningful in this context because “nevermore” is the famous line from one of Edgar Allan Poe’s poems. Although relations gave Watson the ability to simply “look up” answers in its databases, only a quarter of questions had these relations to start with. To make things even worse, Watson was only able to simply “look up” answers for a mere 2 percent of clues.¹³

So how did Watson answer the remaining 98 percent of clues? It did this by systematically analyzing the clue, teasing out key information with a fine-toothed comb.

THE JEOPARDY CHALLENGE IS BORN

Shortly before Watson competed against Jennings and Rutter, a popular book by Stephen Baker called Final Jeopardy came out. The book was originally published electronically, its final chapter withheld until after the competition aired on television. Readers needed to wait to read the final chapter, which was delivered electronically after the show was broadcast (and included in the subsequent print version). Among other things, the book outlined how the team at IBM made their decision to develop a program to play Jeopardy, a story that unfolded as follows.¹⁴

In the early 2000s, IBM was looking for a Grand Challenge, a public display of the company’s technical prowess. Finding such a challenge was important to IBM because the company had a lucrative consulting business, and this business depended on its customers’ faith that the company was at the cutting edge in fields like big data and large-scale computing. IBM had defeated chess champion Garry Kasparov in 1997 with Deep Blue, which had been one such success; so the idea of another challenge was on everyone’s mind.¹⁵

It’s difficult to trace exactly where the original idea for a Jeopardy Challenge started: different employees at the company have different accounts. One version of the story holds that a senior manager at IBM got the idea when he was at a steakhouse one autumn day in 2004. He noticed other customers getting up from their untouched meals to move to a different section of the restaurant. They were crowding around televisions at the bar, three people deep, to watch Ken Jennings during his famous winning streak. After winning over 50 consecutive games, would Ken just keep on going? If the public was so fascinated by this game, the IBM manager wondered, would they be similarly interested in a game between a human and a computer?¹⁶

However the idea for an IBM Jeopardy Challenge actually started (at least one other employee at the company thought he had the idea, and James Fan, whom we met at the beginning of this chapter, also independently had the idea), once it had coalesced, it ran into plenty of internal resistance. Some saw a Jeopardy Challenge as a publicity stunt that could waste money and researchers’ time. Even worse, it might risk the company’s credibility. Despite this resistance, the head of IBM’s 3,000-person research division pitched the project to some of his researchers, one of whom was David Ferrucci.¹⁷

Ferrucci was already familiar with the problems they might face, because one of the research teams he managed had already been working for a handful of years on a question-answering system. Theirs was among the better question-answering systems in the world, and it consistently performed well in competitions. But Ferrucci and his team also knew how far from playing Jeopardy these systems currently were. Still, he pitched the problem to his team. Only one of them was optimistic about the idea: James Fan, who had recently joined the team after finishing his PhD.¹⁸ But the team concluded that the field wasn’t ready yet: it would be too difficult. Ferrucci told the head of research that it would be best not to pursue the project.¹⁹

Before long the head of research returned to ask again about Jeopardy; Ferrucci and his team retreated to a conference room once again to brainstorm. As they discussed the project, their conclusions remained roughly the same: a system able to answer Jeopardy questions would need to be much faster than their current system; it would need to answer a much broader array of questions; and, most difficult of all, it would need to answer those questions more accurately. There were too many open research problems to address. It didn’t seem possible. But in the end, inspired by the possibility of success and some hunches about how they might proceed, they relented, and Watson was born.²⁰

DEEPQA

The question-answering system Ferrucci’s team already had when they first began working on Watson was good by the standards of the day. IBM had already devoted a lot of resources to it—a four-person team had developed the system over the course of six years. But their existing system didn’t work out of the box for Jeopardy, so Ferrucci’s team spent about a month converting it to play the game.

Ferrucci’s team also needed a way to evaluate their system. Fortunately they discovered a goldmine of Jeopardy clues and answers on the internet. Avid Jeopardy fans had created a website containing all Jeopardy questions and answers from televised episodes, and they had annotated the questions with detailed information.²¹

Using this site, the IBM team collected performance statistics of past Jeopardy winners: How often did the winners in Jeopardy buzz in? When they did buzz in, how often did they give the correct answer? Ferrucci’s team created a scatterplot of these two numbers, a cloud of data points illustrating how accurate and how prolific at answering questions past Jeopardy winners were. They called this plot the “Winners Cloud,” and they used it as a measuring stick to benchmark Watson.²² If they could move Watson into the cloud, they could match the human winners’ performance. If they could move it past the cloud, they could beat these humans.

After the team had spent a month converting their previous system to play Jeopardy, they evaluated it using these metrics. But their converted system performed abysmally: if it answered the 62 percent of questions Watson was most confident about—the same fraction Ken Jennings answered on average—it would only answer 13 percent of questions correctly. To be competitive with Jennings, Watson would need to answer more than 92 percent of these questions correctly.²³ It was clear to them that they would need to do things much differently.

This failure of their existing system was in fact a strategy on Ferrucci’s part: the team needed to realize that their current system, with its traditional methods, had failed. By failing, they could start from scratch to reinvent a new way of looking at things.²⁴

And so Ferrucci and his team experimented, implementing state-of-the-art methods from the academic literature. After many months of experimentation, the team finally arrived at an architecture that seemed to work; they called it DeepQA.²⁵ The approach behind DeepQA was simple. Like many other question-answering systems, to arrive at an answer it followed just a few concrete steps, which you can see in figure 12.1: analyze the question, come up with candidate answers with search engines, research these answers, and score these answers based on the evidence it found for them. In the rest of this chapter we’ll focus on the first phase of this pipeline: Watson’s Question Analysis phase.

Figure 12.1

A very basic overview of the very complicated DeepQA pipeline.

QUESTION ANALYSIS

The goal of Watson’s Question Analysis phase is to decompose a question into pieces of information that could be useful in finding and evaluating answers later in the pipeline. Like most parts of Watson, the Question Analysis phase depended heavily on the field of natural language processing, or NLP. NLP gave Watson the ability to do something meaningful with the words making up the clue: Watson used it to find the parts of speech of the words in the clue, to search for names and places in the clue, and to create sentence diagrams of the clue.²⁶

The most important task for Watson during its Question Analysis phase was to find the phrase in the clue summarizing what specifically it is asking for. Take this clue, for example:

It’s the B form of this inflammation of the liver that’s spread by some kinds of personal contact.

The phrase summarizing what the clue is asking for is this inflammation of the liver. Watson’s researchers called this phrase the “focus.” The focus is the part of the clue that, if replaced by the answer, turns the clue into a statement of fact.²⁷ If we replace the focus of the clue above by the answer, hepatitis, it becomes:

It’s the B form of hepatitis that’s spread by some kinds of personal contact.

Now it is a factual statement. Here’s another example:

In 2005 this title duo investigated “The Curse of the Were-Rabbit.”

In this clue, the focus is “this title duo.” Replacing the focus by its answer, we get:

In 2005 Wallace and Gromit investigated “The Curse of the Were-Rabbit.”

Again, that is a factual statement. By finding the focus, Watson could use that information down the road when it would generate possible answers and score them. Now let’s apply this to our clue about the 2008 Olympics. Here’s that clue again, with its focus in bold:

Milorad Čavić almost upset this man’s perfect 2008 Olympics, losing to him by one hundredth of a second.

Another type of information Watson extracted from the question is a word or phrase describing the answer type.²⁸ Is the clue asking for a president? Is it asking for a city? Or maybe it’s asking for an inflammation like hepatitis or an ingredient like lettuce. Again, Watson used this information to come up with candidate answers and to score them later in the pipeline. I’ll describe exactly how Watson used this information in the next chapter; but for now, all you need to know is that Watson stored this information in this stage so it could select and narrow down possible answers in later stages. If the question was asking for a disease, for example, then Watson could narrow down its candidate answers in a later stage by giving higher weight to those candidate answers that were actually diseases and lower weights to candidate answers that were, for example, symptoms of diseases. The answer type is usually part of the focus, so if Watson could find the focus, it had a good chance at finding the answer type. In our clue about the 2008 Olympics, the answer type was simply man. So Watson would use this information later in its pipeline to narrow down its candidate answers to those were instances of man.

Sometimes Watson had little more than a few nouns or verbs to go on in its clue. In one of the clues we saw above, the clue was a single word: Coleslaw.²⁹ When Watson couldn’t find an answer type in cases like this one, it searched the clue’s category for an answer type. (Every question in Jeopardy is assigned to a category, and this category is visible to all players when they see the question.) The category for the clue Coleslaw was The main vegetable, so in this case Watson could set its answer type to vegetable, which would later help Watson to find the correct answer: cabbage.³⁰

Watson also looked for proper nouns, dates, and relations in clues. By finding proper nouns, Watson could be much more focused as it searched for candidate answers later on. In the clue about the 2008 Olympics, it would have found the name Milorad Čavić and the phrase 2008 Olympics. It would also have recognized that 2008 is a date in the clue.

And so Watson proceeded to dissect the clue, teasing little bits of useful information out of it. For some of this information, Watson used simple pattern matching. For example, it’s easy to make Watson search for dates by having it search for sequences of four digits starting with 1 or 2. But for Watson to pull other information from the clue, like the clue’s focus and answer type, it needed a more sophisticated suite of tools.

HOW WATSON INTERPRETS A SENTENCE

One of the most important ways our modern automata interact with the world is via perception. We’ve seen how a self-driving car perceives the world around it: it uses laser scanners, cameras, and accelerometers to create a model of the world. Watson didn’t have laser scanners or accelerometers, nor did it have a camera to read the screen or microphones to listen to Alex Trebek. Instead, the clue was delivered to Watson electronically, in the form of a text file. When Watson looked at this text file, it saw nothing more than an ordered sequence of letters, so it used tricks from the field of natural language processing to make sense of them.

The first way Watson made sense of these characters was by interpreting the clue as a sequence of words instead of as a sequence of letters. Once Watson interpreted a clue as a sequence of words, it could then use some more interesting tricks to process the clue. The most important of these tricks was to map out the structure of the clue with a sentence diagram, just as you likely did in grade school.A computer creates a sentence diagram in a process called parsing; and the resulting sentence diagram is usually called a parse tree. You can see a parse tree for the clue about the 2008 Olympics in figure 12.2. In this clue, the subject is the proper noun Milorad Čavić, and the verb is the word upset; the remaining parts of the sentence modify the verb phrase. (This isn’t the exact way Watson parsed a sentence, but the basic idea is the same.) Once Watson had a diagram of a sentence, it could use this diagram to perform more interesting analysis of the question, which we’ll get to shortly. But first, let’s look briefly at how a program like Watson could create a parse tree.

Figure 12.2

A parse tree for the sentence “Milorad Čavić almost upset this man's perfect 2008 Olympics, losing to him by one hundredth of a second.” This tree is a traditional parse, much like what you might have learned in grade school. Watson didn’t parse a sentence exactly like this, but it used the same idea.

A computer can create a parse tree by using a search algorithm, a lot like the way Boss planned a path through its urban environment. Instead of searching for the best path over a map as Boss did, Watson’s parser searched for the best way to create a tree out of the words in the sentence that agreed with the rules of grammar. Modern parsers use statistics about the relationships between words and parts of speech to find which parse trees are the most likely.

You probably remember from your school days that English sentences can be decomposed into a subject phrase and a verb phrase, and that each of these can be decomposed further. For example, verb phrase or noun phrase can be decomposed into two parts:

verb phrase = adverb + verb phrase

noun phrase = adjective + noun

We can continue applying rules like this until a sentence has been decomposed into small pieces, each of which is a single part of speech. Some sentence parsers use this fact. To parse a sentence, these parsers search for the best possible ways to split up the sentence using these rules, until they can’t split the sentence into any more pieces.

Sometimes sentences have ambiguous parse trees. Here are some sentences rumored to have appeared as newspaper headlines:³¹

Juvenile Court to Try Shooting Defendant

Hospitals Are Sued by 7 Foot Doctors

You might think that these examples seem contrived. These are just the rare exceptions, right? Actually, these sorts of ambiguities can happen all the time. They’re always lurking just below the surface of our language, but we don’t notice them most of the time because our minds resolve their ambiguity quickly. See if you can find the ambiguity in one of the clues we saw earlier in this chapter:

It’s the B form of this inflammation of the liver that’s spread by some kinds of personal contact.

In this clue, the ambiguity is around whether it’s the inflammation that’s spread by some kinds of personal contact, or whether it’s the liver that’s spread by some kinds of personal contact. While it’s painfully obvious to us humans that livers can’t spread by personal contact, this isn’t obvious to Watson’s sentence parser. There’s nothing ungrammatical about that parse, even if it’s semantically weird.

Here’s another example, which Watson faced when it played against Ken and Brad:

This 1959 Daniel Keyes novella about Charlie Gordon and a smarter-than-average lab mouse won a Hugo award.

In this case, the ambiguity is around whether the novella is about Charlie Gordon and a smarter-than-average lab mouse (the correct parse), or whether both a novella about Charlie Gordon and a smarter-than-average lab mouse won a Hugo award. (A Hugo award is an award for science fiction and fantasy books.) There’s nothing syntactically or even semantically wrong with the second parse, although if you knew about the Hugo award, you would realize that it’s not typically awarded to smart mice. The answer to this clue, by the way—which Watson got correct—is Flowers for Algernon.

There’s no way for a computer to know for certain which parse tree for the statements above are correct unless it has more context about the situation; but as I mentioned before, modern parsers use statistics about words, parts of speech, and the ways they combine to form sentences. Often those probabilities are enough for the computer to find the correct parse.

Even though Watson could create these sentence diagrams, it still had no idea what they meant. To Watson, these diagrams were nothing more than data structures floating around its computer memory, some of which pointed to other ones. Fortunately for Watson, it didn’t need to understand these sentence diagrams. They were merely useful tools the programmer could use to interpret the question. But how could the programmer interpret the question without even looking at it?

Remember back to the Monopoly board in the self-driving car. The Monopoly board encoded human knowledge about situations the car might find itself in, such as the etiquette around precedence at traffic stops. Just as Boss’s creators handcrafted the rules for it to traverse crowded intersections when those researchers weren’t around, Watson’s developers handcrafted their own rules so that Watson could traverse its sentence diagrams to pull meaningful information from clues when its researchers weren’t around.

Watson used these rules to inspect parse trees all through its DeepQA pipeline, starting with its Question Analysis phase. One place it used the parse tree was to find the focus of a clue. Remember, the focus is the phrase in the clue that captures what exactly is being asked for—like this man or this inflammation. To find the focus, Watson used simple rules such as search for a noun phrase described by “this” or “these.”³² Watson also looked for other information in its parse tree, including whether there were clues embedded within other clues or whether there were pairs of clues joined by a conjunction like “or.” Watson also searched the parse tree for information about relations involving the clue’s focus.

You can see how Watson would have analyzed our clue about the Olympics in figure 12.3. Watson has systematically dissected the clue with many rules, using the parse tree as a lens through which to inspect it. In its Question Analysis phase, Watson is an obsessive-compulsive organizer, carefully taking stock of what it finds in the sentence and putting bits of information into carefully labeled boxes. But it still hasn’t come any closer to understanding what the clue was asking. Watson has blindly processed its clue so the next few phases in its DeepQA pipeline can do their work.

Figure 12.3

Some of the most important information Watson looks for in its clues during its Question Analysis phase.

When Watson had finished this labeling process, its work was still far from over: it still faced the daunting task of finding the correct answer for the clue. For this, it used some of the typical data sources you might expect: dictionaries, geographic and movie databases, and even Wikipedia. But, as we’ll see in chapter 13, Watson used them in a very different way than a human would use them.