NATURAL LANGUAGE PROCESSING
The abilities to converse on a wide range of topics; to gain knowledge by reading newspapers, magazines, and books; and to answer questions and have discussions based on knowledge of the world and commonsense reasoning are what make us distinctly human. They differentiate us from the animals that share our planet.
Since around 2010, researchers have built computer systems that appear to understand the languages that people speak. IBM’s DeepQA system beat two of the world’s top Jeopardy! champions, Ken Jennings and Brad Rutter. Microsoft announced a computer that it claims reads better than humans. Most of us interact with Siri, Alexa, or Google Assistant daily. These systems appear to understand natural language. But do they?
The first time I encountered a parrot, I was eight years old. It said hello and made a rude comment about my appearance. At first, I thought the parrot understood the language it used. Eventually, I realized that it could only repeat phrases it had heard without any understanding of what it was saying.
The natural language processing systems from IBM and Microsoft, and also Siri, Alexa, and Google Assistant, are more like parrots than people. AGI-level natural language processing is not possible for today’s computers and might never be possible. Marvin Minsky, who many consider the father of AI, once said, “Any ordinary person who can understand an ordinary conversation must have in his head most of the mental power that our greatest thinkers have.”1 Minsky had good reason for placing such a high value on a capability most of us take for granted.
NATURAL LANGUAGE IS COMPLEX
Children develop the ability to understand and converse in natural language through an accumulation of exposure to language and life experiences. They acquire both knowledge about the world and commonsense reasoning that they apply to that knowledge. As experience and exposure accrue over time, so does the ability to understand and use language.
Language understanding involves far more than retrieving the dictionary definition of words and applying grammatical rules. It requires the ability to decipher the unsaid implications of an utterance.2 Even children make extensive use of implied meaning based on world knowledge in understanding language. For example, consider this statement:
The police officer held up his hand and stopped the truck.
The ability to accurately interpret this sentence requires a tacit understanding of specific facts, including the following:3
•Trucks have drivers.
•People obey police officers.
•Trucks have brakes that will cause them to stop.
•Drivers can step on the brake to stop the truck.
•An action can be attributed to an actor (a subject) who caused but didn’t actually perform the action.
Even an eight-year-old would have the ability to pull these bits of world knowledge into their understanding of the sentence. In contrast, consider this similar statement:
Superman held up his hand and stopped the truck.
Our understanding of this sentence is vastly different.4 Here, we draw on our knowledge of a science fiction character, and we understand that this character applied a supernatural physical force to stop the truck. Our understanding of these two sentences goes far beyond the meanings of the individual words or the grammatical rules (which are the same in both sentences).
Similarly, if we hear someone say, “I like apples,” we know they are talking about eating them, even though the speaker never mentioned eating.5 If you hear someone say, “John lit a cigarette while pumping gas,” we apply our commonsense reasoning capabilities to our knowledge of the world and recognize that this is a bad idea. Moreover, we then expect the next sentence to tell us whether there was an explosion.
A great deal of knowledge about the world is required to understand natural language, and this knowledge informs the reasoning processes that enable us to understand natural language.
It was not too long ago that talking to computers was science fiction. Then suddenly, we began talking to smartphones and tabletop devices, such as Apple’s Siri, Google Assistant, Microsoft Cortana, and Amazon Alexa. We engage with these personal assistants regularly throughout the day to check the weather, send text messages, control home thermostats, and perform many other functions. We also engage with customer service chatbots when we use chat capabilities on a vendor’s website or call their support lines.
You can see the primary AI-based components of a personal assistant in figure 12.1.
Figure 12.1 AI-based components of a personal assistant.
Speaker icon: ID 142615061 © Pavel Stasevich | Dreamstime.com
A user speaks a request that is processed by a speech recognition system that turns the voice request into a machine-readable text string. The natural language processing system processes the string and determines the appropriate response. A text-to-speech system generates the actual response. Alternatively, users type natural language text directly into the personal assistant and receive typed output from the personal assistant that is created by the same natural language processing system.
In 1966, MIT professor Joseph Weizenbaum6 developed a chatbot named ELIZA that simulated a Rogerian psychotherapist and could carry on a conversation with a person. Here is an example of a conversation:
Men are all alike.
> IN WHAT WAY?
They’re always bugging us about something or other.
> CAN YOU THINK OF A SPECIFIC EXAMPLE?
Well, my boyfriend made me come here.
> YOUR BOYFRIEND MADE YOU COME HERE?
I’m depressed much of the time.
> I AM SORRY TO HEAR YOU ARE DEPRESSED.
It’s true. I am unhappy.
> DO YOU THINK COMING HERE WILL HELP YOU NOT TO BE UNHAPPY?
I need some help; that much seems certain.
> WHAT WOULD IT MEAN TO YOU IF YOU GOT SOME HELP?
I could learn to get along with my mother.
> TELL ME MORE ABOUT YOUR FAMILY.
My mother takes care of me.
> WHO ELSE IN YOUR FAMILY TAKES CARE OF YOU?
My father.
> YOUR FATHER.
You are like my father in some ways.
> WHAT RESEMBLANCE DO YOU SEE?
You are not very aggressive, but I think you don’t want me to notice that.
> WHAT MAKES YOU THINK I AM NOT VERY AGGRESSIVE?
You don’t argue with me.
> WHY DO YOU THINK I DON’T ARGUE WITH YOU?
You are afraid of me.
> DOES IT PLEASE YOU TO BELIEVE I AM AFRAID OF YOU?
My father is afraid of everybody.
> WHAT ELSE COMES TO MIND WHEN YOU THINK OF YOUR FATHER?
Bullies.
> DOES THAT HAVE ANYTHING TO DO WITH THE FACT THAT YOUR BOYFRIEND MADE YOU COME HERE?
ELIZA was successful in convincing many users that it understood their problems. However, ELIZA was a simple computer program with a clever set of rules. The program operated on keywords, phrases, and patterns. For example, ELIZA mapped keywords related to family (e.g., mommy, father) to a response such as TELL ME MORE ABOUT YOUR FATHER. A pattern like everybody . . . me as in “Everybody is always laughing at me” would elicit the response WHO LAUGHED AT YOU RECENTLY? Patterns with I believe __ would be answered with DOES IT PLEASE YOU TO BELIEVE ___? ELIZA did not know anything about families, just how to map keyword patterns to responses.
Weizenbaum tried to tell his users that ELIZA did not understand what they were saying any more than a parrot would. Yet some users still asked for private time with the system because they felt it understood them. Ten years later, he wrote a book about his experiences with ELIZA and his dismay at the high percentage of users who, inside of these conversations, would reveal their darkest secrets.
PERSONAL ASSISTANTS AND CUSTOMER SERVICE
Most of the chatbots that we interact with daily, from personal assistants like Siri and Alexa to customer service applications, are descendants of ELIZA in the sense that they provide canned responses to anticipated questions and commands.7
Most of the personal assistant and customer service chatbot vendors enable third-party developers8 to create and publish extensions that become accessible via the personal assistant chat interfaces. The process of creating the natural language processing component of personal assistants is similar for most of the vendors, and, as you will see, it is very similar to ELIZA as well.
Each interaction a user has with a chatbot expresses an intent. That might be expressed as a command, such as Send a text to Steve saying that we are all meeting at my office tomorrow at 6:00 p.m. Or it might be a question, such as What is today’s weather? Or it might be a request that triggers a dialogue, such as I’d like to plan a trip to France to go skiing.
The chatbot systems map these intents to internal labels like SendText, GetWeather, and PlanTrip. Many intents will require additional information. For example, a PlanTrip intent would likely need the city traveling from and to, the travel date, and the mode of travel (e.g., car, airplane). These pieces of information also have internal labels like fromCity, toCity, travelDate, and travelMode.
Amazon Alexa terms these fields of additional information slots. The goal of natural language processing is to translate all the different ways a user can ask to send a text, get the weather, or plan a trip into intent and slot labels.
A developer provides the natural language processing logic through a set of ELIZA-like keyword patterns. For example, for the PlanTrip intent, the developer might provide these keyword patterns:
I am going on a trip on {travelDate}.
I want to visit {toCity}.
I want to travel from {fromCity} to
{toCity} on {travelDate}.
I’m {travelMode} from {fromCity} to
{toCity}.
The developer would have to provide many more utterances (probably more than one hundred) to cover most of the different ways people might ask for a PlanTrip intent. Amazon provides built-in recognition of several slot types, including cities, dates, and many more. For slot types not provided by Amazon, the developer must enumerate all slot values and keyword patterns for each possible slot value.
Alexa developers bundle intents into skills. A skill can support one or several intents. When Alexa recognizes an intent that belongs to a skill, it passes the intent label and slot values to the skill. Alexa developers must create code to handle each intent. The code can be as simple as executing a query against a database and returning a value or document. If the intent is to start a game, the skill code will start a new game of the specified type. If the intent is to play music, the skill will launch a music stream for the specific artist or kind of music.
As of February 2019, third-party developers had created over eighty thousand skills for Alexa.9 In addition to handling the user requests, skills serve another function. Each skill handles one or more intents, which means if there are more than eighty thousand skills, there are well over eighty thousand intents, each with keywords that trigger the intent. It is not hard to imagine that a given user utterance might trigger multiple intents in different skills due to overlapping keywords. Developers also associate keywords with skills. When a user mentions a skill keyword or phrase along with an otherwise ambiguous intent reference, Alexa can determine the intent by picking the one that is part of the mentioned skill.
The Apple, IBM, Facebook, Google, and Microsoft approaches to providing third-party natural language chatbot capabilities are similar, with the critical element being developer-defined sample utterances that cover the range of possible intent phrasings.
The personal assistant vendors also provide other generic capabilities. For example, the vendors program canned ELIZA-like responses to many offbeat questions like “What is the meaning of life?” These vendors also have large training tables of user queries and most likely10 label them for deep learning networks. These systems can then learn to recognize which skill is being requested by a user and learn additional canned responses to patterned requests.11
The important thing to recognize here is that all these personal assistant systems rely on simple patterns with canned responses, just like ELIZA.
SOCIAL CHATBOTS
In addition to personal assistant and customer service chatbots that we use in our daily lives, many researchers have been working on social chatbots. The goal is to build a system that can sustain coherent conversations with people on a wide range of topics.
For thirty years, the Loebner Prize spurred the development of conversational chatbots. Started in 1991, the annual competition awards a small prize to the system judged the most human-like. In 2016, Amazon started the Alexa Prize competition, which is somewhat similar to the Loebner Prize. However, it is by invitation only, is limited to university teams, and is taken much more seriously by the academic community.12 The goal is to build a chatbot that can converse on popular topics, such as entertainment, fashion, sports, technology, and politics. Amazon offers $250,000 research grants to participating teams and a $500,000 prize to the best chatbot each year.
A group of University of Washington graduate students won the 2017 Alexa Prize.13 Because there were no large training tables of conversations available to train machine learning systems, the winning entry was built entirely with the Alexa Skills developers kit. The 2018 winning entry, Gunrock, was also built around the Alexa Skills developers kit.14
Since then, researchers have focused on applying deep learning techniques to massive training tables of conversations. Milabot,15 Google’s Meena,16 and Facebook’s Blender17 are examples of these research efforts. These systems learn intents and natural language mappings to those intents so that developers do not need to hand code them. However, the systems remain in the research domain for now because they have significant weaknesses, including responses that are overly vague and generic and responses that do not make sense because the systems have no commonsense knowledge or reasoning capabilities.
WATSON DEEPQA
As previously mentioned, in January 2011, IBM shocked the world by beating two former Jeopardy! champions with its Watson DeepQA system. The Watson victory was a prominent feature in press reports and the subject of a popular book.18
The Jeopardy! rules do not allow contestants to access the internet, so part of the challenge was to collect and organize a massive set of documents and store them in the DeepQA system. Wikipedia was the primary document source, because DeepQA could answer more than 95.47 percent of Jeopardy! questions using Wikipedia articles. For example, consider this Jeopardy! clue:19
Topic: Tennis
Answer: The first US men’s national singles championship, played right here in 1881, evolved into this New York City tournament.
Response: What is the US Open?
This clue contains three facts about the US Open (it was the first US men’s singles championship, it was first played in 1881, and it is now played in New York City) that are all found in the Wikipedia article titled “US Open (tennis).”
The IBM team also found that DeepQA could answer many of the other 5 percent with dictionary definitions, so they incorporated a modified version of Wiktionary (a crowdsourced dictionary). Like Wikipedia, Wiktionary was already title-oriented. The team created title-oriented documents in categories that included Bible quotes, song lyrics, literary quotes, and book and movie plots. The developers stored this set of documents on DeepQA’s hard drive, which the system accessed whenever Jeopardy! presented a clue. The DeepQA system used various strategies for matching questions to this set of documents. Nearly all of the matching algorithms were word-oriented; that is, the matching algorithms correlated words in the questions to words in the documents.
To understand the challenge for the creators of DeepQA, imagine that you have a Jeopardy! clue in a language you do not understand. Even though you might know the answer if the clue were in your native language, you cannot even understand the clue. How would you proceed? You could create an internet search using words from the clue and look at the returned foreign-language documents, but in most cases, you would not know when you were reading a document with the correct answer. Similarly, DeepQA was incapable of using the meaning of the text to find the document with the correct answer. Instead, the first step for DeepQA was to identify Wikipedia and other documents whose titles were candidate answers. It did not search the document set for a single right answer. Instead, it searched for every document that contained words that matched entities and relations in the clue. Entities are people, places, and things, and relations connect entities. For example, the relation BornIn20 connects the entities Barack Obama and Hawaii.
The most serious difficulty the DeepQA team had to overcome was that entities and relations in both the clue and documents were just strings of words. For example, many different phrases could be a reference to former US president Clinton, including Bill Clinton, William Jefferson Clinton, the 42nd president of the US, and President Clinton.
If the clue contained the words the 42nd President of the US and a document contained the words Bill Clinton, and if the matching algorithm used only the individual words in both the clue and the document, then the document would not be (or contain) a candidate answer.
To accomplish the matching, DeepQA used a technique called entity linking,21 which matches the entity22 words in both the clue and the documents to strings in Wikipedia articles that have hyperlinks. Then the reference of the hyperlink was used as the common reference. For example, the previous strings might all occur in different Wikipedia articles, but they are all hyperlinked back to the primary article on the former president. The IBM researchers used supervised learning to train classifiers to extract relations using training tables created with sentences from Wikipedia text that had been hand-labeled with relations.23
In addition to extracting those entities and relations, DeepQA extracted several other types of Jeopardy!-specific information, such as the type of response required (e.g., who/what/where, definitions, multiple-choice, fill-in-the-blanks, and others). The lexical answer type (LAT) was one of the most important pieces of this information. You can think of the LAT as a type of concept. For example, a clue might indicate that the response is a type of dog. In this case, dog would be the LAT.
Here, also, we can run into a word-matching problem. A clue might reference the concept associated with the word dog by many different words and phrases, including dog, canine, and man’s best friend. Fortunately, there are open source24 dictionaries of concepts (ontologies) that provide mappings from keyword patterns to each concept. DeepQA used an open source ontology named YAGO to map clue and document words to concepts, and then it used YAGO to map the LAT words onto YAGO concepts. There is also an open source mapping from Wikipedia to YAGO named DBPedia, and DeepQA used this to map possible answers to YAGO concepts.25
The matching process typically returns an extensive list of candidate answers, although people would not consider most of them valid candidates. Suppose the clue asks for the 42nd president. Suppose further that the system picked all candidate answers from Wikipedia that matched on the number 42. Then, the candidate answer set might include Kepler-42 (a red dwarf star), NGC-42 (a spiral galaxy), the atomic number of molybdenum, Device42 Inc (my company), Level 42 (a pop/rock band), an episode of the TV show Dr. Who, a film about Jackie Robinson, Tokyo 42 (a video game), the third primary pseudo-perfect number (you can look that one up), and the answer to the question What is the meaning of everything? from the book The Hitchhiker’s Guide to the Galaxy. A human would not consider any of these to be possible answers.
DeepQA used over fifty separate scoring algorithms to create a set of scores for each candidate answer. Different scorers analyzed factors like how many words in the document matched words in the clue, how well the word orders matched, and whether the dates for the candidate answer were consistent with those in the question (e.g., if the question asked about the 1900s, someone who lived only in the 1800s was not a plausible answer). DeepQA matched temporal references using conventionally coded rules against DBPedia, which has dates (e.g., birth/death dates of people) and durations of events. DeepQA also stored temporal information in a database that contained the number of times it found a date range in the source texts for an entity. Another scorer used this database to rate time period references in candidate answers. Similarly, location scorers assessed the geographic compatibility of the question and candidate answer. DeepQA matched longitude and latitude specifications against location data stored in DBPedia and other sources. Other automated scorers checked conventionally coded rules, such as “an entity cannot be both a country and a person.” These are only a few examples of more than two hundred rules in DeepQA.
Finally, DeepQA ranked the candidate answers using a supervised learning algorithm whose input was a set of 550 computed input variables derived from all the different scores output by all the different scorers.26 IBM researchers trained a classifier to compare each pair of candidate answers and identify the better answer of the pair. This learned function produced a ranking of the answers. The IBM team trained the supervised learning algorithm on 25,000 Jeopardy! clues, with 5.7 million clue–question pairs (including both correct and incorrect answers). The answer produced by DeepQA is the candidate response with the highest rank (converted into a question).
The DeepQA system was a fantastic feat of engineering. The result was a system that appears to understand complex questions in English. However, DeepQA did not understand English any more than our hypothetical non-English-speaking Jeopardy! contestant. Under the hood, there was a massive set of mostly conventionally coded rules that performed word-oriented matching of clues to documents. The rules were so cleverly constructed that the system worked without understanding English!27
A large body of question answering research followed the DeepQA success. Much of that research was focused on building systems that could perform reading comprehension tests. These tests are similar to the tests most of us took in high school in which we read a block of text and then answered one or more questions about the text.
One example is SQuAD (the Stanford Question Answering Dataset).28 SQuAD is large enough for machine learning with 107,000 crowdsourced questions based on passages from 536 Wikipedia articles. An example of a SQuAD passage and questions is shown below.
Passage: In meteorology, precipitation is any product of the condensation of atmospheric water vapor that falls under gravity. The main forms of precipitation include drizzle, rain, sleet, snow, graupel, and hail. . . . Precipitation forms as smaller droplets coalesce via collision with other rain drops or ice crystals within a cloud. Short, intense periods of rain in scattered locations are called “showers.”
Question: What causes precipitation to fall?
Answer: Gravity.
Question: Where do water droplets collide with ice crystals to form precipitation?
Answer: Within a cloud.
The development of large training tables like SQuAD enabled the application of neural networks to these tasks. Many reading comprehension systems use the encoder–decoder architecture that researchers developed for machine translation. The encoder learns word embeddings within both the question and the text. The decoder learns to find the portion of text that is most like the query.29 Microsoft and Alibaba both developed systems that could perform the SQuAD task at the same level as humans and used these accomplishments to claim that their systems read as well as people.30
At first blush, it might appear that AI systems that can perform at human levels are reading and understanding these passages. However, deeper analysis shows that these systems are just using word-matching to locate the sentences in the document that match the words in the passage. Then they are aligning the words in the document to those in the clue. They learn to match based on words, syntax, and word embeddings.31 Like the DeepQA system, they use these surface-level strategies to match questions to passages. In contrast, people perform reading comprehension tasks by understanding the meaning of questions and passages and using a matching strategy based on meanings.
If you look closely at the recent example, you can see that it is possible to use a surface-level strategy to identify the correct sentence. You can match the words precipitation and fall in the question to the words water, droplets, collide, ice, crystals, and precipitation in the passage.
Google DeepMind researchers created a similar training table by extracting news articles from CNN and the Daily Mail.32 Both publishers provide bullet-point summaries of their articles. The researchers simply deleted an entity reference from a summary to create a question.
Passage: The BBC producer allegedly struck by Jeremy Clarkson will not press charges against the Top Gear host, his lawyer said Friday. Clarkson, who hosted one of the most watched television shows in the world, was dropped by the BBC Wednesday after an internal investigation by the British broadcaster found he had subjected producer Oisin Tymon “to an unprovoked physical and verbal attack.”
Question: Producer X will not press charges against Jeremy Clarkson, his lawyer says.
Answer: Oisin Tymon.
Researchers studying this set of data concluded that systems could answer most questions by finding the single most relevant sentence in the passage. When the answer is present, it greatly restricts the number of possible entities in the answer, often to just one entity.33 Therefore, the task does not require understanding the passage. It merely requires finding the sentence whose words most closely match the words in the question.
Another set of Stanford University researchers34 did some experiments to determine whether the systems scoring highest on SQuAD were engaging in human-like understanding. They added a syntactically correct but irrelevant and factually incorrect sentence to each passage. For example, here is an original SQuAD passage:
Passage: Peyton Manning became the first quarterback ever to lead two different teams to multiple Super Bowls. He is also the oldest quarterback ever to play in a Super Bowl, at age thirty-nine. The past record was held by John Elway, who led the Broncos to victory in Super Bowl XXXIII at age thirty-eight and is currently Denver’s executive vice president of football operations and general manager.
Question: What is the name of the quarterback who was thirty-eight in Super Bowl XXXIII?
Answer: John Elway.
The Stanford researchers added this sentence to the passage:
Quarterback Jeff Dean had jersey number thirty-seven in Champ Bowl XXXIV.
Jeff Dean was not a quarterback. He was the head of Google AI research. They found that this added sentence caused performance to decrease by over 50 percent and led the system to give answers like Jeff Dean to the question. They concluded that these systems are merely learning very superficial rules (e.g., taking the last entity mentioned) and not exhibiting deep understanding or reasoning at all.35
NO AGI HERE
Natural language processing is difficult for computers because it requires a great deal of commonsense knowledge and extensive reasoning that is based on that knowledge. Researchers have developed some impressive natural language processing systems, including personal assistants, Watson DeepQA, machine translation systems, and systems that appear to read and answer questions at a human level.
Personal assistants can only parrot back responses to programmed word patterns. DeepQA used a wide variety of clever, conventionally programmed techniques to respond to clues but did not exhibit human-level understanding of the clues. The systems that achieve high scores on reading comprehension tests use simple word-oriented matching techniques.
None of these systems is AGI or anything close to it. Like every other system we have discussed, they can only process natural language for a very narrowly defined task.