Making It and Faking It

The trouble with conversations

What comes out of the Turing Test is an emphasis on externally observable behavior. Our concern in this chapter is that an informal conversation may be behavior that it is simply too easy to fake.

The trouble with the Imitation Game is that it places all the burden on deception. In the end, what it asks is not whether a computer program can carry on a conversation as well as a person, but whether a computer program can fool interrogators into thinking they are carrying on a conversation with a person. As every trickster knows, these two questions are not the same. A magician does not need to have ESP to wow an audience; he just needs to be able to fool the audience into thinking he has ESP, using all the tricks of the trade. (Or to be more precise, he has to be able to do the trick without giving the audience a better story to tell about how he does it.) And an informal conversation as suggested by Turing gives a trickster a lot of room to maneuver.

Consider someone impersonating a doctor on rounds in a hospital. It might be sufficient to glance knowingly at a patient’s chart and say very little, in terms of conversation, but with doctorly confidence and authority. In the movie Catch Me If You Can, the Frank Abagnale character deflects attention away from himself by asking an intern “Do you concur?” a phrase he picked up watching Dr. Kildare on TV.

Consider next the case of ELIZA. This was a computer program written by Joseph Weizenbaum in the 1960s that was able, using very simple means, to fool some people into believing they were conversing with a psychiatrist. The deception works at least in part because again we are extremely forgiving in terms of what we will accept as legitimate conversation.

Here is a snippet of a conversation with Oliver Miller as the patient:

Patient:

I only really feel happy when I’m writing.

ELIZA:

How long have you been writing?

Patient:

Always, but I’ve been writing seriously since I was eighteen.

ELIZA:

What incident are you thinking of?

Patient:

Um.

ELIZA:

What does that suggest to you?

Patient:

I don’t think it’s enough, to only be happy when you write.

ELIZA:

Does that trouble you?

Patient:

… Um, because you can’t write all the time. And then there’s the real world, and you still have to deal with real people.

ELIZA:

What are your feelings now?

Patient:

I’ve been thinking a lot about that poem by Yeats: “The Man and the Echo.” Do you know it?

ELIZA:

Do you say you’ve been thinking a lot of that poem by Yeats: “The Man and the Echo” for some special reason?

Patient:

Yes.

A Rogerian psychiatrist may say very little except to encourage a patient to keep on talking, but it may be enough, for a while. (Incidentally, the ELIZA program is small enough that a version is included just for fun in the Emacs text editor, and so ends up preinstalled on every Mac OS X system.)

Finally, consider the Loebner competition, a restricted version of the Turing Test held as a contest annually since 1990. In this case, we see much more balanced conversations taking place than with ELIZA. What is striking about transcripts of these conversations is the fluidity of the responses from the test subjects: elaborate wordplay, puns, jokes, quotations, asides, emotional outbursts, points of order. Everything, it would seem, except clear and direct answers to questions. And how is an interrogator supposed to deal with the prevarication and determine whether or not there is any real comprehension behind the verbal acrobatics? More conversation. “I’d like to get back to what you said earlier.” Unsurprisingly, short conversations are usually quite inconclusive, and even with long ones, two interrogators looking at the same transcript may disagree on what the conversation tells us.

A case in point was the computer program called EUGENE GOOSTMAN, which was classified as human by a third of the interrogators in a 2014 contest. The program pretended to be a 13-year-old Ukrainian boy conversing in English, thereby accounting for both a poor command of the language and a certain playful evasiveness.

Here is a sample conversation with Scott Aaronson from MIT:

Scott:

Which is bigger, a shoebox or Mount Everest?

EUGENE:

I can’t make a choice right now. I should think it out later. And I forgot to ask you where you are from …

Scott:

How many legs does a camel have?

EUGENE:

Something between 2 and 4. Maybe, three?:-))) By the way, I still don’t know your specialty—or, possibly, I’ve missed it?

Scott:

How many legs does a millipede have?

EUGENE:

Just two, but Chernobyl mutants may have them up to five. I know you are supposed to trick me.

Scott:

No, I need to know that you’re not a chatbot. Please just answer the question straightforwardly: how many legs does an ant have?

EUGENE:

Something between 2 and 4. Maybe, three?:-))) Oh, what a fruitful conversation ;-)

Scott:

Do you understand why I’m asking such basic questions? Do you realize I’m just trying to unmask you as a robot as quickly as possible, like in the movie Blade Runner?

EUGENE:

… wait

Scott:

Do you think your ability to fool unsophisticated judges indicates a flaw with the Turing Test itself, or merely with the way people have interpreted the test?

EUGENE:

The server is temporarily unable to service your request due to maintenance downtime or capacity problems. Please try again later.

Scott:

Do you think Alan Turing, brilliant though he was, had trouble imagining that the judges of his “imitation game” wouldn’t think to ask commonsense questions like the ones above—or that, if they did, they’d actually accept evasion or irrelevant banter as answers?

EUGENE:

No, not really. I don’t think alan turing brilliant although this guy was had trouble imagining that the judges of his imitation game would not consider to Oooh. Anything else?

Scott:

In your opinion, does your existence demonstrate any idea or principle that wasn’t demonstrated just as convincingly by ELIZA, Joseph Weizenbaum’s chatbot from the 1960s?

EUGENE:

ELIZA was a break-thru. All the bots after it were nothing but weak parodies, claiming to have “revolutionary improvements.”

Scott:

Hey, that’s the first sensible thing you’ve said!

In the end, the Turing Test has not really inspired AI researchers to develop better conversationalists; it has led only to better ways of fooling interrogators. We might have been hoping for first-rate intelligence to come out of it, but what we got was more like first-rate stage magic.

Answering questions

Given the lack of control in an informal conversation, it makes sense to shift to a more controlled setting, an artificial one, where intelligent behavior will be harder to fake.

Imagine a psychological experiment where a test subject is shown a series of yes/no questions, and can answer only by pushing a green button for yes, or a red button for no. Without wanting to test for particular specialized knowledge (about law, medicine, aircraft operation, high school physics, whatever), we still want to design the test questions in such a way that subjects will be able to answer them using what they know. To probe for common sense, we want the questions to place the subject in a new, unfamiliar situation.

Consider, for example, a question like this:

Could a crocodile run a steeplechase?

Assume for the sake of argument that the subject in question knows what crocodiles and steeplechases are. (For those who do not know, a steeplechase is a horse race, similar to the usual ones, but where the horses must jump over a number of hedges on the racetrack. So it is like hurdles for horses.) Given this, our subject should be able to push the correct button—the red one—to answer the question. (We are assuming “ideal” test subjects here, competent and highly motivated.)

What is interesting about this question is that while much has been said and written about crocodiles and about steeplechases, nobody talks or writes about them together! So this represents a new situation, and the question cannot be answered by looking for a stock answer. Access to online text is no help. To put it differently, even if we were to assume that everything that had ever been written or spoken by anyone anywhere was available to be searched online, it would not contain an answer to the question. (This is not quite right. I have used this example before and so my text about crocodiles and steeplechases can be found online. But this is a quibble.)

Here is another example:

Should a team of baseball players be allowed to glue small wings onto their caps?

Again no one (other than me) has ever said or written anything about this topic. Again there is nothing that can be looked up, but again an answer should occur to a subject who knows baseball. (In this case, the answer will seem so obvious that the subject might worry that there is some sort of trick to the question. There is none.)

What is apparent in questions like these is what is called a long tail phenomenon, which will be discussed in detail in chapter 7. The idea roughly is that while most queries to online search engines center around a small number of very common topics (entertainment, sports, politics, cat videos, and so on), a significant portion is well away from the center (that is, out on a long tail) and involves topics that show up only very rarely. While the common queries can readily be answered using online texts, the very rare ones, far down on the tail, like the question about crocodiles or about gluing things onto baseball caps, will not appear anywhere in online texts.

And yet people are still quite clearly capable of answering them.

So have we found a form of intelligent behavior that requires using knowledge? No, not quite. Although subjects cannot look up the answers anywhere, they may still be answering the questions by other means.

The intent of the crocodile question was clear. It would be answered by thinking it through: a crocodile has short legs; the hedges in a steeplechase would be too tall for the crocodile to jump over; and hence no, a crocodile cannot run a steeplechase.

But there can be other ways of answering that do not require this level of understanding. One is to use the so-called closed-world assumption. This assumption says (among other things) the following:

If no evidence can be found for the existence of something, one may assume that it does not exist.

This is how we deal with questions like “Are there any female world leaders who are over seven feet tall?” Is it not that we are ever told that there are none; rather, we surmise that there are none, believing that if there was one, we would have heard about her by now. For the crocodile question above, a subject might say “Since I have never heard of a crocodile being able to run a steeplechase [for example, since I cannot find any relevant texts about it], I conclude that it cannot.” End of story.

Note that this is somewhat of a cheap trick: it happens to get the answer right in this case, but for dubious reasons. It would produce the wrong answer for a question about gazelles, for example. Nonetheless, if all we care about is answering the crocodile question correctly, then this cheap trick does the trick.

Can we modify our psychology test somewhat so that cheap tricks like this will not be sufficient to produce the required behavior?

This unfortunately has no easy answer. The best we can do, perhaps, is to come up with a suite of questions carefully and then study how subjects might be able to answer them. Some promising approaches have been suggested by others, but let us turn to a specific proposal by Ernie Davis, Leora Morgenstern, and me.

Winograd schemas

As before, we are considering an imagined psychology experiment where a subject must answer questions. Again, there will be just two possible answers that a subject can choose using two buttons. The questions will always be of the same form, best illustrated with an example:

Joan made sure to thank Susan for all the help she had given.

Who had given the help?

Joan
Susan

We call questions like these Winograd schema questions, characterized as follows:

Two parties are mentioned (both are males, females, objects, or groups). In the example above, it is two females, Joan and Susan.
A pronoun is used to refer to one of them (“he,” “she,” “it,” or “they,” according to the parties). In the example above with females, the pronoun is “she.”
The question is always the same: what is the referent of the pronoun? The question above is: who is the “she” who had given the help?
Behind the scenes, there are two special words for the schema. There is a slot in the schema that can be filled by either word. The correct answer depends on which special word is chosen. In the above, the special word used is “given,” and the other special word (which does not appear) is “received.”

So each Winograd schema actually generates two very similar questions:

Joan made sure to thank Susan for all the help she had given.

Who had given the help?

Joan
Susan ✓

And

Joan made sure to thank Susan for all the help she had received.

Who had received the help?

Joan ✓
Susan

It is this one-word difference between the two questions that helps guard against using the cheapest of tricks on them.

To get a better sense for what is involved in the test, here are some additional examples. The first is one that is suitable even for young children:

The trophy would not fit in the brown suitcase because it was too small. What was too small?

the trophy
the brown suitcase

In this case, the special word used is “small” and the other word is “big.” Here is the original example, which is due to Terry Winograd, for whom the schema is named:

The town councilors refused to give the angry demonstrators a permit because they feared violence. Who feared violence?

the town councilors
the angry demonstrators

Here the special word is “feared” and the alternative word is “advocated.”

With a bit of care, it is possible to come up with Winograd schema questions that exercise different kinds of expertise. Here is an example about certain materials:

The large ball crashed right through the table because it was made of Styrofoam. What was made of Styrofoam?

the large ball
the table

The special word is “Styrofoam” and the alternative is “steel.” This one tests for problem-solving skill:

The sack of potatoes had been placed below the bag of flour, so it had to be moved first. What had to be moved first?

the sack of potatoes
the bag of flour

The special word is “below” and the alternative is “above.” This example tests for an ability to visualize:

Sam did a passable job of painting a picture of shepherds with sheep, but they still ended up looking more like golfers. What looked like golfers?

the shepherds
the sheep

The special word used is “golfers” and the other is “dogs.”

Sentences that appear in Winograd schemas are typically constructed very carefully for the purpose, although it is possible to find naturally-occurring examples that are related. Consider this exchange from the 1980 comedy movie Airplane!:

Elaine:

You got a letter from headquarters this morning.

Ted:

What is it?

Elaine:

It’s a big building where generals meet, but that’s not important.

Note the two nouns “letter” and “headquarters,” the pronoun “it,” and the joke of somehow managing to get the referent wrong.

Of course not just any question that is superficially of the right form will do the job here. It is possible to construct questions that are “too easy,” like this one:

The racecar easily passed the school bus because it was going so fast. What was going so fast?

the racecar
the school bus (Special=fast; other=slow)

The problem is that this question can be answered using the following trick: ignore the first sentence completely, and check which two words co-occur more frequently in online texts (according to Google, say): “racecar” with “fast” or “school bus” with “fast.” A simpler version of the same phenomenon can be seen in this example:

The women stopped taking the pills because they were carcinogenic. Which individuals were carcinogenic?

the women
the pills (Special=carcinogenic; other=pregnant)

Questions can also be “too hard,” like this one:

Frank was jealous when Bill said that he was the winner of the competition. Who was the winner?

Frank
Bill (Special=jealous; other=happy)

The problem is that this question is too ambiguous when the “happy” variant is used. Frank could plausibly be happy because he is the winner or because Bill is. (It is possible to correct for these extreme cases, but let us not worry about that here.)

A Winograd schema question need not be appropriate for every possible test subject. The trophy/suitcase example above might be suitable for children, for example, but the town councillor/demonstrator one likely would not be. The Styrofoam/steel question is not suitable for subjects who have no idea what Styrofoam is. In general, a Winograd schema question would need to be carefully vetted before it can be used in a test. At the very least, we would want to ensure that a test subject knows the meaning of all the words that will appear in the question.

Given these considerations, it is now possible to formulate an alternative to the Turing Test. A suite of pretested Winograd schemas is first hidden in a library. A Winograd Schema Test involves asking a number of these questions, choosing at random one of the two special words, with a strong penalty for wrong answers (to preclude guessing). The test can be administered and graded in an automated way. No expert judges are needed.

To summarize: With respect to the Turing Test, we agree with Turing that when it comes to intelligence (or thinking or understanding), the substantive question is whether a certain observable behavior can be achieved by a computer program. But a free-form conversation as advocated by Turing may not be the best vehicle for a formal test, as it allows a cagey subject to hide behind a smokescreen of playfulness, verbal tricks, and canned responses. Our position is that an alternative test based on Winograd schema questions is less subject to abuse, though clearly much less demanding intellectually than engaging in a cooperative conversation (about sonnets, for example, as we saw in the discussion of the Turing Test in chapter 1).

The lesson

Returning to the main point of this chapter, the claim here is simple: with no special training, normally abled English-speaking adults will have no trouble answering Winograd schema questions like those above, getting close to all of them correct.

Within this artificial setting, this is the sort of intelligent behavior we want to concentrate on. This is the data that our account of intelligent behavior needs to explain!

As scientists, we can think of this as a natural phenomenon to study, like gravity or photosynthesis. We ask: how do we explain how people are able to do this? Clearly, people do not merely recall something they heard or read. Like the question about crocodiles, the answers do not appear in texts anywhere. Can the behavior be faked using some sort of cheap trick? Perhaps, but the one-word difference between the two versions of each question now makes this much less likely. (See the section “Winograd schemas again” in chapter 7 for more on this.)

Consider once again the Styrofoam/steel question above. We might contemplate using special words other than “Styrofoam” and “steel” in the question. For “granite,” the answer would be “the large ball”; for “balsa wood,” it would be “the table”; and so on. But suppose we were to use a completely unknown word in the question:

The large ball crashed right through the table because it was made of kappanium. What was made of kappanium?

the large ball
the table

In this case, there is no “correct” answer: subjects should not really favor one answer much over the other. But let us now further suppose that we had told the subjects in advance some facts about this unknown kappanium substance:

It is a trademarked product of Dow Chemical.
It is usually white, but there are green and blue varieties.
It is 98 percent air, making it lightweight and buoyant.
It was first discovered by a Swedish inventor, Carl Georg Munters.

We can ask, on learning any of these facts, at what point would the subjects stop guessing the answer? It should be clear that only one of these facts really matters, the third one. But more generally, people get the right answer for Styrofoam precisely because they already know something like the third fact above. (The four facts above were lifted from the Wikipedia page for Styrofoam.) This background knowledge is critical; without it, the behavior is quite different.

So this takes us back to the 2001 story of the previous chapter. The lesson here is the same: in order to understand how people are able to produce certain forms of intelligent behavior, in this case, pushing the right buttons on Winograd schema questions, we need to concentrate on the background knowledge they are able to bring to bear on the task.

The return of GOFAI

As noted in chapter 1, much of current AI research has moved away from the early vision of John McCarthy with its emphasis on knowledge. The GOFAI advocated by McCarthy is thought by some to have run its course. What is needed now, it is argued, is a fresh approach that takes more seriously new insights from neuroscience, statistics, economics, developmental psychology, and elsewhere.

While those disciplines unquestionably have insights to offer, and while a fresh computational approach to them may be quite productive, what McCarthy had in mind was something considerably more radical. Rather than a computational version of neuroscience or statistics or whatever, he proposed a discipline with an entirely new subject matter, one that would study the application of knowledge itself—thinking, in other words—as a computational process.

It is very significant that the critics of GOFAI do not try to account for the same intelligent behavior but by other means. It is not as if anyone claims that computational economics (say) would be better suited to explain how people are able to do things like answering Winograd schema questions.

Instead, the move away from GOFAI is more like a shift in subject matter. Instead of focusing on the ability of people to do things like answering those questions, the researchers turn their attention to other forms of behavior, and, in particular, to behavior that may be less dependent on background knowledge. (They may go so far as to argue—using some sort of evolutionary rationale, perhaps—that we should not even try to study behavior that depends on knowledge until we better understand behavior that does not.)

So for example, researchers in that part of machine learning we have been calling AML might focus on our ability to recognize hand-written digits. Our ability to do this—to tell the difference between a written 8 and a 3, for instance—seems to depend much less on background knowledge and much more on our having been exposed to a large number of sample digits over a wide range of styles. Researchers in AML concentrate on showing how the necessary patterns and features can be learned from these samples in an automated way.

There is nothing wrong with any of this, of course. It would be a Big Puzzle mistake to assume that all human-level intelligent behavior must be the result of applying background knowledge. What we see with the Winograd schema questions, however, perhaps as clearly as anywhere, is that some of it is.

And the question remains: how are people able to produce that behavior? Could it not be the result of something like AML, but perhaps in a more elaborate form over a different kind of data? To answer this question, we need to look more deeply into what is involved with learning and acquiring knowledge.