Dial 1-800-USA-RAIL, and a welcoming voice comes on the line immediately saying: “Hi. I’m Julie, Amtrak’s automated agent.” Sounding efficient and eager to please, the recorded voice then goes on to instruct callers on what to say to proceed: “Okay. Let’s get started. What city are you departing from?” If the caller says Boston, Amtrak Julie might hear it right, but just to double-check, she’ll reply, “I think you asked for Austin, Texas. Is that correct?” If told it’s not, she’ll say: “My mistake.” Once she finally understands that the caller wants to leave from Boston, she’ll say, “Got it,” and move on to the next step.
Amtrak dispatched Julie onto its phone lines in 2002, and since then she has answered every call that comes into the train line’s 800 number. In 2007, Julie answered 18.3 million calls, an average of about 50,000 calls a day. She forwarded 13.2 million calls to live agents at the customer’s request. But she completely handled 5.1 million calls herself that year. In fact, Amtrak Julie completes more calls in a day than one human Amtrak customer service agent handles in a year. Each of the approximately 1,000 live agents at Amtrak’s call centers completes an average of about 13,000 calls per year, while Julie averages about 13,972 calls a day. Julie can field calls about train arrivals, departures, and fares and can even complete reservations. Amtrak doesn’t say exactly how much the Julie system has saved the company since its rollout, but the initial $4.1 million it spent to get the system up and running paid for itself within a year and a half.
At the Amtrak station in downtown Boston one fall evening, a young professional woman on her way home to her husband in Connecticut had spoken to Amtrak Julie a few hours earlier to complete her reservation. She told fellow passengers that the automated system’s pleasant efficiency made her feel that Julie was really “on it.” However, Julie’s renown goes beyond Amtrak’s stations. In fact, it isn’t hyperbole to say that Amtrak Julie has found a place for herself in the American psyche. She is blogged about. She is featured in YouTube videos—most of which are variations on the idea of dialing Julie and recording her responses when callers says things to trip the system up, confuse it, or come on to her. She was an answer on the TV quiz game show Who Wants to Be a Millionaire? And National Public Radio produced a Valentine’s Day parody featuring Julie on a date with Tom, the computerized voice of United Airlines. Eventually the relationship did not work out because Julie and Tom could not agree whether to travel together by plane or train.
Then there were the Saturday Night Live skits. Antonio Banderas hosted the show in 2006, and played a guy who met Amtrak Julie, portrayed by cast member Rachel Dratch, at a party. Eventually he asked her to go back to his place with him that night, to which she replied, “Your approximate wait time is zero minutes.” And when actor John Heder, best known as the star of the film Napoleon Dynamite, hosted the show in 2005, he was a shy guy named Gary who had been fixed up on a blind date with Julie. After they greeted each other in a café and sat down, Gary asked Julie what she did. In a cheerful but monotone voice, she told him she worked in customer service. The waiter then came by to see if the couple wanted anything to drink. Here’s how the conversation went from there:
Gary: Um…what do you think, Julie? A latte or a cappuccino, or something?
Julie: Did you say, latte? Or, cappuccino?
Gary: Uh…well, I said both. Do you want a latte or cappuccino?
Julie: My mistake. Cappuccino, would be great.
Gary ordered the drinks. Then Julie interjected…
Julie: Gary, before we go any further, let me get some information.
Gary: Sure.
Julie: Please say your age.
Gary: Oh, yeah! I get that a lot. I know I look young, but I’m actually twenty-nine.
Julie: I think you said nineteen. Did I get that right?
Gary: No, twenty-nine.
Julie: I think you said nine. Did I get that right?
Gary: No. Wow. Twenty-nine.
Julie: Okay. Got it. Sorry.
The people at Amtrak got it too, and they liked it. Matt Hardison, Amtrak’s chief of sales distribution and customer service, is, in effect, Julie’s boss. He was involved in the rollout of the system and has overseen it ever since. “We intentionally wanted to do something that was a little more lively and engaged than the typical voice response system, to make it a little less mechanical.” The company is more than pleased with the results, although he says they are well aware of Julie’s flaws, which he mostly attributes to the limitations in the state-of-the-art technology behind Julie. “The Saturday Night Live skits are really interesting,” he says, because they echo what Amtrak’s own market research discovered about people’s likes and dislikes in the Julie system. For instance, callers’ top dislike is when the system misunderstands them. “If we pay attention to those skits, then we can see where people are getting frustrated. It’s subtle, but it’s very consistent with what we’re finding and what we’re trying to do to improve Julie.”
The real woman behind Amtrak Julie was also impressed with how right Saturday Night Live got it. Julie Stinneford lends not only her voice to the system but also her first name. A voice actress in her mid-forties, she lives in Boston with her husband and two sons. Stinneford says Dratch’s impersonation captured her verbal style. “She even dragged out the initial M’s in words the way I do as Julie, when I say things like ‘my mistake.’”
In person, Stinneford tends toward informal. Her knowing smile and straight, shoulder-length, strawberry-blond hair give her a trustworthy, mom-next-door quality. She is warm, accessible, self-assured, and, yes, perky—but not in an annoying way. In short, the flesh-and-blood Julie lives up to the impression created by her automated telephone persona’s voice.
Stinneford expresses amazement that the role has made her into what she calls “a pseudo-celebrity.” When people find out what she does for a living, they often ask her to “say something like Amtrak Julie would say it”—especially if they are frequent passengers on the East Coast, where the train service is used heavily. She has learned to give in and dutifully respond with a tried-and-true sample of Amtrak-Julie-ese. On cue, she pauses and says with a straight face: “I’m sorry. Did you say, Schenectady?” That one almost always gets a laugh. “I’ve been surprised,” Stinneford says, “about how attached people have gotten to Amtrak Julie. I find it funny. Because they’re not really talking to me. They are talking to a computer.”
The feat of building those computers—able to carry on conversations advanced enough for companies to use them to speak to their customers—is the result of a long line of technological innovations. First, scientists had to figure out how to make the computers talk. Automated directory assistance in the 1980s was an early commercial example of that. But primitive technology meant less-than-human-sounding voices that often gave not-so-accurate responses. By the late 1980s and early 1990s, recorded human voices had generally replaced computer-generated ones. That was around the time customer service call centers had burgeoned too, and the volume of calls they were receiving increased, as did the complexity of some of those calls. Companies began to assign different tasks to separate call centers. Billing, technical help, and orders, for example, were each handled by agents in separate areas within a customer service operation. Routing callers to the right department became paramount. That role was taken over at most companies by the first interactive voice response systems—or IVRs, as they are called.
Early IVR computers could talk, but they still couldn’t listen effectively, so callers had to use their touch-tone phone keypads to respond after such familiar spoken directions as, “For billing, press 3.” These systems engendered exasperation among customers who had to struggle through byzantine, time-consuming menus, called phone trees, to get their customer service business done—or to find they could not get it done without a live agent. The technology and design of those touch-tone-based systems improved somewhat as the years went by though their legacy of vexation among customers lingered.
By the beginning of this century, through refinements of an artificial intelligence technology called speech recognition, computers began to be able to do more listening—and even to understand many of the variations in human tone and accents. An increasingly natural-seeming phone conversation between humans and computers became possible. Companies started to offer customers the option to speak their responses instead of punching buttons on the phone keypad. As those speech recognition systems continue to evolve, they are likely to remain part of the customer service landscape. In 2005, businesses spent $1.2 billion on speech recognition telephone systems. Each year since, the speech recognition industry has been growing at a rate of more than 22 percent. And in 2009, companies are projected to spend $2.7 billion on speech recognition telephone technology, according to industry analyst Datamonitor. But the technology isn’t perfect yet. And in the face of these newest systems’ limitations on what they can hear and how much they can understand, customers are sometimes left feeling just as frustrated as they were with the touch-tone systems.
Steve Springer wrestles with these sorts of issues every day in his job as senior director of user interface design at Nuance Communications, a Boston-based leader in the speech technology field. Nuance corporate literature says the company has created more than 3,000 speech technology systems for organizations all over the world, and that more than 7 billion phone conversations are automated through their products each year. Springer comes from a background in computer science. He is smart, likable, resourceful, level-headed, a good communicator, and committed to his work—just the qualities he and his team strive to infuse into the computerized telephone agents they create. Springer was part of the team that conceived and produced Amtrak Julie, widely regarded by people in the speech technology business as one of the first and best standard-bearers for how speech recognition computers should interact with customers. Amtrak Julie has become a guiding light of sorts for the thousands of automated voices that companies use as first responders to customers’ inquiries all over the world.
Most people outside the industry don’t make a distinction between speech recognition and touch-tone-based phone systems or between well-designed systems and poorly designed ones. Springer knows that. He also realizes that much of the general public simply believes all automated phone systems are public enemy number one when it comes to customer service—rivaled only, perhaps, by outsourced agents in foreign call centers. The fact that Paul English’s GetHuman Web site struck a chord among so many customers and garnered so much attention when it began was the strongest indication yet of that sentiment.
So instead of blocking out the drumbeat against their profession, Springer was among those who worked with English to address GetHuman’s concerns. In fall 2006, when Nuance Communications and Microsoft joined with Paul English, they announced the set of industry standards that included designing all systems so that the option to speak to a live agent is readily available to callers at every point in the process. Springer believed GetHuman was “saying something very important. And for the most part, I think any designer in my group is probably 80 percent in agreement with what the GetHuman people want.”
But the standards never took off. The point of divergence, says Springer, was practicality. He maintains that companies simply can’t shoulder the increasing costs and complexity of having a human being answer every call coming into a company—especially at multinational companies with millions of customers and hundreds of thousands of calls coming in each week. Making live agents answer mundane and repetitive inquiries, such as bank and credit card balance information, is not the best use of their time or talents either and is not efficient for customers. Speech technology, he says, frees agents up to handle more complex calls, reduces hold times for customers, and makes some company information accessible to customers twenty-four hours a day. But Springer is also keenly aware of the need for many companies to pay more attention to how automated service adds to the alienation and exasperation their customers experience when trying to contact them.
“People are offended when they feel a company doesn’t want to speak to them,” says Springer. “Unfortunately, what I see is a polarization. You have all the GetHuman fans saying, ‘Those damn companies, we’ve got to stop them. We have to hurl grenades at them until they change.’” On the other side, some of Springer’s clients were incensed that Nuance and Microsoft were working with English. “They called us up and asked, ‘What are you doing?’ They didn’t use these words, but essentially they were saying: ‘You’re negotiating with the terrorists. These guys are publishing a cheat sheet that’s meant to undermine all of our cost-saving work for the past five years.’” Springer and many of his colleagues felt caught in the middle, believing that both perspectives have merit. Ultimately he sees Nuance’s role as that of a broker of conversations between companies and their customers. They try to “help people figure out how they need to talk to each other.”
Despite all his best intentions, outside the office Springer rarely escapes the almost universal antipathy toward his work. In social settings, he has had to find a way to tell people what he does without completely alienating them. “At various times I’ve said I’m sort of a conversational linguist—but no one knows what that is. Then I’ve said that I teach computers to understand English—and they think, ‘Yeah, you’re a freak.’” He also tried saying, “You know when you call the airlines and those automated systems tell you a flight’s status?” But before he could even mention that he designs those systems, people would cut him off and say dismissively: “Oh, yeah, I hate those things!” Finally, Springer came up with a description that seems to work. “Now what I say is: ‘You know those really awful automated systems that companies play when you call them? Well, I consult with companies to try to help them make those better.’ And I get a little bit of sympathy there.”
Finding the exact words to convey information clearly and understanding the effects of those words on people’s perception and behavior has become second nature to Springer. It is the art in his work and is evident when he describes how much thought goes into writing the script for voice actors like Julie Stinneford. “When we’re designing these systems,” says Springer, “we’re worried about a high level of precision.” For instance, a big pet peeve among top speech technology designers like Springer is the phrase, “Please listen carefully, as some of our menu options have changed.” It was created to respond to the fact that many callers were pressing the wrong buttons to get around the wordy systems and then were being routed to the wrong places. So instead of making the systems more customer friendly, someone came up with that sentence, which Springer believes essentially blames the customer for the designer’s laziness. He thinks it should be banished from all speech systems, calling it “extremely controlling.” And he asks, “Why take up eight seconds to say something so condescending?”
Instead, a more elegant and respectful way to convey the same information might be to ask: “Which would you like: Reservations, schedules, or fares?” Springer painstakingly points out that the word which should be used at the beginning of that sentence, not the word what, because saying, “What would you like?” signals to callers that they are going to have to supply an answer in their own words. But using which signals that there is a list of options coming up, and the caller can just sit back, listen, and then choose the right option. Designers have to be that deliberate in thinking about every element of their systems. “There are lots of subtle things to consider,” Springer says.
To Robby Kilgore, a designer who works with Springer as a creative director at Nuance, creating a high-quality speech recognition system is a multidisciplinary mix of science, technology, and art. In addition, he says, “The people who are really good at it can walk a mile in the user’s shoes.” And those people, says Kilgore, are not usually the same people who are great at programming, because the programmers “have had their heads buried in computers for all of their lives, and turn out to be terrible conversationalists.” So a typical team would include people with backgrounds in linguistics, psychology, and social science, as well as voice coaches, actors, directors, audio engineers, and computer scientists.
Indeed, it was a multidisciplinary path that led Kilgore to the work of creating speech recognition systems. Formerly a keyboard player, he recorded and performed with such artists as the Rolling Stones, James Taylor, Carly Simon, Paul McCartney, Tom Waits, Laurie Anderson, and Steve Winwood. Eventually he ended up at Microsoft doing sound design for Windows 95. That led to work as a creative director in the social user interface group at Microsoft. Using sound, animation, and breakthroughs in artificial intelligence, they designed on-screen help agents for early versions of Microsoft Office, including the paperclip on-screen help icon, which some loved and others hated.
Now Kilgore works in Nuance’s New York City office, downtown near Wall Street—not a neighborhood where many people share his rock musician past. And he has managed to blend in a bit, appearing unassuming and soft-spoken—almost corporate at first. But his artistic passion emerges as he explains how he uses his finely honed audio and performance instincts to help companies convey their brands through the computer voices that answer their phones. He often borrows terms from the world of magic in describing the work. For instance, when pointing out that speech is what separates humans from all other species, he starts by saying, “So here’s the thing—humans talk. That’s their trick.” And he calls speech recognition “an unbelievably complicated trick,” because it involves teaching computers to talk and listen to callers as humans would do. Its ultimate goal, he says, is creating “the illusion of conversation.”
Kilgore is careful to add, “It’s not so much that I want you to think it’s a real person. I’m not trying to fool you into that. But I am trying to avoid the deal breakers.” For instance, Kilgore says that people who create the systems often “forget to map the social thing onto their software. They forget that a conversation is social. And they make the computer do something really asocial. Like if you call up a company and hear, ‘Please listen carefully for your options have changed…’ yadda, yadda, yadda. Imagine a real person doing that. They’re just talking at you and not listening to you or understanding you. If someone did that at a cocktail party—if they talked your face off for forty-five seconds without letting you respond, and then asked you a bunch of questions, and then didn’t really hear the answers—most people would get away as quickly as possible.”
Kilgore tries to model the systems on accepted norms of real human interaction. “If a system keeps repeating, ‘I didn’t understand that,’ and it loses track of how many times you’ve had that error message—if it doesn’t make you think that it knows that it’s on step three of a five-step process—then it just sounds dumb.” On the other hand, he says, “There is something reassuring” if the computer signals that it knows it is responding a second time and says: “I’m sorry. I still didn’t understand that.”
The merits of making human-to-computer relationships adhere to the etiquette of human-to-human interactions are not just a hunch. When Kilgore was at Microsoft, one of the consultants for their group was a Stanford University professor of communications and social science named Clifford Nass, who conducted numerous experiments in which he found that people tend to relate to computers and other media in the same way they do to humans. He reported his often humorous results in a 1996 book that has become a bible of sorts to many computer interface designers, The Media Equation: How People Treat Computers, Television and New Media Like Real People and Places. He found people like computers that flatter them better than those that don’t. He discovered that people respond differently to computers they perceive as female and male. And he observed people being polite to computers in the same ways they would be to humans. Nass, who is also a professional magician, concluded that people are more at ease with a computer that appears to have some human-like qualities.
Nass’s influence has informed the work of many designers of computer-human customer service interactions as they work to find the right balance between making a computer’s limitations clear to callers and making the experience easy and intuitive for users. Steve Springer says, “We’re of the mind-set that a computer never pretends to be a person. And shouldn’t.” Making a computerized telephone customer service agent speak and interact like a human is more of a metaphor, says Springer. He compares it to the way a computer desktop doesn’t look exactly like a desk, but it suggests one, and serves a similar purpose.
Amtrak Julie may be just a metaphor, but she comes closer to evoking a real human than many among her breed, called IVR personas. Most of them don’t have names. And some are not underpinned with Julie’s sophisticated speech recognition technology, which allows her to mimic human conversation more accurately than many others. Still, the interactive voice response persona trend has developed since Julie broke the ground in 2002. An actress from the Broadway play Rent plays Simone, the automated agent who answers phones for Virgin Mobile USA. She begins the call by saying: “Hey. What’s up? This is Simone. Virgin Mobile customers, you rule!” Her tone is more relaxed than Julie’s, and her style more targeted to Virgin’s hip, young demographic. She even flatters callers, as Clifford Nass’s experiments suggested computers should do.
At one time, Yahoo’s IVR was named Jenni McDermott. She came complete with a profile that said she was a twenty-four-year-old Leo with an art history degree from the University of California at Berkeley, who worked as a coffee barista in a San Francisco–area café. She was in a band and liked to scuba dive and walk along the beach with her jazz musician boyfriend, Rob, and her dog, Brindle. Another IVR persona named Mia has been used to answer the phones in some Domino’s Pizza franchises. These days, IVR personas span the world as well. There are French ones (Florence and Bernard), Turkish ones (Kerem and Zeynep), and Mandarin Chinese ones (Lisheng and Linlin).
In light of this global spread of nonhuman help on customer service phone lines—and the backlash the systems receive—Julie Stinneford feels the need to advocate for the value of her computer-based alter ego and the peers it has spawned. In fact, she even points outs reasons she thinks automated agents might be superior to humans in certain situations. “As a computer,” she says, “I’ll never give you an attitude. I will apologize ’til the cows come home. ‘I’m sorry. It’s my fault. I misunderstood.’ I’ll give you sixteen different opportunities to try again. And it’s never the customer’s fault.” She says some of her most frustrating experiences as a customer have been when she gets through to live agents and they “either don’t know what they are talking about or I can’t understand them.”
She goes on in her own words: “A lot of the people who are doing telephone customer service, in my opinion, are not customer oriented. They’re in a call center, and they have to get though x number of calls in x number of minutes. And in their minds, they’re not paid to be nice. They’re paid to get the job done. It’s only when I have reached a management-level person that I have actually gotten someone who has been friendly to me.”
She doesn’t get an opportunity to speak publicly and off-script much about her work. So perhaps it is some pent-up frustration of her own that spurs her to keep comparing what she sees as the virtues of her virtual self to the shortcomings of getting human agents on the phone. “I’m paid to be nice. Doggone it. You can yell at me, and I’ll say, ‘I’m sorry, I just can’t get it right.’ Again, it is never the customer’s fault. Even if they are speaking gobbledygook into the phone. It’s always my fault.” She says she has to keep that attitude in mind when she goes through the many variations on apologies she has to record for a system. “I can’t be too schmaltzy about it. I have to have the right tone, where I’m not sounding condescending and being snotty. It has to be that I’m genuinely mystified as to why it is I haven’t gotten it right. But clearly I know I haven’t gotten it right. It’s the kind of customer service that people expect.”
Callers’ expectations are among the first things many designers consider when putting together an interactive voice response system. “Often people are at the end of their rope when they call,” says Steve Springer. “They’ve tried the other ways of resolving their problems, and now they feel like the system isn’t set up to help them with their situation and they need to talk to somebody.” So they press zero or say “agent” until they bypass the automated system.
Figuring out how the computers should field those calls is a challenge that system designers take seriously. Robby Kilgore says, “We have multinational companies that have many, many, many millions of constituents and lots of departments. It’s all fine for you to press zero. But I’ve got zeros in billing in Boise, I’ve got zeros in tech support in Bangladesh, and I’ve got zeros in new orders in Tampa.” Without more context from a caller, getting them to the right agent can be awkward. So Kilgore tries to find “some way—without being incredibly long-winded—to say, ‘Well, there are forty-seven different places I can send you.’” In that instance, Kilgore says he would suggest something like, “I can help you do x, but I need to know y.” That is the kind of “very social trade-off” he believes most callers can accept.
It’s not for lack of trying by Springer, Kilgore, and their colleagues, but still their best efforts haven’t sufficiently broken through much of the general public’s resistance toward IVRs. In 2004, Forrester Research found that IVRs met customers’ needs only 18 percent of the time. Kilgore becomes philosophical when faced with that kind of rejection. “I think the interesting thing is: What do they hate? What’s happening when they’re hating it?” And beyond the obvious frustrations that the most antisocial IVRs engender, Kilgore thinks some of the aversion is a reaction to a larger sense of helplessness people feel at the course the world around them is taking. “They hate that they’ve got to go through a sort of an automated triage to speak with somebody at a company because companies are that big now, and their services are that complex. It’s the death of the mom-and-pop shop. And there isn’t a local place to call. So there are just things they hate about the state of the world. And there’s very little we can do about that.”
For the callers who don’t “zero out,” as it is called, a speech recognition IVR system that meets their needs has to do a few basic things: get and give information, understand callers’ responses, and direct them to the right live agent if necessary. It may seem simple, but computers that answer the phones need much more initial basic training than their human counterparts do. For instance, in order for the computer to understand the callers, their designers have to think of every possible way a caller could respond to a simple yes-no question. So it might be “yes,” or “yeah,” or “uh-huh,” or “sure.” Then they teach the computer to recognize those responses, as well as the variations caused by individual accents and intonations.
Springer remembers a famous example of how complicated that can be from when he worked on a system for Bell South. His team had created and tested the voice response system using a female voice they were confident would work well. But when they put the system into use down South, it turned out the Boston designers had made one glaring omission: the computer hadn’t been programmed to understand the responses “yes ma’am” and “no ma’am.” Springer said they then scrambled to add those southernisms to the computer’s repertoire.
That collection of possible responses a computer is trained to understand is called a grammar in the speech recognition business. Some of the most basic systems have a grammar of only 250 to 300 words. More complex systems can understand up to 2,500 responses. Advanced systems, such as those that handle looking up names and numbers in a directory, have virtually open-ended grammars. But now the trend is toward an even more sophisticated system, natural language speech recognition, which can understand when callers speak in a conversational way. Callers can say, “I need to talk to the billing department.” Or “billing please.” Or “gimme billing.” Then the system can pick out key words or sets of words and discern the meaning, and route any of those callers to the billing department. Natural language systems also allow for what are called “disfluencies.” When a speaker hesitates or doesn’t speak in a linear way, saying things like, “uhhh, yeah…I guess I—can you get me to the place to talk about my bill?” Presumably a natural language system could understand those disfluencies and still route that caller to billing.
In addition to understanding callers’ responses, speech recognition systems also have to be programmed to ask questions and give information. In the speech business, the things the system says are called the prompts. That is where the human voice of the computer comes in. For Julie Stinneford, the hardest prompts to record are the short ones. “You’d be surprised,” she says, “at how many different ways there are to say and or or.” With the longer sentences and paragraphs, she says she has the persona to carry her through. “But it’s the little prompts that they have to slip in between, to make it flow right, that get you crazy.” For Amtrak, she also recorded 1,047 city and state names for the stations, with some towns having two or three different locations, such as New York–Grand Central and New York–Penn Station.
And then there are numbers. Stinneford has to record every possible way to say a number so that designers can edit them together to make natural-sounding phone numbers and addresses or bank balances. The process is called catenation. One prompt might be, “The hours are from, nine a-m, to, four p-m.” Stinneford says that designers “want to put it together so it doesn’t sound like Rosie the robot. So when I say each number, I really think about what is coming right before it and right after it. In some of the applications I have done, they have actually said, ‘You’re enunciating too much. Can you slur that one together?’”
Stinneford says she records every number and every letter in at least three different ways: leading, medial, and final. “Leading,” she says, “would be as though you’re starting something and you expect something to follow it. So if I’m going to give you an 800 number, I would say One…800. But the one is a separate entity.” She then uses the example of an address, like 167, to show how she would say the medial number. “If you say, ‘one, six,’ the six is kind of climbing a little bit. But you know it’s not done.” Then the tone of the seven goes down again to show there is nothing coming after it. “As you are recording each and every number, you’re trying to think, ‘Okay, this is coming at the end, so it has to sound final.’”
Telephone numbers are different from addresses. Stinneford treats each seven-digit phone number as three sets of numbers. The first three numbers are leading (6-8-5), the middle two are medial (0-7), and the last two are final (8-3). “I’ve had systems where I have recorded every three-number combination and every two-number combination,” Stinneford says. “Luckily I didn’t have to think of them myself; they just appear in front of me on a screen. You’d be surprised at the detail that goes into these systems. And I’m sure I don’t know half of it.”
Another important challenge for designers like Steve Springer is to make sure the voices and prompts of the systems have the right human qualities to create trust. “But if you go too far with that,” he says, “then you mislead people.” He speaks of a concept called the “uncanny valley” that many technology designers know. It comes from the work of Japanese researcher Masahiro Mori, who looked into people’s emotional responses to computers in the 1970s. Springer explains that Mori “charted how as machines become more anthropomorphic, and you ask people what their comfort level with them is, it goes up and up.” But then there is a point, according to this theory, where if the machine gets too close to being like a real person but it is still not quite there, the comfort level plummets. The line on the graph that looked like a steadily rising peak suddenly drops off into what looks like a valley. Mori called it the uncanny valley. “It freaks people out,” says Springer. “It’s like zombies. They look like people, but they’re not, and it creeps people out.”
Julie Stinneford is well aware of her role in helping to humanize the machines. “People have to be confident that I will do what they need me to do, that the computer will.” When she is recording the prompts, she says she thinks about how to use her voice to “empower customers to feel like they want to keep going, without feeling like they’ve ceded control to a machine.” She tries to convey a trustworthiness in her voice. “People want to have self-control and self-direction—to be able to have a say over their surroundings and what happens to them. If they feel they are losing that, if they are on the telephone and feel that somebody does not care and is taking away their power, it’s infuriating. Because you are literally at the mercy of these systems. You know what you need, and it may not be giving it to you. I know for me, I can’t stand to feel powerless. I think most people are like that. They want to feel, not powerful necessarily, but at least empowered—that they can do whatever it is they have to do, that they are able to be effective without somebody tripping them.”
Stinneford’s unique view into the phenomenon of speech recognition systems causes her to feel some affinity with live customer service agents working in call centers and even some affinity with the companies she represents. But mostly she feels an affinity with the customers. At no time was that vantage point—somewhere between customer, company, and machine—more tangled than when she called Citibank to inquire about her credit card one day. In a Twilight Zone–like moment, she dialed the toll-free number and a cheery female voice answered saying, “Thank you for calling Citibank.” As it asked her for her account number, she realized the voice was her own. “That was very strange,” she says, “especially because I had forgotten I had done it. So I called, not knowing it was going to be me. It asked me for my telephone number, and I felt like saying, ‘Come on. You know my telephone number.’ It was very, very odd speaking to me as the voice of Citibank about something I had to do as a customer of Citibank.”
IVR telephone systems are part of the broader segment of the customer service industry called “self-service,” which includes any kind of customer help besides live human assistance. Companies provide self-service through three main channels: on the phone, on the Web, or at a location (such as a bank’s ATMs, a gas pump, a grocery checkout, or a ticketing kiosk in an airport). Self-service saves companies money, gives customers information instantly, and liberates agents from answering repetitive questions. But self-service also can fuel the perception that a company is uncaring or arrogant—not wanting its customers to talk to live human representatives.
In their own defense, some in customer service point to earlier self-service models and the icy reception they received at first, such as the first ATMs in the early 1980s. Many people hated ATMs at first; they found them hard to use and complained they would do away with human tellers. Many banks also gave their ATMs human names to make them appear friendlier. But now there are many more ATMs (and much less need for giving them human names). Surprisingly, there are also many more tellers than ever before as the job has become more complex and customer oriented. And banks have opened more branches, including those in retail outlets like grocery stores. Customer service industry people also point back to an even earlier time, when customer dialing replaced operators on telephones. Phone customers complained that self-dialing was a sign that society was losing the human touch. Now, the reasoning goes, it is just a matter of time before people accept and appreciate IVRs and other relatively new self-service innovations in the same way they have come to embrace and value the benefits of ATMs and self-dialed phones.
Companies also claim that customers are demanding self-service more and more. It is hard to imagine a company today not having a Web site or a bank not giving customers the option to check account balances through an automated phone line or on the Internet. Amtrak’s Matt Hardison says that 75 percent of the train travelers in New York City who buy their tickets at the station use the self-service kiosks even when live agents are standing ready to help at ticket counters a few feet away. And the self-service trend is not going away. By 2010, the market research firm Gartner predicts, 58 percent of customer service interactions will be self-service, up from 35 percent in 2005.
While the use of IVRs has played a big role in reducing the number of calls handled by live agents, the growth of the Internet has reduced the number of customers who ring companies up in the first place. Hardison says at one point, Amtrak received anywhere from 28 to 30 million calls per year. By also providing customers with the opportunity to use the Web or station-based kiosks, Amtrak has been able to reduce the number of calls it receives to the current level of about 18 million. “We are trying to give our customers lots of options,” says Hardison. “I think people are evolving to where they would rather save time and do what’s fast than have the human interaction.”
For that, the Internet opens up a whole new world of customer service. Web sites across most industries give fast, simple information on services, products, and companies through a list of frequently asked questions, or FAQs. Another convention of most company Web sites is a search engine where customers can enter keywords to find the information they need. Customers’ complaints about Web sites usually relate to the usability of a site. Frustrations run along the same lines as IVR headaches, and most of the causes are similarly rooted in faulty design. Customers are turned off by Web sites that make them repeat information at different steps along the way, force them to do too much of the work, present outdated information, contain search engines that bring back too many or too few results, or require customers to answer a barrage of questions before they can get the information or service they need.
Still, the Internet has expanded the playing field of company-customer interactions, and some businesses exist solely on the Web now. Among the most successful of those, customer service is often a rallying cry. Consider three Web-based companies: the Internet retailer Amazon.com, the Internet bank ING Direct, and Craigslist, the Internet classified ad site. Jeff Bezos, Amazon’s founder, has said from the site’s beginning in 1995 that his goal was to make it “Earth’s most customer-centric company.” More than a decade later he told the Harvard Business Review, “Having that kind of bigger mission is very inspiring. Years from now, when people look back on Amazon, I want them to say that we uplifted customer-centricity across the entire business world.” In order to keep in touch with that goal, Bezos works in the company’s call center once every two years and requires every employee to do the same.
Arkadi Kuhlmann, the founder of ING Direct, the largest online bank in the United States, moved his office from the executive suite to a corner of the company’s call center in fall 2007 to stay in closer touch with the customers. And on the Craigslist Web site, Craig Newmark proclaims his official title as: “Founder, Chairman, Customer Service Representative.” He told Business 2.0 magazine, “American corporate culture seems to devalue customer service in a big way. I say, go the other way. Do it right. Trust your customers. Give them power to do things right. Service costs will drop, and customers will become more devoted to your products and services. This ain’t rocket science.”
A frequent customer of any of those companies can go years without ever having to call or e-mail them. That is by design. And it is an ultimate goal of most self-service in all industries. It is also an objective that can come across to customers as disdain for their needs. But these Internet businesses avow that for them, striving for less human interaction with customers is born of more noble intentions. They try to anticipate their customers’ needs and meet them, thereby making it unnecessary for their customers to spend time contacting them. That not only makes happier customers, they say, but it cuts costs, which along with no physical storefronts to maintain helps Amazon offer low prices, ING keep savings account interest rates high, and Craigslist provide no-fee access to its listings.
“Our customers don’t contact us unless something’s wrong,” says Bezos. So Amazon continually strives to reduce the number of contacts their customers make with the company for each unit sold. Those numbers have gone down every year since Amazon’s start. “If your focus is on customers, you keep improving,” says Bezos. “A lot of our energy and drive as a company, as a culture, comes from trying to build these customer-focused strategies.” When challenges arise and decisions have to be made about which way to take the company, Bezos says, “We try to convert it into a straightforward problem by saying, ‘Well, what’s better for the consumer?’”
That echoes the philosophies of both ING’s Arkadi Kuhlmann and Craig Newmark and is partly a by-product of the accelerated pace of the Internet business environment. Amazon, ING, and Craigslist have all thrived. And their founders say they have done so by making the needs of their customers their business priority at all levels of the company. Bezos points out that “there’s so much rapid change on the Internet, in technology, that our customer-obsessed approach is very effective.”
But not everything they have done has gone over well. In the beginning, Amazon’s phone number seemed impossible to find on its Web site. That was all part of its effort to keep costs low and pass the savings on to consumers. But the apparent nonexistence of Amazon’s toll-free number created a mini-customer backlash in blogs and on consumer Web sites. One site that found and gave out the number was created by a customer who could not figure out how to reach the company by phone after she had been overcharged $300 for an order. Amazon responded directly to her, apologized, refunded the overcharge, and has since listed the number on its site. But it is still not immediately available. Clicking on the link to the phone number leads customers to a few other contact options first, including a last-ditch offer to get them to give their number instead and let the company call them back. If that is not acceptable, another click will reveal the company’s toll-free number. Amazon gets an F on the GetHuman Web site grading system, which rates companies on how easy they make it to reach a live customer service agent.
Companies that are not solely based on the Internet are also finding ways to connect with their customers on their Web sites through peer forums, as they are often called, which encourage their customers to help each other. These are an adjunct to help desks and FAQs, especially at technology Web sites like Dell, Apple, and Microsoft. Users answer other users’ questions. Some customers complain that peer forums are a cynical way for companies to outsource the job of helping their customers to the customers themselves, and thus avoid paying anyone to do the job professionally.
But as with much else in self-service, the customer peer forum concept merges a few different business goals, one of which is finding a better way to help customers. Dell’s IdeaStorm.com and the San Francisco–based GetSatisfaction.com are both attempts to marry such customer service goals to marketing goals. A Get Satisfaction blog entry describes the ideal behind these types of unions. “To create great service, companies are letting go of control, letting go of fear of embarrassment, letting go of perfection. Relaxing these things gives customers the opportunity to help companies in amazing ways, as their passions feed back into the products and services they use. It allows companies to be real instead of defensive when things go wrong. It transforms customer service from a cost to be controlled into the most meaningful, viral, cost-effective marketing imaginable.”
As more and more companies funnel customer service to IVRs and Web sites, a few companies are finding a way to provide a self-service option that fuses the two. They are adding another dimension to self-service, literally.
At the top right corner of the U.S. Web site for the giant Swedish furniture retailer IKEA is a little illustrated icon of a youngish, red-haired woman. Beside her appears the message: “Need help? Ask Anna!” Click on it, and up comes a separate, narrow, vertical panel with a larger representation of the same woman, smiling, blinking her blue eyes, and giving her head an inquisitive tilt. She is wearing a yellow polo shirt with the IKEA logo in blue and has on a telephone headset. She looks like a friendly, efficient customer service agent. Underneath the animation of her is, “Welcome to IKEA. I’m Anna, IKEA USA’s Automated Online Assistant. You can ask me about IKEA and our products and our services. How can I help you today?”
Typing in: “I am interested in chairs,” prompts a Web page to come up that displays all the chairs IKEA sells. Then, reminiscent of Amtrak’s Julie, a dialogue box under the IKEA Anna character says, “You said: I am interested in chairs.” In another box under that is written Anna’s reply: “Please have a look at the chairs.” Much of IKEA’s furniture has to be assembled by the customer, but typing in the question for Anna, “Do I have to put them together myself?” draws an odd response. “Well, if you say so,” she replies, and then the animated Anna appears to laugh at her own possibly Swedish brand of humor. Rephrasing the question by asking, “Do they come already assembled?” causes Anna to reply: “I’m very sorry, but I can’t really talk about things like that at this time.” Typing in, “This is frustrating,” triggers Anna to say: “I can’t discuss those sort of feelings or emotions. Please talk to me about IKEA.” And finally, typing in a bit harsher, more personal critique of Anna, like, “You’re kind of dumb,” will draw this response: “I am not designed to understand or feel insults, although my knowledge will surely improve after this conversation. What would you like to know about IKEA?”
Anna is one of the forerunners of a new breed of human-like customer service agents starting to appear on computer screens. Still in their infancy, they have not even been given a definitive universal name yet. They are called anything from cyberagents, virtual guides, and interactive automated agents to avatars, chatbots, or digital people. They can come in many incarnations and varying levels of sophistication. The most high tech combine the latest speech recognition technology and personas of IVR with the 3-D animation technology of a DreamWorks-produced movie. No such online agent with all those qualities has been put into action quite yet. IKEA USA’s Anna moves a bit and shows facial expressions, but she doesn’t speak except through dialogue boxes with written text. On IKEA Web sites tailored to other countries around the world, other Annas have some more advanced features. The online Anna at IKEA’s U.K. site looks different from her American counterpart. She has blonde hair and does speak, though in a synthetic, computerized voice. But she can’t yet listen to customers speaking. Site visitors type questions into a dialogue box, and British Anna answers them in spoken language.
Other talking online customer service agents include Direct TV’s Diane, Bell Canada’s Emily, Royal Bank of Canada’s May, and Sergeant STAR (Strong, Trained, And Ready), an interactive recruiter on the U.S. Army Web site. They all speak but don’t answer open-ended questions like Anna is supposed to be able to do. Just as with IVR personas, the parodies of online agents have begun too. Radar magazine conducted a mock online interview to try to set up IKEA Anna and Sergeant STAR on a blind date. It took on a question-and-answer format, based on the actual responses of the two online agents to the same questions posed to both. One of the first questions Radar magazine asked each one was about their thoughts on the war in Iraq:
Sergeant STAR: In times of war, deployment is likely.
IKEA Anna: This is a subject I prefer not to discuss. I’d much rather talk about IKEA.
Radar: Anna, at a time like this, can you really in good conscience retreat into your yuppie bubble of chairs and tables?
IKEA Anna: Please have a look at the tables.
Radar: You’ve really got a one-track mind.
IKEA Anna: Don’t worry about it so much!
Radar: Sgt. STAR, does the Army have any jobs for self-involved, table-obsessed ladies of Scandinavian descent?
Sergeant STAR: There are more than 70,000 women soldiers currently serving with pride and distinction.
Radar: Hear that, Anna? Even you could sign up.
IKEA Anna: I’m only here to answer questions about IKEA so I don’t have any outside interests.
More advanced future versions of IKEA Anna, Sergeant STAR, and their ilk will likely pop up on company Web sites soon as personal computers and telephones begin to merge, making the technology that allows users to speak into their computers more widespread. And cell phone companies are considering using these human-like guides on the screens of their phones to help users through complicated procedures, sparing live customer service agents that job. Right now, less feature-filled versions of these chatbots are used by Comcast and many other companies in the form of instant message–like computer dialogue boxes on their Web sites, where customers can enter questions and get canned responses, but not necessarily from an animated character with a human name. IBM and Dell have put 3-D avatars of customer service agents on the Web game Second Life at a virtual information desk in their stores on the site. The millions of Second Life residents worldwide can take their questions about IBM or Dell products to that help center in the virtual world. Gartner market research has predicted that by 2010, 15 percent of Fortune 1,000 companies will use some sort of chatbot for online customer service.
These and many other breakthroughs in self-service technologies have sprung from artificial intelligence research in technology labs at universities like MIT and Stanford and at commercial research centers within companies like IBM and AT&T.
Rosalind Picard is an associate professor of computer science at MIT. An article about her in the university publication Spectrum told the story of another MIT professor who “was on his knees, crying, one day because he and all his best workers from the MIT Lab for Computer Science couldn’t crack a computer problem. Picard said, ‘If the computer can do this to PhDs in computer science from MIT, what’s it doing to the rest of the world?’ adding that we have designed computers for technical people and haven’t thought about the customer as a human being.”
Picard continued, “When you deal with people from different countries, you show respect for them by translating the conversation into their language. You adapt to them; you don’t require them to adapt to you. But computers are very disrespectful. They expect us to adapt to them, and if not, we are made to feel dumb. It is not people who are stupid, it’s the computer that is stupid, and it is the software that refuses to adapt.”
MIT scientists have spearheaded much of the progress that led to computers that appear to speak and to listen. And MIT scientists are among those currently trying to teach computers to think and reason. But Picard and others are trying to go even further. They are working on teaching computers how to feel.
It could be the start of a really bad joke: What do potential international terrorists and callers to most customer service lines have in common? But the answer is no joke. It is the technology called speech analytics. The same technology that the Central Intelligence Agency and the National Security Administration use to listen in on calls domestically and abroad searching for signs of terrorist activity is now being used by call centers all over the world to listen in on conversations with customers. The message at the beginning of most customer service calls, warning that the call may be recorded for quality assurance purposes, doesn’t really tell the whole story. With speech analytics, that original intent has been enlarged, and even usurped, as recordings of customer service calls are increasingly being seen as gold mines of all kinds of information for all areas of companies.
Originally, recorded calls were used by call centers only to evaluate customer service agents and make sure that customers were being treated well. But that was a human-monitored system and was never very efficient. Only a small percentage of calls were examined by managers and used for training agents and listening to customer concerns. That sampling was the best a human worker could do, especially at companies that receive hundreds of thousands of calls in a twenty-four-hour period. So the majority of calls were left unheard. But just as computers can perform mathematical equations much faster and more accurately than humans, speech analytics computers can also mine the information contained in huge numbers of phone calls with speed and variations that would prove unmanageable for humans alone. Now companies are adding to the human monitors and catching on to how valuable it can be to use computers to track the themes and trends of customers’ concerns in those phone calls. Not only can those findings help to improve customer service, but they also can fuel innovations in marketing, sales, strategic planning, and product development. Speech analytics is now seen as an important tool for unlocking the latent potential in call centers to become invaluable business-intelligence-gathering hubs within companies.
Speech analytics applications employ the same technology used to create speech recognition IVR programs. But instead of recognizing a customer’s exact meaning and then triggering the computer to respond, these systems search conversations for patterns and then group them into themes. Companies don’t know all the reasons for their customers’ calls. Speech analytics, also called audio mining, can help them find out more about the content of those calls. It can search through recorded phone calls and identify trends in conversations by finding and recognizing key words and phrases. So a search could flag every call in which a customer uttered the phrase, “cancel my account,” or in which a competitor’s name was used. Then those calls could be mined further to find out if those customers actually defected and to try to analyze why, or why not. Speech analytics systems can be programmed to find all kinds of variations on what is said in calls and group that information in endless ways too. More sophisticated programs can even spot trends in calls not identified by programmers—for instance, if the word hate is being used by a high percentage of callers when the programmer only told it to look for “angry.” Speech analytics technology is relatively new. It started to migrate from use in government intelligence to business use in about 2004. And in 2007, industry watchers said it was one of the fastest-growing call center technologies.
Some of the analytics tools currently available also go a bit further than word spotting. What are called emotion-detection programs can identify a caller’s range of feelings during a customer service phone call. By tracking the pacing, volume, and tone of a caller’s conversation, the technology can flag instances in which a caller was particularly upset or particularly happy. Saying, “I can’t believe this is happening,” can be both a positive and negative statement, depending on the way it is said and the context in which it is used. Simple word or phrase spotting might not be able to interpret that. Presumably emotion detection technology could, and many are thrilled with that possibility. Others are skeptical. On hearing about the idea of emotion detection technology, one reporter for Internet Telephony magazine, Tom Keating, wondered in his blog whether this technology would allow the computer to understand context well enough to know, for example, the difference between a caller saying: “Angelina Jolie is da bomb,” and “I am going to bomb Angelina Jolie’s house.”
The use of speech analytics and emotion detection technology by call centers has also raised privacy and security concerns. For instance, companies have to safeguard against outside hackers, or those within the company who might use the technology to mine for customer account numbers and social security numbers contained in recorded calls. Also, consumer groups have raised concerns about customer privacy if transcripts of conversations are shared with other parts of a company beyond the call center. The idea that a company can keep a record of the emotions of customers makes some nervous that any time they get angry, their customer record will reflect that and influence the treatment they get from agents in future interactions.
But companies counter that such systems can also demonstrate to executives how one customer’s anger is not an isolated incident. Before this technology, they argue, each phone call was a separate entity. Because there was no practical, reliable way to aggregate the complaints of customers and see how widespread one complaint may be, companies often didn’t know they had a larger problem. Speech analytics and emotion detection can provide a way to produce tangible widespread evidence of trends that anger customers, thus empowering those who have to fight within a company for changes to address the problems.
Emotion detection systems, in particular, grew out of voice verification technology, which is also coming into wider use in call centers. Asking for passwords, your mother’s maiden name, the name of your first childhood dog, or your favorite flavor of ice cream are some of the more common ways that companies make sure you are who you say you are. But your voice is even more distinctive to you. And unlike a PIN number, it doesn’t slip your mind. Unlike photo ID cards, you always have it with you. And unlike fingerprinting, it isn’t messy.
When Bell Canada customers want to access their billing information, they need only speak the phrase, “At Bell, my voice is my password.” And if the voice on the phone matches the voice in Bell’s records, that customer is in, with no more questions asked. This Bell voice identification service uses voice biometrics technology to verify a caller’s identity. New customers are asked to say, “At Bell, my voice is my password,” four or five times when they sign up for the service. That gives the company a voiceprint to keep in its records. Every time the customer calls and says the phrase, the computer compares the voice on the phone against the voiceprint for verification. The company had to work at the phrase it chose to ask its customers to say. After they tested the phrase, “Bell is my telecommunications company,” focus groups told them it sounded like mind control, so the company went with the more neutral phrase.
Creators of voice biometrics systems say they are more user friendly and more precise than many other security verification systems. But because the best are only about 80 to 90 percent accurate, they must be backed up by other identification measures. Still, the savings of using a computer to verify a customer’s identity are hard for companies to ignore.
A study by the British contact center analyst firm ContactBabel said that American call centers received 43 billion calls in 2007 and that contact center agents asked questions to verify identities in 41 percent of those calls. And although it takes only about twenty to thirty seconds for a call center agent to ask the questions, the study calculated that in 2007, American contact center agents spent 11,000 years’ worth of time, and the contact center industry spent $11.7 billion, checking callers’ identities.
As with all other self-service technologies used in customer service, companies have to hope that advances in teaching computers to talk, listen, and maybe even think and feel will continue. And companies have to trust that sooner or later, their human customers will not only adapt to the many high-tech systems deployed to help them, but will even grow fond of the company-programmed machines that are increasingly infiltrating their lives. Robby Kilgore at Nuance Communications is optimistic that such a peace can be achieved between customers and automated corporate phone lines and Web sites across the world. His pathway to that might sound self-serving, but it also expresses his practical mission in trying to perfect the help systems. “Given that we’re kind of stuck with them, I think we need thoughtful human beings to design them.”
Besides, he points out, even if all automated customer service were eliminated, technology still wouldn’t be banished from the remaining human-to-human transactions. “The human being you get on the other end of the line,” he says, “is still interacting with a computer system on your behalf.”