CHAPTER 2

Testing, Testing

Does It Work?

Hoa Young, an Asian American woman who works for the St. Paul, Minnesota, office of Planning and Economic Development, is no stranger to ethnic insensitivity. Although she has lived in the United States for nearly forty years, people sometimes treat her as a foreigner. One woman came up to her at a bus stop and asked her in a loud voice whether she spoke English—and then gave her literature about how she could be saved by embracing Jesus. Young was thus in favor of a mandatory diversity-training program for all employees of the city of St. Paul. The employees watched six actors from a company called Theatre at Work enact sensitive workplace scenarios that involved issues of race, gender, religion, and xenophobia. An employment-law attorney moderated the sessions, which included a question-and-answer period.

The diversity program was clearly a well-intentioned attempt to increase tolerance, reduce conflict, and make sure people followed the law. Mayor Randy Kelly endorsed it, noting that “it is sure to raise awareness and sensitivity to issues we as public servants encounter on a daily basis.” Shoua Lee, another Asian American city employee, seemed to agree. “I have been in so many of those situations,” she said. “Then I kick myself for not saying anything.” But not everyone was so sanguine about the program. “I reject the message they are sending about diversity,” argued one middle-aged white man. “We should have a common culture and a common language. And it should be American culture, which is the greatest in the world.”1

How could we tell whether the skits performed by Theatre at Work were effective? It probably never occurred to the creators of the program, or the city that adopted it, that the program should be vetted scientifically before it was implemented so widely—just as it didn’t occur to hundreds of police and fire departments to vet Critical Incident Stress Debriefing (CISD), which, as we saw in chapter 1, turned out not to work. When similar fiascos occur in the medical establishment, such as the rushing of drugs to market before they are adequately tested, there is justifiable outrage and a call for more stringent testing procedures. Why aren’t there similar outcries about behavioral interventions?

THE FAILURE OF COMMON SENSE

Policy makers, self-help authors, and nonpsychologists of all stripes often rely on common sense to tell them how to solve problems—more so, certainly, than medical researchers do when considering whether a new chemical compound will relieve pain from migraines. Common sense tells us, for example, that showing people skits depicting workplace scenarios will raise their consciousness about diversity issues, and that getting people to talk through their feelings about a recent traumatic event is helpful. But common sense can lead us astray by not taking into account how the human mind really works—specifically, by failing to consider how small changes in people’s narratives can have a lasting impact on their behavior.

We will encounter many examples of the failure of common sense throughout this book. Know some teens headed for trouble? Scare the heck out of them by taking them to prisons and funeral homes. Want kids to avoid drugs? Bring police officers into their classrooms to explain the dangers of drugs and give lessons on how to resist peer pressure. These approaches make perfect sense but, it turns out, they are perfectly wrong, doing more harm than good. It is no exaggeration to say that commonsense interventions have prolonged stress, raised the crime rate, increased drug use, made people unhappy, and even hastened their deaths. The problem is that these interventions failed to take into account a basic premise of the story-editing approach, namely, that in order to solve a problem, we have to view it through the eyes of the people involved and get them to redirect their narratives about it. But how do we know that story editing works any better than commonsense approaches? In the remainder of this chapter we will see how.

WHAT ARE WE TRYING TO CHANGE?

The first question to ask with any intervention is, what is it that we want to change? And how can we measure that change? Diversity education programs such as Theatre at Work, for example, typically have several objectives, namely, (a) to instill awareness of the law and company policies, (b) to increase sensitivity to members of different groups, (c) to reduce prejudice, (d) to reduce discriminatory behavior, (e) to increase diversity in hiring and promotions, (f) to increase productivity in the workplace, and (g) to reduce lawsuits (or liability in the event of a lawsuit).

Measuring some of these outcomes is relatively straightforward: if we want to know whether employees are aware of the law and a company’s policies, we could give them a simple knowledge test, such as asking whether it is legal to ask a job candidate whether he or she is married. Measuring other outcomes, such as how prejudiced people are, is more challenging. Social psychologists have conducted a lot of research on this question, developing a number of ways of measuring people’s attitudes, including both questionnaire measures and indirect tests that are typically administered on a computer. Another approach is to measure not what people say but what they do. After a diversity-training program is implemented, for example, are fewer complaints filed about discriminatory behavior? Is there an increase in productivity in the workplace?2

Questions about how to measure people’s beliefs, attitudes, and behaviors are not trivial, but we can usually agree on what we want an intervention to change and how to measure that change. A more difficult question is whether the intervention is causing the desired change. It would be simple if there were a machine that displayed people’s thoughts on a computer screen, so that we could track those thoughts and see how they were influenced by our interventions. But fortunately (for those of us who would prefer to keep our thoughts to ourselves) no such mind-reading machine exists. We can put people in a magnetic resonance imaging (MRI) scanner and measure the flow of blood to different areas of their brains, or place electrodes on their scalps and measure the electronic activity of their neurons with electroencephalography (EEG). Although these techniques yield useful information, they are not “mind-reading” devices that reveal people’s specific thoughts, such as whether they are attributing a bad grade to low intelligence (as Bob did in chapter 1) or to the need to try harder (as Sarah did in chapter 1).

DON’T ASK, CAN’T TELL

Why not just ask people whether they have benefited from a program? This was the approach adopted by the Theatre at Work company (who performed the diversity-training skits in St. Paul). On their website, they report a study in which a researcher interviewed eight people who had attended a Theatre at Work performance and asked them to assess the impact the performance had on them. The results were encouraging: the attendees said that the workshop “had a noticeable impact on their interactions with coworkers as well as family members.” Similarly, researchers have asked people who took part in Critical Incident Stress Debriefing (CISD) sessions how helpful the intervention was, with encouraging results. In one study, 98 percent of police officers who witnessed traumatic events and underwent psychological debriefing reported that they were satisfied with the procedure.3

Unfortunately, such testimonials can be misleading. Sometimes people are less than truthful; perhaps they knew that the debriefing didn’t help much but they did not want to rain on the researchers’ parade. Although such disingenuousness can be an issue, there is a more fundamental problem with asking people about the effectiveness of interventions: they often don’t know the answer. True enough, we are the only species (as far as we know) endowed with consciousness, that navel-gazing, contemplative, sometimes angst-ridden ability to introspect about ourselves and our place in the world. But it turns out that consciousness is a small part of the human mental repertoire. We are strangers to ourselves, the owners of highly sophisticated unconscious minds that hum along parallel to our conscious minds, interpreting the world and constructing narratives about our place in it. It is these unconscious narratives that social psychologists target.4

This doesn’t mean that we are clueless about ourselves. We usually don’t have any trouble knowing how we feel, such as how happy, sad, angry, or elated we are at any given point in time. We are not very good, however, at knowing why we feel the way we do. Our minds don’t provide us with pie charts that we can examine and say, “Why am I feeling sad right now? Let’s see, 37 percent of the reason is that my spouse has been ignoring me, 28 percent is because I learned that my aunt is seriously ill, 13 percent is because of the state of the economy, and the rest is because my serotonin levels are a little low right now.” Instead, we develop theories about the causes of our feelings, just as we develop theories about the causes of other people’s feelings. These theories are often correct—most of us are good observers of ourselves, and we may have noticed that when our spouses ignore us we do feel sad and resentful. But few of us are perfect at knowing exactly why we feel the way we do because life is not a controlled experiment in which only one thing varies at a time, making it easy to see what effect it has on us. Instead, there are usually lots of plausible influences on our feelings and it can be difficult to tease apart which ones really are responsible for our current moods or attitudes.5

This is especially true when we encounter a novel situation—such as an intervention designed to influence our feelings and behavior—and have nothing to compare it to. When people have taken part in a diversity-training program, for example, how are they supposed to know to what extent, if at all, their current feelings and attitudes were influenced by the program? To answer this question they would have to know how they would be feeling if they had not participated in the program. This was the problem faced by many of the participants in the CISD studies. Let’s say that after undergoing the debriefing procedure they were still upset by the traumatic event they had witnessed. “Well,” they might think to themselves, “I would probably be feeling even worse if it weren’t for the CISD intervention. The facilitator seemed nice and everyone seems to believe in it, so I guess it helped me.” Unlike George Bailey in the movie It’s a Wonderful Life, we don’t have angels who can show us what our lives would be like under different circumstances.

Actually, it’s even worse than that. Once people have gone through a program designed to help them in some way, there is a tendency for them to misremember how well off they were before the program began, thereby overestimating the effects of the intervention. One experiment, for example, found that a study-skills program had no effect on college students—after the program their study skills were no better than those of students who hadn’t taken part in the program. But the participants believed that the program had been effective, because they mistakenly recalled that their skills had been much worse before the program began.6

In short, when it comes to evaluating the effectiveness of an intervention, I advocate a “don’t ask, can’t tell” policy—researchers should not assess the impact of a program by asking people how much they benefited from it. Human beings simply are not very accurate at assessing the causes of their own feelings, attitudes, and behavior. To be clear, I am not arguing that researchers should throw away their questionnaires and stop interviewing the recipients of interventions or those who implement them. The don’t ask, can’t tell policy should apply only to questions about how people were influenced by an intervention, because they can’t be expected to answer this question accurately. Asking people about their feelings, attitudes, opinions, and knowledge can be quite valuable—if we know what to compare their answers to.

VIVA THE EXPERIMENTAL METHOD

What should that point of comparison be? One common approach is called a pre-post design, in which researchers compare people’s beliefs, attitudes, or behavior before an intervention to their beliefs, attitudes, or behavior after the intervention to see if any changes have occurred. Suppose, for example, that before participating in a diversity-training program, only 30 percent of employees knew that in their state it is illegal to ask a job candidate whether he or she is married. Right after the program, 90 percent answer this question correctly. This would be pretty good evidence that the employees learned something useful from the program. Usually, however, the goal is to bring about long-term change, not just a fleeting blip in people’s knowledge. Here, things get dicier for the pre-post design, because the more time that passes, the less sure we can be that it was the intervention rather than something else in a person’s life that caused the change. Instead of administering our questionnaire right after the program, what if we waited a few weeks to see if it had lasting effects? The results are encouraging—85 percent of the respondents answer correctly that in their state it is illegal to ask about a job candidate’s marital status. But who knows what happened in the weeks following the intervention? Perhaps people forgot what they heard in our program, but picked up the information from a new company manual that just came out, or talked with coworkers, or read an article in the newspaper.

In other words, pre-post designs are imperfect because they do not control for things that might have influenced people other than the intervention. As seen in our example, this is especially problematic when we are measuring long-term change, because the longer we wait to measure our variables of interest, the more other things occur in people’s lives that could contaminate our results.

A better approach is to compare the people who received the intervention with a control group. Say we compared the employees who took part in our diversity-training program with a control group of employees who did not, and the ones who took part were more likely to know that it is illegal to ask job candidates about their marital status. For the sake of the argument, say that they also did better on other important measures—they showed more tolerance toward members of other races and were less likely to have complaints filed against them. Sounds like very good news for our intervention, doesn’t it?

But ah—there is still a critical ingredient missing from this evaluation of our program: random assignment to condition. If we didn’t randomly assign people to a control group, our results would be hard to interpret, because we could not rule out the possibility that the control participants differ in key respects from the people who received the intervention. This is the classic “correlation does not equal causation” problem: just because one variable (whether people took part in our program) is correlated with another (their tolerance toward their coworkers several weeks later) does not mean that the first variable caused the second one to occur.

Let’s say that participation in our program was voluntary and we compared those who chose to take part to a control group of participants who chose not to. This would be a flawed experiment because tolerant, knowledgeable people might be more inclined to volunteer to undergo diversity training and thus would show more tolerance down the road—not because the program worked but because of who they were at the outset. One study did in fact find that people who showed intercultural sensitivity and competence to begin with were more likely to volunteer for a diversity-training program. So scratch the idea of letting people choose which group to participate in.7

Instead, we could get a little more sophisticated and administer our intervention to people in one unit of a company and compare their beliefs and attitudes to people in a different unit who did not receive the intervention. This would be a better experiment because it avoids the “self-selection” problem, whereby people decide for themselves whether to receive the intervention. It is still possible, however, that employees in the two units differ in important ways. Maybe the people in one unit are younger, work in a part of the company that is more racially integrated, have more interracial friends, more tolerant bosses, or differ from the other group in any number of other ways that could influence the results of the experiment.

Random assignment is the Great Equalizer. If we divide people into two groups on the basis of a coin flip (and if the groups are reasonably large), we can be confident that the groups do not differ in their backgrounds, tolerance for members of other groups, personalities, political leanings, hobbies, or in any other way that might influence their behavior. It would be extremely unlikely that one group would have significantly more cat lovers, vegetarians, hockey fans, or marathon runners than the other, just as it would be extremely unlikely for a fair coin to turn up heads forty times on fifty flips. Thus, if the group that received the diversity training showed more tolerance than the group that did not, we could be confident that the training, and not some unrelated difference in people’s backgrounds, was responsible for the change.

STATISTICAL MAGIC?

Some researchers claim that there is a good alternative to the experimental method: if we can’t randomly assign people to conditions, then we can statistically adjust for variables that might bias our results. Say we did a study in which we found that the more people exercise, the longer they live. Because we did not randomly assign people to exercise a lot or a little, we can’t be sure whether it was the exercise that prolonged people’s lives or whether it was some third variable (e.g., maybe the exercisers smoked less than the nonexercisers). No problem, some researchers say. Just measure how much people smoke, use a statistical procedure called multiple regression that adjusts for the influence of third variables such as smoking, and presto, we have our answer. Essentially, this procedure looks at whether exercisers live longer at a given rate of smoking; for example, whether nonsmoking exercisers have longer lives than nonsmoking couch potatoes. If they do, we can be pretty sure that exercise has beneficial effects. The problem is, we can never measure all the “third variables” that might cloud our results. Suppose, for example, that exercisers are more likely to use sunscreen and wear their seat belts than couch potatoes are, but the researchers didn’t ask people about those habits. We cannot statistically control for things we don’t measure.

There are several well-known studies that fell into this trap and yielded misleading results, sometimes with life-and-death consequences. One was a survey that asked thousands of women questions about their health and whether they were taking hormone replacement therapy at menopause. The researchers found that the women who used hormone replacement therapy had fewer heart attacks than those who did not, after statistically controlling for a host of potential confounding variables. Many physicians relied on this study to recommend hormone therapy for their patients. But a later clinical trial, in which women were randomly assigned to receive hormone therapy or not, yielded the exact opposite results: hormone therapy increased the risk of heart attacks. It now appears that women in the first study who chose to undergo hormone replacement therapy were healthier at the outset in ways that the researchers did not measure, which led to the misleading results. Only by randomly assigning people to conditions can researchers be confident that they have controlled for all possible confounding variables and have identified a true causal effect.8

What if a statistical analysis finds no association between two variables? Doesn’t that prove that one is not causing the other? Not necessarily. A recent study surveyed human resources managers at more than seven hundred organizations in the United States to find out the kinds of strategies the companies were using to increase diversity. They also examined the percentages of women and minority managers the companies actually hired, as gleaned from reports the companies had filed with the Equal Employment Opportunity Commission. It turned out that whether companies had diversity-training programs for their employees was unrelated to the number of women and minorities they had hired as managers. This appears to be pretty damning evidence against the effectiveness of these programs, and, in fact, that is just what the authors concluded: the diversity-training programs were ineffective at “increasing the share of white women, black women, and black men in management.”9

But again, not so fast! In this study, unmeasured third variables might have masked an actual causal effect of diversity programs. It is possible, for example, that the companies that implemented diversity programs differed in key respects from the ones that did not. Perhaps companies with unfair hiring practices were under pressure to do better, and rather than changing their actual hiring policies, they implemented diversity-training programs to take the heat off and make it look like they were doing something constructive. If so, any causal effect of these programs would be hard to detect, because it was the companies that were unwilling to change their hiring practices that were most likely to be implement the programs.

In short, it takes a true experiment to settle the question of what causes what. If we were able to randomly assign companies to either implement or not implement diversity programs, for example, we could control for all possible confounding variables. The companies resistant to change, and those committed to it, would be divided evenly between our two groups. As simple as this sounds, it is not a lesson that is widely heeded—multiple regression is the tool of the trade for many economists, sociologists, and psychologists.

In these researchers’ defense, sometimes it is impossible, for practical or ethical reasons, to perform an experiment using random assignment to condition. To find out whether capital punishment influences the murder rate, for example, we could not randomly assign some cities or states to implement the death penalty and others to ban it. But I think that researchers are often too quick to throw in the experimental towel, deciding that random assignment to condition is impractical. It often seems difficult to test a new intervention that is designed to help people in some way, for example, when in fact it is possible to randomly assign people to a condition in which they receive the intervention or to a control condition in which they do not. If people in the intervention condition show more desirable responses to a statistically significant degree, we can be much more confident that the program is working. This is how I knew that my academic intervention with college students worked (see chapter 1): the students who received it got better grades, and were more likely to stay in college, than the students randomly assigned to the control group. If no differences are found between the intervention condition and the control condition—and we did our experiment well, with a large sample size—then we need to worry. This is in fact how we know that CISD does not work to reduce post-traumatic stress: a number of studies randomly assigned people to either undergo the procedure or not undergo it (the control group) and found no difference between the two groups (in some cases they found that the control group was better off).10

START SMALL

I certainly don’t mean to imply that it is easy to conduct experiments. Designing the appropriate control condition, for example, can be a tricky business. Even more vexing, some interventions may not be so easily packaged into a few sessions and delivered only to people in a treatment condition. Instead of evaluating a diversity-training exercise such as Theatre at Work, for example, suppose we wanted to test the effectiveness of a new way of teaching children to read. It might take six months or a year to tell whether the program is working, and meanwhile, the children randomly assigned to our control group don’t get the benefit of the new program. Though this is indeed a difficult trade-off, it is no different from the one faced by medical researchers when they test the effectiveness of a new treatment. People are randomly assigned to control conditions until the researchers and the medical community at large are convinced that the treatment works. Why should we have different standards for social, psychological, and educational interventions?

One solution is to conduct experimental tests of small-scale interventions, rather than beginning with massive efforts to bring about large-scale change. Before instituting a mandatory diversity-training program throughout the federal government, we might want to test it experimentally in one or two offices. Before requiring that all police officers and firefighters undergo CISD, we might test its effectiveness in a few small, well-controlled studies. Some of the reasons for starting small are obvious: it limits risk (unintended negative effects of an intervention) and saves money. Another reason is that it is difficult to predict how a scaled-up version of an intervention will work in the real world. Social scientists have many powerful theories about how things work, many of them based on small studies done in research laboratories on college campuses. Although these theories are excellent starting points, we can’t be sure how they will play out in real-world settings, so it is best to start small and see what happens.11

This lesson applies to the story-editing approach as much as it does to any other intervention. It can be difficult to predict whether a particular intervention will succeed in getting people to adopt healthy, self-sustaining narratives; thus it is advisable to begin with small trials. James Pennebaker’s first test of his writing technique, for example, was conducted with forty-six college students who were randomly assigned to different writing conditions, and my initial test of the academic improvement intervention involved forty college students randomly assigned to conditions. One problem with such small studies, of course, is that they are, well, small, involving a limited number of people rather than large, representative samples. But if promising results are found in small-scale experiments such as these, researchers can follow them up with replications and extensions to other groups of people, and eventually to large-scale implementations. Both the Pennebaker writing technique and my academic improvement intervention, for example, have been replicated numerous times with people of different backgrounds and nationalities.12

PEERING INTO THE BLACK BOX

There are additional difficulties with conducting experimental tests of the story-editing approach. The premise is that people’s behavior emanates from their interpretations of the world and that these interpretations can often be redirected with relatively simple interventions. But these interpretations are often hard to verbalize, which means that we can’t rely on people to tell us whether our interventions worked in the way we think they have. The don’t ask, can’t tell principle applies as much to studies of story editing as to any other, making it difficult to directly test our hypothesis that we have changed people’s interpretations in the specified way.

In my intervention with first-year college students, for example, the participants who received the message that many students struggle at first but do better as time goes by improved their grades and were less likely to drop out of college, relative to the participants randomly assigned to a control group. But exactly how did this happen? We hypothesized that our message interrupted a self-defeating cycle of thinking in which the students blamed themselves for their academic problems, and prompted them to switch to a self-enhancing cycle of thinking in which they decided they could do better if they tried. But the astute reader will have noticed that I didn’t discuss any evidence showing that people changed their thinking in this manner. And that’s because there wasn’t any. Well, not much—we did find that people who got the intervention raised their expectations about their future academic performance more than people in the control group did, which is consistent with the predicted change in their thinking style. But that was the only shred of evidence consistent with our notion that people’s interpretations had changed in the way we expected them to.

This example reveals the limits of the research psychologist’s tools and measures. The mind is largely a black box that is inaccessible to its owner, and researchers don’t have mind-reading machines that reveal what’s in the black box. There is thus a lot of guesswork about how our interventions are working. We try to deduce what is happening inside the black box by seeing how different inputs change the outputs. Note that this approach is the same in any other science in which researchers cannot directly observe the processes they hypothesize to be occurring. The idea that diseases can be caused and spread by microorganisms, for example, was developed before scientists had the ability to directly observe the microorganisms. Dr. Ignaz Semmelweis, a Hungarian obstetrician working in Vienna in 1847, noticed that up to 30 percent of women who delivered babies at Vienna General Hospital died of puerperal fever, whereas very few women who delivered at home died of puerperal fever. He further observed that the women who died in the hospital were likely to have been examined by doctors who had just conducted autopsies. He formed the hypothesis, quite radical at the time, that puerperal fever was caused by a contagious organism that the doctors were unknowingly transmitting from the cadavers to the women. To test this hypothesis he had the doctors wash their hands in water and lime before examining the women. Deaths by puerperal fever dropped to about 2 percent, thereby confirming his hypothesis. Did Semmelweis’s bold experiment prove the germ theory of disease? It did not, because he had no way of directly measuring the presence of microorganisms. His results were certainly consistent with the theory, however, and—even better—he found a simple, inexpensive way of solving a deadly problem.

Research on story editing is at very much the same point. Simple interventions have been found to solve big problems. We don’t always know exactly why, because we cannot directly measure the thoughts inside the black box that we believe we have changed. Social psychologists are clever methodologists, however, and have prodded and probed the black box in ways that have produced some provocative results. In the rest of the book we will see what they have found, beginning with research on how to become happier.