A question that bedevils many upstream interventions is: What counts as success? With downstream work, success can be wonderfully tangible, and that’s partly because it involves restoration. Downstream efforts restore the previous state. My ankle hurts—can you make it stop? My laptop broke—can you fix it? My marriage is struggling—can you help us get back to the way we were? In these situations, there’s not much conceptual handwringing about what constitutes success. If your laptop starts working again, that’s victory.
But with upstream efforts, success is not always self-evident. Often, we can’t apprehend success directly, and we are forced to rely on approximations—quicker, simpler measures that we hope will correlate with long-term success. But because there is a separation between (a) the way we’re measuring success and (b) the actual results we want to see in the world, we run the risk of a “ghost victory”: a superficial success that cloaks failure.
In this chapter, we’ll scrutinize three kinds of ghost victories. To foreshadow the three varieties, let’s imagine a long-struggling baseball team that is determined to remake itself as a winner. Because that journey may take years, the manager decides to emphasize power hitting—especially more home runs—as a more proximate measure of success. In the first kind of ghost victory, your measures show that you’re succeeding, but you’ve mistakenly attributed that success to your own work. (The team applauds itself for hitting more home runs—but it turns out every team in the league hit more, too, because pitching talent declined.) The second is that you’ve succeeded on your short-term measures, but they didn’t align with your long-term mission. (The team doubled its home runs but barely won any more games.) And the third is that your short-term measures became the mission in a way that really undermined the work. (The pressure to hit home runs led several players to start taking steroids, and they got caught.)
That first type of ghost victory reflects the old expression “A rising tide lifts all boats.” If you’re in the boat-lifting business, you will be tempted to ignore the tide and proclaim success. That happened in the 1990s as crime fell precipitously across the US. In any particular city, the police chief looked like a miracle worker. A dozen different policing philosophies all looked right because crime was dropping everywhere. “Put it this way: Every police chief in the country who was in office in the ’90s has a lucrative consulting company right now,” said Jens Ludwig from the University of Chicago Crime Lab (who we met in chapter 7). “And almost no police chief who worked in the late ’80s, during the crack cocaine era, has a lucrative consulting company.”
This is not to imply, by the way, that the people winning those ghost victories were being deceptive. In their eyes, and in the eyes of the people they were helping, the success was real. In almost every American city, crime really was falling. But their individual stories of causation were likely wrong.
Ghost victories, in all their forms, can fool almost anyone—even (or perhaps especially) the people achieving the “successes.” It’s only when you examine them very closely that you can spot the cracks—the signs of separation between apparent and real success. For Katie Choe, the chief engineer for the City of Boston’s Public Works Department, those first anxiety-making clues came in the form of two maps that she’d commissioned in 2014.
Part of Choe’s job was to determine how to spend the city’s funds for sidewalk repair, and the first map revealed the current condition of the city’s sidewalks. In a herculean feat of cartography, a team had walked all 1,600 miles of sidewalks during a Boston winter, rating the condition of every segment. Thirty percent of the city’s sidewalks—labeled in red—were rated in poor condition.
The second city map was a heat map showing where certain 311 calls had originated—specifically, those calls requesting sidewalk repairs. Choe’s group had been using the 311 calls to direct the sidewalk-maintenance crews. If a Bostonian called to report a cracked sidewalk, the city would add the complaint to a queue and send construction crews to complete the repairs as resources allowed.
Looking at the maps side by side convinced Choe that something had gone badly wrong. The city’s sidewalks were in terrible shape in the lowest-income areas of Boston, but those sidewalks weren’t getting fixed, because the 311 calls—which determined how repair dollars were spent—came disproportionately from the rich areas.
In other words, in Boston, the squeaky wheels got the grease—and the squeakiest wheels were rich people.
Choe’s team had been unwittingly discriminating against low-income Bostonians. But the inequity had been neatly concealed by the way they’d been measuring themselves. The sidewalks team had evaluated their work in three ways. First, they looked at spending. The city government divided Boston into three zones for ease in administration, and each area was allocated a similar repairs budget for sidewalks, roughly $1.5 million apiece. The second measure was the square footage of sidewalks repaired, which was a measure of the productivity of the repair teams. The third and final measure was the number of 311 cases closed.
Three simple measures. Perfectly reasonable. Together, they reflect the values of equity, productivity, and constituent service. It’s easy to see how you could cruise along for years, navigating by these measures and never questioning them. It was only because of the two maps—and the soul-searching it sparked—that Choe realized how distorted the measures were.
For one thing, dividing the city into three parts, and investing in each equally, did not in any way ensure equity, because the money within each area was ultimately spent based on who called 311 to complain. The rich parts of all three areas got served disproportionately. About 45% of the city’s repairs were performed on sidewalks rated in good condition!
You might ask, well, why didn’t the low-income people call? They had equal access to 311. And the simplest answer is that almost everything in their experience had suggested that the city was not interested in investing in them. All you had to do was look around their neighborhoods. Frank Pina, who lived in the low-income Grove Hall area, showed a Boston Globe reporter the spider-webbing cracks on the sidewalk in front of his home. The cracks had been there for years. Asked why he didn’t call for repairs, he said, “Nothing would get done.”
The rich people believed they would get served, so they called, and they were served. The poor people believed they’d be neglected, so they didn’t call, and they were neglected. Boston had created two self-fulfilling prophecies.
Compounding the problem was the way jobs were prioritized. Imagine you’re part of a construction crew facing more requests for repairs than you could ever complete. And you know you’ll be evaluated partly on how many of those requests you complete. Which jobs would you prioritize? The easy ones, of course. The quick fixes. That incentive led to ridiculous outcomes: For instance, 15% of the city’s repairs in 2017 were completed on sidewalks in poor condition—and were still rated in poor condition after the repairs were complete. (I.e., a crew might have fixed one hole but ignored another one a short distance away.) Kind of like a surgeon who sees a patient with three gunshot wounds, patches one of them, and congratulates herself for speedy service.
To Choe’s credit—and she is quick to recognize the mayor and other city leaders for supporting her work—she took decisive action on these issues. Her first question was: What are we trying to accomplish, ultimately, with these repairs? Two goals seemed paramount: walkability and equity. Sidewalks are supposed to allow for easy walking from place to place—repairing a rough patch in a cul-de-sac is far less important than making a similar repair in a high-foot-traffic area. And the places where more walkability was most needed were the places that had been historically neglected.
Before Choe’s intervention, somewhere between $3.5 million and $4 million of the city’s $4.5 million budget for sidewalk maintenance and small repairs went to serve 311 calls. That number is now about $1 million. The priorities have been flipped: The first people helped are not the ones who ask the loudest but the ones who need it worst. The bulk of the repair budget now goes to strategic, proactive efforts to overhaul damaged sidewalks in the areas where it will make the most difference. “We are serving people who really need it—people who have felt under-invested in and felt like the city may actually have abandoned them at some point,” said Choe.
It would be a mistake to assume that this was an easy victory, or that it will be a permanent one. Despite the comparatively low stakes—$4 million to $5 million in a city budget is chicken feed—Choe needed air cover from the mayor. Which tells you something about the political sensitivities involved. And if Boston’s squeaky wheels think that it’s taking longer for the cracks on their sidewalks to get fixed, they will start calling politicians. What will happen then?
Choe is also struggling with what measures of success should replace those used in the past. The team’s aspiration is clear enough: to use sidewalk-repair dollars as leverage to create more practical mobility in the most vulnerable neighborhoods in Boston. But how do you measure that, exactly? Ideally, you’d have tallies of how many people were walking to schools and parks and businesses, before and after the work, and you could celebrate the increases. But how big would those increases have to be to satisfy you? And where would you get those pedestrian counts? Would you try to access surveillance cameras to gather the data, or would privacy issues outweigh your measurement concerns? Would you hire someone to stand at intersections with a counting device, clicking as every human being walks by? (Wacky as it sounds, they’re trying that, but it’s expensive.)
Part of what made the old metrics in Boston so appealing was how simple they were to access and understand. In his book Thinking, Fast and Slow, the psychologist Daniel Kahneman wrote that our brains, when confronted with complexity, will often perform an invisible substitution, trading a hard question for an easy one. “Many years ago I visited the chief investment officer of a large financial firm, who told me that he had just invested some tens of millions of dollars in the stock of Ford Motor Company,” wrote Kahneman. “When I asked how he had made that decision, he replied that he had recently attended an automobile show and had been impressed. ‘Boy do they know how to make a car!’ was his explanation.… The question that the executive faced (Should I invest in Ford stock?) was difficult, but the answer to an easier and related question (Do I like Ford cars?) came readily to mind and determined his choice. This is the essence of intuitive heuristics: When faced with a difficult question, we often answer an easier one instead, usually without noticing the substitution.”
In Boston, the easy questions to answer were: How much are we spending per area? Are we addressing citizen complaints? And how many square feet of sidewalk are we repairing? Those weren’t the right questions, but they were the easy ones.
This substitution—of easy questions for hard ones—is something that happens with both downstream and upstream efforts. But what’s distinctive about upstream efforts is their longer timelines, and those timelines force a second kind of substitution. One tech company was considering how to measure its email marketing campaigns, as reported in a research paper by the economists Susan Athey and Michael Luca. Originally, the firm had been measuring the sales generated by its promotional emails, but that was a noisy measure, since it might take weeks before customers placed an order. And it was complicated to link the purchase back to the original email that the customer had received. So the company switched to a new measurement: “open rates,” or the percentage of people who opened the company’s emails. The open rate could be observed quickly—numbers tallied within hours—and it was useful, in the sense that you could quickly measure the effects of simple tweaks to the email message. Very soon, the open rates increased, thanks to the marketers’ creative tweaks.
But within months, the company knew it had a problem: The sales generated per email had declined precipitously. Why? Athey and Luca explained that “the successful emails (using the opening rate metric) had catchy subject lines and somewhat misleading promises.” (I.e., just think of every email ever sent by a politician: Want to have a beer, DAN?) The short-term measure the leaders chose did not align with their true mission, which was to boost sales.
Choosing the wrong short-term measures can doom upstream work. The truth is, though, that short-term measures are indispensable. They are critical navigational aids. In the Chicago Public Schools example, for instance, the district’s leaders ultimately cared about reducing the dropout rate. That was the mission. But they couldn’t afford to wait four years to see whether their theories were paying off. They needed more proximate metrics that could guide their work and allow them a chance to adapt. Freshman On-Track (FOT) was the first, but even that was too long-term. (You can’t afford to wait until the end of freshman year to see whether students are off track, because if they are, the damage has already been done.) So the school leaders started watching attendance and grades—measures you could examine and influence on a weekly basis. The theory of change was: If we can boost attendance and grades, we can improve a student’s On-Track standing, and that will boost her chances of graduating. The short-term measures were well-chosen: The plan worked brilliantly, as we saw.
Getting short-term measures right is frustratingly complex. And it’s critical. In fact, the only thing worse than contending with short-term measures is not having them at all.
We’ve seen two kinds of ghost victories so far—one is caused by an effort that’s buoyed by a macro trend, like the local police chief heroes in the ’90s who were primarily surfing a nationwide reduction in crime. The second kind of ghost victory happens when measures are misaligned with the mission. That’s what Katie Choe realized about Boston’s sidewalk repairs: The city had chosen the wrong short-term measures.
There is also a third kind of ghost victory that’s essentially a special case of the second. It occurs when measures become the mission. This is the most destructive form of ghost victory, because it’s possible to ace your measures while undermining your mission.
I’ve “won” this kind of ghost victory. When I was a boy, my father offered to pay me $1 for every book of the Bible I read. With 66 books in the Bible, I stood to gain a windfall of $66, which could immediately be reinvested into Atari 2600 cartridges. My father intended for me to start with Genesis and read from beginning to end. Instead, I started with Second John, Third John, and Philemon—the three shortest books in the Bible. I can remember the look of disappointment and disbelief on his face as I tried to claim my first $3 installment.
I’d aced the measures and made a mockery of the mission.
In England in the early 2000s, the Department of Health had grown concerned about long wait times in hospital emergency rooms, according to a paper by Gwyn Bevan and Christopher Hood. So the department instituted a new policy that penalized hospitals with wait times longer than four hours. As a result of the policy, wait times began to shrink. An investigation revealed, however, that some of the success was illusory. In some hospitals, patients had been left in ambulances parked outside the hospital—up until the point when the staffers believed they could be seen within the prescribed four-hour window. Then they wheeled the patients inside.
We’ve all heard stories like this before. People “gaming” measures is a familiar phenomenon. But gaming is actually a revealing word, because often these stories are told with an air of playfulness. (I told my own story about books of the Bible that way, mostly to disguise my own embarrassment.) But for many upstream interventions, gaming is not a little problem—just a quirky, mischievous aspect of human behavior—it’s a destructive force that can and will doom your mission, if you allow it. We need to escalate the rhetoric: People aren’t “gaming metrics,” they’re defiling the mission.
Consider the spectacular drop in crime in New York City. Murders peaked at 2,262 in 1990, and they have fallen in almost every year since, down to 295 in 2018, an 87% drop. Major crimes as a whole declined by more than 80%. Many observers trace the long-term decline to changes made in 1994, when new leadership in the New York City Police Department (NYPD) established a new system called CompStat. (Even as we discuss the CompStat strategy, don’t forget the “rising tides” point—crime was falling in other cities that were using very different approaches.)
To simplify somewhat, CompStat had three key components. First, the police began to track crimes obsessively, gathering data and using maps to pinpoint the locations where crime was happening. Second, police chiefs were asked to allocate their resources based on the patterns in the data; in other words, if there was a rash of robberies in a certain neighborhood, they should shift officers there. Third, leaders at the precinct level were held accountable for reducing crime in their areas. It’s that last point that created some terrible unintended consequences. Recall Joe McCannon’s point from chapter 5 about using data for “inspection”: When people’s well-being depends on hitting certain numbers, they get very interested in tilting the odds in their favor.
In 2018, Reply All, a podcast from Gimlet Media, reported a two-episode series on CompStat and its legacy. It’s a stunning piece of work—essential listening for anyone who is grappling with the tensions between measures and mission. The podcast host, PJ Vogt, explained how the local chiefs reacted to CompStat’s new focus on accountability:
“If your crime numbers are going in the wrong direction, you are going to be in trouble. But some of these chiefs started to figure out, wait a minute, the person who’s in charge of actually keeping track of the crime in my neighborhood is me. And so if they couldn’t make crime go down, they just would stop reporting crime.
“And they found all these different ways to do it. You could refuse to take crime reports from victims, you could write down different things than what had actually happened. You could literally just throw paperwork away. And so [the chief] would survive that CompStat meeting, he’d get his promotion, and then when the next guy showed up, the number that he had to beat was the number that a cheater had set. And so he had to cheat a little bit more.…
“The chiefs felt like they were keeping the crime rate down for the commissioner. The commissioner felt like he was keeping the crime rate down for the mayor. And the mayor, the mayor had to keep the crime rate down because otherwise real estate prices would crash, tourists would go away. It was like the crime rate itself became the boss.”
The tendency to lessen the severity of crimes in order to dodge criticism became known as “downgrading.” Reply All included a chilling example of downgrading. Here’s the conversation between the host (PJ) and Ritchie Baez, a 14-year veteran of the NYPD (and a caution to readers: there’s a description of rape in the passage ahead):
Think about this: An NYPD official is held accountable for rape statistics. There are two ways to make those numbers look better. The first way is to actually prevent rape—to project the police’s presence into dangerous areas and thereby deter the violent acts. (That’s what would have happened if Ritchie and his partner had arrived at the scene just a few minutes earlier.) The second way to reduce the rape count is to reclassify actual rapes as lesser crimes—in this case, Ritchie’s boss tries to reframe the incident with the prostitute as a “theft of service.” The first way constitutes a victory; the second way is an abomination. But, tragically, both would look the same in the data.
Here’s what makes this whole subject even trickier: Crime really did go down—way down—in New York City. But that success became a kind of trap. As it became harder and harder to sustain the real decline in crime, it became more and more tempting to fiddle with the numbers instead.
We cannot be naïve about this phenomenon of gaming. When people are rewarded for achieving a certain number, or punished for missing it, they will cheat. They will skew. They will skim. They will downgrade. In the mindless pursuit of “hitting the numbers,” people will do anything that’s legal without the slightest remorse—even if it grossly violates the spirit of the mission—and they will find ways to look more favorably upon what’s illegal.
All of us won’t stoop to this behavior all the time. But most of us will some of the time.
Imagine a high school principal who’s getting leaned on, hard, to move the dropout rate. What’s the right way to reduce the dropout rate? Keep kids engaged, monitor their performance carefully, and support them relentlessly. But that’s hard, and this principal is lazy. So how else could the principal make the dropout rate budge? He could telegraph to his teachers that Fs are banned from their gradebooks. Never mind what students learn—if they make even a trivial effort to be present, then they should pass, and they should advance, and they should graduate. That’s a ghost victory. More cleverly, the principal could play the downgrading game. Any time a student dropped out, he could consider her situation with the counselor, squint really hard, and come to the determination that she had “TRANSFERRED” (to another school) not “DROPPED OUT.” Dropping out counts against you; transferring doesn’t. And who’s gonna find out? Who’s to say that the student didn’t intend, in her heart of hearts, to enroll in a different school in the next semester?
Could the entire success story at Chicago Public Schools be a ghost victory, because of factors like these? The answer is no. But we only know that because CPS had the courage to expose itself to scrutiny. Researchers at the University of Chicago Consortium on School Research, led by Elaine Allensworth, scoured the district’s data and found that there was, in fact, reason to believe that gaming had happened—that some dropouts had been falsely relabeled transfers. But the researchers also found that the incidence of gaming was insubstantial relative to the size of the gains in graduation.
The researchers also addressed the first type of ghost victory—those caused by surfing a macro trend. Graduation rates are rising nationally—a rising tide is lifting all boats—but the researchers found that CPS’s efforts had “outpaced the increases in most other districts.”
To address the other risk—that students were graduating just because they got passing grades despite poor performance—the researchers looked at several other indicators. Attendance had improved significantly, suggesting that something real and behavioral had changed. The number of students taking AP (advanced placement) courses and the number scoring well had both increased. But most convincing was the students’ performance on the ACT college admissions test, which the state required all students to take. “If schools were simply passing students through to graduation, we would expect the tested achievement levels of students would decline,” the researchers wrote. But that didn’t happen. ACT scores improved by almost 2 points from 2003 to 2014, where a nearly 2-point gain reflects “the equivalent of almost two years of learning.”
CPS’s success is no ghost victory. Their measures matched the mission. And the way the district’s leaders accomplished that is instructive. They used what Andy Grove, the former CEO of Intel, called “paired measures.” Grove pointed out that if you use a quantity-based measure, quality will often suffer. So if you pay your janitorial crew by the number of square feet cleaned, and you assess your data entry team based on documents processed, you’ve given them an incentive to clean poorly and ignore errors, respectively. Grove made sure to balance quantity measures with quality measures. The quality of cleaning had to be spot-checked by a manager; the number of data-entry errors had to be assessed and logged. Note that the researchers who assessed CPS used this pairing: They balanced a quantity metric (number of students graduating) with quality ones (ACT scores, AP class enrollments). In New York City in 2017, NYPD finally added some complementary measures to CompStat: questions for local citizens that measure how safe they feel and how much they trust the police.
Any upstream effort that makes use of short-term measures—which, presumably, is most of them—should devote time to “pre-gaming,” meaning the careful consideration of how the measures might be misused. Anticipating these abuses before the fact can be productive and even fun, in sharp contrast to reacting to them after the fact. Here are four questions to include in your pre-gaming:
There’s a fifth question, too, that should be asked, and it’s so complicated that we’ll spend the next full chapter exploring it:
As we know, good intentions are not enough to ensure that upstream work succeeds. When we try to prevent future problems, there’s always a risk that we’ll fail. But beyond that, there’s a risk that our efforts to do good might actually cause harm instead. Ahead: the struggle to anticipate the ripple effects of our work.