3

 

The Politics
of Tests

 

What is the most important lesson the young bird learns in this story?

(A) Stay close to your nest.

(B) Be careful where you land.

(C) Swimming is easier than flying.

(D) The marsh is safer than the pond.

—Florida Comprehensive Assessment Test (FCAT) 2.0, third grade reading test sample question

On the lawn of a single-story apartment complex in south Orlando, on a sunny afternoon in October 2013, a gaggle of kids were shooting squirt guns and slurping homemade Hawaiian Punch popsicles, killing time before football and cheerleading practice. One of them was Laela Gray, a petite, blond-haired, blue-eyed nine-year-old who resembled Stephanie from the nineties sitcom Full House. She and her twin brother, Caleb, were the middle children of five, and their mother, Ali Gray-Crist, thirty, said, “Out of all five she is the one that loves school the most. She loves to read. She was always good about doing her homework and got really good grades.”

In third grade, however, there was a break in the straight As. Laela started bringing home Cs in reading. Her mother says this was largely due to the teacher’s introduction of short passages with comprehension questions in preparation for the annual Florida Comprehensive Assessment Test (FCAT) state tests. Laela was no good at answering these kinds of questions. “A lot of the passages were really difficult for her because the questions were tricky. It says ‘pick the best one,’ but there’s two best answers.”

In Florida the third- through tenth-grade tests aren’t just used to grade the performance of schools and of teachers; they’re also used in a third way they were never designed for: to determine whether individual students should be promoted from one grade to the next.

When Laela got her FCAT scores back in the spring of 2013 there was bad news. “I failed,” she tells me shyly, sitting on her couch and staring straight ahead at her brother’s video game. She had failed the reading portion of the test by 1 point: 181, when the cutoff score was 182. Caleb would be going on to the fourth grade without her.

How did our nation arrive at a system that gives one point on a standardized test more weight than four years of a little girl’s performance in school?

Step by confusing, largely well-intentioned step.

A Nation at Risk . . . Thanks to Faulty Math

The origins of the latest era of high-stakes testing are usually traced to the publication in 1983 of a now-infamous report titled A Nation at Risk: The Imperative for Educational Reform. “The educational foundations of our society are presently being eroded by a rising tide of mediocrity that threatens our very future as a Nation and a people” is probably its most quoted line.

A Nation at Risk marked the establishment of an enduring bipartisan consensus that schools were failing and that tests could tell us how and why. The authors were largely educational authorities—commissioners of education, university presidents. Their findings were reported as fact. But they weren’t pursuing a dispassionate inquiry. A Nation at Risk was produced by a group already alarmed by what they saw as the decline of America’s educational standards.

As Yvonne Larsen, the vice chair of the commission that authored the report, described it, “I was called by the President’s office. They told us that we were going to have a commission of about 18 people and meet for a year and a half to two years to address the challenge that we faced in trying to upgrade America’s education to the rigorous education that we had in the past. . . . We felt the rigor in our schools had diminished. We were concerned. There was a strong feeling that if we continued how we were going, we wouldn’t continue to improve.”

The case for decline—again, something the authors agreed upon before starting their study—is made with scary statistics like, “The average achievement of high school students on most standardized tests is now lower than 26 years ago when Sputnik was launched” and “The College Board’s Scholastic Aptitude Tests (SAT) demonstrate a virtually unbroken decline from 1963 to 1980. Average verbal scores fell over 50 points and average mathematics scores dropped nearly 40 points.”

The reference to Sputnik was no accident. A Nation at Risk made rescuing our failing schools a matter of Cold War–era national security as well as global economic competitiveness. They could have said “the average achievement of high school students on standardized tests is lower now than 26 years ago, when Jimmy Hoffa was caught” or “when Jack Kerouac wrote On the Road.” But that would have been off-message.

The report was scary. How accurate was it?

“The Bad News Is Right”

The apparent “risk” in A Nation at Risk was a basic manipulation of statistics, one that is actually acknowledged briefly in the report itself: “It is important, of course, to recognize that the average citizen today is better educated and more knowledgeable than the average citizen of a generation ago—more literate, and exposed to more mathematics, literature, and science.”

If citizens are more educated and knowledgeable, how can you fear a “rising tide of mediocrity”? The average decline in performance on standardized tests between the 1960s and the 1980s was real, but it was easily predictable, and in fact entirely explained by the fact that larger numbers and more diverse groups of students were taking the tests in the 1980s than in the 1960s.

A follow-up analysis of historical test score data commissioned by the Department of Energy in 1990 showed the opposite of what A Nation at Risk claimed. When broken down by population subgroup, “To our surprise, on nearly every measure, we found steady or slightly improving trends,” wrote the authors of the Sandia Report, who weren’t educators or partisans with an ideological stake but rather engineers taking an objective look at the data for economic planning purposes.

“Steady or slightly improving trends” means that student test scores between the 1960s and the 1980s didn’t get worse—they stayed the same or got better. The aggregate downward trend on the SATs was produced by a statistical effect known as “Simpson’s paradox.” Although each subgroup—boys, girls, whites, blacks, Hispanics, low-income students, high-income students—was improving or holding steady from year to year, the mix of students was changing. In the 1960s schools were in the process of racial integration. Men dominated in higher education. Only top students took standardized tests or applied to college. By the eighties, however, society was more open, and college was more important. More kids of all backgrounds were taking the tests—more women, more racial minorities, more working-class students. The lower average scores of less advantaged groups pulled down the overall average.

A Nation at Risk was covered on radio and TV. There was a series of public hearings attended by hundreds around the country. President Reagan mentioned it in several speeches. The Sandia Report, which brought good news, had no such public profile; instead, it was the target of suppression efforts by Department of Education officials, including Diane Ravitch, who wrote an op-ed headlined, “U.S. Schools: The Bad News Is Right,” and Deputy Secretary of Education David Kearns, who allegedly told the authors, “You bury this or I’ll bury you.”

The narrative of educational system failure based on either low or declining test scores was just too compelling for its own good and too useful for those in power. It sounds truer every decade that it’s repeated. Over the years of political jockeying, tests have taken their place as, first, the measure of our school system’s performance and, then, the solution to the problems they inevitably diagnose.

1980s: The Achievement Gap

Throughout the 1980s, partly in response to A Nation at Risk, more schools gave more tests more often: the Iowa Basic Skills Test, the California Achievement Test, the Stanford Achievement Test, and more. States and district bureaucracies grew to administer and collect the results of these tests. And states began attaching the tests to consequences for districts, schools, and individual students, such as grade promotion and high school graduation.

Immediately a disproportionate impact was felt on historically disadvantaged groups, including the poor, African Americans, English learners, recent immigrants, rural, and Native American students. True to their historical origins, standardized exit tests ended up denying certain groups their high school diplomas at higher rates than if the diplomas were based on class grades alone.

FairTest, a coalition of education reformers and civil rights activists, formed in 1985 to respond to the new tests, released a paper in 1987 titled, “Fallout from the Testing Explosion: How America’s 100 Million Standardized Tests Undermine Excellence and Equity.” The paper argued, “Standardized tests often produce results that are inaccurate, inconsistent, and biased against minority, female, and low-income students. Such tests shift control and authority into the hands of the unregulated testing industry and can undermine school achievement by narrowing the curriculum, frustrating teachers, and driving students out of school.”

A litany of complaints arose about testing from the point of view of equity, pedagogy, and influence on teachers, complaints that were just emerging then yet still sound current thirty years later.

Clear National Standards: 1980s and 1990s

The eighties and early nineties were not a time when it was fashionable to address racism or child poverty directly; rather, it was the era of “ending welfare as we know it.” Arguments about cultural bias in testing were dismissed as cultural relativism. Educational tests were positioned again and again as a means of diagnosing the perpetual crisis in education and, at the same time, a cure.

“The first President Bush had proposals for national testing,” Bob Schaeffer of FairTest recounts. “That was defeated by a strange-bedfellows coalition of progressive educators, civil rights activists, and [archconservative 1960s antifeminist] Phyllis Schlafly. Then Bill Clinton pushed for some form of national testing. The political alignment changed, with more mainstream Democrats supporting those kinds of proposals, but for partisan reasons many more Republicans opposing it.”

Bill Clinton claimed to represent a “Third Way” in politics—smarter government, not bigger government. More tests, more data, fit into that technocratic vision. On the occasion of Clinton’s first term in 1992, Marc Tucker, an influential reformer and president of the National Center on Education and the Economy, wrote a letter to the new First Lady, Hillary Clinton, laying out an ambitious four-part plan reimagining the education system as “a seamless web of opportunities to develop one’s skills that literally extends from cradle to grave and is the same system for everyone—young and old, poor and rich, worker and full-time student.”

Entered into the Congressional Record in 1998, what became known as the “Dear Hillary letter” mostly focused on Tucker’s longstanding interest in workplace preparedness and a national apprenticeship system modeled on Germany’s. But Part 4 laid out an ambitious program of elementary and secondary education reform that is fully recognizable two decades later. Tucker called for an “aggressive” move toward school choice, for freeing school administrators from union rules, and making school professionals “fully accountable.” Most of all, he wanted uniformly high, unyielding performance standards, stating, “Clear national standards of performance in general education . . . set to the level of the best achieving nations in the world for students of 16, and public schools are expected to bring all but the most severely handicapped up to that standard. . . . A national system of education in which curriculum, pedagogy, examinations, and teacher education and licensure systems are all linked to the national standards” (emphasis mine).

Marion Brady, an education reformer who began his career in 1952, says the impact of “Dear Hillary” grew in tandem with another notable sally that appeared around the same time. This one came from the right: a 1995 Washington Post editorial by Milton Friedman, mandarin of the conservative movement and winner of the 1976 Nobel Prize in Economics. In the piece, titled, “Public Schools: Make Them Private,” Friedman called for a voucher system enabling students to use taxpayer money to attend independent schools. Like Tucker, he focused on innovation and choice, with the underlying justification of reducing class stratification.

Vouchers would eventually prove politically radioactive in most places. But marketization—in the form of charter schools, homeschooling, and a growing role for various for-profit technology and service providers—remains the goal of a significant faction of the education reform movement. Testing assumes an important role as the basis of decision making within that marketplace.

How They Do It in Other Countries

Though reformers like Tucker and the authors of A Nation at Risk invoked the specter of our foreign rivals, it was actually in the nineties that the United States began to diverge from most other countries in its use of tests. Three-fourths of students in developed countries attend a school that administers a standardized test of some sort, but no country in the world administers as many standardized tests as the United States or uses them for the same purposes—particularly, to grade and punish teachers and schools. “The US is extraordinary,” said Dylan Wiliam, the British curriculum expert. “In basically every country apart from the US, there’s a tradition of examining kids at the end of learning. In the US [tests are] used for teachers.”

As Wiliam notes, the most common type of high-stakes standardized test, found throughout Europe, Asia, Africa, and South America, is the high school exit/college entrance exam. In the UK it’s the General Certificate of Education Advanced Levels, also known as the A-levels. In Central and Eastern Europe it’s known as the Maturitat, or Matura. Even Finland, which is held up in US ed-reform conversations as the most laid-back yet highest-performing school system in the world, gives a national matriculation exam (ylioppilastutkinto) for entrance into college. In countries like Germany and France tracking exams are also given at the beginning of high school to determine students’ paths either to university or technical schools and apprenticeships.

Wiliam said evidence shows limiting testing to a single occasion can spur deeper learning and higher achievement. Because students won’t be tested at the end of each year, teachers adapt their teaching styles for long-term memory and deep mastery of their subjects. Also, a system that administers just one or only a few big tests can afford to spend the big bucks to design and grade tests with lots of free-form long answers, including complex, multistep math problems and essays that require research and referral to documents. One equivalent in the United States would be the AP exams, which are written and graded by subject-matter experts and require students to demonstrate critical thinking as well as memorized knowledge through long written responses. They also charge a fee, currently $89 per student.

Countries with high-performing students, including Singapore, Finland, Denmark, and Australia, are marked by diverse national assessment systems that consider exams alongside real student work, like research projects, science experiments, and presentations.

Even when used sparingly, high-stakes, single-occasion tests around the world bring with them all the familiar ills of severe anxiety, cheating, and gaming. This is especially true in the many nations where seats in public university systems are rationed according to the results of these tests. In China, where high-stakes tests were more or less invented with the Imperial civil service exam around the year 600, students take tests for competitive admission into middle school, high school, and, finally, university. “The gaokao (college entrance exam) robs Chinese students of their curiosity, creativity, and childhood,” as Jiang Xueqin, deputy principal of one of Beijing’s most prestigious high schools, wrote in 2011.

South Korea is another paradigmatic case. It is among the top five highest-performing countries in the world on the PISA exam and boasts the most highly educated population in the world, a feat reached in a sprint from very low levels post–World War II. The Korean education system is dominated by the tests that determine competitive entrance into middle school, high school, and college. From the earliest elementary school years Korean students commonly spend forty or more hours a week studying in private “cram schools”—over and above regular school hours. Despite repeated crackdowns on the practice, parents spend more than $19 billion a year on private tutoring, more than half what the government spends on public education.

Yet for all their achievement, Korean students are also, according to surveys that are part of the PISA exam, the unhappiest in the world. Between one in eight to one in four students reportedly considered suicide in 2012, and the nation had the second highest rate of youth suicide in the Organisation for Economic Co-operation and Development (OECD). They commonly refer to the senior year of high school as the “year of hell.”

Goals 2000

Where US education most differs from other industrialized nations is in the high degree of local school control. The Elementary and Secondary Education Act (ESEA), first passed as part of the War on Poverty in 1965, is the federal government’s most significant intervention in K–12 education policy. It awarded what is now billions of dollars in aid to “Title 1” schools with high-poverty populations. This sum, though large, has never been more than 3 to 7 percent of total public education spending. So education policymaking on the federal level has historically been a matter of Carrots, Sticks, and the Bully Pulpit, as the title of a 2012 edited volume by American Enterprise Institute’s (AEI) Frederick Hess and Andrew P. Kelly put it.

In the 1990s, in a highly polarized political environment, the full “Dear Hillary” agenda faltered. Both the right and the left fought apprenticeships and vocational “tracking” as un-American. Further, the fight over healthcare reform exhausted Bill Clinton’s domestic policy momentum.

The education reform package that did make it through, known as Goals 2000, consisted of school choice and tests. Clinton’s 1994 reauthorization of the ESEA introduced for the first time a set of universal, voluntary content and performance standards to be aligned with the first federally mandated tests: one each in the grade spans three through five, six through nine, and ten through twelve. An accountability system would identify schools whose students could not achieve the necessary test scores and target them for sanctions. During the period from 1994 to 2000 most states responded to the law by adopting content and performance standards, collecting longitudinal data, and administering more and more tests.

“Clinton’s connections in Arkansas sold to the business community a plan to increase school revenues through taxation in exchange for more testing,” explains Schaeffer of FairTest. Testing was like a form of 10K reporting for schools: the regular collecting and publishing of data would tell the “shareholders” exactly what they were getting for their money. “More money in exchange for more test-based accountability: that’s exactly how it was done.”

Tests serve multiple purposes in this approach to reform. They are the means of enforcing standards and the basis of informed choice for families. Test scores provide the data for a data-driven education policy that sends resources to winning schools and shuts down losers. And some argue that tests are employed as a political weapon to undermine the power of teachers’ unions and push forward market-based reform.

Most states published their test results and began to attach a broad range of consequences, rattling behind them like tin cans on a bumper. For students it was diagnosis of learning disabilities, tracking, grade promotion, and high school graduation. Schools could be designated for takeover, reorganization, closing, extra help, or more resources for low-performing schools and rewards for high-performing schools. Districts could be subject to similar interventions from the state level.

Bonanza!

Thanks to the new laws, the testing industry, which was still controlled by a few tradition-bound companies with nineteenth-century roots, experienced an unexpected bonanza that dwarfed its previous expansion in the 1920s. Test sales grew from $7 million in 1955 to $263 million in 1997—a 3000 percent increase in constant dollars. As late as October 2011 just three companies—Harcourt, CTB/McGraw-Hill, and Riverside Publishing—still wrote 96 percent of the statewide tests, while Pearson, one of the largest publishers in the world, was a leading scorer.

No Excuses

A central justification for this bipartisan shift toward testing was set forth in the 2000 book No Excuses: Lessons from 21 High-Performing, High-Poverty Schools, published by the conservative Heritage Foundation. The author was Samuel Casey Carter, a thirty-four-year-old fellow of the foundation who had attended Catholic schools, joined a Benedictine order, and briefly studied to be a priest. He told the stories of schools from small towns like Portland, Arkansas, to cities like Detroit, Michigan, that were succeeding with dedicated leadership, an emphasis on basic skills, parental outreach, and, above all, standards and tests.

Carter listed seven common traits of these schools. Number 4, he wrote, was “Rigorous and regular testing leads to continuous student achievement.” He explained, “High expectations without a means of measurement are hollow.” The racist origins of standardized testing, however, were not part of this framing. “Diagnosis is not discrimination,” Carter wrote. Carter became one of the strongest voices to inculcate this powerful and seductive idea: improve test scores and you will erase poverty.

The book, like A Nation at Risk before it, had some math problems. Carter chose a set of schools that were, by definition, outliers. Then he simply asserted, without evidence, that if every high-poverty school had talented enough leaders and worked hard enough, they too could become high performers.

You might compare the overwhelming influence of poverty on test scores to the influence of height on someone’s chance of playing in the NBA. The average height of an NBA player in 2013 was six feet seven inches tall. But there were eight guys who were under six feet tall. Does the existence of those eight players prove that if you are under six feet, you have “no excuse” for not making the NBA?

When I asked Carter how he came by the conviction that poverty could be “no excuse” for student performance, his answer was a mishmash of anecdotal evidence and conservative faith. “My mother was very proud to tell us that she was an Irish ditch-digger’s daughter,” he said. “My great-aunt Kathleen Craig only had a high school education but she was a total polymath. . . . It was never my personal experience that demographics had anything to do with destiny.”

I asked: What would help change the destiny of poor kids? “I was really convinced that the system to produce the greatest good for the greatest number of people was a free market,” Carter said. For public schools that means competition and executive-style leadership. “No Excuses is an application of basic well-understood management principles,” Carter said. “If you can’t measure it, you can’t manage it. And you’ve got to measure like you mean it and monitor what you want to improve.”

Today “no excuses” is ed-reform dogma, upheld by charter school chains such as KIPP and Uncommon Schools, big-city superintendents, right-wing think tanks like Heritage and social advocacy groups such as the Education Trust and Michelle Rhee’s Students First. The common refrain is that erasing the achievement gap is the “civil rights movement of our time.” Test scores are the linchpin of this movement.

Wendy Kopp, the founder of Teach for America (TFA), which sends graduates of elite colleges into high-poverty schools with minimal training for two-year commitments, repeated this philosophy to me in a 2012 interview:

A few years ago there was a Gallup poll asking, Why do you believe we have low educational outcomes in low-income communities? The top three answers were student motivation, parental involvement, and home-life issues.

We asked our [Teach for America] corps members and alumni, and their top three answers were teacher quality, principal quality, and expectations at the school level.

Once you’ve taught successfully in low-income communities you realize, wait, this is not about kids not having the motivation to do well or parents not caring. This is about us—teachers and school leaders working hard. We can do more to provide kids with the opportunities they deserve.

This is a laudable moral attitude for TFA corps members to maintain. It’s an example of a psychologically healthy internal locus of control: the belief that one’s own efforts determine the outcome of a situation. But it’s possible to be too rigid in writing such an attitude into the law. Dr. Michael McGill is superintendent of Scarsdale Public Schools, one of the best-regarded districts in the nation. “I’ve been doing this work for about two hundred years,” he said, tongue in cheek. “I began as a school superintendent in 1973. And so for me the story goes back before NCLB to the early 1980s and A Nation at Risk.”

Within the “no excuses” movement he sees an unholy cross-pollination, “a misinterpretation of a premise or several premises that were liberal premises in the sixties and seventies. Back then, when I was in grad school, there were studies coming out pointing up the disparities in performance between rich and poor, white and black students,” he said. The research made these disparities more explicit and harder to ignore. This challenged a prevailing complacence among some educators.

“Up until that time the problem is seen as not a school problem. You know: ‘these kids come without breakfast, whadyagonna do?’ At the time liberal educators say we can’t take that point of view—we have to assume every child is capable of learning. If we as educators start out from the assumption that they can’t, then they won’t. We have to give our best effort and act as if.” This “acting as if” is exactly the attitude expressed by TFA corps members.

“The big mistake that NCLB makes,” said McGill, “is it takes that theoretical or philosophical stance toward student achievement as literally true. Every child is capable of learning at a high level. And if you don’t do that as a teacher, that’s your problem and we’re going to hold you accountable.” It’s a fine thing for teachers and school leaders to believe wholeheartedly that every child has an equal opportunity to learn. It’s another thing to penalize teachers and schools for children failing to perform. Imagine that a gym teacher could be fired unless he could get every student in his class to run a seven-minute mile.

The Texas Miracle

The most important no-excuses reform of the 1990s happened in Texas. In 1993 Sandy Kress, a Dallas lawyer and school board member—and, incidentally, a Democrat—oversaw the publication of a policy report on accountability. Like Samuel Casey Carter, Kress argued that because some schools do a better job than others educating low-income children, “schools and the people running them can and should be held responsible for student results.”

So that year the state legislature passed rules creating a new assessment system. Schools would have to break out the test results of minority, low-income, and disabled students and would be ranked based on test scores and graduation rates.

As Bill Ratliff, who was a state senator at the time and later lieutenant governor, sees it, the reforms were actually meant to give more autonomy to local schools, not less.

We had a public education code that was a thousand pages or more long. Any time a particular school wanted to do something with respect to teaching methods or curriculum or lesson plans, they had to look in the code to see whether or not they were authorized, and if they were not specifically authorized they took the position that they couldn’t make any changes at the local level. In the early nineties we tried to change that paradigm. I sat down and took the public education code and stripped out the vast majority of those mandates. . . . We said, all right, the state of Texas will stop talking about methodology and look at outcomes.

Outcomes measured, of course, by tests.

In Texas the new focus on outcomes produced immediate results. Dozens of high schools suddenly reported zero dropout rates. Reading and math scores went up while achievement gaps shrank. When George W. Bush ran for president, with no foreign policy experience and little executive experience, he leaned heavily on the “Texas Miracle” in education as a policy win from his time as governor. Houston’s superintendent of schools, Rod Paige, who oversaw a plunge in dropouts in that city, came to Washington with him as secretary of education. Sandy Kress became a Bush presidential policy adviser and, simultaneously, a lobbyist for Pearson, the largest company involved in creating tests. Kress would go on to be the key architect of No Child Left Behind.

The Blueprint

The legislative blueprint for No Child Left Behind was released just three days after George W. Bush’s inauguration and was initially well received on both sides of the Hill. Different versions passed the Senate and House in early 2001 and went to conference committee over the summer of 2001.

The law more than doubled the number of federally required standardized tests. It declared that all states must test at least 95 percent of children annually in grades three through eight in reading and mathematics and report test scores by race, ethnicity, low-income status, disability status, and limited English proficiency. On top of their own tests, states would be required to participate in NAEP, the Nation’s Report Card, previously a voluntary benchmark test.

Schools that failed to report “adequate yearly progress” toward proficiency for each and every subgroup of students would be publicly designated “in need of improvement.” If they continued to miss targets, they would be subject to various policy incentives year after year—both carrots, such as funding for outside tutoring services, and sticks, such as “exit vouchers” allowing students to leave public for private schools, restructuring and replacement of staff, or closing the school. By 2014 all students, in all subgroups, were supposed to achieve 100 percent proficiency on state tests. No exceptions. No excuses.

Common Sense Almost Prevails

Late-breaking information about NCLB’s procrustean design came close to killing the bill. State leaders raised concerns that under NCLB, sooner or later, every school in the country would be declared a failure. Indeed, by 2011, before a majority of states had received waivers from the consequences of the law, about half of the nation’s public schools were “failing” under NCLB.

Other observers raised questions about attaching such high stakes to tests, given the unreliability of the tests themselves. The law didn’t put money toward developing new, better tests; as ever, it would be up to the highly concentrated assessment industry to produce and market them.

In May 2001 the New York Times ran a series of front-page exposés about screw-ups in that industry. In the three years prior to NCLB, their investigation showed, test companies had the highest level of recorded errors of any period in history, affecting millions of students in twenty states, not surprising for an industry expanding so quickly. Worse, companies lied and stonewalled to cover up errors—not a confidence booster for the future.

Not So Miraculous

As for the Texas Miracle itself, it, like A Nation at Risk before it, would be discredited, again to deaf ears. In 1999 Julian Vasquez Heilig was a freshly minted master’s graduate from the University of Michigan interviewing for a position with the Houston Independent School District. “I sat down with the assistant superintendent, and she told me they’d closed the achievement gap and had 0 percent dropout rates. This was 1999. Rod Paige was superintendent. I started working in the research and accountability department and was able to see behind the scenes what was happening. I was really troubled.”

Eventually, for his PhD dissertation at the University of Texas, Heilig tracked forty-five thousand Houston students and found results very different from what his and other school districts were reporting. Heilig and other researchers found that 40 to 45 percent of African American and Latino students—those who failed any core course—were being held back in the ninth grade in Houston. They could take sophomore classes, but they were officially reclassified as freshmen, meaning the lowest performers would sit out the tenth grade accountability test. Larger numbers of students were classified as English-language learners and/or special ed, exempting them from the tests as well. When it came to eleventh grade, the tactics got more insidious. The state of Texas eventually changed their reporting rules in response to evidence that these students were being “pushed out,” advised to leave and take the GED, then counted as transfer students rather than dropouts. (Heilig continues to document similar practices in Texas schools today.)

The majority of high schools in the city, Heilig found, had falsified their dropout rates. The effect was not small. At the same time that the US Department of Education was listing Houston’s high school dropout rate above 30 percent, Houston itself was reporting to the state a rate below 2 percent. Heilig’s dissertation would not appear until 2006. But he was not the only researcher on the case.

Walt Haney, now retired, was a senior research associate in Boston College’s Center for the Study of Testing, Evaluation and Educational Policy. He started his career in education as a conscientious objector to the Vietnam War, teaching in impoverished Laos. He did his graduate work at Harvard. For over twenty years he’s served as an expert witness in a series of lawsuits by civil rights groups against states over the misuse of standardized tests.

Haney was among the first to publicly debunk the Texas Miracle, publishing his initial findings in 2000. He’s a dogged researcher, stacking his office with boxes full of paper as he combs through state records and correlates the results of tests with other demographic data, such as immigration, incarceration, GEDs, and college enrollment rates. “The claims were that achievement had improved, test score gaps had decreased, and the dropout rate had decreased,” he told me. “I concluded that the evidence on all three points was not sound, that in fact the dropout stats were bogus. When I analyzed the twenty-year enrollment data by grade, race, and number of grads, the actual dropout rate was four times worse than the reported rate.”

Education officials in Texas applied a wide range of techniques to “juke the stats.” They aggressively flunked ninth graders before they could get to tenth grade. They overtly or covertly encouraged kids to drop out of school, then claimed that they had repatriated to Mexico or had left to study for the GED. They used the wrong student ID numbers so students would get lost in the state system. Sometimes, after holding students back, they would assign them to catch-up “supersemesters,” where they earned multiple credits for completing easy work on a computer over a couple of days. (Lorenzo Garcia, superintendent of the El Paso Independent School District, went to prison for pulling these kinds of dirty tricks in the late 2000s, as discussed in Chapter 1.)

Heilig found that even as students’ scores on the state tests rose, the scores on a non-high-stakes college readiness test were flat. Far more students took the GED rather than graduated from high school. And the numbers of students taking the tenth-grade tests who were classified as special education and hence not counted in schools’ accountability ratings nearly doubled between 1994 and 1998. Since NCLB passed, Haney has documented at least three states where these kinds of behaviors have spread in reaction to high-stakes policies.

Soft Bigotries

September 11, 2001, brought a moment of renewed national unity. Congressional leaders, including Democrat Ted Kennedy and Republican John Boehner, rallied behind Bush’s education legislation as a policy that all Americans could feel good about. No Child Left Behind, conceived on shaky evidence and trailing unintended consequences, became law in early 2002.

Bush highlighted the law in a long section of his second-term acceptance speech at the 2004 Republican National Convention in New York City. While record numbers of protestors were being arrested in the nearby streets of Times Square, he intoned a favorite phrase from speechwriter Michael Gerson, the ace wordsmith behind phrases like “axis of evil” and “we don’t want the smoking gun to be a mushroom cloud.” This was one Bush had often trotted out on the campaign trail—“the soft bigotry of low expectations.”

BUSH: I believe every child can learn and every school must teach, so we passed the most important federal education reform in history. Because we acted, children are making sustained progress in reading and math, America’s schools are getting better, and nothing will hold us back.

To build a more hopeful America, we must help our children reach as far as their vision and character can take them.

Tonight, I remind every parent and every teacher, I say to every child: no matter what your circumstance, no matter where you live, your school will be the path to the promise of America.

[APPLAUSE]

We are transforming our schools by raising standards and focusing on results. We are insisting on accountability, empowering parents and teachers, and making sure that local people are in charge of their schools.

[APPLAUSE]

By testing every child, we are identifying those who need help, and we’re providing a record level of funding to get them that help.

BUSH: In northeast Georgia, Gainesville Elementary School is mostly Hispanic and 90 percent poor. And this year 90 percent of its students passed state tests in reading and math.

[APPLAUSE]

The principal—the principal expresses the philosophy of his school this way: “We don’t focus on what we can’t do at this school; we focus on what we can do. And we do whatever it takes to get kids across the finish line.”

See, this principal is challenging the soft bigotry of low expectations.

[APPLAUSE]

And that is the spirit of our education reform and the commitment of our country: No dejaremos a ningún niño atrás. We will leave no child behind.

The Wandering Cut Scores

In the 2000s test-based accountability became the hammer, and everything in education looked like a nail. “Data-driven decision making” became a watchword, with tests as the data.

Standardized tests had many useful properties for politicians. They produced authoritative-sounding numbers, as Walter Lippmann had observed in the 1920s. Yet unlike other statistics that politicians have to contend with, such as unemployment or air-quality figures, test scores are more easily manipulated to produce any desired outcome. That’s handy in case leaders want to trumpet a victory before an election or drive a hard bargain with teachers’ unions.

A testing company’s brief is to design a test that validly and reliably assesses a given set of standards, providing predictable results, and to score it correctly. It’s up to educators in each state to decide who passes or fails the test. Just as Alfred Binet set the norm for his intelligence tests at 75 percent, states have the prerogative to set the cut score—higher if they want to declare a crisis in education, close down schools, fire teachers, and shift resources to charter schools, and lower if they want to say that everything is going great.

To take just one example, in New York State in the 2000s the tested proficiency level of students rose steadily. Mayor Bloomberg ran for reelection in part on those improved test scores. But in 2009 the cutoff scores, which had drifted lower and lower over the years, were readjusted. The city’s pass rate in reading for grades three through eight then immediately fell from 68.8 to 42.4 percent, and in math it fell from 81.8 to 54 percent.

Obama, Race to the Top, and Common Core

When President Obama was elected in 2008 some hoped he might slow the relentless march of testing in favor of dealing with educational resources. Instead, he beat the drum even faster and louder.

In June of Obama’s first year a task force of progressive advocates and scholars released the “Broader, Bolder Approach to Education” manifesto, calling for school funding to be equalized and for supplementing the education system with high-quality preschool, afterschool, health care, and mental health services for children and other supports for parents. Elaine Weiss—whose Economic Policy Institute convened the original Broader, Bolder task force—told me that, “a smart accountability system would look at all the inputs for schools. . . . In a country in which we have so much child poverty and inequity, we need to ensure that all children have the basics—that kids are not getting to school hungry, going to sleep in shelters, and going to school with so many barriers in the way.” At the same time, a rival group issued a manifesto calling for redoubling the test-driven agenda: tougher standards and pushing school leaders harder to raise achievement. Arne Duncan, who Obama brought along from Chicago’s public school system to lead the Department of Education, was the only big-city school leader to sign both manifestoes.

The country was in the throes of a financial and economic crisis after the collapse of the mortgage market and banking industry. When Obama came into office, there was a brief political window to introduce new federal spending as a means of arresting economic decline. But rather than create and fund new social programs aimed at schools or children, in July 2009 Duncan announced Race to the Top. This was a $4 billion package of competitive education grants, paid for with the stimulus package passed earlier in the year. States could apply by introducing reform plans that met a set of core principles.

Duncan outlined the four principles of Race to the Top in a series of speeches. As specified in a White House press release, these were:

• adopting internationally benchmarked standards and assessments that prepare students for success in college and the workplace;

• recruiting, developing, rewarding, and retaining effective teachers and principals;

• building data systems that measured student success and inform teachers and principals how they can improve their practices; and

• turning around the nation’s lowest-performing schools.

The first principle, college-ready standards and assessments, referred to what became the Common Core State Standards initiative. This was officially a state, not a federal, initiative led by the National Governors Association and a nonprofit called Achieve.org. It was bigger than any previous attempt to establish US goals for what children should actually be learning in school. The watchword of the Common Core was “fewer, higher, deeper”—promoting deeper learning by focusing on what was truly important, which meant reading, writing, and math. Although the federal government, for political reasons, didn’t want to mandate the Common Core from above, the prospect of Race to the Top money incentivized forty-eight states initially to adopt the standards. And Obama’s Department of Education awarded a total of $350 million to two multistate coalitions, PARCC (representing 22 million students), and Smarter Balanced (19 million students), to develop a set of new tests aligned with the Common Core to be unveiled in the 2014–2015 school year. This was the first time federal money had gone to actually create educational assessments.

With two sets of policies, NCLB/Race to the Top and the Common Core, traveling at varying speeds on parallel tracks, an uncomfortable gap would appear and begin to widen: The Department of Education was rewarding states for attaching more and more stakes to existing tests even as it advertised that new, much better tests were coming to replace them.

World-class standards are the foundation on which you will build your reforms,” Duncan said in a speech at the 2009 Governors Education Symposium. But the real foundation was not standards. It was tests. As had become clear from decades of state-level experience by then, tests are greedy. They are the yardstick of success by which everything else is measured, so they become the focus:

“Adopting internationally benchmarked standards and assessments”—new and harder tests.

“Building data systems that measured student success” using test scores.

“Turning around the nation’s lowest-performing schools,” performance defined by test scores.

The only place where the connection with tests was not immediately clear was in the second item on the list: recruiting, developing, rewarding, and retaining effective teachers and principals.

In his speech to the governors Duncan called “our method of evaluating teachers . . . basically broken.” The numbers bore that out. Across the country 99 percent of teachers were rated “effective,” by subjective and varying means. Duncan and others in the administration were convinced that the way to rigorously and objectively rate teachers was to incorporate “achievement data,” which is to say, test scores. “How can you possibly talk about teacher quality without factoring in student achievement?” Duncan asked.

This became the most prescriptive piece of Race to the Top. In at least three states, laws restricted evaluating teachers or awarding tenure based on test scores. Race to the Top did not dictate that states adopt the Common Core or establish a specific proportion of charter schools, but it did demand that any state wanting the money get rid of this prohibition.

A major reason Race to the Top–related reforms went wrong was that states were in a rush for the money. These were dire economic times. There were historic deficits at the state level—a cumulative shortfall of $113.2 billion for fiscal year 2009, and $142.6 billion in fiscal year 2010.

Compared to other measures of student achievement, tests were cheap. States didn’t have the luxury of time to carefully design, pilot, and vet value-added measurement systems or other new reforms. They created the plans and put them into place, and the federal government funded them in eighteen states and DC.

The Billionaire Boys’ Club

The political science term of art is that policy in the United States is created by an “iron triangle” of Congress, the bureaucracy, and interest groups. This is nothing new. Still, it is worth noting the very special relationship between the Department of Education under President Obama and the nation’s wealthiest people, particularly through the nonprofit, philanthropic Bill and Melinda Gates Foundation.

Bill Gates, the founder of Microsoft and the country’s richest man, began his philanthropic adventures with international health in the nineties. But his wife, Melinda French Gates, had always had education nearest to her heart. In 1999 her Gates Learning Foundation merged with the William H. Gates Foundation, which Bill Gates had set up for his father upon his retirement, to form the Bill and Melinda Gates Foundation. In 2006 Warren Buffett, the second-richest man in the country, pledged most of his fortune to the Gates Foundation as well; today its endowment is valued at $40 billion.

The Gates Foundation is the largest private foundation in the world and, proportionately, the largest private funder that the US education system has ever seen. Its influence has often been compared to that of the Carnegie and Rockefeller foundations in the nineteenth and early twentieth centuries. But whereas those Gilded Age millionaires set up their own institutions, from libraries to universities, Gates has chosen to site its programs within existing public schools and fund a wide variety of organizations to carry out its research and other initiatives. Its influence is felt everywhere, from politics to research to media to afterschool programs. (Disclosure, and for example: the Gates Foundation funded me to write an ebook in 2011, it funded the nonprofit education news service where I had a blog in 2013, and it funds education coverage, among other areas, at NPR, where I now work.) “Our role in philanthropy in general is as a catalyst,” Vicki Phillips, the director of the US education program, told me in an interview. “We never look at our accomplishments as being us. Our proverbial ‘we’ is all of our partners across the country—states, districts, and groups like the Council of Chief State School Officers.”

That said, from the start the personal interests and convictions of its founding family have driven the foundation’s agenda, the more so when Bill Gates stepped down from Microsoft in 2008 to chair it full time. The emphasis in their grant making on data and metrics has permeated the entire philanthropic world. “We are big fans of the potential of data and tracking,” as Phillips puts it. In the international programs they track metrics like how many wells dug and how many doses of malarial medication administered. In the domestic education program the metrics of choice have been test scores.

Arne Duncan recruited both his chief of staff, Margot Rogers, and one of his assistant deputy secretaries, Jim Shelton, from the Gates Foundation. The administration waived ethics rules, allowing Shelton, Rogers, and others in the Department to consult more freely with their former colleagues at the foundation. Gates funded the initial development of the Common Core State Standards through the nonprofit Achieve, supported charter schools, and in 2009 pledged $335 million to raise student achievement—measured through test scores—and promote teacher evaluation systems tied to student performance—also measured through test scores.

Even the competitive structure of Race to the Top itself, a grant process in which applications could win “points” based on a system of priorities set by the Department, is reminiscent of today’s philanthropic grant making. At the time civil rights groups led by the National Association for the Advancement of Colored People (NAACP) protested that this conditional method of funding left states full of poorer students at a disadvantage because of their inability to dedicate resources to the grant application process. They objected to what looked like a never-ending pileup of unfunded and underfunded mandates descending on poverty-stricken schools.

Diane Ravitch, a conservative education official under the first George Bush, has reinvented herself in the last decade as perhaps the single-most prominent and controversial critic of no-excuses-style school reform. Her 2010 bestseller The Death and Life of the Great American School System has a chapter titled, “The Billionaire Boys Club,” characterizing the influence of three foundations that are top donors to public schools: Gates, the Eli and Edythe Broad Foundation, and the Walton Family Foundation. The full roster of the “Billionaire Boys Club” promoting a similar ed-reform agenda would include former mayor of New York City Michael Bloomberg, the Koch brothers, the education task force of the American Legislative Exchange Council (ALEC), the members of a group of wealthy financiers known as the Democrats for Education Reform, and Laurene Powell Jobs, the widow of Apple founder Steve Jobs.

This influx of private wealthy interests has upset the apple cart of school reform, challenging old alliances. Although teachers’ unions remain a powerful force in both local and national politics, for example, many mainline Democrats and progressives, notably within the Obama administration, have “defected” to the ed-reform agenda that includes charters, standards, and metrics.

The Common Core, meanwhile, has split the right side of the political aisle, with the Republican National Committee opposing it on states’ rights grounds while ALEC, the powerful corporate-dominated lobbying group, supports it. Major past ALEC contributor Exxon Mobil has even run national TV commercials in favor of the Common Core.

Many of the strongest ed-tech advocates, too, are on the conservative side of the political spectrum, like Jeb Bush, former Florida governor, 2016 Republican presidential hopeful, and leader of the Foundation for Excellence in Education.

The Common Core, together with ed-tech, creates a massive business opportunity. Gates spoke about this in a keynote address to the 2013 South by Southwest Education conference, an offshoot of the music and technology industry conferences focused entirely on the growing convergence between education, entrepreneurship, and technology. “Because of the Common Core, developers no longer have to cater to dozens or even hundreds of varying standards,” he told the crowd. “Instead, they can focus on creating the best applications that align with the core.” This idea about the power of standards is borrowed from the web. Online, interoperability rules allow developers to create pages and applications that are viewable and usable by anyone with a browser or mobile operating system.

Clayton Christensen, the Harvard Business School professor and management business guru who coined the term “disruptive innovation,” has been increasingly focusing on education for the past decade. In the same speech Gates borrowed Christensen’s lens to talk about the massive opportunity that technology poses for schools, not only to radically improve the quality of learning and make it more accessible, but also to grow businesses and make lots of money. Because of broadband, computers, and the Common Core, formerly separate and fragmented markets in textbooks, materials, and tests can now be served by integrated products. “When you add textbooks, supplements, and assessments together, you’re talking about a $9 billion market that’s wide open for innovation,” Gates said.

Rupert Murdoch, the CEO of News Corp, went even farther in a 2010 press release, when announcing the purchase of what became his education technology brand Amplify, which produces both tablets and software for the K–12 market. “When it comes to K through 12 education,” he wrote, “we see a $500 billion sector in the U.S. alone that is waiting desperately to be transformed by big breakthroughs that extend the reach of great teaching.” Casting an acquisitive eye on every public dollar spent on education, he seemed to envision a future when handheld devices and multimedia content largely replace school as we know it.

Gates on Gates

When I interviewed Gates in March 2013 he defended the tests. “We can make massive strides doing measurement even with imperfect measurement systems,” he said. “No Child Left Behind let us know that we weren’t doing very well. It was fairly minor in terms of particular ways of solving it, but it did have a wonderful thing that it showed on an absolute and relative basis—for inner-city schools, low-income groups—what a poor job we’re doing.”

When I asked about the danger of teaching to the test or overemphasizing the basics, he pushed back: “Where they go and create a test in, say, art or music that can distort what you’re teaching, just because they want to have a teacher measure, we don’t think that’s a good thing. But in general the idea that you are tested on your ability to multiply and divide, there’s not a problem. It’s kind of like, should you teach phonics or just let the kid ‘creatively’ not ever learn phonics?” he said, warming to his subject. “The whole-language debate. That was wrong.”

When Gates was a high school student himself, he spent hour upon hour outside of class learning to use computers and even took a semester off to do an independent study. His own children attend an exclusive private school, Lakeside, that does not administer state standardized tests, and he described using online resources to help feed his son’s curiosity about any possible subject.

I asked whether there was room for that kind of exploratory, immersive learning in a testing-driven environment. He dismissed this concern: “You’re not going to hurt a highly curious, self-motivated student. Unfortunately that is a small percentage of the kids. Now, you might try to figure out how you get more people into that mode or keep them in that mode. But the fact that we insist people understand math, I don’t see the downside of that.”

I wanted to ask Gates more about the critics who attest that his education philanthropy is really in the service of Microsoft’s business interests and that wealthy donors like himself exercise outsized and undemocratic influence over education policy in ways that serve their own interests more than those of the nation’s children, but our time was up after twenty minutes.

Laela Meets Rick Roach

When Laela got her test scores back in the summer of 2013 the school sent her to a “summer reading camp,” which was billed as a second chance to raise her scores and let her move on to the fourth grade. But it had the opposite effect, says her mother. “She took a pretest and scored a 76 on the first day. At the end of it she took a post-test and scored a 67—her score actually went down. Then on the last day of summer camp she took the real test and scored a 48, 2 points shy of what she needed.”

Laela was “sad” about repeating the third grade, she tells me. “She was completely devastated, confused, embarrassed,” elaborates her mother. “Any time anybody would talk about it she would put her head down and start crying or go sit somewhere and sulk. The week before school started we went to meet the teacher, the same she had the previous year. Her teacher was fabulous—I loved her—but when all the other kids were going into their classrooms all excited to see where they would sit, she just walked in with her head down and tears rolling down her face.”

Laela’s mother started doing research, sitting at the computer in the living room amidst piles of folded laundry. “I sent letters all over the place, and Rick Roach was the one that was right there with us. He really felt for us and wanted to do something to help. He told us he’d been fighting all this for three years now, that these standardized tests do not really measure a kid’s ability to read.”

Roach, a lifelong teacher, counselor, and coach with brown hair and a graying mustache, served four terms on the school board of Orange County, Florida, which includes Orlando, starting in November 1998, three years before the start of NCLB. He has seven children and two grandchildren, and he’s spoken out nationally as a voice for sanity when it comes to poorly designed and inflexibly applied standardized tests. In the summer of 2014 he left the school board to prepare for a State Senate run in order to get closer to changing “the bad decisions made about testing.”

One day Roach was attending a speech by his school board chair. “He announced four or five goals that he wanted the schools to adopt and one knocked me off my chair—in two years he wanted 50% of our 10th graders reading at grade level,” he said. “I thought, 61% are failing now? That can’t be true.” Roach’s suspicions lay with the test design, not with the performance of Florida children. So he decided to take the test himself. He scored just 10 out of 60 on the math section of the tenth-grade state test, which he says were all guesses, and 62 percent on the reading section. “It seems to me something is seriously wrong,” Roach wrote at the time.

I have a bachelor of science degree, two masters degrees, and 15 credit hours toward a doctorate. I help oversee an organization with 22,000 employees and a $3 billion operations and capital budget, and am able to make sense of complex data related to those responsibilities. . . .

It might be argued that I’ve been out of school too long, that if I’d actually been in the 10th grade prior to taking the test, the material would have been fresh. But doesn’t that miss the point? A test that can determine a student’s future life chances should surely relate in some practical way to the requirements of life. I can’t see how that could possibly be true of the test I took.

Roach started digging into the extensive literature of test criticism—Jim Popham, Diane Ravitch, Alfie Kohn—and the nuts and bolts of psychometrics. “I kept getting madder and madder,” he said. He found the commonly used practice of predictive score rates—the test writers’ best guess as to how many questions a child will get right, which helps calibrate the difficulty of the test—to be tantamount to rigging the tests in advance.

“There are three types of questions on the FCAT: easy, average, and challenging,” he said. “The easy ones are designed for 70% to get them right, the average questions are designed for 40–70% to get it right, and the challenging questions are meant for no more than 40%,” he explained. “Now on the test I took, 85% of the questions were average or challenging. The failure rate was in it before they put the first pencil mark in a bubble.” I didn’t independently confirm those numbers. It’s true that on a test designed along those lines, a 50 percent pass rate should surprise no one—it’s more or less exactly what the test makers predicted. But that pass rate becomes a problem when you have a 100 percent proficiency target.

Laela Meets the Wizard

Roach decided that the best way to undermine the tests in his state long term while also helping individuals like Laela in the short term was to pull back the curtain. For that he needed a wizard. “I found a former test writer in Kissimmee, bought him breakfast, and got him teaching me how to beat the test,” he said.

Bob Alexander has an egg-bald pate and a pair of rimless glasses suiting his nickname, the Wizard. In 1991 he was one of the last people personally hired by Stanley Kaplan, the original founder of the Kaplan SAT prep company. But he grew disillusioned with the marketing tactics of “Big Prep.” By the mid-nineties he had moved to Florida and was working as a successful private SAT and ACT tutor, giving workshops for teachers as well, and running a nonprofit to offer his services on a sliding scale to students with financial need. “I have a reputation as the guy you call if you need someone to meet NCAA requirements,” he said, referring to promising student-athletes who need to get their scores up in order to qualify for college sports scholarships. “Nine of my guys have made it to the NBA—of course, I can’t tell you their names.”

In 1998, when Florida began phasing in its state standardized test, the FCAT, he met with a local high school principal who asked him to create an FCAT test-prep course. “I said, I’m not interested—the SAT and ACT are keeping me plenty busy. He said, ‘Well, we’re not going to care about SATs and ACTs anymore. We’ll only care about FCATs. So I won’t hire you next year to help with our SAT, but I certainly would hire you for the FCAT and pay you a lot more money.’”

“At the time I was living in Celebration, Florida,” a high-end planned community built by the Walt Disney Company. “He told me, if you get into the FCAT business, you will own the biggest house in Celebration. Well, I left Celebration a few years later and built a 3,300-square-foot lake home. And did I have the FCAT to thank for it? Absolutely.”

Even though—or especially because—his livelihood springs from standardized tests, Alexander delights in debunking their mysteries for people of all ages. “I tell kids I’m going to teach you to think like a test writer. The more you think like your opponent, the easier it will be to win. And that’s the very skill that I taught Laela.”

Ali took Laela for just two ninety-minute tutoring sessions with Alexander. Before the first session she gave her daughter a set of eighteen practice questions from Florida’s Department of Education website. The girl got just two of the eighteen right. Immediately after the session Ali gave her the questions again and this time she got just two of the eighteen wrong. “It was amazing—night and day. I could not believe it,” she said. “She did not learn in an hour and a half how to read better. He taught her how to take a test, basically.”

With Roach’s backing, Ali’s tireless lobbying, and Alexander’s coaching, Laela got a second chance about a month into the new school year to take the third-grade exam for promotion. For good measure, they wanted her to take the fourth-grade benchmark as well. She scored 70 on the third-grade test, which had a cutoff of 33. She also scored 63 on the fourth-grade test, which had a cutoff of 35. “The day they found out the scores, the principal called me,” says Ali. “She was very monotone. You would think she would be celebrating. They told me she’d be starting Monday, September 30, into the fourth grade—‘unless you want to keep her in the third grade, because we weren’t expecting this and we didn’t order any textbooks for her.’ I’m thinking in my head, did this lady really say that?! I said, no, she will be going on to the fourth grade.”

Alexander, who’s worked as a test writer as well as a coach, sees tests being both overprescribed and misused. “What it boils down to is, we realize that what’s happened here in Florida and in many other states where they have a home-brewed test, it was never designed for individual assessment, yet the politicians have turned it into an individual assessment and attached high stakes to it.” The FCAT, like many state tests, has both norm-referenced and criterion-referenced sections. “This test was designed to measure a population of students, not whether Laela should go to the fourth grade.”

Walt Haney, the professional expert witness, concurs that norm-referenced test design was incorrectly used for the Florida FCAT, as it is in other states. “I’ve basically concluded that public education in Florida is about the worst in the country” in part because of the misuse of tests, he said, echoing Roach. Florida is transitioning to Common Core–aligned tests and away from the FCAT, but they’ll still be part of a high-stakes system. “Professional standards regarding educational psychological tests clearly state that test results should not be used in isolation to make decisions because they’re so highly fallible,” Haney said. “To use them to make decisions mechanically about individual schools, teachers, or students is a clear violation of these professional standards. It’s not just wrong, it’s idiotic.” Tests, like any human creation, are imperfect. It’s when they are used as a sole decisive point of evidence that they become truly harmful.

Things Fall Apart

A dozen years of carelessly applied high-stakes testing has done little to improve students’ thinking or learning as far as tests themselves can tell: small gains on the NAEP, the Nation’s Report Card, and no relative gains on PISA, the international achievement test given every other year in sixty-five countries. Between 2002 and 2006, according to the independent nonprofit organization Center on Education Policy, state test scores themselves went up in most states, while achievement gaps narrowed slightly. A second independent study found evidence of improvement in math but not in reading.

By 2010–2011 half of all public schools in the nation stood to miss their 100 percent proficiency targets under No Child Left Behind, which were due to kick in during the 2014–2015 school year. The Department of Education wanted to forestall the political firestorm that would come when so many schools were declared “failing.”

Up through the midterm elections in 2010 the Obama administration held out hope to make the reauthorization of the Elementary and Secondary Education Act a bipartisan cause. The ascent of the Tea Party, whose members, like Ronald Reagan before them, opposed the very idea of a Department of Education, drove the nail in the coffin of that plan. Instead, the Education Department took advantage of its regulatory power to grant waivers to states—that is, flexibility on NCLB’s rules.

Forty-two states now have waivers specifying their own individual accountability standards. Most have abandoned the inflexible 100 percent proficiency rule in favor of incremental targets. For example, Ohio has pledged to raise the proficiency level in reading for the general population of students from 81.9 percent in 2010–2011 to 86.4 percent in 2014–2015. Though they no longer have ironclad proficiency targets, states are still giving NCLB tests; almost two thousand schools a year are still being closed, and teachers are evaluated based on the scores.

Because of the waivers, however, “NCLB as a common state accountability metric has fallen apart,” said Diane Stark Rentner, author of the Center on Education Policy report on NCLB scores. “The data is old now. I can’t tell you whether scores are up or down.” As long as ESEA isn’t reauthorized and with the waivers in place, “accountability” loses any coherence it might once have had.

Lonely Defenders

I had trouble finding anyone who would defend No Child Left Behind as a piece of legislation in 2014.

Frederick Hess, a longtime scholar and advocate of school reform with the conservative American Enterprise Institute, told me, “Back before NCLB, it’s important to remember the vast majority of states couldn’t even tell you how well kids were doing at reading and math. Each and every state could point to its assessment results and tell you it’s performing above the ‘national mean.’ Prior to NCLB—as mixed as I am on NCLB—school systems found it very easy to excuse poor performance.” Of course, they found it nearly as easy to do the same after NCLB, with the help of a little mathematical manipulation.

I ask Hess to refer me to the staunchest defenders NCLB has left. “That would be the Education Trust,” he said, a conservative-funded nonprofit advocacy group in DC. But they are cagey about interviewing, agreeing at first to speak only on background.

When I get Daria Hall, the director of K–12 policy development for the Education Trust, on the phone, she said of course poverty is a big part of the problem with education. “We know the achievement gap starts before students enter school. . . . We let far too many kids grow up facing the worst that poverty has to offer.” The problem, she said, is that “we give them less inside school too: weak, watered-down, boring curricula. Less money, less access to the strongest teachers. The reality was that some students were getting taught at a high level, but on the other side of town they were filling out worksheets, watching movies, drawing pictures. So we have long advocated for rigorous standards for what students should know and be able to do. A body of work reaching back for decades indicates that all students, if given the appropriate support, can achieve at high levels.”

This is the best fundamental argument I have heard for giving the same test to all children in all schools: to equalize expectations for all students regardless of background.

I heard it from my best friend too, in a conversation that galvanized my decision to write this book. Linda is earthy and hilarious, the youngest daughter of a large, prosperous family of midwesterners. She rides her bike all over Brooklyn, plays the violin, and used to sing in a church choir. Her entire career has been spent in the New York City public schools under No Child Left Behind. She graduated from Yale in 2002 and went directly into the New York City Teaching Fellows program, which trained her and placed her in a classroom within a few months. Then she was fast-tracked to principal, starting out at a high-poverty public middle school in the Bronx at the age of twenty-eight.

She keeps a critical perspective on the system, even as she works tirelessly within it for the benefit of her students. So I was surprised to hear her defend the state tests. “Before we had them, entire populations of students would be written off” by their teachers, the very people who were supposed to be helping them, she said. “Before the tests, that was the norm,” echoing Michael McGill of Scarsdale Public Schools. “It was like, oh, the kids are doing all right, they’re coming to school, they have tough lives, I make sure they get a good breakfast. At least the teachers are now paying attention.” Test scores, she said, also serve as ammunition to convince reluctant or disconnected parents that their kids need extra help. They trigger consequences: parent conferences, summer school, tutoring, increased scrutiny for teachers whose students fail year after year.

I pointed out that over the dozen years that tests have been in place, achievement gaps haven’t shrunk; that she has found it extremely difficult to remove demonstrably ineffective teachers; that the tests don’t show the kind of progress her kids are actually making, such as when a sixth grader moves from a third- to a fourth-grade reading level; that tests take precious time and resources and impose stress and anxiety on her kids, her staff, and herself; that the requirements change year after year; their diagnoses aren’t always accurate; that they put her entire school at risk of closure; that they address no part of the pathologies that stand in the way of her students’ success, from homelessness to horrifying trauma and abuse.

“I think it’s a process, and we’re at the first stages,” she replied.

Okay, so what are the next stages?

Resource Accountability

“All students, if given the appropriate support, can achieve at high levels,” said Daria Hall.

Let’s break that down for a second.

What does “high levels” mean? If it represents some kind of personal best, it’s subjective enough to be meaningless. If it means some arbitrary, middle-of-the-road proficiency on a standardized test, it’s patently false for a second-percentile special needs child like Jackson Ellis, and almost as useless for a high-achieving original thinker like Laela Gray or, for that matter, Bill Gates.

And what about “given the appropriate support”?

That’s never been tried. Imagine for a second what kind of school system we’d have if the “broader, bolder” agenda had been put in place. What if the mandate was to provide the most disadvantaged kids with the best-funded schools, the most advanced curricula, the highest-qualified teachers, and all of the wraparound services they need?

Resource accountability is the name given to the idea of holding states responsible for equalizing educational inputs rather than or in addition to outcomes. The idea has never gained traction at a federal level. At one education conference a weary think tank policy expert at a Gates Foundation–sponsored cocktail party warned me against even bringing it up. It is too hard to enforce or too politically inconvenient or just too expensive.

The bulk of public school funding comes from local property taxes, meaning that the schools of the rich continue to be better funded than the schools of the poor. According to the 2014 edition of a “National Report Card” on school-funding fairness, only fourteen states in 2011 even attempted progressive funding, giving more money to high-poverty districts than to more affluent districts to address the greater needs of high-poverty districts’ students. In every other state either the funding is flat across districts or rich districts get more.

The Backlash

No Child Left Behind had critics from the beginning: teachers’ unions, scholars, activist groups like FairTest, and parents. But 2011 marked the beginning of a full-blown backlash. Waivers weakened NCLB. The Common Core stirred up controversy. And discontent with test-dominated education grew.

The rumbling started where NCLB was born—in Texas.

In 2011 Texas House Bill 3 passed, levying the heaviest, most draconian testing requirements in the nation. High school students would have to take a total of fifteen state tests in four years. Not only would these tests be a requirement for high school graduation, but they would also make up 15 percent of the final grade in each tested subject. Plus, students would be required to hit a certain score on the English II and Algebra III exams in order to be eligible for admission to a four-year state university. “I mean, these were ridiculously high stakes,” said Theresa Treviño, a mother of two high school students and who founded Texans Advocating for Meaningful Student Assessment (TAMSA) with other parents in response to the new rules. “It was a drastic change.”

In a series of public remarks in January 2012, Robert Scott, the state’s education commissioner, condemned the new law as a “perversion of its original intent,” stating,

The assessment and accountability regime has become not only a cottage industry but a military-industrial complex. And the reason that you’re seeing this move toward the “Common Core” is there’s a big business sentiment out there that if you’re going to spend $600–$700 billion a year in public education, why shouldn’t there be one big [defense-style] Boeing, or Lockheed-Grumman contract where one company can get it all and provide all these services to schools across the country. . . .

What we’ve done in the past decade, is we’ve doubled down on the test every couple of years, and used it for more and more things, to make it the end-all, be-all. . . . You’ve reached a point now of having this one thing that the entire system is dependent upon. It is the heart of the vampire, so to speak.

All you have to do is kill that, and you’ve killed a whole lot of things.

In the spring of 2012 the Texas School Board introduced an antitesting resolution inspired by Scott. The first line of the resolution reads, “The overreliance on standardized, high stakes testing as the only assessment of learning that really matters in the state and federal accountability systems is strangling our public schools.” By the end of the year districts representing 91 percent of the state’s school children had adopted the resolution. By 2013, in response to a public outcry organized by TAMSA and other groups, with some hearings lasting until 3 a.m., Governor Rick Perry signed two bills cutting the number of “end of course tests” for high school students, from fifteen down to five, and exempting students who scored high enough on tests in third or fifth grade from future state tests.

Resistance to testing ignited elsewhere around the country. In January 2013 teachers at Garfield High School in Seattle voted unanimously not to administer the state Measures of Academic Progress (MAP) test for high school graduation. The boycott spread to nine high schools and ended in victory when the district made the test optional. The wave of opt-out protests spread across the country in the 2013 and 2014 testing season, making news in Illinois, Colorado, New York, and other states.

Accountability Flux

The twelve-year anniversary of No Child Left Behind, which was supposed to be the deadline for every single public school in the country to achieve 100 percent proficiency, passed in January 2014. The Department of Education didn’t even bother issuing a press release.

By the spring of 2014 over a dozen states had already vowed to scale back on tests. In a single week in February Missouri’s state board of education cut back on tests, Virginia’s Senate voted to delay test-based rating of schools, Alaska’s state board of education voted to rescind the high school graduation exam, and New York State announced they would delay implementation of the Common Core and take steps to limit testing. And both California and Washington State walked into showdowns with the federal government over testing in the spring of 2014.

Washington State risked about $40 million in federal funds when it refused to approve the linkage of teacher evaluations to test scores. California decided to suspend high-stakes testing during the transition to the Common Core. The plan was that students would take a sampling of the Smarter Balanced consortium tests in spring 2014 for the purpose of field-testing the tests; the results would not be published and would not be used to evaluate teachers, and California would avoid double-testing students by giving both the old and new tests at the same time. Arne Duncan threatened the state with the loss of their NCLB waiver and $3.5 billion in cash.

Michael Kirst, president of California’s State Board of Education, says everyone knows that the old tests aren’t good enough, and the state needs time to give the new tests a chance. “I think every state that’s been doing value-added measurements off of cheap closed-end multiple-choice tests is going to have trouble maintaining the validity of their testing systems for use in teacher evaluation,” Kirst said. “Teachers have never felt those assessments were true measures of what they were trying to teach.”

In February 2014 the president of the National Education Association (NEA), the nation’s largest teachers’ union, issued an open letter calling for a “course correction” on Common Core. The NEA had previously backed the standards, but President Dennis Van Roekel called the implementation “botched.” “Old tests are being given, but new and different standards are being taught,” he wrote. “This is not ‘accountability’—it’s malpractice.” By June 2014 three states, Indiana, South Carolina, and Oklahoma, had dropped the Common Core. The number of states that planned to use the consortium-produced tests from PARCC and Smarter Balanced had dropped even further, from forty-five to twenty-seven, with the rest either purchasing Common Core tests from vendors like Pearson or still undecided.

Even Vicki Phillips at the Gates Foundation, which had played such a strong role in driving the shift toward tests and the rapid adoption of the Common Core, said in an open statement in June 2014, “Assessment results should not be taken into account in high-stakes decisions on teacher evaluation or student promotion for the next two years,” to ease the transition to the Common Core. An interesting opinion, especially as it appeared to call for changes to state law and/or federal waivers.

Then, in the spring of 2014, the College Board announced a major overhaul to perhaps the most feared and iconic standardized test of all: the SAT. The news made the cover of the New York Times magazine. The changes to the test mirrored many of the concerns raised in this book. For years critics have pointed out that the SAT is a weak predictor of college performance and that scores are highly correlated with family income. David Coleman, the head of the College Board, was also a major architect of the Common Core State Standards. He announced his intention to better align the SAT with the Core and to make the test harder to game through costly tutoring. But rather than placate critics, the SAT overhaul kicked off more attacks on its relevance and fairness as well as new calls to eliminate the test altogether.

The Backlash Heard ’Round the World

The current wave of resistance to high-stakes testing is global. In China there is a growing interest in alternative forms of education, from Western-style liberal arts colleges to a small movement of Montessori, Waldorf, and other progressive education styles. Singapore, widely praised and emulated here in the United States for its math teaching especially, is debating its reliance on standardized testing. And a few provinces in Canada are phasing out traditional standardized tests or considering doing so.

Over the last decade Mexico has come closer than any other country to adopting the US model, particularly in using test scores to judge teachers. Here, test-based accountability has been deployed as a political weapon against the national teachers union, one of the largest labor unions in the world and derided as a bastion of patronage and corruption.

In the mid-2000s the government introduced national tests, the EXCALE and ENLACE, with cash bonuses to teachers for good results. Half of teacher evaluations were based on test scores. In September 2013 the right-wing government of Enrique Peña Nieto voted to tie the hiring and firing of teachers to mandatory standardized testing for both teachers and students as a means to take personnel decisions out of the hands of the union. These reforms have repeatedly drawn tens of thousands of teachers into the streets of Mexico City in protest. The ENLACE test was suspended in February 2014 amid widespread allegations of coaching, cheating, and erasing parties, where teachers get together to change answers.

Israel is another site of testing controversy. It introduced a national standardized test known as the Meitzav in 2002–2003. It’s given every two years in elementary and middle schools, rotating among four core subjects: math, English, science, and Hebrew language. But when the country started publishing the results of these tests, the education minister, Shai Piron, told the press that it bred a familiar list of unintended consequences. They found schools shifting teaching hours to spend time on test prep and spending scarce money on prep materials. Students and teachers felt “undue pressure.” The integrity of the tests was questioned. Schools that served the poor and Israel’s large immigrant populations felt the strongest urge to close the achievement gap on tests. The published results made the schools feel like sports teams pitted against one another. Piron said, “An atmosphere of bad culture and league tables arose, which harmed schools, especially those which integrate students from lower socio-economic status, and do God’s work to close gaps. . . . The message is we’ve gone crazy, confused. This thing turned into something that drives us from learning to measuring.” Piron made the controversial decision to cancel the country’s tests for the 2013–2014 school year while a new means for reporting student achievement and school quality could be developed.

Teetering on the Edge

Something’s in the wind. The edifice of high-stakes standardized testing may be more fragile than it appears. These policies are only a little over a decade old, and they’re already in a state of massive confusion. The weakening of NCLB and rising objections to the Common Core offer a real opening for change.

There are three possible responses now to the testing madness.

We can not take the tests. Just opt out. Opting out can be a surprisingly effective protest, and sometimes it’s the best thing for kids developmentally. On a national basis cutting back on standardized testing, lowering the stakes, and getting rid of value-added teacher measures can all be positive changes.

We can build better tests that measure more important things, more accurately, and better accountability systems to go with them. Better tests merge learning with assessment. They move beyond the known pitfalls of standardization. They aim at measuring thinking rather than memorized information, motivate and reward better teaching and deeper learning, and address the full range of factors that are key to success. They form the foundation for true, two-way accountability, providing the information needed so students, families, educators, and lawmakers can work together to improve schools. I was surprised to learn just how close this future of better tests and real accountability might be.

In the meantime, and as individuals, we can try to pursue positive strategies as parents and educators that help our children beat the tests. The best of these strategies, it turns out, can help kids do better in the rest of school and even in life, thus rendering the flawed tests we have less than a total waste of time.

The rest of this book will take on these three options one by one.