1

 

Ten Arguments
Against Testing

 

On a snowy morning in January, on the repurposed second floor of a cavernous parochial school, the 135 sixth graders of Leaf, a brand-new charter middle school in Brooklyn, are getting ready for the first of six full school days of testing. (I’ve chosen a pseudonym for this school and its students, administrators, and teachers to ensure that they can speak freely.) The students aren’t taking the real New York State standardized tests, which come in April, nor are they taking the extra benchmark exams that are up to each school to choose for their own diagnostic purposes. Leaf does those in reading, social studies, science, and math in August, December, and June—seven testing weeks a year out of thirty-six, a pretty typical schedule.

Instead, today is the English Language Arts (ELA) Mock Exam: three days spent taking a practice reading test, to be followed by the Math Mock Exam next week. Regular students will take the test for ninety minutes each morning; the one-fifth of the students who have a learning disability, English language learners, or ADD (Attention Deficit Disorder) diagnosis get double time, up to three hours to complete the test. In the afternoon the kids are “burnt,” said Ms. Berry, the principal, both from the questions and from sitting in total and strictly enforced silence for three hours. So they’ll spend the rest of each day doing meditation, playing games in the gym, drawing pictures, and watching movies.

It’s a big investment of time and resources, and the results count for nothing. But Ms. Berry knows from experience that a top-to-bottom dress rehearsal is necessary in order to calm nerves and deal with any logistical issues that might come up before the big days. As we walk the halls the troubleshooting begins: one kid finished the first section in pen, which is verboten. A teacher has a cup of coffee; if a monitor dropped by from the city, that would be considered an infraction of the strict rules for proctors. In another classroom an inexperienced teacher needs a pep talk; she’s unable to control her students’ multiple requests to move their desks around the room “to concentrate.”

The word of the day is anxiety. Parents are anxious about how their students’ scores—one for “below standard,” two for “basic,” three for “proficient,” and four for “exceeds proficient”—will look on their applications for competitive public high schools. They call to complain that Leaf is doing either too much test prep or too little. Some New York City public schools send home workbooks for months on end; others hold Saturday tutoring sessions every week of the year. Despite the six days of mock testing, Leaf is actually on the lighter end of the spectrum.

Teachers are anxious because 40 percent of their evaluations come from student scores on a combination of state and other standardized assessments. Ms. Berry is anxious because under New York City’s charter school rules, if they don’t demonstrate enough test score growth within each subgroup of minorities, English language learners, and learning disabled students, they’ll be closed in five years.

Students pick up on their parents’ and teachers’ anxiety. Some stay home with stomachaches. Others stare into space or misbehave. “My mom worries about me a lot. So does my grandmother,” says Lucas, a liquid-eyed sixth grader carrying a fantasy novel with a dragon on the front. When I ask what he thinks of the test, he says, “It’s like a life-and-death situation. It decides whether you’ll get to another grade. If not, people will be disappointed with you.”

Leaf isn’t the only school in the country that’s consumed by anxiety over standardized testing. It’s close to the norm. And as students, family, and school leaders scramble to comply with these requirements, sometimes they lose sight of the big picture: there’s lots of evidence that these tests are doing harm, and very little in their favor.

Here is the case against high-stakes state standardized tests in math and reading as currently administered annually at Leaf and nearly every other public school in the nation:

1. We’re testing the wrong things.

2. Tests waste time and money.

3. They are making students hate school and turning parents into preppers.

4. They are making teachers hate teaching.

5. They penalize diversity.

6. They cause teaching to the test.

7. The high stakes tempt cheating.

8. They are gamed by states until they become meaningless.

9. They are full of errors.

10. The next generation of tests will make things even worse.

1. We’re testing the wrong things.

States are required to test just two subjects: math and language. Reading is emphasized over writing because the tests are mainly multiple choice.

Hugh Burkhardt is a British mathematician and international expert in both curricular design and assessment of mathematics. He has been a consultant on the development of the new Common Core tests. In his spare time he dabbles in elementary particle physics. When I ask him about the problems with tests as they are currently used in the United States, Burkhardt puts it this way, in a plummy accent: “Measurement error consists of two parts: systematic and statistical error. The systematic error in education is not measuring what you want to measure. . . . Psychometricians [test makers], who usually focus only on statistical error, grossly overestimate the precision of tests. . . . They just assess some bits that are easy to assess accurately.”

In other words, to use a metaphor: if your telescope is out of focus, your problem is a statistical error. In Burkhardt’s opinion the lenses we’re using are sharp enough, but we are focusing on just a few stars at the expense of the universe of knowledge.

Are we measuring what we really want to measure in education? A flood of recent research has supported the idea that creative problem solving, oral and written communication skills, and critical thinking, plus social and emotional factors, including grit, motivation, and the ability to collaborate, are just as important in determining success as traditional academics. All of these are largely outside the scope of most standardized tests, including the new Common Core–aligned tests.

Scores on state tests do not correlate with students’ ability to think. In December 2013 MIT neuroscientists working with education researchers at Harvard and Brown Universities released a study of nearly 1,400 eighth graders in the Boston public school system. The researchers administered tests of the students’ fluid intelligence, or their ability to apply reasoning in novel situations, comprising skills like working memory capacity, speed of information processing, and the ability to solve abstract problems. By contrast, standardized tests mostly test crystallized intelligence, or the application of memorized routines to familiar problems. The researchers found that even the schools that did a good job raising students’ math scores on standardized tests showed almost no influence over the same students’ fluid intelligence.

Daniel Koretz, the Henry Lee Shattuck Professor of Education at the Harvard Graduate School of Education and an expert in educational testing, writes in Measuring Up: What Educational Testing Really Tells Us:

These tests can measure only a subset of the goals of education. Some goals, such as the motivation to learn, the inclination to apply school learning to real situations, the ability to work in groups, and some kinds of complex problem solving, are not very amenable to large-scale standardized testing. Others can be tested, but are not considered a high enough priority to invest the time and resources required . . . even in assessing the goals that we decide to measure and that can be measured well, tests are generally very small samples of behavior that we use to make estimates of students’ mastery of very large domains of knowledge and skill.

So some important things we don’t test because the tests aren’t up to it. Some we could test but don’t bother. And for the things we do test, the tests are actually too small a sample of behavior to make wide-ranging judgments.

2. Tests waste time and money.

Not only do standardized tests address only a fraction of what students need to learn, but we’re also spending ages doing it. At schools like Leaf, time given to standardized tests is more than the weeks spent taking the tests; it also includes practice tests, field tests, prep days, Saturday school, workbooks for homework. It includes afternoon periods full of movies for kids “burnt” from the tests. And standardized tests are not just state-mandated accountability tests. There are independent national assessments like Iowa Basic Skills Tests and the “Nation’s Report Card,” international tests like the Programme for International Student Assessment (PISA), diagnostic tests such as Dynamic Indicators of Basic Early Literacy Skills (DIBELS), supplementary subject tests in social studies and science, and local benchmark tests so districts can predict how their students will do on the state tests. In the later grades, of course, come the SAT and ACT and their accompanying practice and prequel tests, now starting as soon as seventh grade. Reports from across the country suggest that students spend about three days taking state tests in each of grades three through ten but up to 25 percent of the school year engaged in testing and test prep.

By the time a student graduates high school that could translate to 585 school days—three and a quarter extra school years that they could have spent learning instead of being tested on what they already knew or, worse, didn’t know.

At the outer limits, in the Pittsburgh Public Schools in the 2013–2014 school year, students in kindergarten through twelfth grade took a total of more than 270 tests required by the state or district. The most tested grade was fourth, with 33 required tests, just shy of one a week on average. These included the state Pennsylvania System of School Assessment tests in math, reading, and science, for teacher evaluations; a three-part reading test; the DIBELS Next three-part reading tests; plus twenty more benchmark reading tests and four benchmark math tests created by district staff.

Ongoing and frequent assessment is part of good educational practice. Good teachers give lots of formative feedback—steady little nudges that let students know how they’re progressing. But they draw on a full palette of assessment to do that: calling on the class during a lecture, pop quizzes, sending students up to the board to solve homework problems, daily journal entries, lab reports, peer evaluations and group critiques, research papers, presentations, and final exams. Standardized tests, however, restrict the palette to black and white. They aren’t in teachers’ control, so they aren’t integrated into teaching and learning in the same way that formative feedback is.

Often the more a kid is struggling in school, the more time she spends taking standardized tests. Response to intervention (RTI) is a heavily assessment-driven approach to schooling that’s being used to some extent in 60 to 70 percent of schools. Assessment is “at the front end” of RTI, said Louis Danielson, who was in the Department of Education’s Office of Special Education Programs from 1976 until 2008.

With RTI, “Assessment,” he said, “plays a key role in decision-making. You’re screening to identify at-risk kids.” Under RTI, at the beginning of first grade every student takes a reading test. Those who score at the low end are assessed every other week to determine whether they’re making sufficient progress. If they aren’t, after six to eight weeks they’ll be eligible for more targeted interventions, like tutoring or small-group work. The testing continues, up to once or twice a week.

Richard Halverson, a professional of educational leadership at the University of Wisconsin-Madison who studies how technologies change schools, calls RTI “a national effort to make special ed into all of school—so all kids get assessed, all get learning plans and the kids who struggle get assessed even more. It’s the enshrinement of pervasive assessment as the model of education.” Pervasive assessment is a nightmare version of school for most students. It’s like burning thirsty plants in a garden under a magnifying glass, in the hope that they will grow faster under scrutiny.

That’s the time factor. What about money? Are we spending too much on these tests, most of which goes to a handful of private companies? A 2012 report by the Brookings Institution found $669 million in direct annual spending on assessments in forty-five states, or $27 per student. But that’s just the beginning.

The cost rises up to an estimated $1,100 when you add in the logistical and administrative overhead (e.g., the extra cost of paying teachers to prep for, administer, and grade the tests) plus the instructional time lost. Leaf, for example, employs a full-time testing coordinator, though it has fewer than two hundred students.

According to a 2006 analysis by Bloomberg Markets, over 60 percent of the test companies’ revenue comes from prep materials, not the tests themselves. The profit margins on No Child Left Behind tests are as low as 3 percent, but practice tests and workbooks are more cheaply produced and claim as high as a 21 percent profit margin.

Many informed observers say we’d do better to have more expensive tests and fewer of them. “The reliance on multiple choice tests is a very American obsession,” said Dylan Wiliam, an expert on the use of assessments that improve classroom practice. “We think nothing of spending $300–$400 on examining kids at the end of high school in England.” It’s a case of penny wise and pound foolish, critics like Wiliam say: you waste billions of dollars and untold hours by distorting the entire enterprise of school, preparing students to take crummy multiple-choice tests that cost only twenty-five bucks to grade.

3. They are making students hate school and turning parents into preppers.

“The tests are boring!” complains Jorge, a sixth-grade student at Leaf. “You don’t really want to sit in a chair for three hours. There’s no breaks. You can’t stand up and stretch, go to the bathroom, get a tissue, get a drink of water. It makes us really stressed, so we don’t do as well.”

A little bit of stress can be healthy and motivational. Too much or the wrong kind can be damaging and toxic. When you put teachers’ and principals’ jobs on the line and turn up the heat on parents, students catch the anxiety like a bug.

Claire Walpole, a Chicago parent, blogged about her experiences assisting her daughter’s class with computer-based testing. Her daughter broke down on the way home on the second day. “‘I just can’t do this,’ she sobbed. “The ill-fitting headsets, the hard-to-hear instructions, the uncooperative mouse, the screen going to command modes, not being able to get clarification when she asked for it. . . . It took just two days of standardized testing for her to doubt herself. . . . ‘I’m just not smart, Mom. Not like everyone else. I’m just no good at kindergarten, just no good at all.’”

Especially in the elementary grades, teachers and parents across the country report students throwing up, staying home with stomachaches, locking themselves in the bathroom, crying, having nightmares, and otherwise acting out on test days. As a first- and second-grade teacher, giving mandated state tests, educational consultant Sara Truebridge said, “I never gave a test where I didn’t have one child totally melt down. Just crying. These are second graders. They can’t do it, they’re nervous, they’re tired, they’re showing tics, they’re not sleeping. And these may be the most gifted kids in the room.”

Research dating back to the 1950s has shown that 25 to 40 percent of students suffer anxiety significant enough to depress test performance and that these anxious students perform 12 percent worse on average. The current thinking is that anxiety distracts people from the task at hand, as their minds are focused on negative thoughts about shortcomings and their imminent failure, and that this negative self-talk also interferes with working memory. All of these effects undermine the reliability of standardized tests to discern students’ true competence. And as the tests draw more and more focus, they destroy students’ enjoyment of school.

The anxiety doesn’t end when students go home. The pressure of high-stakes tests is driving parents to act against their own values. “Parenthood, like war, is a state in which it’s impossible to be moral,” wrote Lisa Miller in New York Magazine in 2013 in an article in which she describes sending a fourth grader to school with head lice so she could take the state-mandated English exam to get into competitive middle schools.

From striving immigrants to the very wealthy, it’s becoming more commonplace for families across the country to spend thousands of dollars annually to help their kids prepare for the standardized tests that will get them into public gifted kindergartens, private schools, competitive middle schools and high schools, and, of course, college. Since the 1970s, among affluent families the total amount spent on out-of-school enrichment has grown from $3,500 a year to $8,900 a year, both in 2012 dollars.

For working-class parents whose kids are more likely to be labeled failing, school-mandated tutoring, afterschool programs, and Saturday and summer school sessions crowd out limited time and resources for extracurriculars or other enrichment. “She goes in the morning to the extra tutoring before school, she stays after school, she’s pulled out during class,” says Rosendo Soto, a firefighter in Texas, whose middle child is struggling with the tests. “We’re on spring break now, I’m working intensively with her every day on math and writing expository stories and personal narratives. It’s all geared toward these tests, tests, tests. She’s nervous, fearful, and I have to remind her every day it’s just school. I feel like we’re sending her to be tortured.”

The money parents spend on preparing kids for tests dwarfs what schools are spending to give them. The total test preparation, tutoring, and counseling market in the United States was estimated at $13.1 billion by 2015, and the global private tutoring market was estimated to pass a whopping $78.2 billion. That counts the companies like Kaplan, Princeton Review, and Grockit that hold classes, sell books, and offer online services, and the national and international chains like Kumon, Sylvan Learning, and Huntington Learning Center that accept kids as young as eighteen months old for pre-academic and after-school drilling and prepping. That estimate also includes money spent on private tutors, who can charge anywhere from $45 to $1,000 an hour, and independently operated “Saturday schools” or “cram schools” that are expanding from their traditional Chinese-, Korean-, and Russian-speaking immigrant roots to attract more and more mainstream American families.

What these dollar figures don’t convey is the time, anxiety, and opportunity cost that come along with them. Instead of giving them time to pursue a creative passion, a sport, play outside, or just be together as a family, millions of stressed-out parents are frog-marching their kids through hours of the most boring kind of studying on top of the time they spend in school. No matter how much you want to convey to your children the spirit of fair play, the joy of learning for its own sake, the belief that they are more than a score on a piece of paper, sending them to test prep is an action that speaks far louder than words.

4. They are making teachers hate teaching.

I want my child taught by proud, well-paid, highly engaged professionals. But high-stakes standardized tests deprofessionalize teaching because they give outside authorities the final say on how teachers should do their jobs. The testing company determines the quality of teachers’ performance. In judging students’ progress, the law gives test scores more weight than the observations of people who spend time with the kids every day.

Possibly the most politically charged application of standardized testing is the rapid growth in the use of these tests in teacher evaluation. Teachers used to be evaluated solely by their supervisors, and the vast majority historically got satisfactory ratings regardless of how well the school or their students were doing. Race to the Top, a 2009 Department of Education initiative under President Obama, instead rewarded states for evaluating teachers based on student test scores in the hope that this would be more objective.

According to the National Council on Teacher Quality, from 2009 to 2012 thirty-six states and the District of Columbia have changed the rules for teacher evaluation. Thirty states now require these evaluations to include “objective measures of student achievement,” which in practice nearly always means test scores. Eighteen states and the District of Columbia actually base tenure decisions on the test scores of a teacher’s students.

How do you judge a teacher based on their student’s test scores? Not very well. Obviously you can’t take a teacher whose students are the children of Hispanic migrant workers and simply compare their test scores to those of the teacher teaching the rich kids up the hill to figure out who is a better teacher. Value-added measurements were thus concocted. These take students’ scores one year and their scores the next year (or, sometimes, their scores on the same test repeated in the fall and the spring) and compare them to a model that predicts how much they should have grown over that time period. The teachers’ “value add” is how much the student actually gains compared to what was predicted.

There are a lot of holes in this approach. There is no value-added data at all on kindergartners through third graders, in the years before official testing begins, although some states have added yet more tests to rectify this problem. Should physical education, art, science, and social studies teachers be evaluated based on their students’ math and reading skills? What about students who transfer into a class midyear? What about team teachers? What about specialists? What about students who are often absent? What if tests and/or cutoff scores change and test results drop district-wide as a result?

In a 2011 paper, “Getting Teacher Evaluation Right,” the Stanford researcher Linda Darling-Hammond and three other education researchers concluded that value-added measurements should only be used alongside other means of evaluation and in a low-stakes way. Their research showed that ratings for individual teachers were highly unstable, varying from year to year and from one test to another.

A vivid example of the instability of value-added formulas is the story of Carolyn Abbott, the “worst” eighth-grade math teacher in New York City. Abbott taught math to both seventh and eighth graders at the Anderson School, a public school in Manhattan that pulls students from all over the city for its gifted and talented program. Her seventh-grade students performed in the 98th percentile on the 2009 state test. Based on their high scores, the value-added model predicted that these students would perform at or above the 97th percentile the following year.

But in 2010 Abbott taught this exact same class of students, now in the eighth grade. By this time these students were far ahead of the material covered on the state test, which they had learned in fifth grade at Anderson. They were preparing instead for a much tougher, high school–level Regents Exam in algebra and were busy applying to high schools. “The eighth-graders don’t care; they rush through the exam, and they don’t check their work,” Abbott told the Washington Post. “The test has no effect on them. I can’t make an argument that it counts for kids. The seventh-graders, they care a bit more.”

So her eighth graders, who had been 98th-percentile performers as seventh graders the year before, slacked their way to “only” the 89th percentile in 2010. The value-added formula blamed Abbott for the relatively large drop in scores, thus anointing her the worst eighth-grade math teacher in New York. New York City’s Department of Education, over the objection of the teachers’ union, released its Teacher Data Report to major media outlets, and Abbott’s name and rank were published far and wide. Abbott had the support of her administration and her students’ parents, but the experience was so “humiliating” that she left teaching for a PhD program in mathematics at the University of Wisconsin-Madison. “It’s too hard to be a teacher in New York City,” she told one blogger. “Everything is stacked against you. You can’t just measure what teachers do and slap a number on it.”

“Teachers are demoralized and feel very powerless” because of test-driven accountability, said Randi Weingarten, president of the American Federation of Teachers, one of the two national unions. “Large numbers of teachers are retiring. The attrition rate in big cities is around 50 percent,” up to a high of 70 percent after five years in Washington, DC.

The 2012 annual MetLife Survey of the American Teacher showed that the percentage of teachers who are “very satisfied” with their jobs had sunk to 39 percent, its lowest point since 1987. Half of teachers said they felt very stressed. Only a fifth to a quarter of teachers in other surveys express faith that tests are accurate reflections of their students’ learning. “These are pretty shoddy tests,” said Weingarten. “When everything becomes about data and testing, it wholly controverts the purposes of education.”

Teachers are taking to YouTube, blogs, Tumblr, and Twitter to describe just how demoralizing standardized tests are to them personally. A veteran fourth-grade teacher in Florida resigned in May 2013 via YouTube. “I have experienced the depressing gradual downfall and misdirection of education that has slowly eaten away at my love of teaching,” she said in her video. Curtains blow gently in the breeze behind her; her face is haggard. “Raising students’ test scores on standardized tests is now the only goal. . . . Everything I loved about teaching is extinct.” The video has over 600,000 views.

5. They penalize diversity.

No Child Left Behind (NCLB), the major testing law, was intended to “close the achievement gap.” It sought to hold schools accountable, not just for results averaged over all students, but also for the performance of each historically lower-performing group of students: the poor, African Americans, Hispanics, English language learners, and those with a learning disability.

The unintended consequence of that laudable intention is that the more of these subgroups a school has, the more chances it has to fail to make adequate yearly progress (AYP) targets. In other words, schools that serve the poor and ethnic minorities are more likely to fail NCLB tests and be punished or closed. The number of so-called turnaround schools spiked from around one thousand a year in the mid-2000s to a peak of six thousand in 2010–2011. The number of schools shut down has been more volatile, but it has risen from around one thousand a year in the early 2000s to between fifteen hundred and two thousand a year in the late 2000s. School reorganizations, granted, sometimes bring improvement, but in all cases they disrupt communities, and this is why they have sparked protests from Detroit to Newark to Chicago to Houston to Baltimore.

Leaders of diverse schools have two rational responses to this situation. The hard way is to redouble efforts to ensure the success of at-risk subgroups of students. The easy way is to cheat on the tests, or to somehow get rid of those subgroups.

The case of Lorenzo Garcia, the superintendent of the El Paso Independent School District, shows the lengths that some school leaders are willing to go in response to high-stakes testing policies—far beyond cheating, to actually interfering with the educations of hundreds of students in order to manipulate the statistics. Garcia collected $56,000 in bonuses for the outstanding improvement in scores posted by his overwhelmingly low-income, immigrant, Hispanic student population on the Texas tenth-grade test.

Over his six-year tenure from 2004 to 2010, as a federal court found, Garcia achieved this improvement in scores by systematically targeting lower-achieving students and stopping them from taking the tests. He and his coconspirators used a wide variety of methods. Students would be transferred to charter schools. Older students arriving from Mexico, many of whom were fleeing the drug wars in nearby Ciudad Juárez, were incorrectly placed in ninth grade. Credits were deleted from transcripts or grades changed to move students forward or back a grade in order to keep them out of the tenth-grade test. Because of the manipulation, enrollment at some high schools dropped 40 or 50 percent between ninth and tenth grades.

Those intentionally held back were sometimes allowed to catch up before graduation through “turbo-mesters,” “earning” a semester’s worth of credits in a few hours on the computer. Sometimes truant officers would visit students at home and warn them not to come to school on test days. And sometimes students were openly encouraged to drop out. El Paso citizens called their lost students “los desaparecidos,” or the disappeared.

Linda Hernandez-Romero’s daughter was one of those held back in the ninth grade. She dropped out of high school and had three children by the age of twenty-one. Hernandez-Romero told reporters, “She always tells me: ‘Mom, I got kicked out of school because I wasn’t smart. I guess I’m not, Mom, look at me.’ There’s not a way of expressing how bad it feels, because it’s so bad. Seeing one of your children fail and knowing that it was not all her doing is worse.” Rick Perry’s Texas Education Agency found Garcia innocent of these allegations, but a federal prosecution resulted in $236,500 in fines and a forty-two-month prison sentence for Garcia.

Garcia’s case is exceptional because it resulted in jail time. But this kind of systematic discrimination in response to high-stakes testing has been documented in at least three states for over a decade, as discussed in Chapter 3.

Not only do they motivate blatant discrimination, but high-stakes standardized tests also interfere with educators’ ability to meet individual learning needs. Overall, 13 percent of schoolchildren are now labeled LD, for learning disabled. Under a high-stakes system both parents and schools have good reasons to push for an official diagnosis for any student who has trouble sitting perfectly still for ninety minutes every day for three weeks at a stretch. The diagnosis means extra time to take the tests, modifications, extra help, and resources. For schools, if more kids with mild learning differences end up slotted into the LD category, statistics dictate that scores will rise in both the general and LD groups.

But the long-term consequences of aggressively sorting, stigmatizing, and medicating kids are unknown. In particular, the number of kids on medication for attention disorders like ADD and ADHD has risen from 600,000 in 1990 to 3.5 million in 2013. Leading doctors who study this disorder have called the trends a “national disaster”—not a medical epidemic but rather one of overzealous treatment driven by a profit-seeking pharmaceutical industry.

The good test-takers are getting shortchanged too. Traditional standardized tests provide the most accurate information on students toward the middle of the intellectual bell curve. If a child either “hits the ceiling” with a perfect score or bottoms out on the test, her score will tell teachers very little about which areas she needs to work on. Not surprisingly, there is evidence that in the most test-driven school settings students who score well above or well below proficient get less individualized attention because teachers instead work intensively with the students who are just below proficient, or “on the bubble.” Promoting a single standard of proficiency for every child may be efficient for policymakers, but it flies in the face of current educational theory, which celebrates the individual learning path of each child.

Allison Keil is the codirector of the highly popular Community Roots Charter School in New York City. Each class in her school is team taught and includes gifted, mainstream, and special needs students working together. She calls the tests distracting, demoralizing, and confusing for many of her students and their families. “A child with an IEP [individualized education plan, e.g., those with a learning disability] has specific goals. She may be working incredibly hard all year, meeting the promotional criteria that she and her teacher have set together, and then she gets a 1 (below proficient) on the test and feels like a failure. It’s a huge disservice to the progress she’s made.”

Rebecca Ellis expresses the identical frustration. She is a single mother of a nine-year-old autistic boy named Jackson in Mandeville, Louisiana, north of New Orleans. I met them through a mutual friend at the raucous sidelines of a Mardi Gras parade; her younger, typically developing son was up on a ladder, trying to catch beads from passing floats, while Jackson ignored the racket, playing with a small plastic puzzle.

“I know today, in 2014, that Jackson is never going to pass one of these standardized assessments,” she tells me. “He took the Iowa test last year and scored in the second percentile.” It frustrates her that there is no official recognition of the real progress he is making, such as in interacting with other children, because there is no room for nuance in the standards. Rather than help him achieve his social development goals, the school’s resources are diverted toward drilling him on math and reading concepts that are far out of his ken.

Standardization is the enemy of diversity. In our high-tech era, what humans have to offer is not robotic sameness but rather variation, adaptability, and flexibility. Rating students as 1, 2, 3, or 4 in a few limited skills does nothing to promote, support, or recognize that human value or individual potential.

6. They cause teaching to the test.

In an ideal world better test scores should show that teaching and learning are getting better. But, as Daniel Koretz explains, standardized tests have never delivered on that simple promise. “If a test is well designed, good instruction will produce increases in scores,” said Koretz. “But if the test is narrow enough, and you’re incentivizing teachers, many will stop doing the more general instruction in favor of the fairly modest amount of material that we can test well. NCLB focuses on easily tested portions of reading and math skills. Huge literatures say that’s a fundamental mistake.”

In his book Koretz identifies seven rational teacher responses to high-stakes tests. From most desirable to least desirable, they are:

1. Working more effectively (e.g., finding better methods of teaching)

2. Teaching more (e.g., spending more time overall)

3. Working harder (e.g., giving more homework or harder assignments)

4. Reallocation (e.g., shifting resources, including time, to emphasize the subjects and types of questions on the test)

5. Alignment (e.g., matching the curriculum more closely to the material covered on the test)

6. Coaching students

7. Cheating

How do we know which strategies teachers are applying? We can guess by looking at the types of tests we’re using. Reliability is a basic concept in the profession of test making (known as “psychometrics”). A reliable test is one in which this year’s test takers show pretty much the same distribution of scores as last year’s test takers. Think back to high school: if you took the SAT more than once, say, in the fall and spring, you would have noticed that the two tests were virtually identical even if no single question was repeated. It wouldn’t be fair to students if the fall 2014 test was very different from the spring 2015 test because that could lead to unpredictable variations in scores. In order to be reliable, then, tests must be at least somewhat predictable or at least change slowly and gradually from year to year. And in order to be relatively cheap to administer, standardized tests currently have to be mostly multiple choice and gradable by computer. Multiple-choice, predictable tests are inherently more susceptible to coaching and cheating. And high stakes applied to cheap tests drive even good teachers toward bad strategies.

A first-grade teacher described on a blog exactly how testing had hurt her and her students:

Standardized tests actually make students stupid. Yes, stupid. Not only are the kids not thinking, they are losing the ability to think. In my zeal to get administrative scrutiny off me and my students, I mistakenly thought that if I give [administrators] the test results they want, then I could do what I know was best for my students. To that end I trained my students to do well in these tests. I taught them to look for loopholes; to eliminate and guess; to find key words; to look for clues; in short, to exchange the process of thinking for the process of manipulation.

Research suggests this teacher’s experience is a common one. The Center on Education Policy reported in 2007 that 44 percent of districts cut time from activities such as social studies, science, art and music, physical education, lunch, and recess after NCLB. “We’re seeing schools emphasize literacy skills and math to the detriment of civics, social studies, the arts, and anything creative,” Wayne Au at the University of Washington Bothell, author of a separate study on the topic, told me. Au found that even in the tested subjects teachers lectured more and raced to cover more ground for the sake of exposing students to all the material potentially covered on the test. This meant fragmented, out-of-context presentation of information—more time spent with teachers talking and students sitting and listening.

7. The high stakes tempt cheating.

The simplest way to improve a school’s test scores is a #2 pencil with an eraser. You take the test papers, erase the students’ incorrect answers and bubble in the correct ones. This is Daniel Koretz’s seventh and least desirable response to testing.

It’s very likely that something like this took place in 2007–2008 after Washington, DC, public school superintendent Michelle Rhee offered cash bonuses to principals with the greatest improvement in scores. Statistical evidence pointed to widespread fixing of test answers. But Rhee, who had built a national reputation on the numbers, refused to investigate and claimed not to have seen a memo detailing the cheating that was written by a whistleblower and later obtained by the press. (She refused to be interviewed for this book.)

In no way is Washington, DC, an isolated case. According to a Government Accountability Office (GAO) report issued in May 2013, officials in thirty-three states confirmed at least one instance of cheating in the 2011 and 2012 school years, and in thirty-two of those cases, states canceled, invalidated, or nullified test scores as a result of cheating. Again, this was over just two school years.

A 2012 investigation by the Atlanta Journal-Constitution showed that 196 school districts across the country exhibited test score patterns consistent with widespread cheating. In 2011–2013, thirty-five educators were indicted in an FBI investigation for allegedly tampering with test scores in Atlanta, where school leaders held “erasing parties” to change student scores at forty-four schools; Louisiana investigated thirty-three schools in the charter-dominated Recovery School District of New Orleans for suspiciously high levels of erasures, improper administration of the tests, and other infractions; and two elementary schools on Long Island were investigated for teacher coaching of third, fourth, and fifth graders.

University of Chicago economist Steven Levitt, of Freakonomics fame, analyzed statistical evidence of cheating in Chicago public schools. He found that “cheating by school personnel increased following the introduction of high-stakes testing, particularly in the lowest-performing classrooms.” The groups most likely to cheat were classrooms that did badly the previous year and classrooms in schools with lower achievement, higher poverty rates, and more African American students, all characteristics associated with lower test scores.

“I’m not going to let the state slap them in the face and say they’re failures,” Damian Lewis, a teacher who participated in the Atlanta teaching scandal, told the New Yorker, explaining part of his justification for fixing answers. But at the same time, he said, “I couldn’t believe what we’d been reduced to.”

After he and other teachers began changing student answers on state tests at Parks Middle School, the predominantly poor, African American school falsely “met” its NCLB proficiency goals for the first time in 2006. They held a pizza party for the whole school. “Everyone was jumping up and down,” Neekisia Jackson, a student, told the New Yorker. “It was like our World Series, our Olympics.” She went on, “We had heard what everyone was saying: Y’all aren’t good enough. Now we could finally go to school with our heads held high.” The school became nationally honored for both its focus on data and its fabricated achievement.

When facing high stakes, students catch the cheating bug too, though not nearly as often as the people educating them. In the fall of 2012 twenty Long Island high school students were arrested for taking part in an SAT cheating ring; five of the students charged others up to $3,600 to sit for the exam. In the spring of 2013 students at more than 240 California schools broke the rules by posting pictures on social media while taking standardized tests, including pictures of test questions and answers. And in the spring of 2013 Nayeem Ahsan, a student at Stuyvesant High School, one of the best public schools in the nation, was caught texting hundreds of his classmates the answers on the state Regents Exams.

Widespread cheating should undermine our faith in tests as an objective measure of student progress. Instead, it undermines the process of education itself.

8. They are gamed by states until they become meaningless.

More widespread and even more detrimental than the cheating that goes on at schools are the games that districts, states, and politicians play with the law’s definitions of “proficiency” and adequate yearly progress. No Child Left Behind states that each school, district, and state must make “adequate yearly progress” in increasing the proportion of students in each subgroup that state tests deem proficient.

But the law did not define proficiency.

You might think that the psychometricians and learning specialists who create the tests also decide what “proficiency” means for a given test in a given grade. You’d be wrong.

Jeff Livingston is a senior vice president at CTB/McGraw Hill, one of the big four companies responsible for creating and marketing annual tests to states. An African American, he defends testing passionately as an instrument of equity. But he also paints a picture of states essentially ordering up tests to get the scores they want. “Respecting the local nature of education decisions, NCLB allowed every state to create its own assessment regime, cutoff scores, and measures of AYP,” or adequate yearly progress, he said.

The assessment regime is the set of tests being given in each grade and state. The cutoff score, also known as the cut score, is the score that designates proficiency. “And so what happened then,” Livingston explained, “is that you essentially had fifty state infrastructures in the process of putting together their own tests. You could have a state where 80 percent of kids are at or above grade level on the state tests but 20 to 30 percent are if you look at any nationally normed situation. And so it was in many ways a game to figure out who could create the test that met the minimum standards of adequacy without making the state education infrastructure look too bad, and I don’t know that it ends up being especially helpful for students.”

I ask him: Didn’t the testing companies balk at participating in this kind of psychometric malpractice? Livingston chuckles. “Our job is to respond to what our customers ask us to do, and our customers are the representatives of their communities,” he said. “I can’t argue with a state board of education. We gave them precisely what they wanted in precisely the way that they wanted.”

Doug Kubach, the CEO of Pearson School, the testing division of the largest education publishing company in the world, echoes this point: it’s out of our hands; the buck does not stop here. “We’re implementing the program and not designing or making decisions about it,” he tells me. “At the end of the day it is the state and the people working for the state that make the cut score decision.”

Unfortunately, when political leaders set educational standards they tend to act with political motivation. The Northwestern Educational Association (NWEA) can put some flesh on that characterization. NWEA is a thirty-seven-year-old nonprofit testing company dedicated to low-stakes diagnostic testing meant to drive personalized instruction. Their tests are used in about half of the school districts in the country as well as 119 countries around the world. Over the last decade independent researchers have published a series of reports comparing NWEA test scores with state NCLB guidelines, and they have come to a single conclusion: there is no accountability in accountability measures. That’s because there is no consistency in state standards.

In a 2009 report, “The Accountability Illusion,” researchers took actual NWEA results in a sample of eighteen elementary schools and compared them to AYP targets for schools and population subgroups in twenty-six states. They concluded, “The way NCLB rates schools appears to be idiosyncratic—even random—and opaque. . . . In Massachusetts, for example, a state with high proficiency cut scores and relatively challenging annual targets and AYP rules, only 1 of 18 elementary schools made AYP; in Wisconsin 17 schools made AYP. Same kids, same academic performance, same schools—different states, different cut scores, different rules. And very different results.”

The Common Core was initially conceived partly as an opportunity to replace the hodgepodge of state-created tests with those produced by two federally funded multistate consortia, Partnership for Assessment of Readiness for College and Careers (PARCC) and Smarter Balanced. But a dozen states, for cost or other reasons, are already balking at the tests the consortia produced and instead commissioning their own from other publishers. And even those states that use the tests from the big two consortia can still choose their own AYP targets, now covered by a hodgepodge of state waivers. So the basic problem—no consistent definition of proficiency—will persist.

9. They are full of errors.

Kubach, Pearson’s CEO, rarely talks to the press. Our interview is rescheduled four times. When I get him on the phone he goes into great detail explaining the twenty- to twenty-five-step process by which test items are written and vetted by a series of committees. Then I ask him about the “pineapple question.” He knows exactly what I’m talking about.

“Yeah, so the pineapple question . . . um,” he pauses.

In 2012 the New York Daily News reported that students taking the New York state eighth-grade reading exam were asked to read a bizarre story about a talking pineapple that challenges a group of animals to a race. It doesn’t budge. At the end the animals eat the pineapple. The students were then asked two multiple-choice questions:

Why did the animals eat the talking fruit?

Which animal was wisest?

This idiotic faux fable stumped teachers, students, and school officials alike. The pineapple story became a local scandal, forcing the state Education Department to officially announce that the question would not count against students. The most annoying part was that it wasn’t even new. The story had appeared on Pearson tests in several states since 2006, drawing complaints year after year.

Kubach explained that this item, for some reason, went through a different review process from that used for the Common Core tests. He also said Pearson and New York State responded to the problems caused by the pineapple question. They changed the passage selection guidelines to reduce the use of “fables and fantasy stories”—no more ambiguous literature!

But the mistakes on tests are far more widespread than one bad pineapple. If your child starts taking math and reading tests in third grade, by the time she gets to seventh grade odds are she will have taken at least one test on which her score was bogus.

Each testing company employs a staff of psychometricians with advanced degrees, issues guidelines, and reviews most test items. But both the writing of actual test items, such as the pineapple question, and the grading of student writing is often farmed out to independent contractors making as little as $15 an hour. These workers, some of whom I’ve spoken with, aren’t required to have relevant degrees or any experience in education. Add this to the expanded and accelerated production schedule of these tests, with tens of thousands of questions in circulation each year, and flaws in standardized tests, ranging from poorly written questions like the one above to outright mistakes, are disquietingly common.

In a yearlong investigation published in the Atlanta Journal-Constitution in September 2013, Heather Vogell studied more than 92,000 test questions given over two years to students in forty-two states and Washington, DC. The investigation revealed that almost one in ten tests nationwide contained significant blocks of flawed questions—10 percent or more of the questions on these tests had ambiguous or wrong answers. In other words, the percentage of flawed questions is high enough in one out of ten tests to place the fairness of the results in doubt. The National Board on Educational Testing and Public Policy reported that fifty high-profile testing mistakes had occurred in twenty states from 1999 through 2002.

If anything, essay questions on standardized tests are even more questionable than multiple choice. They are supposed to be the place to demonstrate deeper learning and communications skills, yet they are typically graded by temporary workers who spend about two minutes per essay. In 2014 the head of the College Board announced that essays would become optional on the SAT. The reason: essay scores are predictive neither of student grades nor success in college. A series of experiments by Les Perelman at MIT had shown that nonsensical essays could get high scores from graders if they used the right vocabulary and length.

10. The next generation of tests will make things even worse.

The Common Core State Standards, touted as “fewer, higher and deeper” and emphasizing ideas like critical thinking and logical reasoning in English Language Arts and math, were introduced in 2010 by Achieve, Inc., a nonprofit with considerable backing from the Gates Foundation. They have a growing chorus of detractors: Oklahoma, Indiana, and South Carolina dropped the standards in the spring of 2014, leaving them in place in forty-two states, and they have been the target of right-wing protests from Glenn Beck and others. Educators’ groups, teachers unions, parent groups, and others who oppose the Core tend to conflate it with the drift toward high-stakes testing.

But what about the tests themselves?

The federal government funded two state consortia to create the tests to the tune of $330 million. When the consortia, PARCC, and Smarter Balanced were announced in 2010, Education Secretary Arne Duncan said, “I am convinced that this new generation of state assessments will be an absolute game-changer in public education . . . many teachers will have the state assessments they have longed for—tests of critical thinking skills and complex student learning that are not just fill-in-the-bubble tests of basic skills but support good teaching in the classroom.”

The consortium assessments were set to roll out in the 2014–2015 school year. Joe Willhoft is the executive director of the Smarter Balanced assessment consortium. He says the Common Core tests will be more useful than older tests because they are given by computer, so teachers can see and apply the results more immediately.

Still, the new tests will have most of the same problems as the old tests. They are still cheap. The Smarter Balanced assessment package, for example, is estimated at $27.30 per student. This is cheaper than what two-thirds of states in the consortium are currently paying. They’re cheap because they are still largely multiple choice and still cover limited subjects in limited ways. And because they are multiple choice and limited, they’ll still be error-prone, coachable, and likely to distort the curriculum.

The Gordon Commission, an independent panel of experts, concluded in a 2013 review of the Common Core–aligned assessments: “The progress made by the PARCC and Smarter Balanced consortia in assessment development, while significant, will be far from what is ultimately needed for either accountability or classroom instructional improvement purposes.” Linda Darling-Hammond, a Stanford researcher and a member of the Gordon Commission, clarifies, “They are for most states a step in the right direction, but they are limited and still in the US testing paradigm, which is different than you see in most countries: a sit-down test with lots of selected-response, multiple-choice questions, and a few open-ended questions . . . they are not as robust as the standards themselves call for and as some other countries do.”

And the worst part is that these tests are still, by current law, intended to be high stakes. The high stakes becomes a real problem when you realize one more consequence of Common Core aligned assessments: the so-called assessment cliff.

These tests are harder by any measure than the ones they’re replacing. Two states got a head start by giving Common Core–aligned assessments produced by Pearson. New York saw a 24 percentage point drop in ELA proficiency and a 33.8 point drop in math in the first year. In Kentucky the drop in both subjects was around 25 points.

Willhoft says the score drop-off is just a reality check that schools and districts need to face, stating, “Thirty to forty percent of our public school graduates must take remedial courses when they get to college.” This number is in dispute: the National Center for Education Statistics, the government clearinghouse, lists the remediation rate for all first-year college students at 20 percent. But even if the real number is half what Willhoft quotes, it’s too much.

Still, the predictive validity of the Common Core tests is not proven because they haven’t yet been given to large numbers of students or correlated with the long-term success of those students. Just because they are harder doesn’t prove that they align well with what students need to know or be able to do in college.

More important, there is no evidence that the effects of high-stakes tests—more teaching to the test, more cheating, more closing of schools and firing of teachers—will indeed prepare more students to succeed in college. In fact, we can be pretty sure it won’t because that’s what we’ve been trying with little success since No Child Left Behind was passed twelve years ago.

The Common Core poses another dilemma: these tests are in some ways even more standardized than the ones that came before them. Instead of fifty different curricular standards and fifty different tests in fifty states, there is just one set of standards and will potentially be just three or four Common Core tests in use across the country.

On the one hand, using fewer tests makes comparisons between states more valid.

If the same tests are given to millions of students, states won’t be able to play so many games with the definition of “proficiency.” Even if each state sets its own cut scores, as McGraw Hill’s Jeff Livingston and Pearson’s Doug Kubach say they do, it will be easy to compare scores across state lines. This may be one of the reasons why a dozen states backed out of the test consortia in the spring and summer of 2014. As of June 2014 only 42 percent of the nation’s students were set to take these tests the following spring; other states would either purchase Common Core tests from vendors like Pearson or hadn’t yet decided.

At the same time, the greater alignment between curriculum and test as well as the smaller number of tests overall and school districts’ need to swiftly adopt brand-new curricula and tests at the same time creates a major business opportunity. Education Secretary Arne Duncan has spoken about the Common Core creating a unified “marketplace.” Companies like Pearson, Apple, Microsoft, and Google can sell the same tests, materials, curricula, and devices to schools nationwide.

The Common Core thus paves the way for education that is ever more test driven, that begins and ends with tests, where teaching to the test is the only option left because the textbook and the test were written and vetted by the same committees and published at the same time by the same company.

Where did these things come from? How did they become the law of the land? And how can we do better?