2

 

The History
of Tests

 

The Personal Equation

In 1795, at an astronomical observatory in Greenwich, England, an astronomer named Nevil Maskelyne had an assistant, David Kinnebrook, who consistently reported times of stellar transit about one second later than Maskelyne’s. Kinnebrook was fired for what must have looked like laziness. Friedrich Bessel, an astronomer and mathematician conducting research at Konigsberg, Prussia, read an account of the incident published in 1816. He was inspired to compare his own observations and those of two other astronomers. In 1823 Bessel reported an average, consistent variation from person to person of between 1.0 and 1.2 seconds. He developed two innovations to improve accuracy: first, a “personal equation” to correct for individual variation, and secondly, a better clock, one that ticked every half a second instead of every second. These two corrections improved the accuracy of astronomy quite a lot. They also introduced the idea of using statistics to measure and correct for the differences among people.

In fact, Maskelyne and Bessel were not the first to notice this strange pattern of human variation in timekeeping. Johann Carl Fredrich Gauss (1777–1855), known as the “Prince of Mathematicians,” had created his own version of the personal equation to model the distribution of human errors in astronomical observation. This was the “Gaussian distribution,” or the famous bell curve.

Notably, history is silent as to whether Maskelyne’s recorded times or his assistant Kinnebrook’s, one second later, were truer to the actual movements of the star. The man was fired, in essence, for not living up to an arbitrary numerical standard set by his superior—the first victim of high-stakes assessment.

Barely a decade would pass before both the bell curve and the concept of a personal equation underwent a subtle and important shift—from improving people’s perception of performance with respect to an objective scientific phenomenon (e.g., the timing of an eclipse) to attempting to observe and measure objectively the people themselves.

A Belgian social scientist, Lambert Adolphe Jacques Quetelet, was the first to apply the bell curve beyond astronomy to what he called “the facts of life.” He published a work in 1835 titled A Treatise on Man, and the Development of His Faculties, that maps out a kind of clockwork human universe, noting bell-curve patterns in crime, poverty, disease, marriage, suicide, birth and mortality rates, morality and immorality, and weight and height. (Quetelet developed the Body Mass Index, the measure of obesity that we still use today.) As a criminologist, he was among the first to calculate the statistical relationships between crime rates and poverty, age, alcohol consumption, and even local climates (people misbehave more in warm weather than cold weather).

His theory, which he called “social physics,” was simple: survey the population, chart the variations on a curve, and at the top of the camel’s hump, in the middle of the distribution, you will find the “average man.” What begins to be assumed in Quetelet’s writing is that being average, if not above average, is desirable. The normal becomes the normative.

Anthropometry

Quetelet laid out the method for calculating the distribution of human intelligence in the early nineteenth century, but science still lacked any way to directly measure or define intelligence.

Folk archetypes allow, at a minimum, for the existence of at least three distinct kinds of smarts: the wise, the clever, and the talented or creative. The tricks of a Loki or an Anansi are very different from the sublime flute playing of Krishna, the scholarship of Thoth, or the strategy and skill of Athena.

“I neither know, nor think that I know” is Socrates’s definition of his own wisdom in Plato’s Apology. “Who is wise? He who learns from every person,” says the Pirke Avot, a Jewish sacred book. In 1575 the Spanish philosopher Juan Huarte de San Juan described three very different and even contradictory qualities associated with intellect: “docility in learning from a master, understanding and independence of judgment, inspiration without extravagance.” All of these dimensions of intelligence—creative talent, interpersonal cleverness, practical skill, prodigious memory, openness to experience, originality—ring true today, but they’re still just as hard to quantify or compare.

Science focuses on what can be measured. In the nineteenth century anthropometry was the name given to the emerging practice of measuring people’s physical characteristics such as weight, height, bone lengths, blood pressure, and so on. Among its notable applications was the Bertillon method of identification, a system of five body measurements, plus photographs, developed in France in 1879 and used to identify criminals for a few years before fingerprinting came along. The Bertillon method was the basis for the mug shot, still in use today.

Darwin’s Cousin

Francis Galton was a cousin of Charles Darwin and a child prodigy. He studied at Cambridge with Wilhelm Wundt, the “father of experimental psychology,” who measured people’s reactions to sensory stimuli such as hot and cold, hard and soft, and high and low tones. Wundt, in turn, had studied with Hermann von Helmholtz (1821–1894), a physicist and philosopher whose wide-ranging interests included the study of visual perception.

In 1884 Galton opened an anthropometric laboratory at the Kensington Museum in London. Between the years 1884–1890 he measured ten thousand people along lines set forth by Wundt, recording their “Keenness of Sight; Colour-Sense; Judgement of Eye; Hearing; Highest Audible Note; Breathing Power; Strength of Pull and Squeeze; Swiftness of Blow; Span of Arms; Height, standing and sitting; and Weight” as well as their eye and hair color—the better to identify “mongrel” characteristics. Among his other accomplishments, Galton coined the term “eugenics.”

His 1889 book Natural Inheritance is a bewildering blend of racism and path-breaking social science. Both are family traits in the field of intelligence testing. Galton presents the foundations of regression analysis and the correlation coefficient, both essential tools across the sciences for describing the relationship between disparate variables. Regression analysis, at its simplest, involves drawing a “line of best fit” between two variables, visualized as X and Y axes on a graph, to understand the relationship between them. The coefficient of correlation between two variables ranges from -1 (if one occurs, the other never occurs) to 0 (no relationship) to +1 (if one occurs, the other one always occurs.) When it’s close to zero, the relationship is close to random chance. Significance is a function both of how large your experimental sample size is and the range of values across the sample. The “threshold of significance” also varies from field to field—correlations tend to be lower in the social sciences. (Lightning very clearly causes thunder, but for anger and shouting, the most that can be said is that they generally occur together.) Galton uses these tools to argue that intelligence, basic virtue, and merit are trademarks of the white race and the male gender.

Among his other contributions, Galton may be credited as the first to suggest that measurements could be useful for evaluating schools’ performance as well as monitoring the development of individuals. He envisioned a full anthropometric laboratory in every school and a system of tracking students after they left school to further correlate their life performance and successes with the measurements taken of them in youth. (The idea of long-term, multidimensional tracking is much more feasible in the era of big data and has champions today, as I’ll get into in Chapter 4.) He proposed a system of “public examinations” and advocated that the highest scorers should be presented with large cash prizes if they elected to marry and breed.

In a 1905 lecture to the Royal Congress of London on Public Health, titled “Anthropometry in Schools,” Galton defined anthropometry as “the art of measuring the physical and mental faculties of human beings.” He asserts, “By recording the measurement of a small sample of his dimensions and qualities, . . . these will sufficiently define his bodily proportions, his massiveness, strength, agility, keenness of sense, energy, health, intellectual capacity, and mental character, and will substitute concise and exact numerical values for verbose and disputable estimates.”

The lust for concise and exact numerical values and the desire to boil down the complexities of humans’ “mental faculties” to an easy-to-manage set of measurements has never left us, no matter how unsatisfying the results.

At the World’s Fair

In 1893 Joseph Jastrow, a psychologist at the University of Wisconsin, opened an outpost of Galton’s laboratory at the Chicago World’s Fair, with a focus on the measurement of “the more elementary mental powers.” Jastrow acknowledged some of the basic problems with all mental tests, which are still bugging us today: “Mental tests of this kind are burdened with difficulties from which physical measurement are comparatively free. Our mental powers are subject to many variations and fluctuations. The novelty of the test often distracts from the best exercise of the faculty tested, so that a very brief period of practice might produce a more constant and significant result. Fatigue and one’s physical condition are also important causes of variation.”

One point Jastrow didn’t seem to worry about is whether his test items, as administered to fairgoers fresh from riding the world’s first Ferris wheel and eating the first Cracker Jack, actually had “predictive validity”—that is, whether they told you anything worthwhile about the person being tested.

He provides detailed records on twenty-nine different tasks, including some memory tasks that might not be out of place on an intelligence test today, alongside some downright goofy stunts, like:

Equality of movements. The subject places the point of a lead pencil at the left end of a sheet of paper 15 inches long, and makes five movements by raising the pencil and bringing it down again, the attempt being to make the distances between the dots so recorded equal. The test is made with the eyes closed, the estimate of the distance moved depending upon the motor sensibility. The only limit of movement is that suggested by the length of the paper. The average distance between the dots and the average percentage of deviation are computed and recorded.

Around the same time, 1889, James McKeen Cattell at the University of Pennsylvania established a permanent lab for similar “mental tests,” such as hand strength, rate of movement, pain threshold, reaction times, and perception of time elapsed. (Cattell was another student of Wundt at the University of Leipzig, where he became the first American to write a PhD dissertation, with the title Psychometric Investigation, underscoring the close relationship between the rise of intelligence testing, the science of psychology, and the history of American academia.)

In 1891 Cattell moved the lab to Columbia University, where he tested all entering freshmen, energetically promoting his program. Unfortunately, Clark Wissler, his graduate student, applied Galton’s correlation coefficient to the results and found close to zero correlation between these “mental tests” and the students’ actual class grades. In other words, the tests had no predictive validity whatsoever. Cattell’s program was quickly scrapped, marking one of the only times that scientific evidence limited the spread of intelligence tests. Still, some vestiges of Cattell’s notions persisted, as Ivy League freshmen were routinely photographed nude for “research” purposes through the 1960s and 1970s.

The Binet Scale

The quest for a usable test of intelligence, one that would both produce results matching the famous bell curve and actually have some power to predict real-world performance in school or life, continued into the twentieth century. In the 1880s the French government established free, mandatory, and secular education. With a much broader influx of the population into schools, they needed a way to predict which children would need early intervention in special classes. Asked to correlate test results with school performance, two French psychologists, Alfred Binet and Theodore Simon, abandoned most anthropometric-style measures in favor of trials of practical knowledge, abstract thinking, vocabulary, and problem solving, with a few memory tasks, mental arithmetic, and moral dilemmas thrown in.

After trying out his questions, first on his own two daughters and later on small groups of Paris schoolchildren, Binet developed the concept of mental age, adjusting test results to the norm of tasks that at least 75 percent of children could accomplish at a given age. Simon and Binet introduced the Binet Scale in 1905.

Because they set the age of full adult mental development at sixteen, contrary to current neurological findings that significant brain development continues through the age of twenty-five, the Simon-Binet test has a low ceiling. The shape of the scale makes it much easier to score high at age four than to do so at age twelve. In fact, I was one of those “geniuses” in diapers—my parents had me tested at age two.

Anthropometrics was giving way to psychometrics, a Sisyphean series of attempts by the discipline of psychology to establish itself as a science through the application of numbers and instrumentation to the study of the human mind. (The current generation of that effort would be neuroscience. Brain scans and the prefix “neuro-” are these days applied to every subject under the sun, creating such exotic new fields as “neuroeconomics,” “neuromarketing,” and “neurolaw.”)

But it’s worth emphasizing that psychometrics has a scholarly lineage entirely different from that of learning and teaching. Colleges of education teach theories of instruction based in part on developmental and behavioral psychology, but psychometrics as a discipline has more in common with statistics. That means tests may be able to show whether you know something, but even the best of traditional tests can’t show how you learned it or what to do if you don’t know it. “Among the last people I’d ask for instructional advice would be my psychometrician friends,” said Joseph Ryan, a fellow of the American Educational Research Association and a professor emeritus at Arizona State University. “Their training isn’t in learning, cognitive or educational psychology.”

Although the era of anthropometric measurements like “Strength of Pull and Squeeze” looks silly now, Binet’s decision to include content-based questions on intelligence tests presented its own problems. Rather than aiming directly at abstract reasoning skills or the functioning of the brain, knowledge questions tie tests to a very specific context of time, place, culture, and schooling that makes it difficult to compare across time or even across populations within one society. A five-year-old French child at the turn of the century probably knew her way around a butter churn, whereas American children today may be more familiar with microwaves.

Dinosaur Bones

Charles Spearman was a British psychologist. Like Galton and Cattell, he was a student of Wilhelm Wundt. He was obsessed with the idea of the fixed, unitary, and hereditary nature of intelligence, writing, “All branches of intellectual activity have in common one fundamental function (or group of functions).” He named this fundamental function the “g” factor, for general intelligence.

It blew my mind to discover that the first point of origin of intelligence testing could be found in eighteenth-century astronomy, one of the earliest moments in experimental science. It blew Harvard scholar Stephen Jay Gould’s mind even more when he realized that something called factor analysis, the key mathematical tool he used every day as an evolutionary biologist connected to . . . well, I’ll let him tell you: “There can be only a few such moments—the eurekas, the scales dropping from the eyes—in a scholar’s life. My precious abstraction, the technique powering my own research at the time, had not been developed to analyze fossils, or to pursue the idealized pleasure of mathematics. Spearman had invented factor analysis to push a certain interpretation of mental tests—one that has plagued our century with its biodeterminist implications.” Gould, an eminent paleontologist and one of the twentieth century’s most beloved popular science writers, published an award-winning book about his eureka moment, Mismeasure of Man, in 1981. It’s a blistering, page-turning critique of psychometrics and how racist ideology perverted science.

In an attempt to secure the scientific bona fides of their efforts, psychometricians felt pressure to prove that their tests were measuring some real thing—the g factor, for general intelligence. Throughout history the creators and champions of intelligence testing have insisted on what Gould called the “reification” of intelligence—arguing that it is not some artifact of the tests (like your time in a mile race) but instead that it is fixed, that it is unitary, and even that it is hereditary (as if the time on a particular mile you ran at age eight determined your lifetime athletic potential, health, and life expectancy and indicated that of your children’s as well). The obsession with the inherited nature of intelligence, of course, is what defines so many prominent psychometricians—Galton, Spearman, Cattell, Lewis Terman, Cyril Burt, Arthur Jensen—as eugenicists and racists.

To Gould, as to many latter-day observers, the ideology is abhorrent, and the tools used to establish it are faulty. “The reification of IQ as a biological entity has depended on the conviction that Spearman’s g measures a single, scalable, fundamental ‘thing’ residing in the human brain,” Gould wrote. Factor analysis is the tool not only used but also invented, Gould argues, to argue for the thing-ness of intelligence. And the scientists who use it this way are misapplying it, running roughshod over the data.

But what is factor analysis?

Factor analysis is a statistical technique to simplify the correlations among separate measurements by drawing lines of best fit and then plotting them as vectors across various axes.

Let’s translate with an example from Gould’s field, paleontology: You have a bunch of stegosaurus skeletons. You measure eleven separate bones belonging to each skeleton. You can see that when the ribs get wider, the vertebrae get thicker—in other words, each separate measurement is correlated. By doing some calculations, you can simplify these correlations. The size of various bones in the skeleton correlates with the likely age of each stegosaurus at its death. You have analyzed your measurements and found that one underlying factor explains them all: age.

You can’t observe the growth and development of the stegosaurus directly because all you have is fossils. There may be other contributing factors to the size of a particular creature, such as gender, nutrition, and genetic variation. It’s possible that the factor analysis is incorrect, and what you’re dealing with is in fact a bunch of different subspecies, one of which is sixteen inches tall at maturity whereas another one is fifteen feet.

But it’s much more likely that this factor is meaningful, that there is such a “thing” as age. First of all, growth and development is observable in living organisms. Secondly, and more important, the correlations Gould found when doing exactly this type of calculation are very high—on the order of 1.0 to 0.91 between the width of a rib bone, say, and the corresponding length of a spine.

In his best-known paper, published in 1904 and titled “General Intelligence: Objectively Defined and Measured,” Spearman took a set of tests given to a small group of students and applied factor analysis to argue that a single factor, “g,” explained the correlation between a single student’s performance in, say, Greek and English grammar. However, correlations between scores for the same person on different kinds of intelligence tests can be pretty low. Scores on different mental tests of the type given in Jastrow’s laboratory to World’s Fair attendees or by Cattell to college freshmen, for example, weren’t meaningfully correlated at all either with each other or with other types of performance such as that in school. Numerous critics have accused Spearman of cherry-picking his data to get the results he wanted.

If you can’t argue that intelligence is unitary, it becomes harder to say that it is fixed. If it’s not fixed, it’s harder to say that it is entirely genetic. At most, you can speak of a genetic predisposition to learn, made up of several different cognitive qualities and personality factors, like working memory and openness to experience.

Gould and many others have argued that the early psychometrician’s faith in the g factor was motivated less by objective knowledge of intelligence and more by ideology about the nature of inequality. In the next chapter we’ll get into the role that tests are playing in furthering inequality in America today.

Terman and the Termites

It would not be long before intelligence tests saw their first large-scale use by the US government. In 1917, soon after America’s entry into World War I, Robert M. Yerkes, the president of the American Psychological Association, proposed that the field of psychology contribute to the war effort by examining military recruits; the process got underway in August of that year.

Lewis Terman, the chair of psychology at Stanford University, at the time was busy adapting the Binet intelligence test, renamed the Stanford-Binet. Terman was yet another eugenicist, an advocate of forced sterilization for the “feebleminded,” who argued that the low Binet scores of “negroes,” “Spanish-Indians,” and Mexicans were racial characteristics. He created the intelligence quotient by dividing a subject’s mental age by her actual age and multiplying by 100. An eight-year-old with a mental age of twelve has an IQ of 150—a genius.

There is nothing about an individual as important as his IQ, except possibly his morals,” Terman stated in 1922. He had a special interest in gifted children and conducted a longitudinal study of fifteen hundred children whom his test labeled “geniuses.” The “Termites,” as they were known, went on to mostly lead lives of accomplishment, although William Shockley and Luis Alvarez, winners respectively of the 1956 and 1968 Nobel Prizes in Physics, both missed the cutoff on the test. Terman was also an early proponent of the idea that a properly scientific and calibrated test could be smarter than a teacher, providing a “more reliable and more enlightening estimate of the child’s intelligence than most teachers can offer after a year of daily contact in the schoolroom.”

Terman and Yerkes’s team threw together a test for the Army’s recruits in just two weeks. There were two versions, known as “Army Alpha,” for those who could speak and read English, and “Army Beta,” for those who could not (this second test was full of mazes and hidden-picture puzzles). In 1917 and 1918 an astonishing 1.7 million men were examined and graded from A to E. Those scoring below C were deemed unfit for the officers corps and, below D, unfit for service.

The “Intelligence” rubric of the Alpha and Beta tests included a wide range of practical qualities, as follows: “Accuracy, ease in learning; ability to grasp quickly the point of view of commanding officer, to issue clear and intelligent orders, to estimate a new situation, and to arrive at a sensible decision in a crisis.” The test also purported to cover leadership and character with paper and pencil questions such as opposites, mazes, and “France is in (circle one): Europe Asia Africa Australia.” This last item was an example of a recent invention: the multiple-choice question.

The multiple-choice question was an important technique for simplifying and mass-producing tests. Frederick Kelly completed his doctoral thesis in 1914 at Kansas State Teacher’s College. He recognized that different teachers tend to give different judgments of student work. And Kelly saw this as a big problem in education. He proposed eliminating this variation through the use of standard tests with predetermined answers. His Kansas Silent Reading Test was a timed reading test that could be given to groups of students all at the same time, without requiring them to write a single sentence, and graded as easily as scanning one’s eyes down a page.

As digital humanities scholar Cathy Davidson writes in her book Now You See It, “To make the tests both objective as measures and efficient administratively, Kelly insisted that questions had to be devised that admitted no ambiguity whatsoever. There had to be wholly right or wholly wrong answers, with no variable interpretations. The format will be familiar to any reader. . . . Here are the roots of today’s standards-based education reform, solidly preparing youth for the machine age.”

Throughout the twentieth century, innovations in efficiency made intelligence tests ever more beloved by bureaucracy. Of course, the most convenient, efficient, and cheap test design is rarely the best possible option for capturing the complex process of learning—which was not the business of psychometricians.

Birds Building Nests

According to Stephen Murdoch’s history of intelligence testing, I.Q.: A Smart History of a Failed Idea, the Army didn’t think much of Army Alpha. Fewer than eight thousand recruits—half of 1 percent—were actually rejected based on the tests, and officers often found that low-scoring men turned out to be fine soldiers. The test administration was also racist, with blacks often being given the Beta version of the test for illiterates regardless of whether they actually could read and write. But Yerkes, Terman, and others nonetheless were able to parlay their war experience into a huge opportunity to advertise the idea of mass testing to schools.

Army Alpha was the first intelligence test designed to be administered to groups of people rather than one on one, as at the Anthropometric Laboratories in London and Chicago. Arthur S. Otis, who worked with Terman, became another father of the standardized mass test. His PhD dissertation proposed procedures that made it easier to score large groups of multiple-choice tests at the same time. This involved using tracing paper, the precursor of the Scantron sheet.

Otis’s Group Intelligence Scale, modeled after Army Alpha, became the first commercially available standardized mental test, published by World Book, the encyclopedia makers, in 1918. Terman and Yerkes’s National Intelligence Test was first used around 1920 to test school children and sold four hundred thousand copies within a year of publication. In 1923, while working at World Book, Otis published the Stanford Achievement Test (SAT), positioned as an objective way to test what children had learned in school.

The Otis tests are designed for students who can read and write and have had at least three years of school. The test items included pairs of opposites, simple pencil exercises like “drawing a tail on the kitty,” and “commonsense” multiple-choice questions such as the following:

Why do birds build nests? Correct answer: to make a place to lay their eggs; wrong answers: because they like to work, or to keep other birds away.

If you hurt someone without meaning to, what should you do? Correct answer: beg his pardon. Wrong answers: say you didn’t; run away.

Why is it a good thing to brush our teeth? Correct answer: to keep our teeth clean and white. Wrong answer: so we can have a toothbrush; because toothpaste has a pleasant taste.

When you read items like these, it becomes clear that the quality early-twentieth-century tests labeled “intelligence” really had a lot to do with social conformity or, as Juan Huarte de San Juan put it, docility. At a minimum, doing well on a test like this assumes that a child wishes to do well in order to please authority figures. Getting the right answer requires knowing and obeying the rules of one’s society. Originality, critical thinking (sometimes birds do build nests to keep other birds away), or even honesty (“run away”) is likely to count against you.

Raising the Stakes

Right away standardized intelligence tests became high stakes for individuals. Binet’s, Otis’s, and Terman’s tests were meant to determine how educational resources were used—who gets placed into a slow class and who into a gifted class, who becomes an engineer and who a car mechanic, who goes to college and who gets invited to leave school, just as Army Alpha and Beta were pressed into service as a way of weeding out the mentally inferior from commanding positions. Meanwhile, the SATs, first introduced in 1926, and the American College Testing programs(ACTs), introduced in 1939, gradually assumed more and more importance in college admissions. But the use of test results to reward and punish schools and teachers, rather than just students, didn’t really take off until the end of the twentieth century.

Most deplorably, the Stanford-Binet test played a role in forced sterilization. According to Stephen Murdoch’s review of these cases, at least sixty thousand Americans—and possibly more—were sterilized for eugenics purposes in the twentieth century. The 1927 Supreme Court case Buck v. Bell, which legalized eugenics-inspired sterilization, hinged on the IQ results of one Carrie Buck. At seventeen, Buck’s foster parents sought to institutionalize her when she turned up pregnant, allegedly after their nephew raped her. They handed her over to Dr. Albert Priddy, the superintendent at Virginia’s State Colony for Epileptics and Feeble-Minded, who had a longtime interest in asserting his right to sterilize people he saw as unfit to reproduce. He examined Buck and found her to have a mental age of nine, and her infant’s intelligence was denigrated as well. “Three generations of imbeciles are enough,” Supreme Court Chief Justice Oliver Wendell Holmes wrote in the famous decision approving her surgery. But Buck’s school records indicated that she had actually been a fine student, and her daughter went on to make the honor roll.

Nonsense!

In the twenties Otis and Terman’s tests began to be marketed nationwide. By the mid-twenties there were over seventy-five different such tests on the market, administered to about 4 million school children annually.

The market for their work had grown because between 1880 and 1920 the United States experienced massive waves of immigration. Basic intelligence tests were among the first attempts to restrict immigration into the United States. A doctor looked over each new arrival at Ellis Island, screening for infectious diseases like tuberculosis or trachoma. An “X” was chalked on the coat of an arrival who was suspected of mental defect or insanity—these were examined in a separate room, perhaps by being asked to solve a puzzle or describe the action in a photograph. If they failed, they could be put in quarantine or turned back.

Due to immigration, urbanization, compulsory attendance laws, and the anti-child labor movement, overall school enrollment grew rapidly, from 12.7 million in 1890 to 19.7 million in 1915. The swelling and increasingly diverse tide entering classrooms raised concern over how best to maintain educational quality and allocate taxpayer resources.

More tests seemed like the answer. But not everyone agreed that they were a good idea.

An article titled “Credulity,” published in the Ohio Educational Journal in January 1922, satirized the broad claims already being made for these tests: “I cavorted about like a glad dog and shouted ‘Eureka! Eureka!!’ For I just knew that no door could withstand the magic of a standardized test. Never! With it, I felt I could jimmy open the Pyramids, and I told people so.”

In a series of withering essays published that year in the New Republic, Walter Lippmann, one of America’s most accomplished social critics, inveighed against the new phenomenon of mass intelligence testing and its noxious bedfellow, eugenics. Lippman took on the claim made by white supremacist Lothrop Stoddard, author of The Rising Tide of Color, that Army Alpha intelligence testing revealed the average mental age of American adults was just fourteen, a remark that made headlines coast to coast. “Nonsense,” writes Lippman. “Mr. Stoddard’s remark is precisely as silly as if he had written that the average mile was three quarters of a mile long.” (Stoddard was probably referring to the Stanford-Binet, an IQ scale that, remember, topped out at age sixteen.)

“There are . . . two uncertain elements,” in intelligence testing, Lippman also wrote. “The first is whether the tests really test intelligence.” This is systematic error of the kind Hugh Burkhardt talks about (see Chapter 1). Lippman pointed out that no one really knows what intelligence is. “You know in a general way that intelligence is the capacity to deal successfully with the problems that confront human beings, but if you try to say what those problems are, or what you mean by ‘dealing’ with them or by ‘success,’ you will soon lose yourself in a fog of controversy.”

The second problem, Lippman writes, is one of sample size. The scores for the tests are based on small, homogenous samples, and then the results are overgeneralized to large, more diverse populations. “Something to keep in mind is that all the talk about ‘a mental age of fourteen’ goes back to the performance of eighty-two California school children in 1913–14,” Lippman wrote, referring to Terman’s research. “Their success and failures on the days they happened to be tested have become embalmed and consecrated as the measure of human intelligence.”

Finally, he wrote, numbers give off that deceptive aura of accuracy that lead to test results being trusted more than other forms of human judgment. “Because the results are expressed in numbers,” he said, “it is easy to make the mistake of thinking that the intelligence test is a measure like a foot rule or a pair of scales. It is, of course, a quite different sort of measure.”

In fact, as Lippman pointed out, it’s the creators of the tests who have the power to define success and failure. By choosing 75 percent as the “normal” point, Binet decided to label at least one-fourth of all children backward. If he wanted to, he could have adjusted the rules so that 5 percent of children would be considered slow, or 50 percent would be.

In 1926 the Twenty-Sixth Yearbook of the National Society for the Study of Education presented official educator objections to testing: “This Committee condemns emphatically the evaluation of the product of educational effort solely by means of subject-matter types of examinations now prevalent in state and local school systems. We have reference specifically to the rigid control over the school curriculum exercised by those administrative examinations, which over-emphasize the memory of facts and principles and tend to neglect the more dynamic outcomes of instruction.” By 1930 psychometrics had become sufficiently established that one observer decried an “orgy of testing” in the public schools. Critics of standardized testing are repeating themselves to this day.

Hope for the Hopeless

Even as mental tests shifted from the anthropometrics of marking down sensory impressions toward more and more questions focusing on learned content, psychometrics stuck to the premise that the tests were measuring some fixed quantity in the human mind.

In his handbook Otis assured test givers: “The degree of brightness of an individual is expected to remain approximately constant. Were this not so, prognostication would be impossible. Intelligence testing would be of little value.” Henry Goddard, who helped popularize the Binet scale in the United States by testing those of below-normal intelligence, put it more bluntly. “No amount of education or good environment can change a feebleminded individual into a normal one, any more than it can change a red-haired stock into a black-haired stock.” The genetic metaphor, of course, was not accidental.

At the end of the 1930s, however, the science of psychometrics took a brief detour into proving the opposite, demonstrating a more hopeful, fluid, dynamic conception of human intelligence and potential. It happened by accident and was forgotten almost as quickly.

In the midst of the Great Depression two baby girls, who had been dismissed as “hopeless” cases, unadoptable because of their severe mental disability, were transferred out of an overcrowded orphanage to a ward for mentally disabled adult women. They were aged just thirteen and sixteen months, but their IQs had been estimated at between 46 and 44, in the “moderately to severely retarded” range. Harold Skeels and Harold Dye, two young psychologists charged with documenting and monitoring their cases, observed, “The youngsters were pitiful little creatures. They were tearful, had runny noses, and coarse, stringy, and colorless hair; they were emaciated, undersized, and lacked muscle tone or responsiveness. Sad and inactive, the two spent their days rocking and whining.”

But a strange thing happened at their new institutional home. The babies became the pets of the women residents and attendants, who bought them toys and books. They were constantly held, talked to, and lavished with affection. Because of their unusual living situation, their intelligence tests were repeated, which wasn’t normally done in the orphanage. After just six months their IQ scores had improved to 77 and 87, and a few months after that their scores had climbed into the mid-90s, near average levels.

Skeels and Dye convinced the state authorities to repeat the accidental experiment. This time, thirteen children between the ages of one and two whose intelligence scores were too low for them to be considered suitable for adoption were placed in the care of “foster mothers” at the adult women’s institution, aged eighteen to fifty, whose mental ages were rated between five and twelve years. The foster mothers were instructed in proper infant care, and the babies also participated in an organized preschool program. A control group remained at the orphanage, only two of whom initially tested as well below average. According to an article discussing the case, the toddlers at the adult women’s home had toys bought for them by the attendants and clothes made for them by the residents. Their “mothers” cheerfully competed over which ones could be made to walk and talk first.

The children remained on the ward for a mean of nineteen months. All but two of the eleven gained more than 15 IQ points during that time. Once they tested at average intelligence they were moved to regular foster homes. A year after the experiment ended, of the thirteen original children, none was still classified as “feebleminded.” At the first follow-up two and a half years later, in 1943, the mean IQ of the experimental group was exactly average, 101.4. Meanwhile the control group left at the orphanage had shown “marked deterioration” and now had an average IQ of 66.1, down from 86 at the beginning of the study.

When Skeels and Dye published their findings from this sad yet hopeful study of the power of love and human potential, they were met with general apathy and derision. Psychometric experts such as Terman asserted that IQ was fixed even in young children and that tests were reliable. If the test results varied over time, this was proof not of some miraculous transformation in the subjects but of slipshod science. The study was dubbed the “wandering IQ” study and placed on a shelf.

The Census of Abilities

Throughout the history of psychometrics a debate has carried on between those who celebrate the diversity of human intelligences, including how they grow and change over time, and the prevailing interest in pinning intelligence down as fixed, inherited, and unitary. Some early intelligence testers were interested in what would later be called “multiple intelligence” theory. Take Louis Leon Thurstone, creator of the American Council on Education’s psychological examination for high school graduates and college freshmen, introduced in 1924. Thurstone claimed in a 1938 paper that his experiments actually indicated the existence of seven distinct factors of intelligence: numerical, reasoning, spatial, perceptual, memory, verbal fluency, and verbal comprehension.

When resources were unrestricted, the resulting test designs could be remarkable for the range of qualities tested. In 1941, President Roosevelt issued an order for the creation of the Office of Coordinator of Information, later the Office of Strategic Services. Although the United States had not yet officially entered World War II, the OSS was charged with espionage, intelligence gathering, propaganda, subversion, aiding resistance movements, and even “psychological warfare” in service of the Allied cause. But undercover agents, men and women with diverse backgrounds and skills, started to suffer nervous breakdowns under the constant threat of being found out. In order to identify the best candidates for this demanding job, the OSS created a major psychological assessment program, screening over five thousand people in a year and a half.

The OSS tests went far beyond Army Alpha. They sought to test general intelligence, especially practical smarts under stress, alongside qualities like motivation, energy, initiative, social skills, and emotional stability as well as propaganda skills (i.e., persuasiveness in writing and speaking), observational ability, and the ability to keep a secret—in short, “the whole man,” a concept borrowed from the contemporary theory of Gestalt psychology. Furthermore, their results were correlated with later performance on the job, at least for a certain percentage of recruits. Doing this kind of follow-up is an important way of establishing a test’s predictive validity, which illuminates whether it’s worth giving at all—just as Bissell’s follow-up work discredited Cattell’s assessments of Columbia freshmen.

The OSS test included paper and pencil questions. But there was also a full life story interview plus performance and situational exercises in teams of four. For a full three and a half days the recruits were asked to prepare a cover story and keep it up at all times, even under different styles of interrogation.

One unique trial was called the Construction Test. In it a candidate was given ten minutes to construct a five-foot square box with an assortment of materials. He was assigned two “helpers.” In fact, both were stooges. One acted incompetent and the other behaved like a jerk. This task was never completed within the time limit and was a tremendous test of not just leadership and social skills but also emotional stability. Several of the candidates actually physically attacked the helpers during the test.

When money and time are no object, as was the case in these examples, it’s possible to get very creative in testing and arrive at very different factors of intelligence beyond IQ or g. But when money and time are objects, as they usually are, the default is multiple-choice bubble tests.

As Nicholas Lemann describes in his wonderful history of the SAT, The Big Test, Henry Chauncey, founder of the Educational Testing Service, originally dreamed of conducting a grandiose-sounding Census of Abilities that would “categorize, sort, and route the entire population.” In the first decades of developing the SATs he experimented with psychological measures, such as the Myers-Briggs personality scale and the Thematic Apperception Test, to incorporate a fuller picture of human capability. But these posed logistical, cost, and sometimes ethical problems.

Despite Thurstone’s interest in multiple intelligences, the ACE test, which he produced with his wife, Thelma, ended up with just two scores: an L (linguistic) and a Q (quantitative) score. By the same token, by the end of the 1950s the College Board had definitely settled on the IQ-style items that still dominate the SAT today, with, again, a system of two scores: math and reading.

And that’s where we have left it today, with most high-stakes tests still testing just these two areas.

#2 Pencils

Mass testing grew throughout the twentieth century alongside other forms of industrial mass production, with American innovations spreading throughout the world. In 1937 IBM filed patents for the 805 test-scoring machine, which read the electrically conductive marks of a graphite pencil with a pair of wire brushes scanning the page. The story is that the inventor, Reynold B. Johnson, then a high school science teacher in Michigan, remembered pranking his older sister’s dates by drawing pencil marks on the outside of a car engine: the spark from the spark plug traveled toward the marks and refused to ignite. Johnson came on board at IBM and went on to develop the first commercial computer disk drive and contribute to the videocassette tape.

Everett Franklin Lindquist, a member of the College of Education at the University of Iowa, then started the Iowa Academic Meet for high school students, known as the Brain Derby. This led him to develop the Iowa Tests of Basic Skills (ITBS), debuting in 1935 for grades six through eight, followed in 1940 by the Iowa Tests of Educational Development for high school students.

The Iowa Tests covered reading comprehension, spelling, and mathematics. By 1940 school districts throughout the nation were purchasing and administering the Iowa Tests along with others like the National Assessment for Educational Progress (NAEP, or the Nation’s Report Card), the California Achievement Tests, the Comprehensive Tests of Basic Skills, the Metropolitan Achievement Tests, and the Stanford Achievement Tests (SATs). Schools, districts, and states adopted these tests voluntarily to help them classify students and allocate resources as well as a means of quality control. Lindquist would go on to cofound the American College Testing program, or ACT, in 1959 and also had a key role in the development of the General Educational Development (GED) and National Merit Scholarship tests. All of these, plus the Advance Placement (AP) exams and the SATs, constituted the primary menu of standardized tests right through the 1980s.

Then, in 1962, Lindquist received a patent for an optical scanner that could grade up to one hundred thousand answer sheets per hour: the cotton gin of standardized testing. Because of this, Lindquist has been called “arguably the person most responsible for fostering the development and use of standardized tests in the United States.”

Testing Learning vs. Testing Smarts

The distinction between intelligence testing (how smart are you?) and achievement testing (what do you know?) is often murky to the general public and even to test makers themselves. Throughout their history, for example, the SAT’s “A” has stood for “aptitude,” “assessment,” and “achievement,” and today SAT, like KFC, stands only for itself.

Still, the achievement tests that began to be introduced midcentury and are emphasized today differ in their design from the intelligence tests of the 1910s and 1920s. The scores on the SATs, ACTs, and Iowa Tests are meant to correlate with grades in school. Accordingly, these tests contain more questions covering specific areas of knowledge.

Another important distinction between intelligence and achievement tests is that intelligence tests are designed to be “norm-referenced.” That is, they are meant to compare individuals to the norms found across a population. That means, for one thing, that the results tend to conform to a bell curve. A predetermined percentage of students will score at the bottom or top end of the scale. This is what it means to “grade on a curve.” Norm-referenced tests are normed on a particular sample population at a particular place and time—Lippmann’s “eighty-two California school children in 1913–14.” If the population taking the test differs in important ways from the sample population, the results may be extremely misleading.

Achievement tests, however, are supposed to be “criterion referenced.” They are not graded on a curve. They’re meant to measure the acquisition of knowledge according to fixed criteria. Think of a driving test. If absolutely everyone passes the driving test on the first go, it’s probably too easy. But 87 percent of American adults are licensed to drive, some of whom took the test multiple times, and few people have a problem with that.

Statewide accountability tests should, in theory, be criterion referenced. The goal is for everyone to meet a certain baseline of knowledge. But because achievement tests evolved from intelligence tests and many of the same psychometric procedures are used to develop and test the tests, in practice the distinction between the two is often muddied.

In reality there is no objective definition of what it means to be a proficient third grader; the standard can be set only by referencing previous third graders’ performance. The size and makeup of the sample population matters a lot. Then again, if you create a test and project in advance, based on field tests, as psychometricians often do, that 80 percent of students will get one block of questions right and 60 percent will get another block right, are you unfairly predetermining the outcomes?

When you claim to test achievement and test children at the end of school rather than on entry, you are in part evaluating the work done by the schools themselves, with consequences that would reach their fullest expression nearly sixty years later.

Skeels and Dye Revisited

In the 1960s a University of Chicago researcher dug up Skeels and Dye’s paper and was excited enough to find the researchers and persuade them to follow up with the original experimental groups. The results were published in 1966. Of the thirteen girls who had been adopted, first informally by developmentally disabled women in the institution and then by families in the outside world, all of them were self-supporting. Eleven of them were married. They had a mean of 11.68 years of education. They earned an average wage of $4,224, which was in the range of average annual earnings for men in Iowa, their home state—not bad for a group of women from an institutional background in the 1960s.

Of the twelve girls in the control group, only four of them had jobs, all of them working in the institutions where they lived. Only three had been married. On average they had less than four years of schooling. The cost savings to the state for rescuing the girls who went on to live healthy, productive lives was approximately $200 million in today’s dollars.

Skeels’s lifetime of research testing the effects of nurturing on children’s intellectual development influenced Head Start, the federal preschool program, as well as the field of special education. But to this day his name and his discoveries about the malleability of IQ are far less well known than those of Alfred Binet, Francis Galton, or even Lewis Terman.

Maybe that’s because his research pointed to something that the creators and promoters of tests didn’t want to acknowledge: at best, intelligence tests can only be an indicator of someone’s situation at a given point in time and can serve only as a benchmark for the real work of educators and caregivers in cultivating a child.

Skeels concluded his 1966 follow-up study with these urgent words: “It seems obvious that under present-day conditions there are still countless infants with sound biological constitutions and potentialities for development well within the normal range who will become retarded and noncontributing members of society unless appropriate intervention occurs . . . sufficient knowledge is available to design programs of intervention to counteract the devastating effects of poverty, sociocultural, and maternal deprivation.”

The same problem is just as apparent today, and the need for comprehensive “programs of intervention,” if anything, even more so.

Racism, Difference, and Tests

In the Jim Crow era a rogue variant of high-stakes standardized test emerged: the so-called literacy test, employed to systematically restrict the voting rights of African Americans. Oral, written, or multiple choice, these were all buried in bureaucratic verbiage familiar to any subject of Terman or Otis.

Louisiana, my home state, crafted a special breed of pencil-and-paper test featuring nonsense worthy of an evil Lewis Carroll.

Sample questions on the 1964 Louisiana Voter Literacy Test:

“1. Draw a line around the number or letter of this sentence.”

“20. Spell backwards, forwards.”

The Voting Rights Act of 1965 specifically prohibited tests like these, and any jurisdiction that had employed them in the past became subject to federal oversight.

As I went through the lively history of standardized testing I had to face again and again the awkward question: Why so many racists in psychometrics? My intention is not to throw a rhetorical Molotov cocktail at today’s state-mandated testing programs. I’m not saying anyone involved in testing today is, de facto, racist. But it’s hard to ignore the shadow of history. The majority of the foundational intellectual figures in this relatively small scientific field were directly associated with eugenics, discrimination, or both. Why? And what, if anything, does it mean for today’s test-obsessed school system?

What is the commonality between Galton’s anthropometric laboratory, the Army Alpha and Beta tests of World War I, the intelligence tests given at Ellis Island, and the logic-defying Jim Crow literacy test? How do these all relate to the achievement gaps of today?

One common theme is that the people in power create the tests, and disempowered people have to pass them. Another is that tests represent the urge to sort and stratify, to catalog, correct for, normalize, and, ultimately, erase difference.

(A) Race, (B) Class, (C) Tests

Are standardized tests biased against minorities, or do they reveal an unpleasant ground truth about groups’ fixed abilities, as would be argued by generations of “scientific” racists, from Galton to Charles Murray, author of the 1994 book The Bell Curve? The achievement gap, or the tendency of poor people and minority groups to score lower on standardized tests, is really at the heart of the testing controversy. Researchers have looked at the problem from all angles.

There are cultural arguments, such as that African Americans develop an “oppositional” attitude toward academic achievement because of racism. Research has shown that tests, written largely by privileged, educated white people and normed on similar populations, are likely to assume prior knowledge common to their own group, which depresses the scores of the nondominant groups. My mother-in-law, a therapist and university instructor, told me about a possibly apocryphal example of class-based assumptions embedded in an intelligence test for young children in the 1970s. One question has three pictures: a man in overalls, a man in a tuxedo, and a man in a pinstriped suit. The question was, “Which one is a picture of Daddy going to work?” Whether or not this item actually appeared on tests, it’s a great illustration of what bias looks like on a test.

Here’s a more anodyne example from my own childhood. In third grade I was enrolled in public school in Cambridge, Massachusetts, while my parents were on sabbatical from Louisiana State University. While there I remember vividly taking some kind of multiple-choice test that included the question: Which of the following fruits do we eat fresh, and which are dried? One of the options was “fig.” Back in subtropical Baton Rouge, Louisiana, we had a fig tree growing in our backyard. So I got that one wrong.

The cultural bias arguments are provocative but gild the lily a bit. The “achievement gap” is a tautology masquerading as a problem: all it really means is that students with disadvantages, on average, are at a disadvantage.

Of the 50 million students in American public schools, 22 million receive free and reduced-price lunches. Their families earn less than 185 percent of the federal poverty level—less than $50,000 for a family of four. That means that nearly half of public school students struggle economically to some extent. Economic struggle in turn affects academic achievement in multiple ways. Poor kids may not get the same quality sleep because they share a bed or sleep on a couch. They may come to school without breakfast. Their vision goes uncorrected. They are likely to have less educated parents who own fewer books and talk to them less from the time they are infants—a gap that’s been estimated at 30 million words by the time they start kindergarten. They are more likely to suffer from “toxic stress”—a parent in jail, abuse, trauma, or risk of homelessness—that interferes with their ability to concentrate day to day and can distort their brain development over time.

In 2013 Dr. Michael Freemark, a professor of pediatrics at Duke University took a look at state test results for students in Raleigh-Durham and Chapel Hill, North Carolina, and found that “85 percent of variability in school performance is explained by the economic well-being of a child’s family, as measured by eligibility for subsidized lunches.” No other single factor comes close in its correlation with test scores.

The great American project is how to have a broadly successful, competitive, vital, dynamic society with the full participation of people of varying backgrounds and abilities. The provisional answer we’ve come up with is meritocracy. This means providing opportunities to all to participate and then rewarding talent combined with effort.

But meritocracy relies on accurate sorting so that the cream rises to the top. Intelligence and aptitude testing offer a fantasy of objectively identifying fixed quantities of merit and awarding opportunities based on those quantities: a spot in a gifted kindergarten, a spot in Harvard’s freshman class.

But what if the tests are not objective? And what if merit is not fixed? The converse, the possibility that Skeels and Dye’s “wandering IQ” study puts in front of us, is a huge moral challenge. Their study, laughed out of the hall, suggests that disparities in human flourishing are partly circumstantial, that they can be mitigated with a mixture of hard effort and love. A diversity of achievement and talents would naturally persist even if our society did the utmost to promote the advancement of each person. But nonetheless, the idea that tests represent only a snapshot, a moment in time, puts a huge responsibility in the hands of everyone tasked with bringing up children.

Tests and Civil Rights

In the 1960s psychometric tests pivoted from being a frank instrument of racism to being marketed as a tool used to diagnose racist wrongs. Richard Halverson at Wisconsin-Madison, quoted in Chapter 1, traces the latest wave of obsession with high-stakes testing to school desegregation policies, beginning with Brown v. Board of Education in 1954. “That’s the historical root of the interest in assessment,” he said. “With the civil rights movement we made huge public investments in our schools and said, schools, remediate the results of Jim Crow and civil rights abuses. By the ’70s and ’80s it was pretty clear those effects had not been remediated by our schools. So the question became, what are we putting all this money into schools for?” Some conservatives used that question as a bludgeon to try to reduce educational spending. Some progressives asked it in order to spur more spending and intervention. In either case, there was an increasing interest in methods of evaluating program outcomes, and this often meant tests.

At roughly the same time, in the 1970s, the quest for educational access as a civil right resulted in a wave of antitesting activism, media coverage, and federal court cases. A central figure in this movement was none other than Ralph Nader. In 1980 he published Allan Nairn’s five hundred–page magnum opus investigative report on the practices of the Educational Testing Service (ETS), creator of the SATs and GREs. The paper, titled “The Reign of ETS: The Corporation that Makes Up Minds,” questioned the validity of the tests as well as the powerful position of ETS, a private nonprofit that takes in millions in revenue while deciding who will pass through the gates of America’s colleges and on to economic opportunity. The accompanying campaign led to the passage of “truth in testing” laws in New York and California, requiring that companies publish tests after they were administered along with the answers and any accompanying background reports so individual items could be publicly examined for racial and class bias.

In January 1980 Nader appeared on the Tonight Show. Illustrating how little the terms of debate have changed, according to a contemporary account by Walt Haney, “After condemning the ‘reign’ of ETS, Nader gave an impassioned plea for wider consideration of traits like perseverance, wisdom, idealism and creativity. . . . The Tonight Show audience broke into spontaneous applause.” Once again, as with Walter Lippman in the 1920s, a major cultural critic had mounted a popular critique of intelligence testing on intellectual and moral grounds. But the reign of standardized tests was not to be ended so quickly.

The Rainbow Connection

The racists in the center of this history focused on consistent scoring differences between racial and ethnic groups. The current obsession with narrowing the “achievement gap” is based on those same stubborn trends. The implication is that poor students and minorities must constantly labor to catch up on systematic and persistent weaknesses. Small wonder that even being reminded of one’s race can, research says, be enough to depress math scores, a condition known as “stereotype threat.”

Poverty and racism are very real and distinct disadvantages. But certainly it would be more useful if tests could somehow filter out the background noise that arises from the accident of birth and instead highlight the individual strengths that can help determine success in life as well as people’s capacity to grow and develop these strengths. That would especially improve the quality of gatekeeper tests such as the SATs.

Dr. Robert Sternberg knows from smart. He’s a psychologist and psychometrician with degrees from Stanford and Yale. He knows from successful. He’s been the president of both the American Psychological Association and the University of Wyoming. And he knows from complex. At the age of sixty-four he has two grown children plus toddler triplets with his current wife, a former student. And he resigned the University of Wyoming presidency after four months, having lost the confidence of the board of trustees.

For the past thirty-odd years Dr. Sternberg has been developing and perfecting a new, more complex theory of intelligence, called successful intelligence, measured by tests useful in high-stakes contexts. “There’s more to people than SAT and ACT scores,” he said. “You need to be creative, because the world changes so fast that if you can’t adapt to novelty and respond to new social and emotional challenges, you just fall behind.” Sternberg’s characterization of intelligence, which he emphasizes is just a working definition, focuses on the broadest possible range of qualities that allow people to achieve what they set out to in life—hence, successful intelligence. In his scheme traditional academic abilities, including math and language, are just one of three legs of a stool. These are called “analytic intelligence.” The other two legs are practical and creative intelligence.

The idea that intelligence includes emotional, creative, and practical ability is common sense across cultures and is enjoying a bit of a renaissance these days. But without reliable, easy-to-use tests to measure the full range of human strengths, it’s proving quite difficult to incorporate these into our reckoning of what makes people successful, much less the systems by which we judge the success and failure of students and schools. As documented throughout this book, our schools then end up concentrating on analytic intelligence, especially memorized information, the easiest kind to test. It’s like the old vaudeville routine in which a comedian comes out onstage hunting for his spectacles.

“Did you lose them out here?” asks the emcee.

“No, I lost them backstage.”

“So why are you looking out here?”

“Because the light is better out here!”

Our comedian is committing a systematic error.

Being the rare learning psychologist who is also a psychometrician, Sternberg has developed a solution to this conundrum. He has designed a series of tests to shine a light on his broader definition of intelligence, with the evocative names of “Rainbow,” “Kaleidoscope,” and “Panorama.”

The College Board funded the development of the Rainbow test. Sternberg tested almost seven hundred undergraduates from diverse backgrounds. He used the SATs to measure analytic intelligence and devised a multiple-choice test to get at practical and creative intelligence. The Rainbow test also included free-response items reminiscent of personality testing or the old OSS tests in the 1940s. The creative portion, among other tasks, asked students to come up with captions for New Yorker cartoons and write short stories after being given only a title. The practical portion asked them to recommend a solution to common interpersonal dilemmas people face in the workplace or at school.

The results, published in 2006, were impressive. Testing all three kinds of intelligence predicted the GPAs of college freshmen far better than the SAT alone. And even more interesting, testing for all three kinds of intelligence greatly reduced the gaps among racial and ethnic groups. Groups who scored lower on analytic intelligence had correspondingly higher scores on the creative and practical tests.

The results of Sternberg’s Rainbow study was published as a lead article in the journal Intelligence and featured in the media. But despite its success, the College Board stopped funding it. “There are differing interpretations of why,” Sternberg said. “I can only say what they said, which is that they had reservations about whether this could be scaled up.” In particular, the multiple-choice items Sternberg had designed ended up correlating only with analytic skills, regardless of what they were intended to measure. It was just the open-ended tasks, which had to be graded by humans, that produced separate scores for creative and practical intelligence. Sternberg said he learned that “if you want to test something other than analytic intelligence, don’t go with multiple choice.”

However, Sternberg disagreed that his tests couldn’t scale. He was impatient to put his ideas into practice, so he moved to a position as a dean at Tufts University in Boston. This time he’d be working with the admissions office. “I spent thirty years as a professor at Yale and eventually concluded that as a professor I couldn’t change squat. So I said, the hell with it, at least as an administrator I can do something.”

Sternberg’s next phase, dubbed the Kaleidoscope Project, gave Tufts applicants the option of doing an essay or submitting a project that demonstrated successful intelligence. To the creative, analytical, and practical legs of his model he’d added a fourth component: wisdom. He defines this as the ability to apply knowledge and skills toward a common good by balancing your interests with those of others and the greater good over both the long and short term. The entire Tufts application was scored using rubrics, with admissions officials trained in criteria to evaluate the students’ successful intelligence.

Over the course of five years and thousands of students, systematically measuring all of these factors increased the university’s ability to predict incoming students’ academic success, leadership, and extracurricular involvement. They did all this while nearly eliminating ethnic-group differences in admissions consideration—a significant achievement considering the Supreme Court cases that have greeted attempts to do the same at universities across the country. Tufts applicants reacted positively to the idea that the university was emphasizing a full range of strengths. As a result, both the diversity of applicants and their SAT scores went up. At Sternberg’s next stop, Oklahoma State University, he achieved similar results with a very different student body. “The purpose of college education is to produce active citizens, ethical leaders, who will make the world a better place,” Sternberg said. “You’re not going to do that only by looking at their SAT scores. If you believe in that, your admissions process should reflect what you are trying to optimize for.”

Like Skeels and Dye’s “wandering IQ” study, Sternberg, whose research continues at Cornell University, asks us to consider a dynamic definition of intelligence, one that incorporates students’ relationships with others and, above all, their ability to learn. “Learning from experience is the most important skill,” he said. “We should be looking at kids’ growth potential.”

In the next chapter we’ll see the consequences when, in the twenty-first century, the federal government decrees that every student in the nation must achieve a single standard on a test.