7 `THE TRAP OF ROUTINE ASSESSMENT`

`Routines for Computers, Unstructured Problem Solving, and Complex Communication for People`

AUTOGRADERS EXCEL AT assessing routine tasks. These are exactly the kinds of tasks that we no longer need humans to do.¹

As personal computers were widely adopted in the workplace throughout the 1980s and 1990s, executives, policymakers, and researchers wondered what influence these new machines would have on human labor. An early hypothesis was that computers would complement high-wage workers and replace low-wage workers—computers would replace cashiers but not professionals like doctors. Computerization has certainly replaced human work in the labor market, but the story that has unfolded over the last forty years is much more complex than this simple prediction. Economists interested in education have investigated computerization not just to understand how labor and skill demands would change, but to better understand how these changes could and should impact education. Richard Murnane and Frank Levy have explored these issues for many years, and one story they use to help people understand automation is about airline check-in counters.²

Some of us remember checking in for an airplane flight with an actual human being. The conversations with these people were extremely structured. If you fly frequently, you can probably remember exactly how these conversations are supposed to go:

Do you have a ticket?
What is your final destination?
Do you have identification?
Do you have bags to check?
Have your bags been in your possession since you packed them?
Did anyone else help pack your bags?
Here are your tickets and luggage receipts. Have a nice flight.

The people behind the counters were solidly middle-class workers. They joined unions, wore uniforms, and had decent salaries and benefits. And many of these jobs are gone. Today, few people check in first with a human being to get on an airplane.

The information needed to process these conversations was highly structured—a ticket is a number, a passenger name, a seat ID, and an origin and a destination; the baggage questions are yes / no; identification cards are designed to be scanned and matched to people. As a result, these conversations could be encoded into computers built into kiosks, and these kiosks could inexpensively and ceaselessly carry on these conversations with passengers day and night. As mobile technologies and networks spread, these kiosks are now complemented by smartphone apps that carry on these same conversations with you from the palm of your hand.

Still, if you visit a bank of airline check-in kiosks, you will see uniformed airline employees behind the counters and amid the kiosks. As an educator, I am extremely interested in these people—who they are, what they do, and what they are good at. In a sea of automation, what are the tasks being done by the last few people whose jobs could not be automated?

Airline counter staff tackle two general kinds of problems that kiosks and mobile apps don’t handle well. First, there are problems that come up periodically that were not anticipated by the designers of the kiosk systems: flights get diverted, payments get mixed up, and other out-of-the-ordinary events occur. Murnane and Levy call these “unstructured problems,” problems whose outcomes and solution paths are not immediately clear. Humans have what economists call “a comparative advantage” over computers in these kinds of challenges, meaning that as tasks get more complex and ill structured, the cost of developing and maintaining software to provide a service becomes higher than the cost of hiring people.³

Second, people have all kinds of difficulty communicating with the kiosks. They cannot find the right kiosk, or they do not speak a language programmed into the kiosk, or they forgot their glasses, or they are so hopping mad at the airline that they just bang on the kiosk. Human beings are better than computers at “complex communication,” problems whose solutions require understanding a task through social interaction, or when the task itself involves educating, persuading, or engaging people in other complex ways.⁴

Complex communication and unstructured problem solving are domains where for the foreseeable future, human beings will out-perform computers. David Autor, along with Levy and Murnane, took the list of hundreds of job codes maintained by the United States Department of Labor and labeled each job as routine manual work, routine cognitive work, work requiring complex communication, or work requiring ill-structured problem solving. They found that routine work was disappearing as a proportion of the labor market, and jobs requiring complex communication and expert thinking were expanding. Subsequent studies found that certain kinds of routine jobs were persisting in the service sector, but these were typically jobs that could be performed by any interchangeable person without any particular skills, and they typically paid the state- or federally mandated minimum wage.⁵

This labor market research by Levy and Murnane is the original source for nearly every list of twenty-first-century skills promoted by education reformers over the last two decades. Perhaps the most popular formulation has been the “four Cs” of creativity, critical thinking, communication, and collaboration. The first two of the four Cs (creativity and critical thinking) are derivatives of unstructured problem solving, and the latter two (communication and collaboration) are derivative of complex communication.⁶

`Computers Changing Work Up and Down the Labor Force`

These kinds of shifts can also be seen within job categories, and the effects are similar across very different kinds of jobs and wage levels. One of my favorite illustrations of this phenomenon comes from a cabin that I own in rural Vermont. My mother bought the cabin thirty-five years ago, and she had a hot-water heater installed by the local plumber. Twenty-five years later, my mother passed, my brother and I inherited the house, and the hot-water heater needed replacing.

The same plumber was still working after all those years, and he installed a beautiful new system. After he finished, he brought me into the utility closet to explain it. The heater had an LED panel that under normal operations displays the word “GOOD.” The plumber said that if that panel ever said something other than GOOD, I should hit the reset button and wait a while. If that didn’t work, I should toggle the power system on and off and reboot my hot-water heater. I asked him what to do if that didn’t work, and he explained that only one plumber in the Upper River Valley of Vermont had flown to Scandinavia to get trained on how to reprogram this kind of heater, and I would have to call him.

Over the years, I have thought about that tradesman in the 1960s and 70s completing his plumbing apprenticeship, learning how to cut and solder pipes, install new appliances, and get frozen houses in rural Vermont up and running again. What would it be like to explain to the young men in that apprenticeship program that fifty years later, plumbers would be computer programmers? Not only that, but the computers inside hot-water heaters can solve the easy problems themselves through rebooting and power cycling. Plumbers with computer programming skills are needed only to solve the uncommon, non-trivial problems that the computer cannot resolve on its own.

There are similar stories on the other end of the economic spectrum. In legal work, a common task is discovery, the request or retrieval of documents from a deposition that may contain evidence of malfeasance. In days gone by, firms might have hired a small army of lawyers to read every document looking for evidence. New services automate part of the discovery process by having computers scan and examine documents to look for keywords or other incriminating patterns. A much smaller subset of documents can then be turned over to a much smaller subset of lawyers for examination.⁷

In every profession in every sector of our economy, enterprising computer programmers are identifying tasks that can be turned into routines and then writing software, developing apps, creating robots, and building kiosks that can replace those elements of human work. Self-driving cars represent one much sought after milestone in the development of automation technologies. In The New Division of Labor, published in 2005, Levy and Murnane described a left-hand turn into oncoming traffic as the kind of decision that was so complex, with so many variables of cars, conditions, pedestrians, animals, and so forth that computers could never learn to reliably and safely decide when to turn left. And of course, roboticists, artificial intelligence specialists, machine vision experts, and others are working furiously to program self-driving cars that can turn left against traffic, among many other complex decisions.⁸

As an educator, one of my foremost concerns is whether our educational systems can help all students develop the kinds of skills that computers and robots cannot replicate. Our efforts as educators should have a special focus on domains where humans have a comparative advantage. If computers are changing the demands of the labor market and civic sphere and requiring that students develop proficiency with complex skills, what role can large-scale learning technologies play in addressing those challenges?

As we observed in the previous chapters, the value proposition offered by instructor-guided and algorithm-guided learning at scale is that learners can engage in educational experiences facilitated by computers, and they can advance through those learning materials on the basis of automated assessments. Most MOOCs and adaptive tutoring systems include problems, quiz questions, or other activities that purport to assess learner competencies. On the basis of those assessments, they offer feedback or credentials, or, in the case of personalized tutors, additional problems or learning resources based on those performances.

So what kinds of domains of human performance can we evaluate with computational systems?

`(Mis)Understanding Testing: The Reification Fallacy`

Before addressing this question, it’s worth taking a moment to consider the “reification fallacy,” when we uncritically believe that something’s name accurately represents what that thing actually is. Psychometricians are statisticians who study testing and all the ways tests and testing data are understood and misunderstood. One of the most common fallacies in evaluating testing systems is believing that the name of a test accurately defines what it tests. In common parlance, we might call something an “algebra test,” but just calling it an algebra test doesn’t necessarily mean that it’s an accurate measure of a person’s ability to do algebra. An immigrant student just starting to learn English might understand algebra very well but fail an algebra test that depends on English language fluency. The thing called an “algebra test” isn’t just an algebra test, but also an English (or maybe a “mathematical English” or “academic English”) test and an evaluation of the knowledge not just of subject matter content but of test-taking strategies.⁹

Also, an algebra test won’t evaluate every possible dimension of algebra. All tests are an effort to sample a learner’s knowledge within a domain. An algebra test might include many items on rational expressions and very few on reasoning with inequalities, so the test might do a good job of evaluating learners in one part of algebra and a lousy job evaluating in another. The reification fallacy reminds us that something called an “algebra test” is never a universal test of algebra. A well-designed assessment might sample widely from representative domains of algebra while providing enough supports and scaffolds so that the assessment is not also testing English language fluency, or test-taking skills, or other domains. But all assessments are imperfectly designed.¹⁰

An implication of the reification fallacy is that all tests are better at measuring some parts of what they claim to be testing and worse at others, and tests inevitably also evaluate domains that are not named in the description of the test. When educators and instructional designers use computers to automatically grade assessments, the strengths and limitations of computational tools dramatically shape what parts of a given domain we examine with a test.

`Computers, Testing, and Routines`

Much as computers out in the world are good at highly routine tasks, computers used in educational systems are good at evaluating human performance when that performance can be defined in highly structured ways or turned into routines with well-defined correct and incorrect answers.

In Chapter 2, I referred to one of the first computer programming languages that was developed exclusively for educational purposes: the TUTOR language, written in the 1960s for the PLATO computer systems. Over time, PLATO supported a remarkably wide variety of learning experiences, games, and social interactions, but some of its earliest functions were creating computer-assisted instructional lessons in which screens of content delivery alternated with assessments. These assessments used a pattern-matching system. Instructional designers could enter certain words, numbers, or features into an answer bank, PLATO would pose a question to students, students would type in an answer (which appeared immediately on the screen in front of them—a major advance at the time!), and the PLATO systems would evaluate whether there was a match between the learner’s answer and the acceptable answers in the bank.¹¹

This kind of pattern matching is still the primary way that computers evaluate answers. There is no automated evaluation system that employs reasoning like human reasoning, that evaluates meaning, or that makes any kind of subjective assessment. Rather, computers compare the responses that students submit to “answer banks” that encode the properties of correct and incorrect answers. Computers determine whether or not new answers and the established bank of answers are syntactically or structurally similar. Computers evaluate answers based on syntax and structure because computers understand nothing of substance and meaning.

Over time, we have developed increasingly complex tools for pattern matching. In the early version of the TUTOR programming language, if an instructional designer wanted to accept “5,” “five,” “fiv,” and “3 + 2” as answers to a problem, those alternatives (or rules defining the complete set of alternatives) would all need to be manually programmed into the answer bank. We have increasingly sophisticated answer “parsers” that can evaluate increasingly complex inputs, but at a fundamental level, computational assessment tools match what students submit with what assessment designers have defined as right.¹²

Most automated grading systems can only evaluate structured input. Grading systems can automatically evaluate multiple-choice-type items (if the correct answer is programmed into the system). They can evaluate quantitative questions for which there is a single right answer (“5”) or a set of right answers (“5” and “-5”). In chemistry, they can check for balancing of chemical equations, where the inputs aren’t strictly numerical but can be converted into numerical systems. When a system can be evaluated through computational logic, like the circuit and electronics systems featured in MIT’s first MOOC, 6.002x, instructors can define success criteria even for somewhat open-ended challenges (“using these simulated parts, build a complete circuit that turns this simulated light on”), and computers can identify which student-built systems meet the success criteria.

When considering these capabilities, it is important to remember that even in the most quantitative of subjects and disciplines, not every human performance that we value can be reduced to highly structured input. Consider mathematics. The Common Core State Standards, a set of math and literacy standards widely adopted in the United States, defines the process of mathematical modeling as including five steps: (1) finding a problem in a set of features; (2) arranging that problem into an appropriate model—which could be an equation, table, schematic, graph, or some other representation; (3) resolving computational problems within the model; (4) putting numerical answers in their original context; and (5) using language to explain the reasoning underlying the models and computations.¹³ As a field, our autograding tools are good at evaluating the computational component, and of the five parts of mathematical modeling, that component is the one thing we no longer need human beings to do. When I write an academic paper with statistical reasoning, I typically will use a calculator or computer to make every calculation in that paper. The value that I bring as a human into my academic collaboration with computers is that I, not the computers, know what the interesting problems are. My value comes in asking interesting questions, identifying interesting problems, framing those problems for both humans and computers to understand, and then presenting structured equations and inputs that allow computers to compute solutions. Once the computer has (usually instantly) computed a solution, then I take the helm again and explain how the computer’s computational answer fits into the context that I initially described, and I craft a written argument to explain the reasoning behind my solution process and what consequences the solution has for my academic field or for human society.

In my collaboration with statistical computing software, my added value comes from doing all of the things that software cannot do—analyzing unstructured data or interesting problems, framing those problems in (subjectively judged) useful ways, and using natural language to explain my process. Computers cannot do those things, and therefore they generally cannot be programmed to evaluate human performance on those kinds of tasks (though I’ll come to some advances below). And since we cannot develop autograders for these things, we tend not to evaluate them at scale in math class. Rather, what we evaluate at scale in math class are the computational things that computers are already good at and that we don’t need humans to do any more.¹⁴

Here again, the idea of the reification fallacy is useful. All throughout their careers, students take things called “math tests,” but if these math tests are computer graded (or if they are limited to problem types that could be graded by computers), then we know for certain that these tests are only evaluating a portion of the full domain known as “math.” A student who scores well on a computer-graded “math test” may be good at computation, but the test has not evaluated his or her ability to find interesting problems, frame those problems, explain his or her reasoning, or any of the other things that we pay professional mathematicians to do.

I do not mean to imply that we should not teach computation in schools. Reasoning about mathematics and writing about mathematics require an understanding of computation, and young people should learn all kinds of computation. Students should still memorize the kinds of mathematical facts—multiplication tables, addends that combine to ten—that are incredibly useful in all kinds of situations where it is advantageous to have faster-than-Google access to those facts. But those computational facts should be building blocks for students’ learning how to go beyond quantitative computation into reasoning mathematically.

The trap of routine assessment is that computers can only assess what computers themselves can do, so that’s what we teach students. But in our economies and labor markets, we increasingly do not need people to do what computers are already good at. We need students to develop complex communication skills and take on unstructured problems—problem finding and framing rather than problem computing in mathematics—and explain their reasoning. But school systems cannot cheaply test these important domains of mathematics, so school systems do not assess these dimensions at scale, and so teachers, publishers, and others diminish the importance of these dimensions of mathematics in curriculum and teaching. To be sure, there are some fabulous math teachers who do teach a more complete mathematics, but when they do so, they are working against the grain.

`Machine Learning and Assessment`

If we see improvements in assessment technologies in the decades ahead, where computers develop new capacities to evaluate human reasoning, it will most likely be connected to advances in “machine learning.” Machine learning is a field combining algorithms and statistics in which computers are programmed to make determinations without following specific rule sets but by making inferences based on patterns. For assessment, the most relevant branch of machine learning is supervised machine learning, which involves training computer programs to label data or make decisions such that their results are similar to how proficient human beings might label data or make decisions. For instance, native speakers listening to language learners trying to pronounce a series of words would label the words as correctly or incorrectly pronounced. Computer programmers then take these “training” data and try to use them to teach computers how to recognize correctly or incorrectly pronounced words.¹⁵

The machines doing this learning will not “listen” to the words in any kind of meaningful sense; or at least, they will not listen to words in the same ways that humans do. Rather, computer programmers will instruct the machines to take the audio files of sounds and then break them down into very small audio wave segments and compute certain features of these microsounds, such as pitch and volume. The machine-learning algorithm then calculates which of these quantitative sound properties are more strongly correlated with correct pronunciations and which are more correlated with incorrect pronunciations. When the machine-learning algorithm inputs a new sound that hasn’t been labeled as correct or incorrect, it will compare the quantitative properties of the new sound file with existing sound files and produce a probability that the new sound file is more like a correctly pronounced sound file or more like an incorrectly pronounced sound file. Humans can then review these computer-generated assessments and reevaluate them (“No, computer, this sound that you labeled as ‘0’ for incorrect pronunciation should actually be labeled a ‘1’ for correct pronunciation.”). Scoring algorithms can be updated based on these tuning improvements.¹⁶

Machine-learning-based approaches to improving autograders are most useful when two conditions are true: when a set of rules cannot describe all possible right answers, and when human evaluators can distinguish between right and wrong (or better and worse) answers. If a set of rules can be programmed to describe all possible right answers (as with a circuit simulator or an arithmetic problem), then machine learning is unnecessary; a set of programmed evaluation rules will do. If humans cannot reliably distinguish between correct and incorrect, then humans cannot generate a set of training data that can be used to program computers. Machine-learning approaches to computer-assisted pronunciation training are promising because both of these conditions are true. It is impossible to develop a strict set of rules for correct pronunciation in the same way that computer programmers could develop a strict set of pattern-matching rules for defining the correct answer to an arithmetic problem. At the same time, most native speakers can trivially recognize the difference between a correctly or incorrectly pronounced word. Now, there is “fuzziness” in these human assessments—will a native English speaker from Minnesota recognize my Boston argot of “cah” as an acceptable pronunciation for automobile?—but training data do not need to be labeled with perfect agreement among human beings in order for machine-learning algorithms to develop predictions that work with acceptable levels of reliability.

Through these machine-learning-based pronunciation technologies, you can speak “por favor” into your favorite language-learning app, and the app can make an automated assessment as to whether you are pronouncing the phrase more or less correctly. It is not the case that your phone has “listened” to you saying “por favor” in the same way that your Spanish teacher in middle school did. Instead, your language-learning app took the sound file for your “por favor,” broke it down into many tiny sound segments, assessed certain quantitative features of those segments, used those assessments to create a quantitative evaluation of your sound file, and then matched those quantitative assessments against a library of quantitative models of “por,” “favor,” and “por favor” that had been labeled by humans as correctly or incorrectly pronounced. From that comparison, the pronunciation autograder then made an estimation as to whether the quantitative model of your sound file was more similar to the correctly pronounced sound files or the incorrectly pronounced sound files. This assessment is probabilistic, so a programmer determined the tolerance for false positives and false negatives and decided on a threshold of probabilistic confidence that the language-learning app had to reach in order to assign your sound file a rating of “correct.”

All of this is essentially a magnificent kluge of a system to do the same kind of pattern matching that programmers trained the TUTOR language to do with the PLATO system fifty years ago. It takes something as idiosyncratic as the sound of a word and breaks that sound down into a series of quantitative features that can be compared to an “answer bank” of similar quantitative sound models. We have only the tiniest understanding of all the marvelous complexity of what happens inside the sea of neurons and chemicals in the human brain that lets a child instantly recognize a mispronounced word, and our computers take this nuanced, fuzzy assessment that humans can make and transform it into a series of routine computational tasks that can result in a probabilistic assessment.

Training these systems requires vast stores of data. When Google trains its image search engines to classify new images, it can use as training data the enormous corpus of existing images on the internet that have been captioned by human beings. There is no naturally occurring source of data where humans label text as pronounced correctly or not, so companies developing language-learning apps must create these data. As you might imagine, it’s relatively inexpensive to have expert humans rate pronunciations for the most common thousand words in the most common pronunciations of a single language, but it becomes much more expensive to develop training sets that include more words and phrases, more variation in acceptable dialect and accent, and more languages. If a pronunciation detector is trained using data from native Spanish language speakers from Spain learning English, the classification algorithms used by the detector will work with decreasing fidelity with Spanish language learners from Latin America, native Portuguese speakers, other Romance language speakers, and native Mandarin speakers. The kinds of errors that native Mandarin speakers make learning to pronounce English are sufficiently different from the kinds of errors that Spanish speakers make that new data are needed to train classifiers that can effectively evaluate the quality of pronunciation from those different speakers. Vast stores of human-labeled data are required for robust pronunciation assessment, and the heterogeneity of the labeled data inputs—such as having novice learners from many different native language backgrounds and experts from many different contemporary dialects of the target language—plays a major role in how effectively the assessments can correctly autograde learners from different backgrounds. The costs of this data collection, labeling, and assessment system training are substantial. Language-learning systems will make steady progress in the decades ahead, but the challenges of recognizing pronunciation demonstrate how far away we are from an adaptive tutor that can listen to natural speech and provide feedback like a native speaker.¹⁷

`Machine Learning and Automated Essay Scoring`

Perhaps the most prominent place where machine learning has been applied to advancing autograding is in automated essay scoring. If automated essay scoring worked reliably, then throughout the humanities, social studies, sciences, and professions, we could ask students to demonstrate their understanding of complex topics through writing. Ideally, as essay questions were added to various high-stakes exams and other facets of our testing infrastructure, teachers would need to assign more high-quality writing assignments to help students do well on these gate-keeping experiences. Better tests could allow for better instruction. As in most education technologies, the state of the art does not come close to these majestic hopes. Automated essay scoring provides limited, marginal benefits to assessment systems, benefits that are neither completely trivial nor truly revolutionary. It is possible that in the years ahead, these systems will continue to improve, but gains will probably be incremental.¹⁸

The mechanics of automated essay scoring strike most educators as weird. In evaluating essays, human scorers examine both syntax—the arrangement of words and punctuation—and semantics, the meaning that emerges from syntax. Just as computers do not understand the sound of words, computers do not understand the meaning of sentences, so computers only parse syntax. An automating essay-scoring tool starts with training data—a large corpus of essays that have been scored by human graders according to a rubric. Then, the computer takes each essay and performs a variety of routines to quantify various characteristics of the text. One such technique is called—and this is a technical term—the bag of words, where the software removes all punctuation, spacing, and word order; removes all stop words like a, an, the, and and; stems all words so that jumping, jumped, and jump are all the same word; and then produces a list of all remaining words in the document with their frequency counts. Autograders can perform a variety of other similar calculations, such as total word count or n-grams, the frequency with which word pairs or trios occur with each other. The autograder then develops a model of correlations between human scores and the quantized syntactical features of the essay.

With enough training data on a specific essay topic written by a specific target audience, usually hundreds of essays from a particular standardized test essay question, each newly submitted essay is run through the same syntactical algorithms (tossed into a bag of words, counted, weighed, and so on). The autograders can make a prediction, based on feature similarity, of how a human rater would grade an essay with similar syntactic features. Through this process, scores generated by autograders can achieve a level of reliability similar to human graders. Here, reliability refers to the fact that if hundreds of new essays were given to two humans and a computer to be graded, the computer’s grade will disagree with the score from any given human rater about as often as any given human disagrees with another human.

The case in favor of this approach to grading is that it allows for more inexpensive grading of natural language writing at scale. The assessment conducted by human graders in these large-scale writing assessments is not particularly good. Raters are typically given only a couple of minutes per essay to provide a holistic assessment with no qualitative feedback for students, but computers can achieve something close to this level of assessment quality and reliability. By using these technologies to increase the number of essays that are included in standardized tests, the hope is that educational systems are more likely to teach writing in their curricula, and even if the assessment is imperfect, it is better than standardized tests without writing.¹⁹

The case against automated essay grading is that it ignores the essential role of the audience in writing, that it replicates grading that is of low quality to begin with, and that it is difficult to scale. People don’t write to have computers dump our craft into a bag of words; we write to reach other people or ourselves. Writing to satisfy the syntactic criteria of a software program drains the meaning out of the activity of writing. The semantic meaning of the grade itself is also somewhat different. A human grade signifies that a person has evaluated the semantic meaning of a piece of writing against a set of criteria and made a claim about quality. A computer grade is a prediction about how the syntactic qualities of a document relate to the syntactic qualities of other documents. Advocates argue that these different means bring about the same grading ends, but critics of autograding argue that the process of these educational activities matters.

Finally, the whole system of standardized test writing is not a crown jewel of our global educational system. The prompts tend to be banal, the time constraints of writing unrealistic, and the quality of human assessment rushed and poor. Developing essay banks and training datasets large enough to reliably autograde new essays is expensive and time consuming, requiring a big investment from students, human raters, assessment designers, and so forth. Importantly, training data from one essay do not allow an autograder to reliably evaluate another essay, so each particular prompt must be trained. Incrementally making this system better by creating ever-so-slightly-less-lousy autograders may not be a productive path to improving teaching and learning.²⁰

One clever way to critique essay autograders is to program essay autowriters. Given that computers evaluate patterns in language syntax, if humans can decode those patterns, then they can use computational tools to generate new essays that are semantically meaningless but syntactically adhere to the patterns favored by grading algorithms. Les Perelman, the emeritus director of MIT’s Writing across the Curriculum program (now called Writing, Rhetoric, and Communication) and an inveterate critic of automated essay grading, worked with MIT students to develop the Babel Generator, which can produce semantically meaningless essays that nonetheless score highly on automated essay rubrics. Perelman’s team used the Educational Testing Service’s ScoreItNow! tool to get automated feedback on their autogenerated essays. In an essay about mandatory national school curricula that started with the sentence, “Educatee on an assassination will always be a part of mankind” and concluded, “Therefore, program might engender most of the disenfranchisements,” they scored a six, which ScoreItNow! describes as a “cogent, well-articulated analysis of the issue [that] conveys meaning skillfully.”²¹

Autowriters probably aren’t a threat to proctored standardized exams—though presumably, a determined student could memorize an autogenerated nonsense essay and submit it. But autowriters could be problematic for the grading of un-proctored essays, and they certainly present a striking kind of critique.²²

When I read about autowriters and autograders, I like to imagine students downloading and running computer programs that automatically write essays while instructors use computers to automatically grade those essays. While the computers instantaneously pass these essays and grades back and forth, students and instructors can retire to a grassy quad, sitting together in the warm sun, holding forth on grand ideas and helping each other learn.

As with many ideas in education technology, this dream is actually quite old. When Sidney Pressey presented the first mechanical teaching machines in the 1930s, students at the Ohio State University student-run publication the Ohio State University Monthly wrote that if someone could build a second machine that automatically depressed the correct keys on Pressey’s mechanical multiple-choice machine, then the future of education would be “perfect in the eyes of the student.”²³

`Autograders in Computer Programming`

Perhaps the most complex human performances that we can automatically grade reasonably well are computer programs. The field of computer programming has evolved in part through the development of tools that give feedback to computer programmers on their software. Many computer programmers write in what are called integrated development environments, or IDEs, that perform certain kinds of automated work for computer programmers. For instance, a computing script might include a number of variables that need to be modified and called on through a program, and an IDE might track those variables for programmers and let them select a variable from a list rather than needing to remember the exact right string of characters. Many IDEs have auto-complete functions that let programmers type the first few characters in a string or function and then select the right string or function from a list. Integrated development environments also have features that give programmers feedback on errors that are triggered in running a program; when a program fails to run properly because logic breaks down somewhere, IDEs automatically parse the code to identify possible sources of error. In a sense, every time a computer programmer runs a program, it is a kind of formative assessment task, and good IDEs give programmers feedback about what is working and what is not.

Given that automatically providing feedback on code is central to the development of computer programming as a discipline and profession, it is perhaps no surprise that online education systems have a powerful suite of tools for evaluating computer programs. As assignments, students write computer programs, and then instructors create other computer programs that evaluate the quality of students’ programs along a number of dimensions: Does the submission meet engineering requirements? How quickly does it run? How many or how few lines of code are required? Does the code meet design specifications? Even if the code that students are submitting is relatively complex, automated grading tools can evaluate many important dimensions of the quality of that human performance. Computer programming probably represents the pinnacle of computational assessment of complex human performance.

For all that, however, the reification fallacy looms just as large over computer science as over any other domain. Writing computer programs that pass engineering tests is only a fraction of what good computer programmers do. As Hal Abelson, a longtime computer science professor at MIT and collaborator of Seymour Papert, once argued (with collaborators Gerald and Julie Sussman), “We want to establish the idea that a computer language is not just a way of getting a computer to perform operations but rather that it is a novel formal medium for expressing ideas about methodology. Thus, programs must be written for people to read, and only incidentally for machines to execute.” Abelson’s point is that it is not enough to write a computer program that passes an engineering test; the code should be written in such a way that another reader should be able to understand how the programmer has gone about solving a problem. This information is conveyed in the order of operations, the naming of variables, how the code is structured, and how comments within the code explain how the program proceeds. In Ableson’s framing, everything that can be computationally evaluated by autograders is the “incidental” part of programming. Autograders that evaluate for “style” can evaluate whether a given code snippet adheres to certain well-established conventions, but only another human programmer can determine if a given code submission is written in such a way as to be parsable to other human beings as a medium for expressing ideas about methods.²⁴

Along the same lines, computer programming is about understanding the needs of human systems, balancing engineering demands with broader societal concerns, collaborating among teams to weigh priorities, and a thousand other concerns that are both common to other engineering disciplines and unique to software engineering. As marvelous as our autograders are for evaluating computer programming, they still can evaluate only a fraction of the knowledge, skills, and dispositions required to be a good software engineer. To say that someone has “aced” a computer-graded exam in a computer programming class doesn’t mean that he or she has all the skills to be a good software engineer, only that he or she has demonstrated proficiency in the skills that we can currently assess using autograders.

`Escaping the Trap of Routine Assessment, One Innovation at a Time`

The trap of routine assessment has two interlocking components: as automation technologies advance, the labor market and civic sphere will put a premium on non-routine skills that computers cannot do. At the same time, computers mostly assess the routine skills that humans no longer need to do. As educators, there is probably little that we can do to stem the tide of automation technologies reshaping the workplace, but it may be possible to continue to develop assessment technologies that slowly expand the range of complex human performances that can be automatically assessed.

As one example, MIT math educators have developed for their calculus MOOCs a new assessment tool for evaluating how students draw curves on a graph. In learning calculus, students are often tasked with evaluating functions and then drawing those functions as curves on a Cartesian plane or drawing their integral or derivative. The goal is not necessarily to perfectly plot these curves but rather to make sure that they cross the x axis and the y axis at roughly the right spot, that they go up or down when they are supposed to go up or down, and that they approach an asymptote at roughly the right point.²⁵

Drawing these curves is essential to learning the basics of calculus, so the MOOC team in the Mathematics Department at MIT developed a system to analyze student submissions and evaluate their quality with enough confidence to assign them a grade. Before this innovation, one of the only options for an automated assessment to test students on their conceptual understanding of derivatives and integrals would have been a multiple-choice item displaying four graphs and asking students to recognize the right one. The new tool allows instructors to evaluate students’ ability to draw, rather than just recognize, curves. It opens up a wider space for math instructors to computationally assess increasingly complex human performance.

Every assessment involves sampling from a domain. Because no test can cover all the knowledge, skills, and dispositions required for success in a domain, test designers try to choose items that form a representative sample of skills. In this one example, the MIT calculus team expanded the proportion of the full domain of calculus skills and knowledge that could be sampled by test designers relying on autograders. If we go back to our framing of mathematical modeling as consisting of five steps, then this particular advancement stands out because it allows students to demonstrate their proficiency with a dimension of problem representation, not just another form of calculation. It is through these kinds of steady, incremental advances that assessment technologies will allow for a greater range of human performances to be evaluated by machines. And as the field improves its capacity for assessment, these kinds of performances are more likely to appear in curricula, to be taught, and to be learned by more people. The pathway beyond the trap of routine assessment involves developing thousands of other applications like the calculus curve grader, each tackling the evaluation of some new element of complex human performance.

`Stealth Assessment`

One of the most intriguing lines of research into automated assessment is called “stealth assessment,” or assessing students as they go about performing learning tasks rather than during special activities called “assessment.” Typical assessment exercises can feel disconnected from the act of learning; learning stops on Friday so that students can take a quiz. What if classroom assessment could look more like formative assessment in apprenticeships? Consider, for instance, an apprentice woodworker turning a chair leg on a lathe while a nearby journeyman offers tips and feedback as the apprentice goes about the task. In this context, assessment naturally is part of the process of building a chair in the woodworking shop.

What might such assessment look like in physics or math? Several researchers, notably Valerie Shute at the University of Florida, have created online games where gameplay requires developing an understanding of mathematical or scientific phenomena, and as players engage in the game, they create log data that can be analyzed for patterns of play that correlate with other measures of scientific understanding. In the game Newton’s Playground, players engage in tasks that require an understanding of Newtonian motion. In the research study, students do a pre-test about Newtonian motion, play the games, and then do a post-test. These tests effectively serve to “label” the gameplay data, since patterns found in log data can be correlated with scores afterward on a test. The goal is to have the patterns of effective play be identified with sufficient reliability that in the future, it would become unnecessary to give the test. A student who demonstrated sufficient understanding of Newtonian physics through a video game could be evaluated by gameplay rather than a distinct evaluation event.²⁶

Though the notion is promising, implementation proves to be quite difficult. Developing playful learning experiences where students demonstrate important mathematical or scientific reasoning skills is quite hard, and substantial investment can be required in developing the game and the associated assessment engine for a handful of content standards. Researchers have explored the promise of using stealth assessment to evaluate competencies that aren’t traditionally assessed by tests, such as creativity, patience, or resilience in solving problems. These kinds of assessments might provide a comparative advantage in the future, but as of yet, these systems remain research pilots rather than widely deployed assessment systems.²⁷

Perhaps the biggest hurdle is that gameplay data in these virtual assessments often don’t reliably correlate with competencies or understanding. When students are tested in a game environment, their behavior is shaped by the novelty of the environment. A student who appears inefficient, hesitant, or confused might understand the content quite well but not the new environment, and their patterns of play might have enough similarities with students who don’t understand the content to make distinguishing patterns difficult. It may be possible with more research and development to overcome these kinds of hurdles so that rather than stopping learning to make time for assessment, online learning environments can simply track student problem solving as it happens and make inferences about student learning as the learning is happening.

`The Future of Assessment in Learning at Scale`

Is the trap of routine assessment a set of temporary hurdles that will be overcome by more advanced technologies or a more permanent and structural feature of automated computer assessment? Techno-optimists will point to the extraordinary gains made by artificial intelligence systems that can identify photos online, play chess or Go at superhuman levels, or schedule an appointment with a hair salon over the phone. These are examples of impressive technological innovation, but each of them has features that are not easily replicated in education.²⁸ Automated photo classifiers depend upon training data from the billions of photos that have been put online with a caption written by humans. There simply are no equivalent datasets in education where humans naturalistically engage in a labeling activity that distinguishes effective and ineffective teaching practice. The most advanced chess engines use a kind of reinforcement learning whereby the software plays millions of games of chess against itself, and the highly structured nature of the game—pieces with specific movement rules, an 8 × 8 board, a well-defined winning state—is well suited to automated assessment. Reinforcement learning systems cannot write millions of essays and grade them against other essays, because there is no defined quality state in writing as there is in chess. The advance of computer-voice assistants—such as the kind that can book an appointment at a hair salon over the phone—is impressive, but each individual task that an assistant is trained to do involves extensive data, training, and adjustment. Computer-voice tutors might be developed in limited areas with well-defined right and wrong answers and carefully studied learning progression, but trying to develop such systems for the highly granular goals we have in educational systems—adding 1-digit numbers, adding 2-digit numbers, subtracting 1-digit numbers, and on and on ad infinitum—makes the bespoke nature of those innovations incompatible with systemwide change.²⁹

The history of essay autograders in the first part of the twenty-first century is instructive. As natural language-processing software improved, autograders have been more widely adopted by the Graduate Record Exam (for graduate school admissions), state standardized tests, and some limited classroom applications. When MOOCs were at their peak of public discourse in 2013, assessment designers explored implementing essay autograders in courses, and some limited experiments took place. Despite the technological expertise of developers and strong incentives to expand assessment tools, not much progress was made, and today, few MOOCs use automated essay scoring. This activity coincided roughly with the development of the PARCC (Partnership for Assessment of Readiness for College and Careers) and Smarter Balanced testing consortiums, and the Hewlett Foundation funded an automated essay grader competition with eight entrants from research labs and private firms. Since those efforts in the early 2010s, no major changes or advancements in essay-scoring technologies have been implemented in the PARCC or Smarter Balanced assessments. Advocates claiming that massive improvements in automated assessment technologies are just around the corner would need to explain why the last decade has shown such modest progress despite substantial investment by very smart, very devoted teams in academia and in industry.³⁰

The problems of assessment are hard. The examples that are featured in this chapter—assessing computer programs, graphical calculus functions, pronunciation, and standardized essays—all represent areas where assessment designers are pushing the boundaries of what can be assessed in human performance. The modest, incremental progress of these innovations and the limited domains where they appear valuable also show how resistant those boundaries are to substantial advancement.

With known technologies, large-scale learning environments will remain bound by the trap of routine assessment in our lifetime. Much of what we can assess at large scale are routine tasks, as opposed to the complex communication and unstructured problem-solving tasks that will define meaningful and valuable human work in the future. Computers can mostly assess what computers are good at doing, and these are things we do not need humans to do in the labor market. Innovations in assessment technology will push gently on these boundaries, and the incremental advances that emerge will make marginal, useful expansions of what can be automatically assessed, but they will not fundamentally reshape automated assessment. There are millions of dollars invested, countless smart and talented people working on this problem, and strong incentives in educational marketplaces for finding new solutions. But progress remains slow, and for the foreseeable future, if we want to assess people on the kinds of performances that are most worthwhile for people to learn, we’ll have to depend heavily on bespoke assessments evaluated individually by teachers and other educators. The human-scale limits on the assessment of complex performance will remain one of the most important strict limits on how widely large-scale learning environments can be adopted throughout educational systems.