2

ALGORITHM-GUIDED LEARNING AT SCALE

Adaptive Tutors and Computer-Assisted Instruction

ON EVERY CONTENT PAGE of an edX MOOC there are two buttons: “Next” and “Previous.” The MOOC’s instructional-design staff organizes course content into a linear sequence of pages, and every learner—novice or expert, confused or confident—who clicks the “Next” button will be moved to the following page in that sequence. It’s one-size-fits-all learning at a massive scale. This chapter examines an alternative to instructor pacing: large-scale learning environments where the sequencing of content is determined by algorithmic assignment rather than by instructor designation. In algorithm-guided learning, the next action in a sequence is determined by a student’s performance on a previous action rather than by a preset pathway defined by instructors. The tools in this genre go by many names; I’ll call them adaptive tutors or computer-assisted instruction (CAI).

Earlier, I introduced Khan Academy, a free online resource with instructional materials in many subjects. Khan Academy is best known for its online explainer videos, but in K–12 schools, the majority of student time using Khan Academy is spent on math practice problems. Teachers assign (or students choose) a domain for study, such as “evaluating expressions with one variable,” and then students are presented with a series of problems. These problems are instantly recognizable to anyone who has ever completed a worksheet. A mathematical expression is in the center of the screen, and below is an answer box for numerical answers (some questions have multiple-choice answers or allow for marking points on a Cartesian plane). There are links to video explainers and hints for each problem, and then there is a “Check Answer” button. When users get an answer right, there are pop-ups with stars, bings, and firework animations. When users get an answer wrong, they can try again, get hints, or move on. When students get problems right, the system assigns harder problems from the same domain, and when students get problems wrong, the system assigns easier problems. When students get enough problems correct, the system encourages students to move on to a new domain. Students who log into a user profile can also be assigned practice problems from older material to promote retention through retrieval practice, and teachers who register their classes can track student performance.

Imagine you are a K–12 school principal, and parents, faculty, and school board members are reading stories in Time, Wired, and Forbes about how Khan Academy is poised to change the world. Clayton Christensen has declared these new technologies to be a “disruptive innovation,” a disjunction between an inferior past and a better future. Adopting adaptive tutors across a grade, a department, or an entire school would constitute a major initiative, requiring investing in hardware, scheduling computer lab or laptop cart times, selecting software, training teachers, communicating with parents and family, and tinkering with many other elements in the complex ecology of a school. As a principal, if you chose to invest your energy in this initiative, the opportunity cost would be all of the other initiatives that you chose not to take up.1 You’d also want to be confident that introducing adaptive tutors would actually improve learning outcomes, especially for the students who are struggling the most. How do you decide whether implementing adaptive tutors is the best possible bet for your students and community?

School leaders facing these kinds of decisions about technology-based innovations should ask four sets of questions. The first order of business is to understand the basics of how the technology works as a learning tool: What is the pedagogical model? What are the fundamental principles of the underlying technology? Some educational technologists pushing their product may try to convince you that their technology is new, complex, and hard to understand. Demystifying the technology is the first step to understanding how it may or may not work for your students in your school.

The second task is to investigate how similar technologies have been integrated into schools elsewhere. Some of these questions should be about nuts and bolts: How much do teachers and students actually use the technology after it has been purchased? What kinds of changes in schedule, curriculum, physical plant, and other school elements must be adopted to make the new technology-mediated practices work?

After understanding how the technology works and how it might be integrated into schools, a third step is to investigate what the accumulated research evidence says about which kinds of schools and students are most likely to benefit from a new approach. Do these tools benefit learners on average? How do impacts differ between high- and low-achieving students, or between more-affluent and less-affluent students? In MOOCs, answering these questions was challenging because the research on large-scale, self-paced online learning has been relatively sparse. By contrast, there are dozens of high-quality studies that examine how adaptive tutors affect learner outcomes in the K–12 context that can inform a principal’s judgment.

These first three questions are about average effects: What do we know about how the technology has been implemented across many kinds of schools? The final task for the principal (or school board member, or superintendent, or department head) is to consider all this history and evidence in the light of one particular, idiosyncratic school: yours. What are the unique features of your school, your faculty, your students, your community that might abet or thwart efforts to make a technology adoption of adaptive tutors a success?

If the evidence base for adaptive tutors suggested that they substantially benefited all students in all subjects in all contexts, this would be an easy question to answer (full speed ahead!). Unfortunately, as I will describe below, the track record of algorithm-guided technologies is not nearly so clear. Some evidence from some contexts suggests that they can be moderately helpful in some subjects, but many implementations of these technologies have shown null or even negative effects. This unevenness comes from a variety of sources. Schools are complex places, and factors like technology availability or teachers’ willingness to adopt new practices play a role in determining the efficacy of the initiative. Furthermore, adaptive-tutoring technologies are well developed only in a few subject areas, including math and early reading, so a push toward adaptive learning can only benefit this limited subset of the curriculum. And students from different backgrounds have different experiences and outcomes with new technologies. All this complexity and nuance means that for a principal to answer the most important question—Will adaptive tutors help my students in my school?—the best place to start is by investigating the origins of CAI and understanding the pedagogy, technical underpinnings, and values of this approach.

Computer-Based Instruction, Sixty Years in the Making

In most K–12 schools, the day is organized around class periods of fixed length, and each class period is assigned to a single topic. On a given day, a student will spend forty-seven minutes learning to factor polynomials, whether the student needs 17 minutes or 107 minutes to learn the topic. For those seeking to maximize individual student learning, the inefficiencies here are stark: some students spend the majority of class bored and not learning anything new, and others leave class without having mastered the requisite material that will be necessary for learning in subsequent lessons.

One solution is to tutor every child so that all children receive the amount and type of instruction best suited for their intellectual development. In a now famous 1984 research paper, “The 2 Sigma Problem,” Benjamin Bloom published results from two doctoral dissertations comparing students who were randomly assigned to learn in one of three conditions: a traditional classroom setting, a one-on-one tutoring condition, or a third condition called “mastery learning,” in which students received additional instruction and practice on concepts that they struggled with. Bloom argued that the students in the one-on-one tutoring condition performed two standard deviations (2 sigma) better on a unit post-test than students in the classroom condition. If a typical student in the classroom condition would be at the fiftieth percentile, then a typical student in the tutoring treatment would be at the ninety-eighth percentile. Thus, Bloom and his colleagues used the full machinery of modern social science to argue what medieval lords knew: that tutoring, while expensive, worked quite well. Bloom’s article became a call to action to design educational approaches that could achieve the kinds of gains that could be achieved with one-on-one tutoring. One of Bloom’s suggestions was to explore “whether particular computer courses enable sizable proportions of students to attain the 2-sigma achievement effect.”2

By the time Bloom published this suggestion in 1984, computer scientists and researchers had been working on this challenge for over two decades. Since the very first days of computer technologies, computer scientists have sought to use computers as individualized tutors for students. In 1968, R. C. Atkinson and H. A. Wilson wrote in Science that “ten years ago the use of computers as an instructional device was only an idea being considered by a handful of scientists and educators. Today that idea has become a reality.”3

Among the first computer-based teaching systems was the Programmed Logic for Automatic Teaching Operations, or PLATO, developed in 1960 at the University of Illinois Urbana-Champaign. In 1967, with the development of the TUTOR programming language, PLATO formalized several innovations essential to the future of CAI, including automated assessment and branching.4

Problems in TUTOR included a minimum of a question and a correct answer, but the language also allowed for complex answer banks and different feedback for right and wrong answers. In the 1969 guide to the language, one of the first examples presents a student with a picture of the Mona Lisa and then asks the student to name the artist. The correct answer, “Leonardo da Vinci,” produces the response, “Your answer tells me that you are a true Renaissance man.” The incomplete answer “Leonardo” produces the prompt, “The complete name is Leonardo da Vinci.” A blank entry provokes a hint: “HINT—MONA LISA—HINT.” Any of the correct answers drives students to a subsequent unit in the lesson sequence, on the artist Rubens, but the incorrect answer “Michelangelo” takes students to MREVIEW, a review unit.

Here we see some of the crucial features of CAI. Lesson sequences were organized around a series of instructional units that both presented content and tested student recall or understanding. Students were given different feedback based on their responses, and the system was designed such that when students provided incorrect answers, they could be given different learning experiences to remediate different kinds of problems. These systems required that curriculum authors manually sequence each problem and learning experience, telling the computer how to address wrong answers, what feedback to give, and which problems were easier or harder.

In the five decades following the initial development of the TUTOR programming language, there were two critical advances that brought adaptive-tutoring systems to the forefront of education reform debate and dialogue in the twenty-first century. The first innovation was statistical. Starting in the 1970s, psychometricians (statisticians who study educational testing) developed an approach called item response theory to create a mathematical model of the relative difficulty of a question, problem, or test item. These quantitative representations of learning experiences paved the way for computers to automatically generate testing and learning sequences that could adapt to the performance of individual students rather than having to manually program branches as in the TUTOR example above. Nearly all contemporary adaptive-tutoring systems use variations on this forty-year-old statistical toolkit.

The second innovation was rhetorical. In the late 2000s, education reformers developed an interlocking pair of narratives about “personalized learning” and “disruptive innovation” that explained why and how Bloom’s vision of computerized tutors for every child could be brought into reality. The most ambitious blueprints to put personalized learning at the center of students’ school experience called for a dramatic reorganization of schooling institutions. Students would spend less time in face-to-face, whole-class instruction and more time working individually with adaptive learning software. Teachers might spend more time coaching individual students or working with small groups, and supervision of students working on software could be done by paraprofessionals. These models called for new schedules, new teaching roles, and new learning spaces. Very few schools adopted these personalized learning blueprints in any substantial way, and the predictions of transformation based on the theory of disruptive innovation proved incorrect, but these new rhetorical devices help explain why adaptive tutors experienced a surge of interest in the 2000s.5

Technical Foundations for Algorithm-Guided Learning at Scale: Item Response Theory

Few edtech evangelists of the twenty-first century can match Jose Ferreira for bravado and exaggeration in describing new educational technologies. Ferreira wanted educators and investors to believe that algorithm-guided learning technologies were unfathomably complicated. (Contrary to Ferreira’s assertions, the important elements of nearly every education technology are, with a little bit of study and research, comprehensible.)

Ferreira founded Knewton, a company that tried to offer “adaptive learning as a service.” Most publishers and start-ups offering adaptive learning have implemented their own algorithmic-based systems in their own platforms. Knewton offered to do this technical development for publishers. Publishers could generate textbooks and assessment banks, and then Knewton would handle turning those assessments into CAI systems.

Ferreira described his technology in magical terms: “We think of it like a robot tutor in the sky that can semi-read your mind and figure out what your strengths and weaknesses are, down to the percentile.” The robot tutor in the sky was powered by data; Ferreira claimed that Knewton collected 5 to 10 million data points per student per day. “We literally have more data about our students than any company has about anybody else about anything,” Ferreira said. “And it’s not even close.”6

These claims, however, were nonsense. If you have ever sat in a computer lab watching students—some engaged, some bored—click or type their way through an adaptive tutor, you will have seen quite clearly that students are not generating millions of useful data points as they answer a few dozen problems.7 But Ferreira’s was a particular kind of nonsense: an attempt to convince educators, investors, and other education stakeholders that Knewton’s technologies were a disjunctive break with the past, a new order emerging. The truth was something much more mundane. While Ferreira made claims of unprecedented scale and complexity in Knewton promotional material, Knewton engineers were publishing blog posts with titles like “Understanding Student Performance with Item Response Theory.” In one post, engineers declared, “At Knewton, we’ve found IRT models to be extremely helpful when trying to understand our students’ abilities by examining their test performance.” Lift up the hood of the magical robot tutor, and underneath was a forty-year-old technology powering the whole operation.8

In the 1970s, researchers at Education Testing Service developed a statistical toolkit called item response theory (IRT) that would eventually allow computer algorithms to generate customized sequences of problems and learning experiences for individual students. Item response theory was originally designed not for adaptive tutors but to solve a basic problem in test design. When testing large numbers of students, consumers of testing data (admissions offices, employers, policymakers) would like to be able to compare two students tested on the same material. Testing companies, however, would prefer not to test all students on the exact same material, since using identical test items and formats with different students in different places and times opens the door to cheating and malfeasance. Rather than giving students the exact same test, therefore, test makers would prefer to give students different tests that assess the same topics at the same level of difficulty, allowing results to be compared fairly. Doing so requires a model of the difficulty of each question on the tests. To understand how these models work, we’ll need to do a bit of math (graphing logistic curves, to be precise).9

In IRT, every test question or problem—in psychometric parlance, an item—is modeled as an S-shaped curve called a logistic function. These S-curves start at the origin (0,0) in a Cartesian plane, start ascending slowly up and to the right, then ascend more quickly, and then ascend more slowly as they approach an asymptote, making them look like a stretched out “S.” Logistic curves always go up and always follow these S-shaped patterns because of the way that they are mathematically defined f (x )=L1+e (k (xx0 ) ) for those inclined to remember their algebra).

In these models, the x axis represents the ability of a student in a particular domain (for example, recognizing Chinese characters or multiplying single-digit numbers), and the y axis represents the probability that a student at a given level of ability will get an item correct. On an S-curve, values of y—probability of getting an answer correct—are low at low levels of student ability (on the left side of the x axis) and are high at high levels of student ability (on the right side of the x axis). Psychometricians summarize the difficulty of any item by describing the level of student ability where the S-curve crosses the point where 50 percent of students are predicted to get the item correct. When item developers make a new item, they have to guess to set the initial parameters of these S-curve models (how quickly they ascend up and down and how far to the right they stretch), but then these models can be dynamically updated as they are piloted and used in the field. If lots of students who are highly skilled in a domain (as measured by other items) answer an item incorrectly, then its difficulty can be revised upward.

Those are the basics of item response theory, and even if your memory of logistics curves is a little fuzzy, your takeaway should be that IRT does nothing more than create a mathematical model (an S-curve) of the difficulty of an item. Test makers use these models to make equivalently difficult versions of the same test. For developers of computer-assisted instructional systems, IRT and its variants make it possible for a computer program to assign an appropriate item in a sequence based on the student’s answers to the previous item (or a recent sequence of answers). Since computers have a model of the difficulty of each item in an item bank, when learners get an item right, the system can assign a slightly harder item, and when learners get an item wrong, the system can assign an easier item. Instead of humans manually creating branching instructional activities, as with the early TUTOR programming language examples, computers can algorithmically generate instructional sequences and continuously improve them based on student responses.10

These algorithm-guided large-scale learning technologies are decades old. Knewton, then, was not a magical new technology; rather, it offered one business innovation (adaptive learning as a back-end service rather than a product feature) on top of a very well-established set of learning technologies. My hope is that seeing the basic functioning and long history of a complex technology like adaptive tutors will help education stakeholders understand that new technologies are both comprehensible and historically rooted. If we can situate a new technology in its history, we can make predictions about how that new technology will function when integrated into the complex ecology of schools.

In education technology, extreme claims are usually the sign of a charlatan rather than an impending breakthrough. Knewton failed. Ferreira left Knewton in 2016, and soon after, the company hired a former publishing executive as its CEO and pivoted to publishing its own textbooks with adaptive practice problems. In 2019, the company was sold to publisher Wiley in a fire sale. Like Udacity, which became a provider of technical certificate programs, and Coursera, which became an online program manager, Knewton began with dramatic claims about transforming teaching and learning, raised vast venture funds, and within a few years pursued well-trodden pathways to financial sustainability that fit easily into existing educational systems.11

Even if Knewton wasn’t effective as a provider of learning experiences, it was for a time extremely effective in deploying narratives about change to raise incredible sums of venture capital funding (a process that technology commentator Maciej Cegłowski calls “investor storytime”). These narratives about how technology can transform archaic, traditional educational systems were central to the surge of venture and philanthropic investments in adaptive tutors in the first two decades of the twenty-first century.12

Rhetoric of Transformation: Personalized Learning and Disruptive Innovation

In 2010, references to “personalized learning” began appearing at education conferences and in trade magazines. Computers had been in schools in various forms for decades by this point, but suddenly, the narrative of personalized learning was everywhere. For CAI enthusiasts, personalization meant that each child would be able to spend part or all of her day proceeding through technology-mediated learning experiences at her own pace. Students would sit at computer terminals using software that algorithmically optimized student pathways through a set of standardized curriculum material. Teachers would be available for coaching and small group instruction, particularly when software flagged individual students or groups as requiring additional supports.13

One challenge that the CAI enthusiasts had in advancing their vision was that it was fairly easy to characterize extreme versions of the model as dystopian: children wearing headphones sitting in cubicles staring at screens all day long. So CAI advocates sometimes argued that CAI approaches could save classroom time for more project-based learning. In the early years of Khan Academy, Salman Khan suggested that mathematics education could be transformed in a series of steps: (1) schools would adopt Khan Academy’s free math resources; (2) each student could then pursue a personalized mathematics learning trajectory that allowed him or her to proceed at his or her own pace; and (3) with all the classroom time saved through personalized CAI, collaborative activities in math could focus on rich, real-world, project-based learning exercises. The idea was that if students learned the facts and basics of mathematics faster with technology, then they would have more time to do interesting projects and team-based work. As Khan said in 2019, “If Khan Academy can start taking on some of the foundational practice and instruction, it should hopefully liberate the teachers and class time to do more higher-order tasks.”14

This argument has deep roots among CAI advocates; historian Audrey Watters found a variation of this argument made in 1959 by Simon Ramo, who like Khan was a businessman (vice president of the firm that developed the intercontinental ballistic missile) turned education technology advocate. As Ramo wrote in “A New Technique in Education,” his 1959 CAI manifesto, “The whole objective of everything that I will describe is to raise the teacher to a higher level in his contribution to the teaching process and to remove from his duties the kind of effort which does not use the teacher’s skill to the fullest.”15 Thus, the rhetoric of personalized learning made two grand claims: that adaptive tutors would be more efficient at teaching students mastery of key concepts and that this efficiency would enable teachers to carry out rich project-based instruction.

If personalized learning provided the vision for what schools could look like, the theory of disruptive innovation provided a blueprint for how technology innovations would inevitably lead to school change. In 2008, Clayton Christensen, Curtis Johnson, and Michael Horn published a book called Disrupting Class that argued that online education and CAI represented a new kind of disruptive innovation in education. The theory of disruptive innovation argues that, periodically, innovations come along that may be low quality in some dimensions but offer low cost and novel features. The Sony Walkman’s sound quality was much worse than contemporary hi-fi systems, but it was relatively cheap and you could walk around with it; it appealed particularly to “non-consumers,” people who were unlikely to buy expensive hi-fi systems, like those of us who were teens and tweens in the 1980s. Christensen and colleagues argued that online education was one such disruptive innovation that would eventually revolutionize education in the same way that the Walkman or iTunes revolutionized music and media.16

In Disrupting Class, the authors made three predictions about how online learning would reshape K–12 education. They predicted that by 2019, half of all secondary courses would be mostly or entirely online, that the cost of providing these courses would be about a third the cost of traditional classes, and that the quality of these online courses would be higher. Disruption, the theorists argued, often catches established stakeholders unaware, because early in their existence, disruptive innovations are obviously deficient in certain dimensions—like sound quality in a Walkman or the quality of the learning experience in online schools. But disruptive innovations are supposed to improve rapidly in both established dimensions and novel ones, so Christensen and colleagues argued that as online learning was rapidly adopted, it would quickly prove superior to existing educational models.

Christensen and colleagues argued that these disruptive processes could be predicted with precision. They claimed that adoption of disruptive innovations historically followed an S-shaped logistic curve, and a feature of logistic curves is that when plotted against log-transformed x and y axes, the S-curve becomes a linear straight line. If early adoption patterns followed this log-transformed linear model precisely—and the Disrupting Class authors argued that data on online course adoption showed that it did—then a new technology could be definitively identified as disruptive, and the timing of its adoption could be predicted with some precision. Their models showed that the adaptive online learning curve would pass the midpoint of adoption by 2019, at which point, 50 percent of all secondary school courses in the United States would be conducted through customized, personalized online software. In this model of disruption, schools could choose to be early adopters or late adopters, but progress and change were inevitable.

Personalized Learning: An Unrequited Disruption

Theories of personalized learning and disruptive innovation offer models of how schools might be dramatically transformed by algorithm-guided learning technologies—though one is challenged to find any school in which this grand vision of transformation actually occurred. It was something of a Rube Goldberg plan; to get better and deeper project-based learning, schools should buy computers, buy CAI software, train teachers on their use, reallocate time to individualized computer-based instruction, and then, when all of that was working, use additional time for projects. Schools are complex places, and they are not typically successful at implementing multipart schemes to improve learning. It is not surprising, then, that CAI did not remake education in the way predicted by true CAI enthusiasts.

Furthermore, the theory of disruptive innovation, the guiding force behind the bold predictions in Disrupting Class, has come under substantial critique. In 2014, Harvard historian Jill Lepore presented one of the most damning appraisals of the theory. In a piece published in the New Yorker, Lepore argued that the theory was based on idiosyncratically selected case studies of individual industries and circumstances, weak foundations on which to build new theories. Disruption theory evangelists disregard case studies showing contrary examples, and theorists observe disruption in hindsight but struggle to accurately use the theory to predict future changes (an unfortunate quality for a business management theory). Lepore showed that many of the companies presented as laggardly dinosaurs in Christensen’s original tome, The Innovator’s Dilemma, are happily dominating their industries decades later. Both Lepore and Audrey Watters have observed that disruption theory appears to draw as much on millenarian narratives of struggle and redemption as it does from empirical evidence: the death of old worlds trapped in old ways, reformed and reborn by the revelation of new technologies.17

In the field of online learning, there is no evidence that the core predictions from Disrupting Class have come to pass by 2020. The data on online course enrollment by secondary school students are incomplete, but no data suggest that secondary schools are even close to provisioning 50 percent of their courses through adaptive online offerings. In 2018, there were about 57 million children in US pre-K–12 schools (40 million in pre-K–8, and 17 million in high school), and only 430,000 students enrolled in fully online or blended schools—about 0.75 percent. No evidence suggests that traditional US high schools have made substantial adoptions of online or blended offerings that would allow 50 percent of all courses to be taken in online or blended forms.

Nor is there evidence that online education has become one-third less expensive to provision than traditional education. In my home state of Massachusetts, I sat for six years on the state’s Digital Learning Advisory Council to provide policy guidance on the state’s two K–12 virtual schools. In 2010, the state required school districts to pay a tuition of $6,700 for each student who chose to attend one of these virtual schools, which was about two-thirds of the state’s formula for per-pupil expenditures ($10,774). In 2018, the two virtual schools requested a funding increase, and the state set the new funding at $8,265, which was the state formula less estimated costs for operating buildings. Rather than reducing costs as virtual schools in Massachusetts expanded enrollment, virtual school leaders argued that virtual schooling should cost about the same as traditional education. Limited research exists in other states, but it does not appear that good virtual schooling can be provisioned for one-third the cost of traditional schooling. And as was noted at the end of Chapter 1 on instructor-paced learning at scale, learning outcomes for fully online schools are generally dismal. As we shall see in the rest of this chapter, the evidence on adaptive tutors implemented within K–12 schools is complex and somewhat more promising, but mainly in certain areas of mathematics instruction.18

Adaptive Tutors in K–12 Schools: Mathematics and Early Reading

Grand visions of disruption and transformation did not come to pass, but adaptive tutors have found two more modest niches in K–12 schools: providing supplemental practice for mathematics and for early reading. A confluence of factors is responsible for the limited role of adaptive tutors in schools, including the costs of computing infrastructure and mixed evidence of efficacy. Probably the most important limit, however, is technological. For adaptive tutors to assign a sequence of problems and learning resources to students, the system has to measure the performance of students regularly and automatically. The core technology of adaptive tutors is an IRT-powered assignment algorithm paired with an autograder. As we observed in Chapter 1 on MOOCs and will explore further in Chapter 7 on assessment technologies, the state of the art in autograders is quite limited. Autograders work reasonably well in mathematics, where quantitative answers can be computationally evaluated. In a few domains of early reading, they can be useful as well—testing students’ ability to match sounds with letters (phonics), identifying basic vocabulary, or doing simple translations when learning a foreign language. Reading instructors sometimes discuss a transition, which happens in about the third grade, from learning to read—learning how to decode the sounds and meaning of text—to reading to learn—using reading to advance content knowledge. Generally speaking, the autograders of adaptive tutors have some applications in learning to read, but very limited applications in reading to learn. When educators need to evaluate whether students can reason based on the evidence provided in a text, autograders typically cannot effectively evaluate the quality of student reasoning.

These two subject domains of mathematics and early reading also overlap with a substantial portion of the standardized testing infrastructure in the United States. As high-stakes testing has spread throughout the United States since the 1990s, educational publishers have invested more in developing resources for tested subjects than non-tested subjects. The reasoning goes something like this: because schools are more likely to purchase products and services related to tested subjects like reading and math, and because autograding technologies work best in reading and math, publishers have generally focused on creating adaptive tutors in reading and math. (As we shall see in Chapter 7 on assessments, this alignment is not coincidental but instead part of a powerful feedback loop; standardized test developers have access to the same autograding technology as educational publishers, so our testing infrastructure evaluates domains like reading and math where autograders work best; then schools emphasize those subjects, publishers create products for those subjects, policymakers evaluate schools on those subjects, and the system becomes mutually reinforcing.)

Do Adaptive Tutors Improve Student Learning?

Adaptive learning tools for reading and mathematics have been researched extensively over the last thirty years, and the results are mixed at best. Two groups of researchers have conducted most of this research. The first group are the computer scientists, learning scientists, and CAI researchers who developed these systems. A second group are economists of education, who are typically interested in the return on investment for different educational interventions. The interest of CAI researchers is obvious: they want to know if their innovations improve student learning. To the credit of the CAI community, many CAI products have been regularly scrutinized through studies in which the CAI software companies help with implementation of the software and teacher training, but independent third-party organizations conduct the research evaluation. An easy way to tell if an edtech developer is serious about improving learning and not just hoping to extract dollars from the education system is to see how they participate in research studies with a real chance of showing that their products do not work. Economists of education are often interested in innovations that have the potential to substantially change educational practice at large scales, and they are interested in labor issues; computers that can do some of the work of teachers tick both boxes.

Over the past thirty years, there have been hundreds of studies about adaptive tutors in K–12 schools, allowing researchers to conduct meta-analyses (research studies that investigate trends across multiple studies). Through the early 2010s, the general consensus of economists and other education policy experts was that CAI should not be considered a reliable approach for improving student learning in math or reading. This conclusion was based on evidence from numerous large-scale randomized controlled field trials conducted in the 1990s and early 2000s; such trials are the best research methods we have to determine whether or not a pedagogical approach improves learning in typical school settings (as opposed to in research labs or special cases). Some of these studies showed a positive effect, some a null effect (no impact), and some a negative effect. The meta-analyses of these field trials suggest that on average, adaptive reading tutors do not lead to better reading test scores than traditional instruction. Meta-analytic findings about math CAI approaches have been more mixed; some meta-analyses found average null results and others found modestly positive effects for math CAI. In one meta-analysis, researchers argued that adaptive math tutors overall had a small positive effect on students but that they benefited students from the general population more than low-achieving math students. They warned that “computerized learning might contribute to the achievement gap between students with different achievement levels and aptitudes.” This study provides some evidence of the edtech Matthew effect that will be discussed in Chapter 6.19

Even within studies that show an average effect of zero, there can still be considerable variation in how adaptive tutors effect change in individual schools or classrooms. An average effect of zero can happen when nobody’s learning changes, or it can happen when some students experience large positive effects and some experience large negative effects, which cancel each other out. In his doctoral research, Eric Taylor, now on the faculty at the Harvard Graduate School of Education, articulated a version of this argument. He observed in a meta-analysis that the average learning gains of classrooms using CAI and classrooms not using CAI were about the same. But among teachers using CAI, the variance of learning gains from teacher to teacher was lower than the variance among teachers not using CAI. Put another way, the difference in learning outcomes from classroom to classroom for teachers not using CAI is rather large: some classes do very well, some very poorly. When teachers use CAI, the difference between the classes that do well and the classes that do poorly is smaller. Why should this be?20

Nearly all CAI implementations follow some kind of blended model, where human educators teach class for part of the time (usually a few days a week), and students work individually with computers during the other part of the time. In contrast, the traditional model of math instruction involves teacher-led whole-group instruction followed by individual practice problems without feedback. Taylor argued that for the weakest teachers in the system, replacing one or two days a week of their instruction with individual time on computers improved outcomes for students—that time on computers was a boon for students who had the weakest teachers. By contrast, for the strongest teachers in the system, replacing part of their instruction led to worse outcomes; for their students, the time on computers took away from valuable time with a proficient instructor. This one study shouldn’t be considered dispositive, but it provides an intriguing hypothesis for the effects of CAI on instruction and some real puzzles for implementation, which we’ll come to soon.

Two Recent Studies of Adaptive Math Tutors with Positive Results

Two of the largest experimental field trials of adaptive tutors, Cognitive Tutor and ASSISTments, have occurred since the meta-analyses of the early 2010s, and these two studies showed much better outcomes for student learning than would have been predicted based on the history of CAI in schools. Both studies were conducted by reputable third-party researchers funded by the federal government, and they showed substantial positive effects for CAI in math classrooms.

In 2014, the RAND corporation released a study investigating the use of Carnegie Learning’s Cognitive Tutor: Algebra in seventy-three high schools and seventy-four middle schools in seven US states. Cognitive Tutor emerged from three decades of research at Carnegie Mellon University, and it is among the most widely adopted CAI systems and among the most closely researched. In the RAND study, a large number of schools agreed to adopt Cognitive Tutor: Algebra, and then half of those schools were randomly assigned to get the CAI software and professional development support; the other half continued with business as usual. Carnegie Learning encourages teachers to spend three days a week doing regular whole-class instruction in which the pace of the class roughly matches the pace of a typical algebra class. Then, two days a week, students use the Cognitive Tutor: Algebra program for individualized practice; in these sessions, students are supposed to work through the material at their own pace. Thus, in a five-day week, students receive both in-person, group instruction and supplemental personalized computer practice provided by intelligent tutors.21

John Pane lead the RAND team evaluating test score data from the experiment. He and his colleagues found no effect of CAI in the first year of implementation in a new school, which they characterize as an “innovator’s dip.” They argued that it takes schools about a year to figure out how to productively integrate new tools into their math teaching routines. In the second year, they saw positive, statistically significant improvements in learning outcomes among ninth graders using the program (they saw more modest, positive, not statistically significant effects among eighth graders).

Describing learning gains in education research is a tricky business, and the shorthand references that researchers and policymakers use can often be confusing. The most common measure of an intervention’s effect on learning is called the effect size, which is the average change in assessed outcomes in standard deviation units. Using a standard deviation unit allows comparisons across different interventions with different tests, different scales, and so forth. In the RAND study of Cognitive Tutor: Algebra, in the control condition without any CAI technology, the average student gain between pre- and post-tests after a year of learning was about 0.2 standard deviations. In the experimental condition, researchers found a 0.2 effect size in the second year of the study, meaning that on average, those students experienced a 0.4 standard deviation gain from pre-test to post-test. We could think of the 0.2 standard deviation growth in the control group as the baseline amount of learning that typically occurs in an algebra classroom, so getting an additional effect size of 0.2 standard deviations of test score gains from CAI meant that students in the treatment group were seeing twice as much learning gains as a typical student. (Another way to frame the magnitude of the effect is that students in the fiftieth percentile in the control group would be, on average, in the fifty-eighth percentile if assigned to the treatment group.)

As may be apparent from the previous paragraph, effect sizes and standard deviations are difficult to parse, so researchers have tried using months or years of learning as a measure—taking a standard measure of average learning gains and translating that to one year, or nine months, of learning; in the case of the RAND study, 0.2 standard deviation represents a “year of learning.” Since an additional effect size of 0.2 standard deviation, then, represents an “additional year of learning,” the Carnegie Learning website claimed that Cognitive Tutor: Algebra doubled students’ learning. One important clarification is that no one is claiming that students learned two years of material in one year. Rather, students showed performance gains in Algebra I post-tests as if they had studied for eighteen months in a traditional control setting instead of nine months, assuming a consistent rate of learning per month. Students assigned to Cognitive Tutors learned the Algebra I curriculum twice as well, as measured by standardized tests, as typical students, but they did not learn Algebra I and an additional year of math.22

These average effect sizes mask the great variation in effectiveness across schools. In some middle schools, students assigned to use Cognitive Tutor saw gains even greater than the 0.2 standard deviation average, and some saw gains that were much smaller. After the RAND study, Carnegie Learning researchers looked more deeply into the data to try to explain this variation, and they found that learning outcomes were better in schools where teachers most fully allowed students to proceed on practice problems at their own pace, even if that meant that sometimes, different students were working on problems from very different places in the curriculum.

In experimental studies, one concept that researchers study is “fidelity”: do teachers actually use the pedagogical innovation in the intended ways? One of the core intentions of Cognitive Tutor is that while using the software, students should advance only as they demonstrate mastery so that students don’t miss foundational ideas early on. This means that students should be working on different lessons at different times. Since Cognitive Tutor logs student activity, researchers can tell whether students in the same class are mostly working in lockstep or whether teachers are actually letting students work on a topic until they achieve mastery. In a 2016 follow up to the RAND study, Steve Ritter from Carnegie Learning presented evidence that how teachers used Carnegie Learning mattered a great deal for student learning outcomes. Ritter’s research team looked at how much adaptive mastery learning teachers actually allowed in their classes, and they found that some teachers assigned work in Carnegie Learning in such a way that it wasn’t really personalized—these teachers required students to work on problem sets related to the topics being taught at that moment to the whole class. By contrast, other teachers allowed students to work at their own pace, even if this meant that some students were still doing practice problems on topics that might have been covered in class weeks earlier. Ritter’s team found that overall learning gains were higher in the classes where students were allowed more opportunities to move at their own pace; in other words, the teachers who used Carnegie Learning as intended had more learning gains in their classrooms than did the teachers who kept their students moving in lockstep. That suggests that more professional development and coaching for teachers implementing Cognitive Tutors might be able to improve outcomes further if all teachers could be convinced to let students work on practice problems at their own pace.23

In 2016, the contract research group SRI International evaluated a major field trial of a similar CAI system called ASSISTments. ASSISTments was created by Neil and Cristina Heffernan, both former middle school math teachers. Neil did his dissertation at Carnegie Mellon with Ken Koedinger, who was instrumental in the development of Cognitive Tutor. The Heffernans took ASSISTments in a slightly different direction than Cognitive Tutors. Cognitive Tutor: Algebra was designed to replace part of routine class activity; students would spend three days a week on in-person group instruction and two days a week on computers using Cognitive Tutor: Algebra. ASSISTments, by contrast, is mostly a homework helper: students do teacher-assigned homework problems at night, get immediate feedback about whether they are right or wrong, and have the option to do some additional “skill-builders” that incorporate some adaptive elements of CAI.24

The program was rolled out in middle schools across Maine, where a statewide laptop initiative ensured universal access for middle school students. Teachers received professional development from the ASSISTments team for using the freely accessible ASSISTments system. The research team estimated that students would use ASSISTments three or four nights a week for about ten minutes a night, although data later showed that students used ASSISTments somewhat less than that. Most student work was probably on non-adaptive teacher-assigned homework problems rather than the skill practice, so the intervention probably wasn’t really testing adaptive learning environments—for the most part, kids were doing the same textbook problems they would have been doing, except the problems were online. The main levers of learning were probably two-fold: students got immediate feedback on problems in the evening, and teachers got a simple report each morning that showed which problems students struggled with, allowing them to tailor their morning homework review in class to the most challenging problems and issues.

Like the RAND / Cognitive Tutor study, the SRI team found that students in the treatment condition assigned to use ASSISTments learned more on average and did about 0.2 standard deviations better on pre- and post-test gains, which was about 75 percent more than the control group. They also found that most of the gains were among low-achieving math learners, so the intervention played a role in closing achievement gaps.

In comparing the ASSISTments study with the Cognitive Tutor: Algebra study, one difference that leaps out is how much simpler ASSISTments is. Cognitive Tutor: Algebra is a full CAI adaptive learning solution, while ASSISTments is more of an online homework helper. Cognitive Tutor: Algebra requires major changes in classroom practice, which reduces teacher contact time with students, increases in-classroom computer usage, and creates the opportunity for individual pacing. By contrast, as used in the Maine study, ASSISTments just lets kids see the answers to their problems and lets their teachers get more information about how students are doing. Cognitive Tutor rearranges math teaching; ASSISTments gains some efficiencies in homework and review. In the two experimental studies, the effects of both interventions were about the same. This suggests that all the complex machinery of the full CAI system may be unnecessary, and a lightweight online homework helper could perhaps be just as good as a complex adaptive tutor.

For policymakers or school leaders trying to decide what role computers should play in teaching mathematics, these two recent studies can help advance our understanding of the value of CAI. New studies do not replace previous studies; rather, they help stakeholders in math and computer-assisted education regularly update our understanding of the state of the art. One view of this research is that two large, well-conducted, randomized field experiments should revise our consensus to have a more positive outlook on intelligent tutors in math. This pair of experiments with Cognitive Tutor: Algebra and ASSISTments suggests that researchers, developers, and educators have developed an understanding of computer-assisted instruction that allows teachers using this software to consistently get moderate learning gains from incorporating these tools. If this were the case, we should expect future studies and implementations to show similar gains from CAI systems, perhaps even with modest improvements over time as developers continuously improve these systems. In this view, even though the older consensus was that CAI systems in math did not significantly improve learning, these new studies suggest that the field is maturing.

A more cautious view would be that over the last three decades, there have always been occasional studies that show positive results, but these are regularly “balanced out” by other studies that showed negative or null results. For instance, in early 2019, Teachers College at Columbia University released results from a study of Teach to One, another computer adaptive learning system developed originally in the New York City Public Schools. While not a randomized field trial, this study showed that schools adopting Teach to One did not improve on state test scores. No single research study perfectly captures the “true effect” of an education approach, as our measures are always affected by errors in measurement, sampling variation, and other exigencies of research. Perhaps in the Carnegie Learning and ASSISTments studies, these errors nudged the results in the positive direction; it could be that the next two big CAI assessments will have negative effects, and the two after that will be null, and as a field, we will realize that the evaluation that economists had in the mid-2010s probably holds into the future.25

My own view is that these recent positive results represent a maturing of the field, and in the future, we should expect these kinds of consistent, replicable gains for adaptive tutors in math. That said, I am constantly trying to look at new studies and new evidence, and I try to revise my thinking as more studies come out. I hope that the case study above of adaptive tutors provides a model for how people interested in education technology can steadily revise their thinking on a topic—looking for meta-analyses or studies that provide a consensus view of a field, and then gradually updating their thinking as new studies emerge.

So What Should a Department Head Do? Synthesizing Research in Computer-Assisted Instruction

Let’s put ourselves back in the shoes of a K–12 principal considering whether it’s worth pursuing adaptive tutors as a way to improve student performance in a school. We understand a bit more about how these tools work; they are not magical robot tutors in the sky, but rather are software programs with a long history of designers and researchers tinkering toward incremental improvement. They have not found a wide purchase in the K–12 curriculum, but they have been used in early elementary reading and throughout the math curriculum. On average, studies of adaptive tutors in early reading have not shown positive impacts on learning. A school looking to be on the cutting edge of innovation might be willing to try some newly developed approach, but elementary schools looking for reforms with a strong track record of research should probably turn to other approaches to support early reading.

In a sense, then, the decision to explore adaptive tutors in K–12 schools probably belongs primarily to the math department head, as math is the only domain where adaptive tutors have consistently shown some evidence of efficacy, especially in a few recent studies. In the ideal world, a math department head looking to improve teaching and learning in her district would take all of these studies and perspectives into account before identifying whether CAI would be a good fit for her math teachers, and if so, what specific products might work well. Randomized control trials are good tools for figuring out if interventions work on average. But no school district is average; every context is unique. If schools have already made big investments in technology, as the middle schools in Maine did with their laptop program, then the costs to schools of implementing CAI are much lower than the costs of buying new machines just for math. If math teachers in a district are generally quite strong, then Eric Taylor’s research suggests that adaptive tutors might not be the best tool for getting further improvements, or maybe that a complementary system like ASSISTments would be more promising than a supplementary system like Carnegie Learning. By contrast, in a system where math teachers are generally not as strong—maybe a district with frequent teacher turnover and many new teachers—computer-assisted instruction may be a more compelling path forward.

Some schools may have teachers of varying quality but with a high willingness to try new approaches. A school district with math teachers willing to dive into a new program may have better results with Carnegie Learning than a district with teachers who aren’t as willing to change their teaching practices. Steve Ritter’s research suggests that Carnegie Learning works best with teachers who are most willing to let the tutors personalize student practice on the days devoted to computer-based instruction.

One of the wonderful and challenging things about schools is that they can always be improved; teaching and learning are so immensely complex that there is always room for tinkering and improvement. There is no evidence that computer-based instruction regularly outperforms traditional instruction or that CAI leads to dramatic transformation of math learning. The best way to understand CAI is as one possible tool among many for improving mathematics education. There are other options too, of course: investment in human tutors to provide more support for the students struggling the most; professional development for teachers in rich mathematical discourse or a deeper understanding of fundamental math content; new software that facilitates new kinds of visualization in mathematics, like Geometer’s Sketchped or the Desmos graphing calculator. For some schools, CAI might be the right tool to improve math instruction, and in other schools, one of these other approaches might be a better fit, based on the strengths, weaknesses, and interests of the teachers in a given school or district.

This argument about the utility of intelligent tutors should feel familiar, as it is structurally similar to the case made in the previous chapter in regard to MOOCs. Both technologies are useful but not transformative. They have particular niches in which they appear to work well (math education for CAI, professional education for MOOCs) and other niches in which evidence suggests that they are much less useful (reading education for CAI, remedial or entry-level higher education for MOOCs). Both technologies raise serious concerns about issues of inequality, though some studies of adaptive tutors suggest ways that they might benefit struggling students. Instead of transforming educational systems, they are best understood as technologies that can offer limited but meaningful value in particular parts of our existing education systems. With ongoing research and tinkering, I suspect that technologies and implementation models for adaptive tutors will continue to incrementally improve, and perhaps over time, the weight of evidence will shift to support more widespread adoption and implementation. Technology evangelists who claim that a new generation of adaptive tutors can reshape the arc of human development should be treated with suspicion.