1. Teaching Computers to Fish

By 1960, IBM realized it had a problem. At a conference four years earlier, in the summer of 1956, a group of leading academics convened to consider how to build machines that, as they put it, “simulated every aspect of human intelligence.” The collection of brash young scientists brainstormed for two months among the stately Georgian spires and lush gardens of Dartmouth College, fueled by the bold prediction of the organizers that “a significant advance can be made in one or more of these problems if a carefully selected group of scientists work on it together for a summer.”1 They may not have agreed on much, but they unanimously adopted the moniker “artificial intelligence” for their endeavor, as suggested by their host, the mathematician John McCarthy. It was a heady time.

Returning to their respective institutions, few seemed to notice that the optimistic goals of the conference were largely unmet. But that didn’t stop them from expressing their enthusiasm for the newly minted field. Their predictions were soon featured in general-interest publications such as Scientific American and the New York Times.2

Among the conference organizers was Nathaniel Rochester, a star researcher at IBM’s Watson Research Lab, who was tapped to lead the company’s nascent AI efforts. But as word spread about his team’s work on computer programs that played chess and proved mathematical theorems, complaints started to mount from an unexpected source.

The singular focus of IBM’s storied sales force was to sell the latest data-processing equipment to industry and government. Renowned for aggressive tactics and armed with an answer to every objection, the sales force began reporting back to headquarters that decision makers were concerned about just how far this new push into AI might go. It was one thing to replace lowly clerks who typed up memos and sent out bills, but quite another to suggest that the same computers IBM was urging them to buy might someday threaten their own jobs as managers and supervisors.

Rising to this challenge, an internal IBM report suggested that the company cease all research in AI and shutter Rochester’s new department.3 Perhaps concerned for their own jobs, members of IBM management not only implemented these recommendations but also armed their sales force with the simple riposte, “Computers can only do what they are programmed to do.”4

This straightforward phrase may be one of the most widely circulated and potent cultural memes of the last half century. It deftly neutered concerns about the mysterious, brightly colored Pandora’s boxes IBM was installing on raised floors in special air-conditioned “computer rooms” throughout the world. Nothing to fear here: these electronic brains are just obedient mechanical servants blindly following your every instruction!

Programmers schooled in sequential step-wise processing, in which you break a problem down into ever more manageable chunks (called “structured programming”), would be quick to agree, perhaps even today. Computers at the time were monolithic devices that loaded some data from a finite memory, fetched an instruction, operated on that data, then stored the result. Connecting two computers together (networking) was unheard of, much less having access to volumes of information generated and stored elsewhere. Most programs could be described as a sequence of “Do this, then do that” instructions. Rinse and repeat.

Despite the lofty goals of the field, AI programs of the time reinforced this paradigm. Following the orientation of the founders, many early AI efforts focused on stringing logical axioms together to reach conclusions, a form of mathematical proof. As a result, they tended to focus on domains that were amenable to logical analysis and planning, such as playing board games, proving theorems, and solving puzzles. The other advantage of these “toy” problems was that they didn’t require access to large amounts of messy data about the real world, which was in scarce supply, to say the least.

In the context of the time, these efforts could be seen as an obvious next step in expanding the utility of computers. The machines were initially conceived as general-purpose calculators for tasks like building ballistics tables for the military during World War II; IBM had successfully beaten these electronic swords into plowshares by applying them not only to numbers but also to the processing of letters, words, and documents. AI researchers were simply further expanding the class of processed data to include symbols of any kind, whether preexisting or newly invented for specific purposes like playing chess. Ultimately, this style of AI came to be called the symbolic systems approach.

But the early AI researchers quickly ran into a problem: the computers didn’t seem to be powerful enough to do very many interesting tasks. Formalists who studied the arcane field of theory of computation understood that building faster computers could not address this problem. No matter how speedy the computer, it could never tame what was called the “combinatorial explosion.” Solving real-world problems through step-wise analysis had this nasty habit of running out of steam the same way pressure in a city’s water supply drops when vast new tracts of land are filled with housing developments.

Imagine finding the quickest driving route from San Francisco to New York by measuring each and every way you could possibly go; your trip would never get started. And even today, that’s not how contemporary mapping applications give you driving instructions, which is why you may notice that they don’t always take the most efficient route.

Much of the next several decades of AI research could be characterized as attempts to address the issue that logically sound approaches to programming tended to quickly peter out as the problems got more complex. Great effort went into the study of heuristics, which could loosely be described as “rules of thumb” to pare down the problems to manageable size. Basically, you did as much searching for an answer as you could afford to, given the available computing power, but when push came to shove you would turn to rules that steered you away from wasting time on candidate solutions that were unlikely to work. This process was called pruning the search space.

Monumental debates broke out over where, exactly, the intelligence was in these programs. Researchers in “heuristic programming” soon came to realize that the answer lay not in the rote search for a solution or the process of stringing logical propositions together, but rather in the rules they used for pruning.

Most of these rules came from experts in the problem domain, such as chess masters or doctors. Programmers who specialized in interviewing experts to incorporate their skills into AI programs became known as “knowledge engineers,” and the resulting programs were called “expert systems.” While these programs were certainly a step in the right direction, very few of them turned out to be robust enough to solve practical real-world problems.

So the question naturally arose: What is the nature of expertise? Where does it come from, and could a computer program become an expert automatically? The obvious answer was that you needed lots of practice and exposure to relevant examples. An expert race car driver isn’t born with the ability to push a vehicle to its operating limits, and a virtuoso isn’t born holding a violin. But how could you get a computer program to learn from experience?

A small fringe group of AI researchers, right from the earliest days, thought that mimicking human brain functions might be a better way. They recognized that “Do this, then do that” was not the only way to program a computer, and it appeared that the brain took a different, more flexible approach. The problem was that precious little was known about the brain, other than that it contains lots of intricately interconnected cells called neurons, which appear to be exchanging chemical and electrical signals among themselves.

So the researchers simulated that structure in a computer, at least in a very rudimentary form. They made lots of copies of a program, similar in structure to a neuron, that accepted a bunch of inputs and produced an output, in a repeating cycle. They then networked these copies into layers by connecting the outputs of lower layers into the inputs of higher layers. The connections were often numeric weights, so a weight of zero might mean not connected and a weight of one hundred might mean strongly connected. The essence of these programs was the way they automatically adjusted their weights in response to example data presented to the inputs of the lowest layer of the network. The researcher simply presented as many examples as possible, then turned the crank to propagate these weights throughout the system until it settled down.5

Following the tendency for AI researchers to anthropomorphize, they called these programs “neural networks.” But whether these programs actually functioned the way brains do was beside the point: it was simply a different approach to programming.

The most important difference between the symbolic systems and neural networking approaches to AI is that the former requires the programmer to predefine the symbols and logical rules that constitute the domain of discourse for the problem, while the latter simply requires the programmer to present sufficient examples. Rather than tell the computer how to solve the problem, you show it examples of what you want it to do. This sounds terrific, but in practice, it didn’t work very well—at least initially.

One of the earliest neural networking efforts was by Frank Rosenblatt at Cornell in 1957, who called his programmatic neurons “perceptrons.”6 He was able to show that, with enough training, a network of his perceptrons could learn to recognize (classify) simple patterns in the input. The problem was, as with symbolic systems programs, the results were mainly small demonstrations on toy problems. So it was hard to assess the ultimate potential of this approach, not to mention that Rosenblatt’s claims for his work rankled some of his friendly academic competitors, particularly at MIT.

Not to let this challenge go unanswered, two prominent MIT researchers published a widely read paper proving that, if limited in specific ways, a network of perceptrons was incapable of distinguishing certain inputs unless at least one perceptron at the lowest level was connected to every perceptron at the next level, a seemingly critical flaw.7 The reality, however, was a little different. In practice, slightly more complex networks easily overcome this problem. But science and engineering don’t always proceed rationally, and the mere suggestion that you could formally prove that perceptrons had limitations called the entire approach into question. In short order, most funding (and therefore progress) dried up.

At this point, readers close to the field are likely rolling their eyes that I’m retelling this shopworn history-in-a-bottle tale, which ends with the underdog winning the day: the 1990s and 2000s witnessed a resurgence of the old techniques, with increasingly persuasive results. Rebranded as machine learning and big data, and enhanced with advanced architectures, techniques, and use of statistics, these programs began to recognize objects in real photographs, words in spoken phrases, and just about any other form of information that exhibits patterns.8

But there’s a deeper story here than researcher-gets-idea, idea-gets-quashed, idea-wins-the-day. There’s an important reason why machine learning was so weak in the late twentieth century compared to symbolic systems, while the opposite is true today. Information technology in general, and computers in particular, changed. Not just by a little, not just by a lot, but so dramatically that they are essentially different beasts today than they were fifty years ago.

The scale of this change is so enormous that it’s difficult to conjure up meaningful analogies. The term exponential growth is thrown around so often (and so imprecisely) that most people don’t really understand what it means. It’s easy to define—a quantity that changes in proportion to a fixed number raised to a changing power—but it’s hard for the human mind to grasp what that means. The powers 100, 1,000, 10,000 (powers of 10), and 32, 64, 128 (powers of 2), are numeric examples. But these numbers can get mind-bogglingly large very quickly. In just eighty steps in the first of these example sequences, the figure is larger than the estimated number of atoms in the entire universe.

For at least the last half century, important measures of computing, such as processing speed, transistor density, and memory, have been doubling approximately every eighteen to twenty-four months, which is an exponential pace (power of 2). At the start of the computer revolution, no one could have predicted that the power of these machines would grow exponentially for such a sustained period. Gordon Moore, cofounder of Intel, noticed this trend as early as 1965, but remarkably, this pattern has continued unabated through today with only minor bumps along the way.9 It could all end tomorrow, as indeed concerned industry watchers have warned for decades. But so far, progress marches on without respite.

You’ve probably experienced this remarkable achievement yourself without realizing it. Your first smartphone may have had a spacious eight gigabytes of memory, a small miracle for its time. Two years later, if you bothered to upgrade, you likely sprang for sixteen gigabytes of memory. Then thirty-two. Then sixty-four. The world didn’t end, but consider that your phone contains eight times as much memory as it did three upgrades ago, for pretty much the same cost. If your car got eight times the gas mileage it did six years ago, on the order of, say, two hundred miles per gallon, you may have taken more notice.

Now project this forward. If you upgrade your phone every two years for the next ten years, it’s not unreasonable to expect it to come with two terabytes (two thousand gigabytes). The equivalent improvement in gas mileage for your car would be over six thousand miles per gallon. You could drive from New York City to Los Angeles and back on one gallon, and still have enough left to make it down to Atlanta for the winter before refueling, with just another gallon.

Imagine how mileage like this would change things. Gas would effectively be free. Drilling for oil would come to a near standstill. Airlines and shipping companies would constantly scramble to adopt the latest hyperefficient motor technology. The cost of package delivery, freight, plane tickets, and consumer goods would drop significantly. This blistering rate of change is precisely what’s happening in the computer industry, and the secondary effects are transforming businesses and labor markets everywhere.

So your phone might have two thousand gigabytes of storage. What does that mean? To put that in perspective, your brain contains about one hundred “giga-neurons.” This is not to suggest that twenty bytes of computer memory is as powerful as a neuron, but you get the picture. It’s quite possible, if not likely, that within a decade or two your smartphone may in principle have as much processing power as your brain. It’s hard to even imagine today what we will do with all this power, and it’s quite possibly just around the corner.

To my children, this story is just the ramblings of an old-timer talking about the good ol’ days. But to me, this is personal. Over the 1980 winter holiday break at Stanford, I helped some researchers from SRI International build a program that could answer questions posed in English to a database. Though the system’s linguistic capability was rudimentary compared to today’s, the team leader, Gary Hendrix, was able to use this demo to raise venture capital funding for a new company that he cleverly named Symantec.

Sequestered in my basement for two solid weeks, I cobbled together a flexible database architecture to support the project. Gary had loaned me a state-of-the-art personal computer of the time, the Apple II. This remarkable machine stored information on floppy disks and supported a maximum of forty-eight thousand bytes of memory. To put this in perspective, that Apple II could store about one second of CD-quality music. By contrast, the phone I’m carrying around today, which has sixty-four gigabytes of memory, can hold about twelve days of CD-quality music. My phone literally has over 1 million times as much memory as that Apple II, for a fraction of the cost.

What does a factor of 1 million mean? Consider the difference between the speed at which a snail crawls and the speed of the International Space Station while in orbit. That’s a factor of merely half a million. The computer on which I am typing these words has far more computing power than was available to the entire Stanford AI Lab in 1980.

While it’s possible to compare the processing power and memory of today’s and yesterday’s computers, the advances in networking can’t even be meaningfully quantified. In 1980, for all practical purposes, the concept barely existed. The Internet Protocol, the basis for what we now call IP addresses, wasn’t even standardized until 1982.10 Today, literally billions of devices are able to share data nearly instantly, as you demonstrate every time you make a phone call or send a text message. The enormous and growing mountain of data of nearly every kind, stored on devices accessible to you through the Internet, is astonishing.

So how did this affect the relative success of the various approaches to AI? At some point, large enough differences in quantity become qualitative. And the evolution of computers is clearly in this category, even though progress may seem gradual on a day-to-day basis, or from Christmas gift to Christmas gift. As you might expect, machines so vastly different in power may require different programming techniques. You don’t race a snail the same way you would race a spaceship.

The original symbolic systems approach was tailored to the computers available at the time. Since there was precious little computer-readable data available at all, and no way to store any significant volume of it, researchers made do by handcrafting knowledge they painstakingly distilled from interviews with experts. The focus was on building efficient algorithms to search for a solution because the limited processing power would not permit anything more ambitious.

The alternative neural networking approach (more commonly called machine learning today), which attempted to learn from examples, simply required too much memory and data for early computers to demonstrate meaningful results. There were no sufficiently large sources of examples to feed to the programs, and even if you could, the number of “neurons” you could simulate was far too small to learn anything but the simplest of patterns.

But as time went by, the situation reversed. Today’s computers can not only represent literally billions of neurons but, thanks to the Internet, they can easily access enormous troves of examples to learn from. In contrast, there’s little need to interview experts and shoehorn their pearls of wisdom into memory modules and processors that are vanishingly small and slow compared to those available today.

Important subtleties of this technological revolution are easy to overlook. To date, there seem to be no limitations on just how expert machine learning programs can become. Current programs appear to grow smarter in proportion to the amount of examples they have access to, and the volume of example data grows every day. Freed from dependence on humans to codify and spoon-feed the needed insight, or to instruct them as to how to solve the problem, today’s machine learning systems rapidly exceed the capabilities of their creators, solving problems that no human could reasonably be expected to tackle. The old proverb, suitably updated, applies equally well to machines as people: Give a computer some data, and you feed it for a millisecond; teach a computer to search, and you feed it for a millennium.11

In most cases, it’s impossible for the creators of machine learning programs to peer into their intricate, evolving structure to understand or explain what they know or how they solve a problem, any more than I can look into your brain to understand what you are thinking about. These programs are no better able to articulate what they do and how they do it than human experts—they just know the answer. They are best understood as developing their own intuitions and acting on instinct: a far cry from the old canard that they “can only do what they are programmed to do.”

I’m happy to report that IBM long ago came around to accepting the potential of AI and to recognizing its value to its corporate mission. In 2011, the company demonstrated its in-house expertise with a spectacular victory over the world’s champion Jeopardy! player, Ken Jennings. IBM is now parlaying this victory into a broad research agenda and has, characteristically, coined its own term for the effort: cognitive computing. Indeed, it is reorganizing the entire company around this initiative.

It’s worth noting that IBM’s program, named Watson, had access to 200 million pages of content consuming four terabytes of memory.12 As of this writing, three years later, you can purchase four terabytes of disk storage from Amazon for about $150. Check back in two years, and the price will likely be around $75. Or wait ten years, and it should set you back about $5. Either way, be assured that Watson’s progeny are coming to a smartphone near you.