Around the time that I dropped out of my PhD program in 2003 to take a job in industry, there were two trends that were about to change the nature of AI. One was rapid adoption of the Internet, and the flood of data generated directly and indirectly by users as more and more of their daily lives moved online. The second trend was the emergence of the cloud, a new massively scalable distributed computing platform that was emerging to handle the growth of the Internet. With these two trends, the subdiscipline of AI called machine learning was about to kick into high gear, and the second phase of AI—what I have called systems of learning—was born.
Unlike my early experiences with AI that emphasized logic, symbolic reasoning, and encoding and manipulating expert knowledge, machine learning uses data to build models and then uses those models to do intelligent things. The first machine learning program I wrote in preparation for a bigger machine learning project I had been tasked with at work was a spam classifier for e-mail.
Classifiers use machine-learned models to determine what class something belongs to. In the case of e-mail, you might want to take everything that comes into your in-box and determine whether it’s spam or not. To do that you need a bunch of e-mail messages—your training data—with each message labeled spam or not spam. The more data, and the more representative that data is of typical in-boxes, the better. You then take this labeled training data set and train a model using the machine learning algorithm of your choice.
For my classifier, I used Naive Bayes. What that is and how it works aren’t super important. The important thing is the idea: if you have enough labeled data, you can train a model that is able to make inferences about things it has never encountered before—in the case of my classifier, determining whether brand-new e-mails that show up in an in-box are spam or not. When you write a machine-learned spam classifier, you don’t have to know beforehand all the words and phrases that spammers use in their messages to trick you into doing something you shouldn’t. You just need to have examples of spam so that the system itself can learn the patterns that make a message spam or not. Rather than having to encode rules and relationships and logic and explicit knowledge, the algorithm learns what it needs to learn from the data. The more training data you have, the better your learned model will be. And the more frequently you can gather labeled training data and retrain, the easier it will be for your model to adapt to new patterns that might emerge.
It sounds complicated, but the pattern is simple. You have some data you want to be able to reason over. You make a bunch of examples of what you want the machine learning algorithm to learn by labeling your data, whether that’s this e-mail is spam or this image contains a cat. You then use these examples, the labeled training data, to train a model. If you do your job right, when you pass new data through the model, it has learned what to do with that data, whether that’s inferring a previously unread message is spam or tagging a cat in a previously unseen image.
Perhaps the most exciting thing about machine learning, and one of the reasons that this second phase of AI and its systems of learning have made such rapid progress compared with the first phase of AI and its systems of reasoning, is that machine learning progress is driven by how much data you have and how much computing power you have available to train models using all of this data. With the Internet and the cloud, both data and compute power have been growing at astronomical rates over the past two decades, which has helped us to make stunning progress in machine learning. One way to think about this is that instead of having thousands or tens of thousands of experts building systems of reasoning during the first phase of AI, we now have tens of thousands of experts building systems of learning that are in turn being trained by the data being produced by a few billion people as they use the Internet, their smartphones, and the increasing number of smart devices that permeate our lives.
The machine learning approach I just described is an example of supervised learning. With supervised learning, a human being labels all the data required to train a model. You can think of this labeling process as a way to teach a model how to recognize patterns in data. To borrow an analogy from Sean Gerrish’s book How Smart Machines Think, it’s a bit like teaching a young child about the world through flash cards. The cards are the labeled training data, and the training process is repeatedly showing the cards to the child until they grasp the pattern. Any child is a far more advanced learning machine than a machine learning algorithm, so you don’t want to carry this analogy too far. But at a high level, this is the essence of supervised machine learning.
Your model is only as good as your data. If you are teaching it to recognize a bucket, you must figure out what readily definable features of your data are going to help your model learn. For e-mail that might be easy because e-mails have lots of easily discernible structure. It’s a lot tougher to say what the features might be for recognizing buckets. A human, when asked to describe useful features of buckets, might say things like “they sometimes have handles” or “they’re kind of cylindrical and have holes in the top” or “they’re usually made from plastic or metal.” But for a computer, those aren’t readily definable features. They are difficult to describe directly to a machine that knows nothing at all about many things we take for granted like handles, holes, and materials. In fact, the difficulty of describing such complex features to machines is one of the reasons machine learning was invented in the first place!
Once you’ve figured out features, you might have to label a hundred thousand different buckets of all different shapes, sizes, and color, from different perspectives and under different lighting conditions. If you don’t provide enough variety and quantity of training examples, the model won’t be able to generalize, to recognize buckets that it’s never seen before, or to say that a picture of a dog is not a bucket. If you have biased data, you will likely train a biased model. If, for instance, you mostly used pictures of red buckets in training, your model might not recognize blue buckets as buckets at all, which is problematic in a world with both red and blue buckets.
Some of the biggest challenges in modern supervised machine learning arise from the need to make your training data representative enough of the domain you want the AI to learn. A tremendous amount of human effort goes into feature engineering and labeling data. You must make sure that the features you choose are predictive, that the labels are accurate, and that the training data is representative. Human developers, data scientists, machine teachers, and data workers also must manage bias in the data they are feeding the AI. Machine-learned models learn what we teach them. It is far too easy for harmful human biases, naturally present in human-generated data, to make their way into models where those biases can be amplified at AI scale. For example, if all the labeled data identifies doctors as men, your model is going to learn that all doctors are men and blithely propagate this bias and inaccuracy.
All this machine teaching and data manipulation is time-consuming and extremely expensive. Even for large technology companies with hundreds of millions of customers with whom they interact daily, data is the biggest thing that limits what you can accomplish with AI. You might have a perfectly good idea for a problem that you think could be solved with machine learning, but find yourself in a spot where you don’t have the right data: too little, too expensive, the wrong type, too biased, etc. There are several techniques that can help when confronted with these sorts of data problems, chief among them deep learning, machine teaching, transfer learning, reinforcement learning, and unsupervised learning.
Supervised machine learning can accomplish an unbelievable number of things using a relatively simple pattern: engineer your features, label your data, train your model, deploy your model, and use it to make your product “better,” resulting in more user engagement and more direct and indirect data to use to make your model better; wash-rinse-repeat. This cycle has driven much of the growth of the consumer Internet over the past fifteen or so years.
The first step in this cycle, feature engineering, isn’t something that folks talk about a lot, but it is a necessary step and one of the most important parts of doing machine learning. In every machine learning system, you have data and something that you want to learn from these data. In order to learn from the data, you must tell the machine learning algorithm what features of your data you believe will help it to learn. For some types of data and machine learning problems, like e-mail and spam detection, figuring out the features can be easy. With e-mail you already have some structure: whom it’s from; whom it’s to; a subject; a date and time sent; a body with words and sentences and paragraphs; attachments; links to web pages; etc. All of those, or combinations of them, might be useful features to use in your model.
For images you’re using to train a bucket detector, the only structure you may have is a list of numbers telling you what color each pixel is. Are those pixels the right features to help your model recognize buckets? If not, what features do you use? Your guess is as good as mine.
Deep neural networks, DNNs for short, can be a very effective way to build models when you have lots of labeled training data but you’re not too sure what the right features are for your data. DNNs are based loosely on the biological neural networks in our brains. DNNs are much smaller than a human brain. As of this writing, the largest of the largest DNNs have fewer than ten billion synapses or parameters, although that number has been increasing in recent years by as much as a factor of ten every year, given massive increases in available compute power. Still, our largest DNNs are many orders of magnitude smaller than a human brain, which has approximately a hundred trillion synapses. It also bears mentioning that even though our largest artificial DNNs are much smaller and much dumber than a human brain, the computing infrastructure and power required to run them might consume a hundred thousand times the space and power of a human brain.22
Structurally, deep neural networks are also quite different from a human brain. AI scientists and engineers design DNNs to solve categories of problems. DNNs for recognizing images tend to be structurally different from those that translate text from one language to another. The structure of these DNNs are far more regular and much less complex than our biological neural networks, and typically only good at the narrow range of tasks for which they were designed. We are still at the very early stages of research into techniques like transfer learning that allow DNN models designed and trained for one task to be combined with other models to solve different tasks.
With DNNs, you determine some very high-level structure of your data and problem, and then select a DNN category. If you’re trying to do some sort of inference on images or data with two-dimensional structure, you will probably use a convolutional neural network (CNN). If you’re trying to build a model that can predict the next word in a sequence or the next action to take after a sequence of prior actions, you might use some sort of long short-term memory network (LSTM). If you’re trying to build models that are operating over recorded speech or on handwriting, you might use a more general recurrent neural network (RNN). Once you’ve picked a DNN architecture for your problem, you’re then mostly off the hook for specifying features.
The idea of neural networks has been around for a long time, and technically predates the field of artificial intelligence. The first neural network models were created in 1943 by Warren McCulloch and Walter Pitts, and there was significant progress in neural network research over the next couple of decades. Things stalled out by the late 1960s, primarily because the ambition of research had exceeded the ability of computers to keep up. Progress slowed because there simply wasn’t enough compute power to do the next set of interesting things with this flavor of AI.
In 2006, Geoffrey Hinton, Simon Osindero, and Yee-Whye Teh published a seminal paper, “A Fast Learning Algorithm for Deep Belief Nets,” describing techniques for effectively training deep neural networks. In 2009, Andrew Ng pioneered the use of graphics processing units (GPUs) for training deep neural networks, achieving improvement by a factor of 100 in training performance. Since then, DNNs and deep learning have, in many ways, dominated the field of machine learning, allowing AI experts to make stunning breakthroughs, most noticeably on perception tasks like labeling the objects in images and recognizing the words in spoken language. Part of this is because DNNs solve a very hard problem in feature engineering. Part of this is because of an explosion of this perception data. And part of this is because the algorithms used to train deep neural networks are amenable to running on GPUs, processors originally designed to accelerate computer graphics, which has allowed us to train DNNs in reasonable amounts of time where, prior, their training compute requirements were impractically large.
DNNs are a hugely useful technique currently being used in all the major flavors of machine learning: supervised, reinforcement, and unsupervised. They make feature engineering easier, but still require lots of labeled training data to perform well. Lots and lots of labeled training data. The good thing is that DNNs can be used in conjunction with other techniques to help reduce the burden imposed by their thirst for data. When using DNNs in supervised learning applications, combining them with machine teaching or transfer learning may provide a way to bootstrap the training process from smaller training data sizes. Reinforcement learning and unsupervised learning use DNNs, but in these applications, the data that fuels the DNNs are generated in massively scalable simulation loops.
One of the weaknesses of DNNs is that we don’t yet have a precise science governing their design. A deep neural network is composed of layers of artificial neurons, each of which is connected to neurons in other layers. How many neurons are in each layer, how many layers, and how the neurons interconnect is known as the architecture of the DNN. Each of the connections between neurons has a weight, and this weight determines how much signal flows from neuron A to neuron B when A is activated. As with biological neurons, activation is typically a nonlinear function of the neuron’s inputs. The rectified linear unit, or relu, is the most popular of these nonlinear activation functions in 2019, but as with anything in AI, this could change very quickly with dramatic, unpredictable effect. And you don’t need to worry about what relu is or even what a nonlinear function does. The important thing to understand is that a neuron is either off or on. It takes a bunch of signals from neurons in other layers, and if those signals surpass some threshold, the neuron turns on and passes a signal to all the neurons to which it is connected in proportion to the weights on those connections.
The job of the learning algorithm is to flow training data from the input layer to the output layer, learning the weights on all the connections in a way that maximizes the likelihood that the output layer outputs the right thing. For example, if we are training a DNN to recognize buckets, we would likely use a convolutional neural network architecture. We would take labeled training data consisting of images with and without buckets and flow those data through the DNN, learning a set of weights that maximizes the likelihood that the output layer emits a bucket signal when the training data contains a bucket, or a no bucket signal when there is no bucket.
One other trick that DNNs use in training is a technique called dropout that prevents the DNN from overfitting the model to the training data.
Imagine the following exchange:
“Every CEO is old and gray.”
“But my boss is a CEO, and she is neither old nor gray.”
“Well, that’s the exception that proves the rule!”
“I still think you’re overgeneralizing!”
“Really? I don’t think so. Just because your boss is a CEO doesn’t mean that we can’t say that all CEOs are either old and gray or else look like your boss . . .”
Overfitting is a problem that any machine learning system can have, not just DNNs, that results from the model learning from the training data so well that it has trouble generalizing to any data that wasn’t in the training data set. Since generalizing to previously unseen data is the point of building machine learning systems, overfitting is a bad thing, and having techniques at your disposal for dealing with it is very important.
To see the point with just a little bit more technical depth, one could use a confusion matrix to depict a group of people—our training samples—according to age and grayness, and whether or not they are CEOs.
We can use machine learning to learn a rule, or function, that predicts if a person is likely (or unlikely) to be a CEO. The blue line represents one such function. The function that it represents is a simple one—linear regression, to be specific—that despite its simplicity seems to do a pretty good job determining if someone is likely to be old and gray enough to be a CEO.
A more complicated function might be depicted by a green line. On the surface, having more complexity seems to give us more accuracy. The green function better separates the group of real CEOs from the non-CEOs. In fact, imagine that one little red dot near the bottom might represent someone just like my boss, who is a CEO, but neither gray nor old. It might even be my boss!
But the problem here is what happens when we use these machine-learned functions to analyze data we’ve never seen before. For example, my friend is about as young as my boss. But he is not a CEO, nor would one normally think of someone as nongray and young as him to be CEO material. In this case, what we see is that the green function is not really a good one—it is guilty of overfitting.
Neural networks today are capable of learning extremely complex functions—in fact, the most complex functions ever built by humankind (with machine assistance, of course). If left unchecked, given enough data and the right DNN architecture, the functions/rules they learn would essentially account precisely for every single “exceptional” case in their training sets. Simply put: Neural nets today can learn too much from their training data. This capability needs to be tamed. So a big part of the research on neural nets has centered around how to control this overfitting problem.
Dropout turns out to be one of the most effective approaches to preventing overfitting in DNNs. A simple way to think about dropout is that it’s like selective forgetting. If the network remembers everything that it has seen, then it may never learn to generalize beyond its memory, which is a bad thing. The goal of any machine learning system is to be able to generalize beyond the data on which it has been trained, to be able to respond with good answers when it is asked a question about data that it has never seen before. It’s the same as wanting your child to be able to eventually do arithmetic more complicated than what they’ve memorized from flash cards.
Dropout forces a DNN to forget some of what it learns in each stage of training in order to force it to learn how to generalize. With dropout, during each stage of the training process, some neurons within each layer are chosen randomly to be removed—or “dropped out”—and consequently don’t propagate signals to other neurons in other layers. The probability of dropping a neuron out is a configurable parameter of the training process. If you forget too little, you may fail to learn to generalize. If you forget too much, your model may never learn anything. So choosing the correct value for this parameter is important.
When the training stage is complete, the dropped-out nodes are reinserted. This process is repeated in every stage until training is complete. Why dropout works as well as it does is not entirely understood, which is a common refrain with deep neural networks. The most plausible theories that folks currently have is that randomly dropping out nodes is like changing a little bit of the DNN architecture during each training stage, which makes it harder for model weights to co-adapt, which can lead to overfitting.
As a side benefit, doing dropout also reduces the amount of time and power required during training since we’re only training a fraction of the nodes at each stage. Whether the dropout is a fifty-fifty thing (as in a coin flip) or some other probability, and whether every stage is handled the same way, and lots and lots of other tweaks and variations, is a subject unto itself. But the bottom line is that what people generally want out of a machine learning system is a fairly “smooth” function. Dropout has the effect of smoothing things out, so that weird spikes—like having someone as young and dashing as my boss in your training set—won’t lead us to conclude that all such people are CEO material.
Many of the choices that we make about the configuration of a DNN to make it work well are determined through experimentation. Researchers and AI developers have been exploring ways to automate this experimentation to quickly and automatically determine what configurations work best for a particular problem or class of problems. These efforts are colloquially known as AutoML, and the most sophisticated ones use machine learning to determine DNN parameters, like how much dropout and what activation function to use, and even DNN architecture. Doing this experimentation manually is time-consuming, expensive, and error-prone, in many cases resulting in a suboptimal model. Sometimes that means training a DNN with poorer than ideal performance, which may result in people giving up on deep learning prematurely. And sometimes it means that an individual or team doesn’t try deep learning at all, because the whole process seems too daunting.
Microsoft’s AutoML system has been applied in dozens of machine learning systems in active use across Microsoft’s portfolio of products and services. In our own uses of AutoML, it is often able to build better models than the best hand-tuned ones that data scientists and AI experts were able to achieve on their own. In some cases, AutoML makes machine learning accessible for the first time for teams without AI expertise. Given how successful AutoML has been internally, we’ve made it available for other developers as a service in Azure Machine Learning, and for nondevelopers in tools like Power BI, where it is truly making the power of machine learning accessible to whole new audiences. Due to its promise and demonstrated utility, AutoML is a very active area of research and development in industry and academia.
Just as flash cards are limited tools for teaching children, our current best practices for data labeling and data engineering are very limited ways to teach a supervised machine learning system to make inferences about data. Patrice Simard, the originator of machine teaching, writes, “While machine learning focuses on creating new algorithms and improving the accuracy of ‘learners,’ the machine teaching discipline focuses on the efficacy of the ‘teachers.’”23
If I wanted to build an AI system to recognize all the pages on the web that contain food recipes, using typical supervised machine learning techniques, I would need to find perhaps hundreds of thousands of examples of recipes among the tens of billions of pages on the web—because I would want all my training examples to be accurate; i.e., actual recipes and not unrelated things like instructions for grooming your cat or lists of facts about an actor. A human who knows how to distinguish recipes from these other things would have to look at each page and label it. Because I potentially need so many labeled training examples, I would need lots of humans, and it would take a very long time. And even after I spent the time and money to gather all these labeled examples, I might get a model that doesn’t work well and need to gather different data for subsequent rounds of training.
Machine teaching seeks to make it easier for humans, who can quite easily discriminate between recipes and nonrecipes, to convey their knowledge of a domain like this to a machine learning system. For instance, one of the things that a human might easily grasp about a recipe is that each one typically contains a list of ingredients. Things without lists of ingredients are unlikely to be recipes, and things with them have a much higher probability of being recipes. So, instead of just simply labeling things as recipe or not a recipe and hoping that the machine learning system figures out ingredients lists, why not teach the system to recognize ingredients lists as part of teaching it to recognize recipes?
Machine teaching is an incredibly promising new direction in machine learning. It has the potential to dramatically reduce the amount of data and data engineering required to train a model, and consequently to lower the expense of training. This could make machine learning significantly more accessible, making it just as possible for a small business as for a giant tech company to build bespoke models for problems that they are trying to solve, and for both large and small organizations to solve problems where data scales and/or budgets would never be big enough to solve the problem at hand. It also makes bias an easier problem to grapple with. It’s far easier to train a small group of folks how to recognize and deal with bias than it is to get very large groups of data labelers or the whole Internet to confront and deal with conscious and unconscious biases. Author Ted Chiang, best known for his short story that inspired the film Arrival, explores this idea of machine teaching combined with systems of simulation in his novella The Lifecycle of Software Objects. Picking up on the great Alan Turing’s suggestion that an alternative to programming intelligence would be to teach an agent to speak English and then have it learn like a child over time, Chiang explores each stage in the agents’ development, tracing their progress from one step to another.24 His approach is not far off—machine learning whether supervised or unsupervised ultimately requires a lot of human nurturing, teaching, care, and feeding before it’s able to do anything useful.
Even in our ancient stories, humans imagined machines that could mimic life. When my daughter learned about Greek mythology, she declared that Hephaestus must have been the god of engineers because he built robots in his workshop. Over the centuries, humans have gone from telling stories about automata to building them. It seems that every time we have made the jump from imagination to reality, we go through a cycle of surprise and incredulity, followed by amazement and wild optimism, then fear and anxiety, and, ultimately, to understanding and acceptance of the limitations of the leap that was made.
With artificial intelligence, the cycle starts with folks making claims that we are either very far away from, or very close to, the point where a machine could do some tasks as well as a human. When the machine subsequently bests humans at one of these tasks, we enter a period of hype where we question our understanding of what is hard and easy, imagine the many other things that machines might be able to do if they can do this, and then start making attention-grabbing predictions about the future in either utopic or dystopic directions. Since the Dartmouth workshop in 1956 made many bold predictions about what cognitive feats machines would soon be able to do, we have been frequently so awful at predicting the future of AI, leading to so much disappointment when bold claims fail to materialize, that the busts that typically follow the periods of hype have a name: the AI winter.
We’ve had several periods of AI hype followed by AI winters. We are almost certainly in a period of hype right now, which means that, to borrow a phrase from George R. R. Martin, winter is coming. AI is not the only human pursuit that exhibits this pattern, and this boom and bust cycle might be unavoidable. But it is unfortunate because the hyperbole of the boom, and the pessimism of the bust, are distractions from steady progress on a tool that is becoming more and more useful by the day.
I fear that the next AI winter will have disproportionate impact on the parts of AI that have the most promise for broad, beneficial impact on society, mostly because we are underinvesting in those things right now. The big institutions funding AI right now—big tech companies and the Chinese government—have a deep understanding of the value that AI can create, and will almost certainly continue to fund it at high levels even when the public hype bubble bursts. AI winters of the past, however, have resulted in dried-up funding for public research, fewer folks pursuing advanced degrees in AI, and small companies going out of business. If we believe that democratization of AI—making it a tool that almost anyone can use to achieve their goals—is important, and that encouraging the use of AI for the public good is necessary, then an AI winter would be bad, and we should try to avoid the next one.
The causes for boom and bust cycles are as complicated as humans themselves. I do, however, believe that with artificial intelligence, there are a handful of things contributing to the ups and downs of the field. The first is the false equivalence that folks may assume between the mechanism of artificial, and that of human, intelligence. A brain and a piece of software attempting to mimic some aspect of human behavior are two very different things. A human brain contains about 100 billion neurons, 100–500 trillion connections between them, and runs on about 25 watts of power. The largest neural networks, as of 2018, are ten thousand times smaller than a brain and require many orders of magnitude more energy. Moreover, these networks, which are much smaller and much less energy efficient than brains, perform computations that are mere fragments of full human intelligence.
Perhaps more vexing even than the temptation to conflate the brain with the digital hardware and software that powers AI, the very notion of human intelligence is so ill-defined that analogizing it with artificial intelligence can be tricky. Ken Richardson writes in Genes, Brains, and Human Potential:
Intelligence is viewed as the most important ingredient of human potential. But there is no generally accepted theoretical model of what it is (in the way that we have such models of other organic functions). Instead psychologists have adopted physical metaphors: mental speed, energy, power, strength, and so on, together with simple genetic models of how it is distributed in society.
Richardson, and a growing number of scholars, believe that standard measures of intelligence, e.g., IQ tests, are poor indicators of human potential. Indeed, many of the things that we have traditionally believed to be indicators of high human intelligence, e.g., mastering strategy games like chess or Go, reading passages of text and being able to answer questions about them, translating between human languages, etc., are things that machines are readily able to do. Contrariwise, there are tasks that nearly any human toddler can do that are formidable, unresolved challenges for machines. In one famous behavioral experiment, with a toddler watching, a person with an armful of stuff walks toward a cabinet with closed doors, and feigns frustration at not being able to put the items away because they don’t have a free hand to open the door. The toddler watches, gets up, and opens the door for the adult. This simple act of commonsense reasoning that is within the capabilities of a human child too young to even speak would be challenging for existing machine learning systems to do.
Unfortunately, making a false equivalence between human and artificial intelligence is an easy thing for us to do. I catch myself doing it all the time. Whenever I see a surprising new AI achievement, I tend to immediately start asking, If it can do that, then what else might be possible? And depending on whether I’m in an optimistic or pessimistic mood, I can go down the path of Damn, that’s really exciting, or Oh no, that’s pretty scary. Both paths are usually fruitless and tend to lead to bad predictions, which at best is a waste of time, and at worst can be a distraction from better, more practical paths that deserve to be explored.
When I’m gripped by excitement with AI, whether positive or negative, I try very hard to calm down and ask myself what is it that I’m excited about. If it’s about the implications of what I’m seeing, then the very next question I ask is whether I really understand what’s going on beneath the surface of the new thing. I’ve been excited so often about so many things that led nowhere that I try my best to assume nothing.
Renowned futurist and science fiction author Arthur C. Clarke once said that any sufficiently advanced technology is indistinguishable from magic. Even though I love Clarke’s work, and I understand that humans have a long history of ascribing magical or mystical qualities to complex things they don’t understand, I’m not a big fan of this way of thinking. AI is advanced, and indeed complex. I spend most of my time every day working with some of the most accomplished AI experts in the world, and have been building machine learning systems for over fifteen years. And I can say without any shame that I’m nowhere close to understanding all there is to understand about AI. (Many of my colleagues will happily attest to that fact!)
But AI is a human work of science and engineering. Despite its complexity, it is within our ability to understand it. We should not ascribe magical or mystical properties to AI technology just because the complexity is daunting. AI is like any other technical discipline, with a bunch of folks focused on the theory of how to build intelligent machines, a bunch of folks working on experiments to refine theory, and a bunch of folks trying to build things of practical use based on our theoretical understanding of what is and isn’t possible. In the science of the natural world, you might observe an apple falling from a tree. You may wonder what caused the apple to fall. Your first impulse as a scientist would be to observe apples falling as carefully as you are able, to do some experiments, and to discern patterns in your observations. You then formulate a theory of why the apple falls. Like all good scientists you want your theory to be as general as possible, i.e., to work for things other than apples being pulled toward the surface of the Earth. You then do some more experiments to try to prove or disprove your theory. When you’re confident that your theory is a good model of some aspect of the natural world, you share it with others who poke it at, run their own experiments, and refine it in ways large and small. As we get more and more confident that the theory is a good model, we start using it to build things, from brand-new theories to technology that is powered by or dependent upon the theory.
This is exactly how the development of AI works. Researchers and practitioners in the field have ideas about how to imbue software with intelligence. Like Newton’s gravity, we don’t at first know what mechanism is behind the aspect of intelligence that we’re trying to model. We postulate some theories about what’s going on that we can then model with math and computation. Some of our theories prove to be good models of certain types of intelligent behavior, good enough at least for us to have been building useful AI software for decades, with increasing utility in recent years. Nothing magical at all is going on.
Approaching AI with scientific rigor is the only real way to have agency over such a complex system. Ascribing magical properties to AI, appealing to its metaphysical qualities, or otherwise surrendering to its complexity not only puts you a bit further away from conversations that can influence the development of AI, but it’s selling yourself short. None of this stuff is impossible to understand.
That doesn’t mean that you need to make yourself an AI expert. It just means that when you see someone making claims about AI, you should have faith in your ability to challenge those assertions, and to dig in deeper to understand.
All the way at the bottom of the many layers of abstraction upon which modern AI is built are a set of very complex principles and technologies. There are tens of thousands of researchers and engineers around the world attempting to push the frontiers of AI forward, and the body of work that they are creating is so vast; requires such a deep degree of expertise across so many disciplines of computer science, mathematics, and neuroscience; and is changing so rapidly that it’s a challenge even for the folks in the trenches to stay fully up-to-date.
Case in point: as I am writing this in December 2018 one of the most prestigious AI conferences, the Neural Information Processing Symposium, known more commonly as NeurIPS, has just been held in Montreal. The best paper at the conference, authored by a talented group of researchers from the University of Toronto, was “Neural Ordinary Differential Equations,” and its abstract reads:
We introduce a new family of deep neural network models. Instead of specifying a discrete sequence of hidden layers, we parameterize the derivative of the hidden state using a neural network. The output of the network is computed using a black-box differential equation solver. These continuous-depth models have constant memory cost, adapt their evaluation strategy to each input, and can explicitly trade numerical precision for speed. We demonstrate these properties in continuous-depth residual networks and continuous-time latent variable models. We also construct continuous normalizing flows, a generative model that can train by maximum likelihood, without partitioning or ordering the data dimensions. For training, we show how to scalably backpropagate through any ODE solver, without access to its internal operations. This allows end-to-end training of ODEs within larger models.
This was an unexpected result, a brilliant connection of the area of AI called deep learning with the much more mature discipline of numerical methods, and it may have a big impact on the way that we think about training certain types of deep neural networks.
This level of detail is necessary if you are one of the scientists or researchers trying to push the frontier of AI fundamentals forward. It’s also the case, at this level of detail, that bullshit is no match for scientific scrutiny, well-designed experiments, and mathematical rigor. In the labs and offices, we are trying to build an intellectual fortress to support very precise ideas that will either become the next breakthrough like “Neural Ordinary Differential Equations” or send us back to the drawing board. Unverifiable opinions and untestable speculation, regardless of the reputation of the speculator and the degree of their conviction, are irrelevant.
I interact with a lot of folks at many different levels of AI knowledge and expertise, and my own AI knowledge and expertise has gone up and down over time as the field has advanced. When trying to wrap your head around all the information coming at you about AI, it’s essential to know your level of expertise, and the level of expertise of the person who is espousing an opinion or asserting a point of view.
The most confident I’ve ever been about machine learning systems was when I was directly involved in building them on a day-to-day basis. On those systems I knew exactly what they could and couldn’t do, was surrounded by other experts, and spent a considerable amount of time tracking the state of the art in the areas of AI related to my project. I was deep, but narrow. If I was telling you something about how my system or a system like it worked, it would have been a good bet to pay attention.
Even when I was that deep and narrow, there were many folks more technically expert than I was. These folks were actively working to advance the state of the art in some aspect of artificial intelligence. These folks typically had a PhD in computer science or mathematics, or an equivalent degree of intense, focused experience. The field is complex and moving fast, so becoming a technically expert specialist is partially about getting to a high degree of accomplishment in all the scientific, mathematical, and engineering methods and tools one needs to practice the craft at the very highest level. And it’s partially about staying on top of the incredible volume of work that other specialists in your area are producing.
Specifically, at the very frontier of a discipline, abstractions tend to form, then break, and then be re-formed at a rapid pace. You must understand not just the torrent of new abstractions that your colleagues are producing, but also the tools by which these new abstractions are formed. Obviously you must have a high degree of facility with these tools so that you can produce your own innovations, but also so you can help to scrutinize the work that others are producing.
The “Neural Ordinary Differential Equations” paper is a very good example of work produced by technically expert specialists. The new idea that they produced, the new abstraction, so to speak, was modeling the interior of deep neural networks as a system of ordinary differential equations. If you don’t understand what that means, don’t worry. It’s something that had not appeared in the literature for deep neural networks before, and required a group of researchers who were not just immersed in the current state of the art in their own area of AI, but also had the necessary sophistication in mathematics and in the adjacent field of numerical methods to make a new connection and apply a set of classic techniques in a new and interesting way.
Being a technically expert specialist in one area of AI does not necessarily mean that you are able to keep up with everything going on across the entire field. That’s an impossibly difficult task for any individual, given the volume of work happening in AI now. Jumping sideways into another area of AI may require a technically expert specialist to spend significant amounts of time ramping up on the current research in that new area, and potentially to invest in learning a new set of tools being used by the specialists in that area.
Even though I’ve built and overseen many big machine learning projects that have operated at massive scale, I’ve had periods over the past six years where I have gotten embarrassingly out-of-date as my work became less about day-to-day AI execution, with the field heedlessly moving forward at a blistering pace. For instance, I failed to stay current on deep learning as I was growing LinkedIn’s engineering team from 250 engineers to 3,100 engineers from 2011 to 2016. When I jumped back in where AI again became a day-to-day part of my work, it was shocking how much I had to catch up on.
When I’m trying to make sense of someone’s AI pontifications, I typically look at a person’s breadth of AI focus, the fraction of their time they’re spending on AI, and the specificity of the point of view that they’re expressing. If the person is narrowly focused, spending most of their time on AI, and is expressing specific points of view related to their work, then it’s far more likely that what they’re saying is going to be trustworthy and insightful rather than some super-broad generalization from a person who is very far away from hands-on and spends only a tiny chunk of their time doing AI.
This might seem obvious, particularly given that it is also true for things other than AI. If you have a heart condition, you would probably trust your cardiologist for advice over your general practitioner, and your general practitioner for sure over someone who watches a lot of medical dramas. It’s worth saying, though, because the frequently assumed similarities between natural and artificial intelligence invite some very broad, very confident assertions that are so out there that there’s not even a way to put them into a proper scientific framework where a reasonable conversation can be had.
This is not to say that if you’re not in the trenches, there’s no way for you to connect to the AI conversation. Even folks in the trenches need the ability to connect to the broader conversation about AI. The field is advancing so quickly on multiple fronts, and not everyone in AI is in the same trench. It’s nearly impossible to stay on top of everything. Fortunately, you don’t need to operate at the lowest levels of detail in order to use modern AI. It’s a near certainty that anyone reading this book interacts daily with technology that incorporates varying degrees of AI. An increasing number of software developers at all levels can use higher-level abstractions to incorporate AI into the technology that they’re building. And as we’ve already observed, some of the most promising developments in AI may put the capability to build and create with AI into the hands of folks with no prior programming experience.
Perhaps the best practical advice I have for anyone trying to understand what impacts AI is having on their lives is to try to understand who is profiting from it. Who is getting paid? Since the boom that thrust the consumer Internet into all our lives, the deal that we’ve had with most Internet companies is that we get to use their services for free in exchange for their gathering data about us that is subsequently used to create more engaging products and services, to grow those services more quickly, and to target ads to us to which we are more likely to respond. These companies make more money as they gain more users, as users use their products more, and as they are able to show you more and more compelling ads. I’ve worked for and helped build two companies that make most of their revenue this way.
This is not necessarily a bad bargain. We just all need to be aware that there are machine learning systems at the core of most of the ad-supported applications. As we’ve seen, machine learning systems are very good at optimizing exactly what you teach them to optimize. If the machine learning systems only optimize for more users, more engagement, and making more money, that is exactly what you get, along with whatever side effects that they bring.
The next time you are navigating the web or using your favorite application, ask yourself why you are doing the things that you are doing. Why are you clicking on that link, taking your time to read a piece of content, liking that friend’s post, or sharing that article? Why were you shown what you were shown, and who profits from your interacting with it? Are you getting fair value in return? These are perfectly reasonable questions for you to ask, ones that you wouldn’t hesitate to ask in other parts of your life where you are transacting business. When the party on the other end of a transaction is a machine learning algorithm, it’s probably even more important that you demand to understand what’s going on than you would when a salesperson is pitching you on a purchase, or a telemarketer is trying to talk you into something.
This isn’t to suggest that anything sinister is going on. Without this business model many of the great things that enrich all of our lives, like search, for instance, would not exist. I use “free” Internet services all the time, fully understanding that the price I pay for using them is my data. Most of the time that’s fine. And when it isn’t, I try to pay for the service, tune my settings to provide some acceptable degree of privacy, or find an alternative to the service. Mostly I try to be constantly aware of what I am consuming or acting upon, and ask myself whether I’m getting fair value in return and whether I’m in control or being manipulated. When I believe that the value exchange is off, or that I’m being deliberately or inadvertently manipulated, I change my behavior.
Judy Estrin and Sam Gill have introduced the notion of digital pollution25 as a mechanism for understanding and dealing with side effects of algorithmic and machine learning systems that try to extract business value from human attention. They write:
Digital pollution is more complicated than industrial pollution. Industrial pollution is the by-product of a value-producing process, not the product itself. On the Internet, value and harm are often one and the same. . . . The complex task of identifying where we might sacrifice some individual value to prevent collective harm will be crucial to curbing digital pollution. Science and data inform our decisions, but our collective priorities should ultimately determine what we do and how we do it.
There may be other constructs for reasoning about and addressing the side effects of attention-oriented, ML-powered business models, but this is a good start. We have clearly arrived at a moment where we can no longer pretend that there are no side effects and accept inaction as an appropriate response. Some people fear the specter of an AI that develops superhuman intelligence, and that poses an existential threat to humanity like Terminator’s Skynet. I’m far less worried about this than I am by the unintended side effects of completely explicable machine learning algorithms innocently pointed at perfectly reasonable business objectives, like making a product as engaging as possible.
One late summer morning, Dario Amodei addressed a small gathering of AI scientists at our Redmond labs in Building 99. Dario leads AI safety research at OpenAI, then a nonprofit AI research company that described itself as “discovering and enacting the path to safe AI.” It was originally funded by individuals like Elon Musk and Reid Hoffman as well as by tech companies. In the summer of 2019, I helped to lead a multiyear, $1 billion investment in OpenAI to build a platform to create new AI technologies and deliver on the promise of artificial general intelligence. The title of Dario’s presentation caught my eye: “AI Safety through Integrating Humans in the Training Loop.”
“Reinforcement learning systems have made great advances in optimizing fixed, well-defined reward functions (as in games like Dota or Go), but are less well-equipped to pursue complex goals that embody human values or judgments,” he wrote. This inability to translate human intentions into behavior (or the very slow loop of doing so through mathematical specification of reward functions) can lead to unintended consequences and practical safety problems. His research focuses on learning from human preferences, or human-in-the-loop training.
Amodei points out that failures in AI can generally be pegged to one of three problems—a problem with the algorithm; a problem somewhere in the software stack; or the AI was trained for the wrong objective function, the wrong reward. He demonstrated by showing a video game his team had programmed in which a speedboat was rewarded for making laps around a water course. The failure was immediately obvious. The boat veered this way and that, crashing into things and even catching on fire, but it was racking up points—rewards—because that’s what the developer had mistakenly prioritized in his or her reinforcement learning model. The AI boat had learned how to get a turbo boost and create fires, which accrued lots of points. It was going around the course racking up points in any way it could. He later showed a robot arm that was trained to move a puck from one end of a table to a target on the other end. As the puck approached the target, the robot arm would nudge the table slightly to achieve the goal. Not optimal, but it worked.
The rewards didn’t correspond with what was really wanted. Therefore, a human was really needed in the loop.
The type of AI that OpenAI is pursuing is called artificial general intelligence (AGI), sometimes known as strong AI. It is the type of AI that the founders of artificial intelligence sought to create at the birth of AI at the Dartmouth workshop in 1956, AI that would be indiscernible from human intelligence for arbitrary cognitive tasks. Even though the dream of AGI has existed since at least 1956, probably earlier, perhaps even back to antiquity, progress has been extremely slow and frustrating. We’ve had more instances of thinking we were close, only to have those hopes dashed, than we have had major breakthroughs. But with reinforcement learning and unsupervised learning, a good chunk of the world’s knowledge represented in some form or the other on the Internet, and computing power increasing at a blistering clip, folks are more hopeful about the feasibility of AGI in the near future than perhaps they have been since the fifties and sixties.
Getting to AGI is no sure bet, though, and opinions about when or if we’re likely to achieve AGI are widely varied even among experts. The most optimistic credible estimates are five to ten years, and the most pessimistic are never. Anyone who tells you with absolute certainty when AGI will be here is most likely wrong.
Why is making this prediction so hard? One reason is that AGI’s goal of emulating human intelligence is one of the most complex problems that humans have ever attempted. Unfortunately, our definitions of intelligence itself are imperfect and not terribly helpful when trying to define what AGI should or could be. Perhaps more important, our understanding of how the brain works, how it manifests human intelligence, is still poorly understood. We do understand some of the microscopic structures of the brain. We know what a neuron is and what axons, dendrites, and synapses are. We know what a neuroreceptor is. We understand how these microscopic structures function in isolation and some of how they interact in proximity with one another. We understand some of the neurochemistry of the brain. We know how neurons use these structures and chemistry to transmit signals from themselves to other connected neurons. We’ve understood some of this for a while and, in fact, that understanding informed the development of artificial neural networks back in the 1940s.
The thing we don’t understand as well is how all these microscopic structures of the brain come together in a way that results in human intelligence. With approximately a hundred billion neurons in a human brain and an average neuron connecting to a thousand other neurons, it may be a long time before we’ve fully characterized brain function and worked our way to an understanding of human intelligence from a neurobiological perspective.
Even if you were to believe that artificial neural networks are the precise digital equivalents to biological ones—which you shouldn’t—and that if you had the ability to train a hundred-trillion-synapse (parameter) DNN—which we don’t, yet—AGI is almost certainly not as simple as training such a large DNN using the world’s knowledge. This large DNN still would not possess an equivalent neural structure to a human’s, a structure evolved over very long time horizons that we don’t fully understand.
This doesn’t mean that the pursuit of AGI is pointless. Quite the contrary, the problem itself is so fascinating that we continue to make attempts at solving it even after six decades of slow progress. The two big organizations attempting to achieve AGI, OpenAI, and DeepMind, are making incredibly useful discoveries as they pursue their longer-term goal. Many of these discoveries are being shared with the rest of the AI community in terms of open source software and research papers, which is helping to accelerate other AI efforts unrelated to AGI. Some of these discoveries will have commercial utility and could show up in products and services that benefit us. The quest for AGI itself might also prove to be an incredibly useful way for scientists to better understand the mechanisms of human intelligence.
As a platform company, Microsoft is interested in providing the best possible tools and infrastructure to AI researchers and practitioners, no matter how ambitious their goals. As I am writing in 2019, we are not actively attempting to build our own artificial general intelligence systems. But we are proud to partner with folks who are working on AGI, and I am intensely curious about the process of discovery they are on. Their ambition helps us to invest in more and better AI platform capabilities that we can then make available to everyone.
Many have expressed concern about AGI, although the nadir of that worry, at least for the time being, seems to have occurred in 2018. Elon Musk and Stephen Hawking, for instance, have believed that AGI is an existential threat to humanity, and Musk has been publicly calling for strict regulation for several years now. What exactly that strict regulation would be like is unclear. But the fear and anxiety that the very thought of AGI provokes in some is crystal clear.
Despite these concerns, I do believe that we need to think about any potential regulation of AGI very carefully. No one knows exactly what steps would be required to achieve AGI, much less a malevolent or destructively indifferent one. If, for instance, we wanted to regulate AGI like we do weapons, then given the current state of things and our present knowledge of how AGI might work, we could very quickly create a regulatory regime that would not only make it more difficult to reap the many positive benefits of AGI for fear of the bad, but the regulatory blast radius could make it harder to do ordinary AI and even normal software development.
Why? Sticking with the weapons analogy, we regulate the development and use of arms, like nuclear weapons, by making it illegal to possess them, and by tightly controlling the key ingredients for making them. That’s not the easiest thing to do, but arms are physical things, and it’s reasonably well understood how to construct them. That means that we can write careful regulation, set up monitoring processes, enter into treaties, etc., where detection and enforcement are well-understood and narrowly scoped to preventing the specific bad things we want to prevent.
AGI, on the other hand, isn’t a physical thing, and the ingredients for making one are human ideas turned into code. Moreover, we don’t even know which ideas are going to be the ones that lead to AGI, much less the full scope of what bad stuff an AGI could do that we would want to prevent. Once you begin to think about how to regulate the wrong ideas and our ability to write those ideas down as code, a whole bunch of things that today look like free expression could become illegal. That’s not to say there should be no regulation. We should simply be very, very thoughtful about what it would look like.
So how concerned about AGI should we be? And how do we place those concerns into the broader context of immediate things that we should be worried about with mainstream AI, about the things that are happening and that we have very high confidence will be happening soon?
I do think that it’s prudent and reasonable for us to guard against the creation of an AGI that would harm humans or, more strongly, would do anything other than serve the best interests of our species. Codifying what exactly that means would be tough. Isaac Asimov, the futurist and science fiction author, is famous for his three laws of robotics:
Even though it might be an oversimplification, I believe that we need the equivalent of Asimov’s Laws for AGI. I also believe that OpenAI may currently have the best, simplest articulation of what good AGI governance could look like. If you look at the OpenAI Charter,26 the research lab’s board commits to specific actions to ensure that the group’s work on AGI has broadly distributed benefits, a specific focus on long-term safety, investment in technical excellence, and a cooperative orientation. At the time of this writing, they have taken several actions that are consistent with this charter. They have a new financial structure called a public benefit corporation that allows them to fund their charter at high levels, but that caps investor and employee returns at a fixed level, and transfers returns above and beyond those levels to a nonprofit organization with public governance to ensure that most of the value of a realized AGI would go to the public. Their safety review committee has also recently elected not to publicly distribute a new language model given its potentially dangerous uses, including its potential to be used to create fake news.