Chapter 1: Seven Definitions of Learning, How We Learn

CHAPTER 1 Seven Definitions of Learning

What does “learning” mean? My first and most general definition is the following: to learn is to form an internal model of the external world.

You may not be aware of it, but your brain has acquired thousands of internal models of the outside world. Metaphorically speaking, they are like miniature mock-ups more or less faithful to the reality they represent. We all have in our brains, for example, a mental map of our neighborhood and our home—all we have to do is close our eyes and envision them with our thoughts. Obviously, none of us were born with this mental map—we had to acquire it through learning.

The richness of these mental models, which are, for the most part, unconscious, exceeds our imagination. For example, you possess a vast mental model of the English language, which allows you to understand the words you are reading right now and guess that plastovski is not an English word, whereas swoon and wistful are, and dragostan could be. Your brain also includes several models of your body: it constantly uses them to map the position of your limbs and to direct them while maintaining your balance. Other mental models encode your knowledge of objects and your interactions with them: knowing how to hold a pen, write, or ride a bike. Others even represent the minds of others: you possess a vast mental catalog of people who are close to you, their appearances, their voices, their tastes, and their quirks.

These mental models can generate hyper-realistic simulations of the universe around us. Did you ever notice that your brain sometimes projects the most authentic virtual reality shows, in which you can walk, move, dance, visit new places, have brilliant conversations, or feel strong emotions? These are your dreams! It is fascinating to realize that all the thoughts that come to us in our dreams, however complex, are simply the product of our free-running internal models of the world.

But we also dream up reality when awake: our brain constantly projects hypotheses and interpretative frameworks on the outside world. This is because, unbeknownst to us, every image that appears on our retina is ambiguous—whenever we see a plate, for instance, the image is compatible with an infinite number of ellipses. If we see the plate as round, even though the raw sense data picture it as an oval, it is because our brain supplies additional data: it has learned that the round shape is the most likely interpretation. Behind the scenes, our sensory areas ceaselessly compute with probabilities, and only the most likely model makes it into our consciousness. It is the brain’s projections that ultimately give meaning to the flow of data that reaches us from our senses. In the absence of an internal model, raw sensory inputs would remain meaningless.

Learning allows our brain to grasp a fragment of reality that it had previously missed and to use it to build a new model of the world. It can be a part of external reality, as when we learn history, botany, or the map of a city, but our brain also learns to map the reality internal to our bodies, as when we learn to coordinate our actions and concentrate our thoughts in order to play the violin. In both cases, our brain internalizes a new aspect of reality: it adjusts its circuits to appropriate a domain that it had not mastered before.

Such adjustments, of course, have to be pretty clever. The power of learning lies in its ability to adjust to the external world and to correct for errors—but how does the brain of the learner “know” how to update its internal model when, say, it gets lost in its neighborhood, falls from its bike, loses a game of chess, or misspells the word ecstasy? We will now review seven key ideas that lie at the heart of present-day machine-learning algorithms and that may apply equally well to our brains—seven different definitions of what “learning” means.

LEARNING IS ADJUSTING THE PARAMETERS OF A MENTAL MODEL

Adjusting a mental model is sometimes very simple. How, for example, do we reach out to an object that we see? In the seventeenth century, René Descartes (1596–1650) had already guessed that our nervous system must contain processing loops that transform visual inputs into muscular commands (see the figure on the next page). You can experience this for yourself: try grabbing an object while wearing somebody else’s glasses, preferably someone who is very nearsighted. Even better, if you can, get a hold of prisms that shift your vision a dozen degrees to the left and try to catch the object.¹ You will see that your first attempt is completely off: because of the prisms, your hand reaches to the right of the object that you are aiming for. Gradually, you adjust your movements to the left. Through successive trial and error, your gestures become more and more precise, as your brain learns to correct the offset of your eyes. Now take off the glasses and grab the object: you’ll be surprised to see that your hand goes to the wrong location, now way too far to the left!

So, what happened? During this brief learning period, your brain adjusted its internal model of vision. A parameter of this model, one that corresponds to the offset between the visual scene and the orientation of your body, was set to a new value. During this recalibration process, which works by trial and error, what your brain did can be likened to what a hunter does in order to adjust his rifle’s viewfinder: he takes a test shot, then uses it to adjust his scope, thus progressively shooting more and more accurately. This type of learning can be very fast: a few trials are enough to correct the gap between vision and action. However, the new parameter setting is not compatible with the old one—hence the systematic error we all make when we remove the prisms and return to normal vision.

What is learning? To learn is to adjust the parameters of an internal model. Learning to aim with one’s finger, for example, consists of setting the offset between vision and action: each aiming error provides useful information that allows one to reduce the gap. In artificial neural networks, although the number of settings is much larger, the logic is the same. Recognizing a character requires the fine-tuning of millions of connections. Again, each error—here, the incorrect activation of the output “8”—can be back-propagated and used to adjust the values of the connections, thus improving performance on the next test.

Undeniably, this type of learning is a little particular, because it requires the adjustment of only a single parameter (viewing angle). Most of our learning is much more elaborate and requires adjusting tens, hundreds, or even thousands of millions of parameters (every synapse in the relevant brain circuit). The principle, however, is always the same: it boils down to searching, among myriad possible settings of the internal model, for those that best correspond to the state of the external world.

An infant is born in Tokyo. Over the next two or three years, its internal model of language will have to adjust to the characteristics of the Japanese language. This baby’s brain is like a machine with millions of settings at each level. Some of these settings, at the auditory level, determine which inventory of consonants and vowels is used in Japanese and the rules that allow them to be combined. A baby born into a Japanese family must discover which phonemes make up Japanese words and where to place the boundaries between those sounds. One of the parameters, for example, concerns the distinction between the sounds /R/ and /L/: this is a crucial contrast in English, but not in Japanese, which makes no distinction between Bill Clinton’s election and his erection. . . . Each baby must thus fix a set of parameters that collectively specify which categories of speech sounds are relevant for his or her native language.

A similar learning procedure is duplicated at each level, from sound patterns to vocabulary, grammar, and meaning. The brain is organized as a hierarchy of models of reality, each nested inside the next like Russian dolls—and learning means using the incoming data to set the parameters at every level of this hierarchy. Let’s consider a high-level example: the acquisition of grammatical rules. Another key difference which the baby must learn, between Japanese and English, concerns the order of words. In a canonical sentence with a subject, a verb, and a direct object, the English language first states the subject, then the verb, and finally its object: “John + eats + an apple.” In Japanese, on the other hand, the most common order is subject, then object, then verb: “John + an apple + eats.” What is remarkable is that the order is also reversed for prepositions (which logically become post-positions), possessives, and many other parts of speech. The sentence “My uncle wants to work in Boston,” thus becomes mumbo jumbo worthy of Yoda from Star Wars: “Uncle my, Boston in, work wants”—which makes perfect sense to a Japanese speaker.

Fascinatingly, these reversals are not independent of one another. Linguists think that they arise from the setting of a single parameter called the “head position”: the defining word of a phrase, its head, is always placed first in English (in Paris, my uncle, wants to live), but last in Japanese (Paris in, uncle my, live wants). This binary parameter distinguishes many languages, even some that are not historically linked (the Navajo language, for example, follows the same rules as Japanese). In order to learn English or Japanese, one of the things that a child must figure out is how to set the head position parameter in his internal language model.

LEARNING IS EXPLOITING A COMBINATORIAL EXPLOSION

Can language learning really be reduced to the setting of some parameters? If this seems hard to believe, it is because we are unable to fathom the extraordinary number of possibilities that open up as soon as we increase the number of adjustable parameters. This is called the “combinatorial explosion”—the exponential increase that occurs when you combine even a small number of possibilities. Suppose that the grammar of the world’s languages can be described by about fifty binary parameters, as some linguists postulate. This yields 2⁵⁰ combinations, which are over one million billion possible languages, or 1 followed by fifteen zeros! The syntactic rules of the world’s three thousand languages easily fit into this gigantic space. However, in our brain, there aren’t just fifty adjustable parameters, but an astoundingly larger number: eighty-six billion neurons, each with about ten thousand synaptic contacts whose strength can vary. The space of mental representations that opens up is practically infinite.

Human languages heavily exploit these combinations at all levels. Consider, for instance, the mental lexicon: the set of words that we know and whose model we carry around with us. Each of us has learned about fifty thousand words with the most diverse meanings. This seems like a huge lexicon, but we manage to acquire it in about a decade because we can decompose the learning problem. Indeed, considering that these fifty thousand words are on average two syllables, each consisting of about three phonemes, taken from the forty-four phonemes in English, the binary coding of all these words requires less than two million elementary binary choices (“bits,” whose value is 0 or 1). In other words, all our knowledge of the dictionary would fit in a small 250-kilobyte computer file (each byte comprising eight bits).

This mental lexicon could be compressed to an even smaller size if we took into account the many redundancies that govern words. Drawing six letters at random, like “xfdrga,” does not generate an English word. Real words are composed of a pyramid of syllables that are assembled according to strict rules. And this is true at all levels: sentences are regular collections of words, which are regular collections of syllables, which are regular collections of phonemes. The combinations are both vast (because one chooses among several tens or hundreds of elements) and bounded (because only certain combinations are allowed). To learn a language is to discover the parameters that govern these combinations at all levels.

In summary, the human brain breaks down the problem of learning by creating a hierarchical, multilevel model. This is particularly obvious in the case of language, from elementary sounds to the whole sentence or even discourse—but the same principle of hierarchical decomposition is reproduced in all sensory systems. Some brain areas capture low-level patterns: they see the world through a very small temporal and spatial window, thus analyzing the smallest patterns. For example, in the primary visual area, the first region of the cortex to receive visual inputs, each neuron analyzes only a very small portion of the retina. It sees the world through a pinhole and, as a result, discovers very low-level regularities, such as the presence of a moving oblique line. Millions of neurons do the same work at different points in the retina, and their outputs become the inputs of the next level, which thus detects “regularities of regularities,” and so on and so forth. At each level, the scale broadens: the brain seeks regularities on increasingly vast scales, in both time and space. From this hierarchy emerges the ability to detect increasingly complex objects or concepts: a line, a finger, a hand, an arm, a human body . . . no, wait, two, there are two people facing each other, a handshake. . . . It is the first Trump-Macron encounter!

LEARNING IS MINIMIZING ERRORS

The computer algorithms that we call “artificial neural networks” are directly inspired by the hierarchical organization of the cortex. Like the cortex, they contain a pyramid of successive layers, each of which attempts to discover deeper regularities than the previous one. Because these consecutive layers organize the incoming data in deeper and deeper ways, they are also called “deep networks.” Each layer, by itself, is capable of discovering only an extremely simple part of the external reality (mathematicians speak of a linearly separable problem, i.e., each neuron can separate that data into only two categories, A and B, by drawing a straight line through them). Assemble many of these layers, however, and you get an extremely powerful learning device, capable of discovering complex structures and adjusting to very diverse problems. Today’s artificial neural networks, which take advantage of the advances in computer chips, are also deep, in the sense that they contain dozens of successive layers. These layers become increasingly insightful and capable of identifying abstract properties the further away they are from the sensory input.

Let’s take the example of the LeNet algorithm, created by the French pioneer of neural networks, Yann LeCun (see figure 2 in the color insert).² As early as the 1990s, this neural network achieved remarkable performance in the recognition of handwritten characters. For years, Canada Post used it to automatically process handwritten postal codes. How does it work? The algorithm receives the image of a written character as an input, in the form of pixels, and it proposes, as an output, a tentative interpretation: one out of the ten possible digits or twenty-six letters. The artificial network contains a hierarchy of processing units that look a bit like neurons and form successive layers. The first layers are connected directly with the image: they apply simple filters that recognize lines and curve fragments. The layers higher up in the hierarchy, however, contain wider and more complex filters. Higher-level units can therefore learn to recognize larger and larger portions of the image: the curve of a 2, the loop of an O, or the parallel lines of a Z . . . until we reach, at the output level, artificial neurons that respond to a character regardless of its position, font, or case. All these properties are not imposed by a programmer: they result entirely from the millions of connections that link the units. These connections, once adjusted by an automated algorithm, define the filter that each neuron applies to its inputs: their settings explain why one neuron responds to the number 2 and another to the number 3.

How are these millions of connections adjusted? Just as in the case of prism glasses! On each trial, the network gives a tentative answer, is told whether it made an error, and adjusts its parameters to try to reduce this error on the next trial. Every wrong answer provides valuable information. With its sign (like a gesture too far to the right or too far to the left), the error tells the system what it should have done in order to succeed. By going back to the source of the error, the machine discovers how the parameters should have been set to avoid the mistake.

Let’s revisit the example of the hunter adjusting his rifle’s scope. The learning procedure is elementary. The hunter shoots and finds he’s aimed five centimeters too far to the right. He now has essential information, both on the amplitude (five centimeters) and on the sign of the error (too far to the right). This information allows him to correct his shot. If he is a bit clever, he can infer in which direction to make the correction: if the bullet has deflected to the right, he should shift the scope one hair to the left. Even if he’s not that astute, he can casually try a different aim and test whether, if he turns the scope to the right, the offset increases or decreases. In this manner, through trial and error, the hunter can progressively discover which adjustment reduces the size of the gap between his intended target and his actual shot.

In modifying his sight to maximize his accuracy, our brave hunter is applying a learning algorithm without even knowing it. He is implicitly calculating what mathematicians call the “derivative,” or gradient, of the system, and is using the “gradient descent algorithm”: he learns to move his rifle’s viewfinder in the most efficient direction, the one that reduces the probability of making a mistake.

Most artificial neural networks used in present-day artificial intelligence, despite their millions of inputs, outputs, and adjustable parameters, operate just like our proverbial hunter: they observe their errors and use them to adjust their internal state in the direction that they feel is best able to reduce the errors. In many cases, such learning is tightly guided. We tell the network exactly which response it should have activated at the output (“it is a 1, not a 7”), and we know precisely in which direction to adjust the parameters if they lead to an error (a mathematical calculation makes it possible to know exactly which connections to modify when the network activates the output “7” too often in response to an image of the number 1). In machine learning parlance, this situation is known as “supervised learning” (because someone, who can be likened to a supervisor, knows the correct answer that the system must give) and “error backpropagation” (because error signals are sent back into the network in order to modify its parameters). The procedure is simple: I try an answer, I am told what I should have answered, I measure my error, and I adjust my parameters to reduce it. At each step, I make only a small correction in the right direction. That’s why such computer-based learning can be incredibly slow: learning a complex activity, like playing Tetris, requires applying this recipe thousands, millions, even billions of times. In a space that includes a multitude of adjustable parameters, it can take a long time to discover the optimal setting for every nut and bolt.

The very first artificial neural networks, in the 1980s, were already operating on this principle of gradual error correction. Advances in computing have now made it possible to extend this idea to gigantic neural networks, which include hundreds of millions of adjustable connections. These deep neural networks are composed of a succession of stages, each of which adapts to the problem at hand. For example, figure 4 in the color insert shows the GoogLeNet system, derived from the LeNet architecture first proposed by LeCun and which won one of the most important international image recognition competitions. Exposed to billions of images, this system learned to separate them into one thousand distinct categories, such as faces, landscapes, boats, cars, dogs, insects, flowers, road signs, and so forth. Each level of its hierarchy has become attuned to a useful aspect of reality: low-level units selectively respond to lines or textures, but the higher you go up in the hierarchy, the more neurons have learned to respond to complex features, such as geometric shapes (circles, curves, stars . . .), parts of objects (a pants pocket, a car door handle, a pair of eyes . . .), or even whole objects (buildings, faces, spiders . . .).³

By trying to minimize errors, the gradient descent algorithm discovered that these forms are the most useful for categorizing images. But if the same network had been exposed to book passages or sheet music, it would have adjusted in a different way and learned to recognize letters, notes, or whichever shapes recur in the new environment. Figure 3 in the color insert, for example, shows how a network of this type self-organizes to recognize thousands of handwritten digits.⁴ At the lowest level, the data are mixed: some images are superficially similar but should ultimately be distinguished (think of a 3 and an 8), and conversely, some images that look very different must ultimately be placed in the same bin (think of the many versions of the digit 8, with the top loop open or closed, etc.). At each stage, the artificial neural network progresses in abstraction until all instances of the same character are correctly grouped together. Through the error reduction procedure, it has discovered a hierarchy of features most relevant to the problem of recognizing handwritten digits. Indeed, it is quite remarkable that, simply by correcting one’s errors, it is possible to discover a whole set of clues appropriate to the problem at hand.

Today, the concept of learning by error backpropagation remains at the heart of many computer applications. This is the workhorse that lies behind your smartphone’s ability to recognize your voice, or your smart car’s emerging perception of pedestrians and road signs—and it is therefore very likely that our brain uses one version of it or the other. However, error backpropagation comes in various flavors. The field of artificial intelligence has made tremendous advances in thirty years, and researchers have discovered many tricks that facilitate learning. We will now review them—as we shall see, they also tell us a lot about ourselves and the way we learn.

LEARNING IS EXPLORING THE SPACE OF POSSIBILITIES

One of the problems with the error correction procedure I just described is that it can get stuck on a set of parameters that is not the best. Imagine a golf ball rolling on the green, always along the line of the steepest slope: it may get stuck in a small depression in the ground, preventing it from reaching the lowest point of the whole landscape, the absolute optimum. Similarly, the gradient descent algorithm sometimes gets stuck at a point that it cannot exit. This is called a “local minimum”: a well in parameter space, a trap from which the learning algorithm cannot escape because it seems impossible to do better. At this moment, learning gets stuck, because all changes seem counterproductive: each of them increases the error rate. The system feels that it has learned all it can. It remains blind to the presence of much better settings, perhaps only a few steps away in parameter space. The gradient descent algorithm does not “see” them because it refuses to go up the hump in order to go back down the other side of the dip. Shortsighted, it ventures only a small distance from its starting point and may therefore miss out on better but distant configurations.

Does the problem seem too abstract to you? Think about a concrete situation: You go shopping at a food market, where you spend some time looking for the cheapest products. You walk down an aisle, pass the first seller (who seems overpriced), avoid the second (who is always very expensive), and finally stop at the third stand, which seems much cheaper than the previous ones. But who’s to say that one aisle over, or perhaps even in the next town, the prices would not be even more enticing? Focusing on the best local price does not guarantee finding the global minimum.

Frequently confronted with this difficulty, computer scientists employ a panoply of tricks. Most of them consist of introducing a bit of randomness in the search for the best parameters. The idea is simple: instead of looking in only one aisle of the market, take a step at random; and instead of letting the golf ball roll gently down the slope, give it a shake, thus reducing its chance of getting stuck in a trough. On occasion, stochastic search algorithms try a distant and partially random setting, so that if a better solution is within reach, they have a chance of finding it. In practice, one can introduce some degree of randomness in various ways: setting or updating the parameters at random, diversifying the order of the examples, adding some noise to the data, or using only a random fraction of the connections—all these ideas improve the robustness of learning.

Some machine learning algorithms also get their inspiration from the Darwinian algorithm that governs the evolution of species: during parameter optimization, they introduce mutations and random crossings of previously discovered solutions. As in biology, the rate of these mutations must be carefully controlled in order to explore new solutions without wasting too much time in hazardous attempts.

Another algorithm is inspired by blacksmith forges, where craftspeople have learned to optimize the properties of metal by “annealing” it. Applied when one wants to forge an exceptionally strong sword, the method of annealing consists of heating the metal several times, at lower and lower temperatures, to increase the chance that the atoms arrange themselves in a regular configuration. The process has now been transposed to computer science: the simulated annealing algorithm introduces random changes in the parameters, but with a virtual “temperature” that gradually decreases. The probability of a chance event is high at the beginning but steadily declines until the system is frozen in an optimal setting.

Computer scientists have found all these tricks to be remarkably effective—so perhaps it should be no surprise that, in the course of evolution, some of them were internalized in our brains. Random exploration, stochastic curiosity, and noisy neuronal firing all play an essential role in learning for Homo sapiens. Whether we are playing rock, paper, scissors; improvising on a jazz theme; or exploring the possible solutions to a math problem, randomness is an essential ingredient of a solution. As we shall see, whenever children go into learning mode—that is, when they play—they explore dozens of possibilities with a good dose of randomness. And during the night, their brains continue juggling ideas until they hit upon one that best explains what they experienced during the day. In the third section of this book, I will come back to what we know about the semi-random algorithm that governs the extraordinary curiosity of children—and the rare adults who have managed to keep a child’s mind.

LEARNING IS OPTIMIZING A REWARD FUNCTION

Remember LeCun’s LeNet system, which recognizes the shapes of numbers? In order to learn, this type of artificial neural network needs to be provided with the correct answers. For each input image, it needs to know which of the ten possible numbers it corresponds to. The network can correct itself only by calculating the difference between its response and the correct answer. This procedure is known as “supervised learning”: a supervisor, outside the system, knows the solution and tries to teach it to the machine. This is effective, but it should be noted that this situation, where the right answer is known in advance, is rather rare. When children learn to walk, no one tells them exactly which muscles to contract—they are simply encouraged again and again until they no longer fall. Babies learn solely on the basis of an evaluation of the result: I fell, or, on the contrary, I finally managed to walk across the room.

Artificial intelligence faces the same “unsupervised learning” problem. When a machine learns to play a video game, for example, the only thing it is told is that it must try to attain the highest score. No one tells it in advance what specific actions need to be taken to achieve this. How can it quickly find out for itself the right way of going about it?

Scientists have responded to this challenge by inventing “reinforcement learning,” whereby we do not provide the system with any detail about what it must do (nobody knows!), but only with a “reward,” an evaluation in the form of a quantitative score.⁵ Even worse, the machine may receive its score after a delay, long after the decisive actions that led to it. Such delayed reinforcement learning is the principle by which the company DeepMind, a Google subsidiary, created a machine capable of playing chess, checkers, and Go. The problem is colossal for a simple reason: it is only at the very end that the system receives a single reward signal, indicating whether the game was won or lost. During the game itself, the system receives no feedback whatsoever—only the final checkmate counts. How, then, can the system figure out what to do at any given time? And, once the final score is known, how can the machine retrospectively evaluate its decisions?

The trick that computer scientists have found is to program the machine to do two things at the same time: to act and to self-evaluate. One half of the system, called the “critic,” learns to predict the final score. The goal of this network of artificial neurons is to evaluate, as accurately as possible, the state of the game, in order to predict the final reward: Am I winning or losing? Is my balance stable, or am I about to fall? Thanks to this critic that emerges in this half of the machine, the system can evaluate its actions at every moment and not just at the end. The other half of the machine, the actor, can then use this evaluation to correct itself: Wait! I’d better avoid this or that action, because the critic thinks it will increase my chances of losing.

Trial after trial, the actor and the critic progress together: one learns to act wisely, focusing on the most effective actions, while the other learns to evaluate, ever more sharply, the consequences of these acts. In the end, unlike the famed guy who is falling from a skyscraper and exclaims, “So far, so good,” the actor-critic network becomes endowed with a remarkable prescience: the ability to predict, within the vast seas of not-yet-lost games, those that are likely to be won and those that will lead only to disaster.

The actor-critic combination is one of the most effective strategies of contemporary artificial intelligence. When backed by a hierarchical neural network, it works wonders. As early as the 1980s, it enabled a neural network to win the backgammon world cup. More recently, it enabled DeepMind to create a multifunctional neural network capable of learning to play all kinds of video games such as Super Mario and Tetris.⁶ One simply gives this system the pixels of the image as an input, the possible actions as an output, and the score of the game as a reward function. The machine learns everything else. When it plays Tetris, it discovers that the screen is made up of shapes, that the falling one is more important than the others, that various actions can change its orientation and its position, and so on and so forth—until the machine turns into an artificial player of formidable effectiveness. And when it plays Super Mario, the change in inputs and rewards teaches it to attend to completely different settings: what pixels form Mario’s body, how he moves, where the enemies are, the shapes of walls, doors, traps, bonuses . . . and how to act in front of each of them. By adjusting its parameters, i.e., the millions of connections that link the layers together, a single network can adapt to all kinds of games and learn to recognize the shapes of Tetris, Pac-Man, or Sonic the Hedgehog.

What is the point of teaching a machine to play video games? Two years later, DeepMind engineers used what they had learned from game playing to solve an economic problem of vital interest: How should Google optimize the management of its computer servers? The artificial neural network remained similar; the only things that changed were the inputs (date, time, weather, international events, search requests, number of people connected to each server, etc.), the outputs (turn on or off this or that server on various continents), and the reward function (consume less energy). The result was an instant drop in power consumption. Google reduced its energy bill by up to 40 percent and saved tens of millions of dollars—even after myriad specialized engineers had already tried to optimize those very servers. Artificial intelligence has truly reached levels of success that can turn whole industries upside down.

DeepMind has achieved even more amazing feats. As everyone probably knows, its AlphaGo program managed to beat eighteen-time world champion Lee Sedol in the game of Go, considered until very recently the Everest of artificial intelligence.⁷ This game is played on a vast square checkerboard (a goban) with nineteen positions on each side, for a total of 361 places where black and white pieces can be played. The number of combinations is so vast that it is strictly impossible to systematically explore all the future moves available to each player. And yet reinforcement learning allowed the AlphaGo software to recognize favorable and unfavorable combinations better than any human player. One of the many tricks was to make the system play against itself, just as a chess player trains by playing both white and black. The idea is simple: at the end of each game, the winning software strengthens its actions, while the loser weakens them—but both have also learned to evaluate their moves more efficiently.

We happily mock Baron Munchausen, who, in his fabled Adventures, foolishly attempts to fly away by pulling on his bootstraps. In artificial intelligence, however, Munchausen’s mad method gave birth to a rather sophisticated strategy, aptly called “bootstrapping”—little by little, starting from a meaningless architecture devoid of knowledge, a neural network can become a world champion, simply by playing against itself.

This idea of increasing the speed of learning by letting two networks collaborate—or, on the contrary, compete—continues to lead to major advances in artificial intelligence. One of the most recent ideas, called “adversarial learning,”⁸ consists of training two opponent systems: one that learns to become an expert (say, in Van Gogh’s paintings) and another whose sole goal is to make the first one fail (by learning to become a brilliant forger of false Van Goghs). The first system gets a bonus whenever it successfully identifies a genuine Van Gogh painting, while the second is rewarded whenever it manages to fool the other’s expert eye. This adversarial learning algorithm yields not just one but two artificial intelligences: a world authority in Van Gogh, fond of the smallest details that can authenticate a true painting by the master, and a genius forger, capable of producing paintings that can fool the best of experts. This sort of training can be likened to the preparation for a presidential debate: a candidate can sharpen her training by hiring someone to imitate her opponent’s best lines.

Could this approach apply to a single human brain? Our two hemispheres and numerous subcortical nuclei also host a whole collection of experts who fight, coordinate, and evaluate one another. Some of the areas in our brain learn to simulate what others are doing; they allow us to foresee and imagine the results of our actions, sometimes with a realism worthy of the best counterfeiters: our memory and imagination can make us see the seaside bay where we swam last summer, or the door handle that we grab in the dark. Some areas learn to criticize others: they constantly assess our abilities and predict the rewards or punishments we might get. These are the areas that push us to act or to remain silent. We will also see that metacognition—the ability to know oneself, to self-evaluate, to mentally simulate what would happen if we acted this way or that way—plays a fundamental role in human learning. The opinions we form of ourselves help us progress or, in some cases, lock us into a vicious circle of failure. Thus, it is not inappropriate to think of the brain as a collection of experts that collaborate and compete.

LEARNING IS RESTRICTING SEARCH SPACE

Contemporary artificial intelligence still faces a major problem. The more parameters the internal model has, the more difficult it is to find the best way to adjust it. And in current neural networks, the search space is immense. Computer scientists therefore have to deal with a massive combinatorial explosion: at each stage, millions of choices are available, and their combinations are so vast that it is impossible to explore them all. As a result, learning is sometimes exceedingly slow: it takes billions of attempts to move the system in the right direction within this immense landscape of possibilities. And the data, however large, become scarce relative to the gigantic size of that space. This issue is called the “curse of dimensionality”—learning can become very hard when you have millions of potential levers to pull.

The immense number of parameters that neural networks possess often leads to a second obstacle, which is called “overfitting” or “overlearning”: the system has so many degrees of freedom that it finds it easier to memorize all the details of each example than it is to identify a more general rule that can explain them.

As John von Neumann (1903–57), the father of computer science, famously said, “With four parameters I can fit an elephant, and with five I can make him wiggle his trunk.” What he meant is that having too many free parameters can be a curse: it’s all too easy to “overfit” any data simply by memorizing every detail, but that does not mean that the resulting system captures anything significant. You can fit the pachyderm’s profile without understanding anything deep about elephants as a species. Having too many free parameters can be detrimental to abstraction. While the system easily learns, it is unable to generalize to new situations. Yet this ability to generalize is the key to learning. What would be the point of a machine that could recognize a picture that it has already seen, or win a game of Go that it has already played? Obviously, the real aim is to recognize any picture, or to win against any player, whether the circumstances are familiar or new.

Again, computer scientists are investigating various solutions to these problems. One of the most effective interventions, which can both accelerate learning and improve generalization, is to simplify the model. When the number of parameters to be adjusted is minimized, the system can be forced to find a more general solution. This is the key insight that led LeCun to invent convolutional neural networks, an artificial learning device which has become ubiquitous in the field of image recognition.⁹ The idea is simple: in order to recognize the items in a picture, you pretty much have to do the same job everywhere. In a photo, for example, faces may appear anywhere. To recognize them, one should apply the same algorithm to every part of the picture (e.g., to look for an oval, a pair of eyes, etc.). There is no need to learn a different model at each point of the retina: what is learned in one place can be reused everywhere else.

Over the course of learning, LeCun’s convolutional neural networks apply whatever they learn from a given region to the entire network, at all levels and on ever wider scales. They therefore have a much smaller number of parameters to learn: by and large, the system has to tune only a single filter that it applies everywhere, rather than a plethora of different connections for each location in the image. This simple trick massively improves performance, especially generalization to new images. The reason is simple: the algorithm that runs on a new image benefits from the immense experience it gained from every point of every photo that it has ever seen. It also speeds up learning, since the machine explores only a subset of vision models. Prior to learning, it already knows something important about the world: that the same object can appear anywhere in the image.

This trick generalizes to many other domains. To recognize speech, for example, one must abstract away from the specifics of the speaker’s voice. This is achieved by forcing a neural network to use the same connections in different frequency bands, whether the voice is low or high. Reducing the number of parameters that must be adjusted leads to greater speeds and better generalization to new voices: the advantage is twofold, and this is how your smartphone is able to respond to your voice.

LEARNING IS PROJECTING A PRIORI HYPOTHESES

Yann LeCun’s strategy provides a good example of a much more general notion: the exploitation of innate knowledge. Convolutional neural networks learn better and faster than other types of neural networks because they do not learn everything. They incorporate, in their very architecture, a strong hypothesis: what I learn in one place can be generalized everywhere else.

The main problem with image recognition is invariance: I have to recognize an object, whatever its position and size, even if it moves to the right or left, farther or closer. It is a challenge, but it is also a very strong constraint: I can expect the very same clues to help me recognize a face anywhere in space. By replicating the same algorithm everywhere, convolutional networks effectively exploit this constraint: they integrate it into their very structure. Innately, prior to any learning, the system already “knows” this key property of the visual world. It does not learn invariance, but assumes it a priori and uses it to reduce the learning space—clever indeed!

The moral here is that nature and nurture should not be opposed. Pure learning, in the absence of any innate constraints, simply does not exist. Any learning algorithm contains, in one way or another, a set of assumptions about the domain to be learned. Rather than trying to learn everything from scratch, it is much more effective to rely on prior assumptions that clearly delineate the basic laws of the domain that must be explored, and integrate these laws into the very architecture of the system. The more innate assumptions there are, the faster learning is (provided, of course, that these assumptions are correct!). This is universally true. It would be wrong, for example, to think that the AlphaGo Zero software, which trained itself in Go by playing against itself, started from nothing: its initial representation included, among other things, knowledge of the topography and symmetries of the game, which divided the search space by a factor of eight.

Our brain too is molded with assumptions of all kinds. Shortly, we will see that, at birth, babies’ brains are already organized and knowledgeable. They know, implicitly, that the world is made of things that move only when pushed, without ever interpenetrating each other (solid objects)—and also that it contains much stranger entities that speak and move by themselves (people). No need to learn these laws: since they are true everywhere humans live, our genome hardwires them into the brain, thus constraining and speeding up learning. Babies do not have to learn everything about the world: their brains are full of innate constraints, and only the specific parameters that vary unpredictably (such as face shape, eye color, tone of voice, and individual tastes of the people around them) remain to be acquired.

Again, nature and nurture need not be opposed. If the baby’s brain knows the difference between people and inanimate objects, it is because, in a sense, it has learned this—not in the first few days of its life, but in the course of millions of years of evolution. Darwinian selection is, in effect, a learning algorithm—an incredibly powerful program that has been running for hundreds of millions of years, in parallel, across billions of learning machines (every creature that ever lived).¹⁰ We are the heirs of an unfathomable wisdom. Through Darwinian trial and error, our genome has internalized the knowledge of the generations that have preceded us. This innate knowledge is of a different type than the specific facts that we learn during our lifetime: it is much more abstract, because it biases our neural networks to respect the fundamental laws of nature.

In brief, during pregnancy, our genes lay down a brain architecture that guides and accelerates subsequent learning by imposing restrictions on the size of the explored space. In computer-science lingo, one may say that genes set up the “hyperparameters” of the brain: the high-level variables that specify the number of layers, the types of neurons, the general shape of their interconnections, whether they are duplicated at any point on the retina, and so on and so forth. Because many of these variables are stored in our genome, we no longer need to learn them: our species internalized them as it evolved.

Our brain is therefore not simply passively subjected to sensory inputs. From the get-go, it already possesses a set of abstract hypotheses, an accumulated wisdom that emerged through the sift of Darwinian evolution and which it now projects onto the outside world. Not all scientists agree with this idea, but I consider it a central point: the naive empiricist philosophy underlying many of today’s artificial neural networks is wrong. It is simply not true that we are born with completely disorganized circuits devoid of any knowledge, which later receive the imprint of their environment. Learning, in man and machine, always starts from a set of a priori hypotheses, which are projected onto the incoming data, and from which the system selects those that are best suited to the current environment. As Jean-Pierre Changeux stated in his best-selling book Neuronal Man (1985), “To learn is to eliminate.”