Chapter 2: Why Our Brain Learns Better Than Current Machines, How We Learn

CHAPTER 2 Why Our Brain Learns Better Than Current Machines

The recent surge of progress in artificial intelligence may suggest that we have finally discovered how to copy and even surpass human learning and intelligence. According to some self-proclaimed prophets, machines are about to overtake us. Nothing could be further from the truth. In fact, most cognitive scientists, while admiring recent advances in artificial neural networks, are well aware of the fact that these machines remain highly limited. In truth, most artificial neural networks implement only the operations that our brain performs unconsciously, in a few tenths of a second, when it perceives an image, recognizes it, categorizes it, and accesses its meaning.¹ However, our brain goes much further: it is able to explore the image consciously, carefully, step by step, for several seconds. It formulates symbolic representations and explicit theories of the world that we can share with others through language.

Operations of this nature—slow, reasoned, symbolic—remain (for now) the exclusive privilege of our species. Current machine learning algorithms capture them poorly. Although there is constant progress in the fields of machine translation and logical reasoning, a common criticism of artificial neural networks is that they attempt to learn everything at the same level, as if every problem were a matter of automatic classification. To a man with a hammer, everything looks like a nail! But our brain is much more flexible. It quickly manages to prioritize information and, whenever possible, extract general, logical, and explicit principles.

WHAT IS ARTIFICIAL INTELLIGENCE MISSING?

It is interesting to try to clarify what artificial intelligence is still missing, because this is also a way to identify what is unique about our species’ learning abilities. Here is a short and probably still partial list of functions that even a baby possesses and that most current artificial systems are missing:

Learning abstract concepts. Most artificial neural networks capture only the very first stages of information processing—those that, in less than a fifth of a second, parse an image in the visual areas of our brain. Deep learning algorithms are far from being as deep as some people claim. According to Yoshua Bengio, one of the inventors of deep learning algorithms, they actually tend to learn superficial statistical regularities in data, rather than high-level abstract concepts.² To recognize an object, for instance, they often rely on the presence of a few shallow features in the image, such as a specific color or shape. Change these details and their performance collapses: contemporary convolutional neural networks are unable to recognize what constitutes the essence of an object; they have difficulty understanding that a chair remains a chair whether it has four legs or just one, and whether it is made of glass, metal, or inflatable plastic. This inclination to attend to superficial features makes these networks susceptible to massive errors. There is a whole literature on how to fool a neural network: take a banana and modify a few pixels or put a particular sticker on it, and the neural network will think it’s a toaster!

True enough, when you flash an image to a person for a split second, they will sometimes make the same kinds of errors as a machine and may mistake a dog for a cat.³ However, as soon as humans are given a little more time, they correct their errors. Unlike a computer, we possess the ability to question our beliefs and refocus our attention on those aspects of an image that do not fit with our first impression. This second analysis, conscious and intelligent, calls upon our general powers of reasoning and abstraction. Artificial neural networks neglect an essential point: human learning is not just the setting of a pattern-recognition filter, but the forming of an abstract model of the world. By learning to read, for example, we have acquired an abstract concept of each letter of the alphabet, which allows us to recognize it in all its disguises, as well as generate new versions:

The cognitive scientist Douglas Hofstadter once said that the real challenge for artificial intelligence was to recognize the letter A! This quip was undoubtedly an exaggeration, but a profound one nevertheless: even in this most trivial context, humans deploy an unmatched knack for abstraction. This feat is at the origin of an amusing occurrence of daily life: the CAPTCHA, the little chain of letters that some websites ask you to recognize in order to prove you are a human being, not a machine. For years, CAPTCHAs have withstood machines. But computer science is evolving fast: in 2017, an artificial system managed to recognize CAPTCHAs at an almost humanlike level.⁴ Unsurprisingly, this algorithm mimics the human brain in several respects. A genuine tour de force, it manages to extract the skeleton of each letter, the inner essence of the letter A, and uses all the resources of statistical reasoning to verify whether this abstract idea applies to the current image. Yet this computer algorithm, however sophisticated, applies only to CAPTCHAs. Our brains apply this ability for abstraction to all aspects of our daily lives.

Data-efficient learning. Everyone agrees that today’s neural networks learn far too slowly: they need thousands, millions, even billions of data points to develop an intuition of a domain. We even have experimental evidence of this sluggishness. For instance, it takes no less than nine hundred hours of play for the neural network designed by DeepMind to reach a reasonable level on an Atari console—while a human being reaches the same level in two hours!⁵ Another example is language learning. Psycholinguist Emmanuel Dupoux estimates that in most French families, children hear about five hundred to one thousand hours of speech per year, which is more than enough for them to acquire Descartes’s patois, including such quirks as soixante-douze or s’il vous plaît. However, among the Tsimane, an indigenous population of the Bolivian Amazon, children hear only sixty hours of speech per year—and remarkably, this limited experience does not prevent them from becoming excellent speakers of the Tsimane language. In comparison, the best current computer systems from Apple, Baidu, and Google require anywhere between twenty and a thousand times more data in order to attain a modicum of language competence. In the field of learning, the effectiveness of the human brain remains unmatched: machines are data hungry, but humans are data efficient. Learning, in our species, makes the most from the least amount of data.

Social learning. Our species is the only one that voluntarily shares information: we learn a lot from our fellow humans through language. This ability remains beyond the reach of current neural networks. In these models, knowledge is encrypted, diluted in the values of hundreds of millions of synaptic weights. In this hidden, implicit form, it cannot be extracted and selectively shared with others. In our brains, by contrast, the highest-level information, which reaches our consciousness, can be explicitly stated to others. Conscious knowledge comes with verbal reportability: whenever we understand something in a sufficiently perspicuous manner, a mental formula resonates in our language of thought, and we can use the words of language to report it. The extraordinary efficiency with which we manage to share our knowledge with others, using a minimum number of words (“To get to the market, turn right on the small street behind the church.”), remains unequalled, in the animal kingdom as in the computer world.

One-trial learning. An extreme case of this efficiency is when we learn something new on a single trial. If I introduce a new verb, let’s say purget, even only once, it will be enough for you to use it. Of course, some artificial neural networks are also capable of storing a specific episode. But what machines cannot yet do well, and that the human brain succeeds in doing wonderfully, is integrate new information within an existing network of knowledge. You not only memorize the new verb purget, but you immediately know how to conjugate it and insert it into other sentences: Do you ever purget? I purgot it yesterday. Have you ever purgotten? Purgetting is a problem. When I say, “Let’s purget tomorrow,” you don’t just learn a word—you also insert it into a vast system of symbols and rules: it is a verb with irregular past tense (purgot, purgotten) and a typical conjugation in the present tense (I purget, you purget, she purgets, etc.). To learn is to succeed in inserting new knowledge into an existing network.

Systematicity and the language of thought. Grammar rules are just one example of a particular talent in our brain: the ability to discover the general laws that lie behind specific cases. Whether it is in mathematics, language, science, or music, the human brain manages to extract very abstract principles, systematic rules that it can reapply in many different contexts. Take arithmetic, for example: our ability to add two numbers is extremely general—once we have learned this procedure with small numbers, we can systematize it to arbitrarily large numbers. Better yet, we can draw inferences of extraordinary generality. Many children, around five or six years of age, discover that each number n has a successor n + 1, and that the sequence of whole numbers is therefore infinite—there is no greatest number. I still remember, with emotion, the moment when I became aware of this—it was, in reality, my first mathematical theorem. What extraordinary powers of abstraction! How does our brain, which includes a finite number of neurons, manage to conceptualize infinity?

Present-day artificial neural networks cannot represent an abstract law as simple as “every number has a successor.” Absolute truths are not their cup of tea. Systematicity,⁶ the ability to generalize on the basis of a symbolic rule rather than a superficial resemblance, still eludes most current algorithms. Ironically, so-called deep learning algorithms are almost entirely incapable of any profound insight.

Our brain, on the other hand, seems to have a flowing ability to conceive formulas in a kind of mental language. For instance, it can express the concept of an infinite set because it possesses an internal language endowed with such abstract functions as negation and quantification (infinite = not finite = beyond any number). The American philosopher Jerry Fodor (1935–2017) theorized this ability: he postulated that our thinking consists of symbols that combine according to the systematic rules of a “language of thought.”⁷ Such a language owes its power to its recursive nature: each newly created object (say, the concept of infinity) can immediately be reused in new combinations, without limits. How many infinities are there? Such is the seemingly absurd question that the mathematician Georg Cantor (1845–1918) asked himself, which led him to formulate the theory of transfinite numbers. The ability to “make infinite use of finite means,” according to Wilhelm von Humboldt (1767–1835), characterizes human thought.

Some computer-science models try to capture the acquisition of abstract mathematical rules in children—but to do so, they have to incorporate a very different form of learning, one that involves rules and grammars and manages to quickly select the shortest and most plausible of them.⁸ In this view, learning becomes similar to programming: it consists of selecting the simplest internal formula that fits the data, among all those available in the language of thought.

Current neural networks are largely unable to represent the range of abstract phrases, formulas, rules, and theories with which the Homo sapiens brain models the world. This is probably no coincidence: there is something profoundly human about this, something that is not found in the brains of other animal species, and that contemporary neuroscience has not yet managed to address—a genuinely singular aspect of our species. Among primates, our brain seems to be the only one to represent sets of symbols that combine according to a complex and arborescent syntax.⁹ My laboratory, for example, has shown that the human brain cannot help hearing a series of sounds such as beep beep beep boop without immediately theorizing the underlying abstract structure (three identical sounds followed by a different one). Placed in the same situation, a monkey detects a series of four sounds, realizes that the last is different, but does not seem to integrate this piecewise knowledge into a single formula; we know this because when we examine their brain activity, we see distinct circuits activate for number and for sequence, but never observe the integrated pattern of activity that we find in the human language area called “Broca’s area.”¹⁰

Similarly, it takes tens of thousands of trials before a monkey understands how to reverse the order of a sequence (from ABCD to DCBA), while for a four-year-old human, five trials are enough.¹¹ Even a baby of a few months of age already encodes the external world using abstract and systematic rules—an ability that completely eludes both conventional artificial neural networks and other primate species.

Composition. Once I have learned, say, to add two numbers, this skill becomes an integral part of my repertoire of talents: it becomes immediately available to address all my other goals. I can use it as a subroutine in dozens of different contexts, for example, to pay the restaurant bill or to check my tax forms. Above all, I can recombine it with other learned skills—I have no difficulty, for example, following an algorithm that asks me to take a number, add two, and decide whether it is now larger or smaller than five.¹²

It is surprising that current neural networks do not yet show this flexibility. The knowledge that they have learned remains confined in hidden, inaccessible connections, thus making it very difficult to reuse in other, more complex tasks. The ability to compose previously learned skills, that is, to recombine them in order to solve new problems, is beyond these models. Today’s artificial intelligence solves only extremely narrow problems: the AlphaGo software, which can defeat any human champion in Go, is a stubborn expert, unable to generalize its talents to any other game even slightly different (including the game of Go on a fifteen-by-fifteen board rather than the standard nineteen-by-nineteen goban). In the human brain, on the other hand, learning almost always means rendering knowledge explicit, so that it can be reused, recombined, and explained to others. Here again, we are dealing with a singular aspect of the human brain, linked to language and which has proven to be difficult to reproduce in a machine. As early as 1637, in his famous Discourse on Method, Descartes anticipated this issue:

If there were machines that resembled our bodies and imitated our actions as much as is morally possible, we would always have two very certain means for recognizing that they are not genuinely human. The first is that they would never be able to use speech, or other signs by composing them, as we do to express our thoughts to others. For one could easily conceive of a machine that is made in such a way that it utters words . . . but it could not arrange words in different ways to reply to the meaning of everything that is said in its presence, as even the most unintelligent human beings can do. And the second means is that, even if they did many things as well as or, possibly, better than anyone of us, they would infallibly fail in others. One would thus discover that they did not act on the basis of knowledge, but merely as a result of the disposition of their organs. For whereas reason is a universal instrument that can be used in all kinds of situations, these organs need a specific disposition for every particular action.

Reason, the mind’s universal instrument. . . . The mental abilities that Descartes lists point to a second learning system, hierarchically higher than the previous one, based on rules and symbols. In its early stages, our visual system vaguely resembles current artificial neural networks: it learns to filter incoming images and to recognize frequent configurations. This suffices to recognize a face, a word, or a configuration of the game Go. But then the processing style radically changes: learning begins to resemble reasoning, a logical inference that attempts to capture the rules of a domain. Creating machines that reach this second level of intelligence is a great challenge for contemporary artificial intelligence research. Let’s examine two elements that define what humans do when they learn at this second level, and that defy most current machine learning algorithms.

LEARNING IS INFERRING THE GRAMMAR OF A DOMAIN

Characteristic of the human species is a relentless search for abstract rules, high-level conclusions that are extracted from a specific situation and subsequently tested on new observations. Attempting to formulate such abstract laws can be an extraordinarily powerful learning strategy, since the most abstract laws are precisely those that apply to the greatest number of observations. Finding the appropriate law or logical rule that accounts for all available data is the ultimate means to massively accelerate learning—and the human brain is exceedingly good at this game.

Let us consider an example. Imagine that I show you a dozen opaque boxes filled with balls of different colors. I select a box at random, one from which I have never drawn anything out before. I plunge my hand into it, and I draw a green ball. Can you deduce anything about the contents of the box? What color will the next ball be?

The first answer that probably comes to mind is: I have no idea—you gave me practically no information; how could I know the color of the next ball? Yes, but . . . imagine that, in the past, I drew some balls from the other boxes and you noticed the following rule: in a given box, all balls are always the same color. The problem becomes trivial. When I show you a new box, you need only to draw a single green ball to deduce that all the other balls will be this color. With this general rule in mind, it becomes possible to learn in a single trial.

This example illustrates how higher-order knowledge, formulated at what is often called the “meta” level, can guide a whole set of lower-level observations. The abstract meta-rule that “in a given box, all the balls are the same color,” once learned, massively accelerates learning. Of course, it may also turn out to be false. You will then be massively surprised (or should I say “meta-surprised”) if the tenth box you explore contains balls of all colors. In this case, you would have to revise your mental model and question the assumption that all boxes are similar. Perhaps you would propose an even higher-level hypothesis, a meta-meta-hypothesis—for instance, you may suppose that boxes come in two kinds, single-colored and multicolored, in which case you would need at least two draws per box before concluding anything. In any case, formulating a hierarchy of abstract rules would save you valuable learning time.

Learning, in this sense, therefore means managing an internal hierarchy of rules and trying to infer, as soon as possible, the most general ones that summarize a whole series of observations. The human brain seems to apply this hierarchical principle from childhood. Take a two- or three-year-old child walking in a garden and learning a new word from his or her parents, say, butterfly. Often, it is enough for the child to hear the word once or twice, and voilà: its meaning is memorized. Such a learning speed is amazing. It surpasses every known artificial intelligence system to date. Why is the problem difficult? Because every utterance of every word does not fully constrain its meaning. The word butterfly is typically uttered as the child is immersed in a complex scene, full of flowers, trees, toys, and people; all of these are potential candidates for the meaning of that word—not to mention the less obvious meanings: every moment we live is full of sounds, smells, movements, actions, but also abstract properties. For all we know, butterfly could mean color, sky, move, or symmetry. The existence of abstract words makes this problem most perplexing. How do children learn the meanings of the words think, believe, no, freedom, and death, if the referents cannot be perceived or experienced? How do they understand what “I” means, when each time they hear it, the speakers are talking about . . . themselves?!

The fast learning of abstract words is as incompatible with naive views of word learning as Pavlovian conditioning or Skinnerian association. Neural networks that simply try to correlate inputs with outputs and images with words, typically require thousands of trials before they begin to understand that the word butterfly refers to that colored insect, there, in the corner of the image . . . and such a shallow correlation of words with pictures will never discover the meanings of words without a fixed reference, such as we, always, or smell.

Word acquisition poses a huge challenge to cognitive science. However, we know that part of the solution lies in the child’s ability to formulate nonlinguistic, abstract, logical representations. Even before they acquire their first words, children possess a kind of language of thought within which they can formulate and test abstract hypotheses. Their brains are not blank slates, and the innate knowledge that they project onto the external world can drastically restrict the abstract space within which they learn. Furthermore, children quickly learn the meaning of words because they select among hypotheses using as a guide a whole panoply of high-level rules. Such meta-rules massively accelerate learning, exactly as in the problem of the colored balls in the different boxes.

One of these rules that facilitates vocabulary acquisition is to always favor the simplest, smallest assumption compatible with the data. For instance, when a baby hears its mother say, “Look at the dog,” in theory, nothing precludes the word dog from referring to that particular dog (Snoopy)—or, conversely, any mammal, four-legged creature, animal, or living being. How do children discover the true meaning of a word—that dog means all dogs, but only dogs? Experiments suggest that they reason logically by testing all hypotheses but keeping only the simplest one that fits with what they heard. Thus, when children hear the word Snoopy, they always hear it in the context of that specific pet, and the smallest set compatible with those observables is confined to that particular dog. And the first time children hear the word dog, in a single specific context, they may temporarily believe that the word refers to only that particular animal—but as soon as they hear it twice, in two different contexts, they can infer that the word refers to a whole category. A mathematical model of this process predicts that three or four instances are enough to converge toward the appropriate meaning.¹³ This is the inference that children make, faster than any current artificial neural network.

Other tricks allow children to learn language in record time, compared with present-day AI systems. One of these meta-rules expresses a truism: in general, the speaker pays attention to what he or she is talking about. Once babies understand this rule, they can considerably restrict the abstract space in which they search for meaning: they do not have to correlate every word with all the objects present in the visual scene, as a computer would, until they obtain enough data to prove that each time they hear about a butterfly, the little colored insect is present. All the child has to do to infer what his mother is talking about is follow her gaze or the direction of her finger: this is called “shared attention,” and it is a fundamental principle of language learning.

Here is an elegant experiment: Take a two- or three-year-old child, show him a new toy, and have an adult look at it while saying, “Oh, a wog!” A single trial suffices for the child to figure out that wog is the name of that object. Now replicate the situation, except that the adult doesn’t say a word, but the child hears “Oh, a wog!” uttered by a loudspeaker on the ceiling. The child learns strictly nothing, because he can no longer decipher the speaker’s intention.¹⁴ Babies learn the meaning of a new word only if they manage to understand the intention of the person who uttered it. This ability also enables them to acquire a lexicon of abstract words: to do so, they must put themselves in the speaker’s place to understand which thought or word the speaker intended to refer to.

Children use many other meta-rules to learn words. For example, they capitalize on grammatical context: when they are told, “Look at the butterfly,” the presence of the determiner word the makes it very likely that the following word is a noun. This is a meta-rule that they had to learn—babies obviously are not born with an innate knowledge of all possible articles in every language. However, research shows that this type of learning is fast: by twelve months, children have already recorded the most frequent determiners and other function words and use them to guide subsequent learning.¹⁵

Learning means trying to select the simplest model that fits the data. Suppose I show you the top card and tell you that the three objects surrounded by thick lines are “tufas.” With so little data, how do you find the other tufas? Your brain makes a model of how these forms were generated, a hierarchical tree of their properties, and then selects the smallest branch of the tree which is compatible with all the data.

They are able to do this because these grammatical words are very frequent and, whenever they appear, almost invariably precede a noun or a noun phrase. The reasoning may seem circular, but it is not: babies start learning their first nouns, beginning with extremely familiar ones like bottle and chair, around six months of age . . . then they notice that these words are often preceded by a very frequent word, the article the . . . from which they deduce that all these words probably belong to the same category, noun . . . and that these words often refer to things . . . a meta-rule which enables them, when they hear a new utterance, such as “the butterfly,” to first search for a possible meaning among the objects around them, rather than treating the word as a verb or an adjective. Thus, each learning episode reinforces this rule, which itself facilitates subsequent learning, in a vast movement that accelerates every day. Developmental psychologists say that the child relies on syntactic bootstrapping: a children’s language-learning algorithm manages to take off gradually, on its own, by capitalizing on a series of small but systematic inference steps.

There is yet another meta-rule that children use to speed up word learning. It is called the “mutual exclusivity assumption” and can be stated succinctly: one name for each thing. The law basically says that it is unlikely for two different words to refer to the same concept. A new word therefore most probably refers to a new object or idea. With this rule in mind, as soon as children hear an unfamiliar word, they can restrict their search for meaning to things whose names they do not yet know. And, as of sixteen months of age, children use this trick quite astutely.¹⁶ Try the following experiment: take two bowls, a blue one and another of an unusual color—say, olive green—and tell the child, “Give me the tawdy bowl.” The child will give you the bowl that is not blue (a word he already knows)—he seems to assume that if you had wanted to speak about the blue bowl, you would have used the word blue; ergo, you must be referring to the other, unknown one. Weeks later, that single experience will suffice for him to remember that this odd color is “tawdy.”

Here again, we see how the mastery of a meta-rule can massively accelerate learning. And it is likely that this meta-rule itself was learned. Indeed, some experiments indicate that children from bilingual families apply this rule much less than monolingual babies.¹⁷ Their bilingual experience makes them realize that their parents can use different words to say the same thing. Monolingual children, on the other hand, heavily rely on the exclusivity rule. They have figured out that whenever you use a new word, it is likely that you wanted them to learn a novel object or concept. If you say, “Give me the glax” in a room full of familiar objects, they will search everywhere for this mysterious object to which you are referring—and won’t imagine that you could be referring to one of the known ones.

All these meta-rules illustrate what is called the “blessing of abstraction”: the most abstract meta-rules can be the easiest things to learn, because every word that the child hears provides evidence for them. Thus, the grammatical rule “nouns tend to be preceded by the article the” may well be acquired early on and guide the subsequent acquisition of a vast repertoire of nouns. Thanks to the blessing of abstraction, around two to three years of age, children enter a blessed period rightfully called the “lexical explosion,” during which they effortlessly learn between ten and twenty new words a day, solely based on tenuous clues that still stall the best algorithms on the planet.

The ability to use meta-rules seems to require a good dose of intelligence. Does that make it unique to the human species? Not entirely. To some degree, other animals are also capable of abstract inference. Take the case of Rico, a shepherd dog who was trained to fetch a diverse range of objects.¹⁸ All you have to do is say: “Rico, go fetch the dinosaur” . . . and the animal goes into the game room and comes back a few seconds later with a stuffed dinosaur in his mouth. The ethologists who tested him showed that Rico knows about two hundred words. But the most extraordinary thing is that he too uses the mutual exclusivity principle to learn new words. If you tell him, “Rico, go fetch the sikirid” (a new word), he always returns with a new object, one whose name he does not yet know. He too uses meta-rules such as “one name for each thing.”

Mathematicians and computer scientists have begun to design algorithms that allow machines to learn such a hierarchy of rules, meta-rules, and meta-meta-rules, up to an arbitrary level. In these hierarchical learning algorithms, each learning episode constrains not only the low-level parameters, but also the knowledge of the highest level, the abstract hyperparameters which, in turn, will bias subsequent learning. While they do not yet imitate the extraordinary efficiency of language learning, these systems do achieve remarkable performance. For example, figure 4 in the color insert shows how a recent algorithm behaves as a kind of artificial scientist who finds the best model of the outside world.¹⁹ This system possesses a set of abstract primitives, as well as a grammar that allows it to generate an infinity of higher-level structures through the recombination of these elementary rules. It can, for instance, define a linear chain as a set of closely connected points which is characterized by the rule “each point has two neighbors, one to the left, one to the right”—and the system manages to discover, all by itself, that such a chain is the best way to represent the set of integers (a line that goes from zero to infinity) or politicians (from the ultraleft to the far right). A variant of the same grammar produces a binary tree where each node has one parent and two children. Such a tree structure is automatically selected when the system is asked to represent living beings—the machine, like an artificial Darwin, spontaneously rediscovers the tree of life!

Other combinations of rules generate planes, cylinders, and spheres, and the algorithm figures out how such structures approximate the geography of our planet. More sophisticated versions of the same algorithm manage to express even more abstract ideas. For example, American computer scientists Noah Goodman and Josh Tenenbaum designed a system capable of discovering the principle of causality²⁰—the very idea that some events cause others. Its formulation is abstruse and mathematical: “In a directed, acyclic graph linking various variables, there exists a subset of variables on which all others depend.” Although this expression is almost incomprehensible, I cite it because it nicely illustrates the kind of abstract internal formulas that this mental grammar is capable of expressing and testing. The system puts thousands of such formulas to the test and keeps only those that fit with the incoming data. As a result, it quickly infers the principle of causality (if, indeed, some of the sensory experiences it receives are causes and others are consequences). This is yet another illustration of the blessing of abstraction: entertaining such a high-level hypothesis massively accelerates learning, because it dramatically narrows the space of plausible hypotheses within which to search. And thanks to it, generations of children are on the lookout for explanations, constantly asking “Why?” and searching for causes—thus fueling our species’s endless pursuit of scientific knowledge.

According to this view, learning consists of selecting, from a large set of expressions in the language of thought, the one that best fits the data. We will soon see that this is an excellent model of what children do. Like budding scientists, they formulate theories and compare them with the outside world. This implies that children’s mental representations are much more structured than those of present-day artificial neural networks. From birth, the child’s brain must already possess two key ingredients: all the machinery that makes it possible to generate a plethora of abstract formulas (a combinatorial language of thought) and the ability to choose from these formulas wisely, according to their plausibility given the data.

Such is the new vision of the brain:²¹ an immense generative model, massively structured and capable of producing myriad hypothetical rules and structures—but which gradually restricts itself to those that fit with reality.

LEARNING IS REASONING AS A SCIENTIST

How does the brain select the best-fitting hypothesis? On what criteria should it accept or reject a model of the outside world? It turns out that there is an ideal strategy for doing so. This strategy lies at the very core of one of the most recent and productive theories of learning: the hypothesis that the brain behaves like a budding scientist. According to this theory, learning is reasoning like a good statistician who chooses, among several alternative theories, that which has the greatest probability of being correct, because it best accounts for the available data.

How does scientific reasoning work? When scientists formulate a theory, they do not just write down mathematical formulas—they make predictions. The strength of a theory is judged by the richness of the original predictions that emerged from it. The subsequent confirmation or rebuttal of those predictions is what leads to a theory’s validation or downfall. Researchers apply a simple logic: they state several theories, unravel the web of ensuing predictions, and eliminate the theories whose predictions are invalidated by experiments and observations. Of course, a single experiment rarely suffices: it is often necessary to replicate the experiment several times, in different labs, in order to disentangle what is true from what is false. To paraphrase philosopher of science Karl Popper (1902–94), ignorance continuously recedes as a series of conjectures and refutations permit the progressive refinement of a theory.

The slow process of science resembles the way we learn. In each of our minds, ignorance is gradually erased as our brain successfully formulates increasingly accurate theories of the outside world through observations. But is this nothing more than a vague metaphor? No—it is, in fact, a rather precise statement about what the brain must be computing. And over the last thirty years, the hypothesis of “the child as a scientist” has led to a series of major discoveries about how children reason and learn.

Mathematicians and computer scientists have long theorized the best way of reasoning in the presence of uncertainties. This sophisticated theory is called “Bayesian,” after its discoverer, the Reverend Thomas Bayes (1702–61), an English Presbyterian pastor and mathematician who became a member of the Royal Society. But perhaps we should be calling it the Laplacian theory, since it was the great French mathematician Pierre-Simon, Marquis de Laplace (1749–1827), who gave it its first complete formalization. In spite of its ancient roots, it is only in the last twenty years or so that this view has gained prominence in cognitive science and machine learning. An increasing number of researchers have begun to realize that only the Bayesian approach, firmly grounded in probability theory, guarantees the extraction of a maximum of information from each data point. To learn is to be able to draw as many inferences as possible from each observation, even the most uncertain ones—and this is precisely what Bayes’s rule guarantees.

What did Bayes and Laplace discover? Simply put: the right way to make inferences, by reasoning with probabilities in order to trace every observation, however tenuous, back to its most plausible cause. Let’s return to the foundations of logic. Since ancient times, humanity has understood how to reason with values of truth, true or false. Aristotle introduced the rules of deduction that we call syllogisms, which we all apply more or less intuitively. For example, the rule called “modus tollens” (literally translated as “method of denying”) says that if P implies Q and it turns out that Q is false, then P must be false. It is this rule that Sherlock Holmes applied in the famous story “Silver Blaze”:

“Is there any other point to which you would wish to draw my attention?” asks Inspector Gregory of Scotland Yard.

Holmes: “To the curious incident of the dog in the night-time.”

Gregory: “The dog did nothing in the night-time.”

Holmes: “That was the curious incident.”

Sherlock reasoned that if the dog had spotted a stranger, then he would have barked. Because he did not, the criminal must have been a familiar person . . . reasoning that allows the famous detective to narrow down his search and eventually unmask the culprit.

“What does this have to do with learning?” you may be asking yourself. Well, learning is also reasoning like a detective: it always boils down to going back to the hidden causes of phenomena, in order to deduce the most plausible model that governs them. But in the real world, observations are rarely true or false: they are uncertain and probabilistic. And that is exactly where the fundamental contributions of the Reverend Bayes and the Marquis de Laplace come into play: Bayesian theory tells us how to reason with probabilities, what kinds of syllogisms we must apply when the data are not perfect, true or false, but probabilistic.

Probability Theory: The Logic of Science is the title of a fascinating book on Bayesian theory by statistician E. T. Jaynes (1922–98).²² In it, he shows that what we call probability is nothing more than the expression of our uncertainty. The theory expresses, with mathematical precision, the laws according to which uncertainty must evolve when we make a new observation. It is the perfect extension of logic to the foggy domain of probabilities and uncertainties.

Let’s take an example, similar in spirit to the one on which the Reverend Bayes founded his theory in the eighteenth century. Suppose I see someone flip a coin. If the coin is fair, it is equally likely to land on heads as it is tails: fifty-fifty. From this premise, classical probability theory tells us how to compute the chances of observing certain outcomes (for example, the probability of getting five tails in a row). Bayesian theory allows us to travel in the opposite direction, from observations to causes. It tells us, in a mathematically precise manner, how to answer such questions as “After several coin flips, should I change my views on the coin?” The default assumption is that the coin is unbiased . . . but if I see it land on tails twenty times, I have to revise my assumptions: the coin is most certainly rigged. Obviously, my original hypothesis has become improbable, but by how much? The theory precisely explains how to update our beliefs after each observation. Each assumption is assigned a number that corresponds to a plausibility or confidence level. With each observation, this number changes by a value that depends on the degree of improbability of the observed outcome. Just as in science, the more improbable an experimental observation is, the more it violates the predictions of our initial theory, and the more confidently we can reject that theory and look for alternative interpretations.

Bayesian theory is remarkably effective. During the Second World War, British mathematician Alan Turing (1912–54) used it to decrypt the Enigma code. At the time, German military messages were encrypted by the Enigma machine, a complex contraption of gears, rotors, and electrical cables, assembled to produce over a billion different configurations that changed after each letter. Every morning, the cryptographer would place the machine in the specific configuration that was planned for that day. He would then type a text, and Enigma would spit out a seemingly random sequence of letters, which only the owner of the encryption key could decode. To anyone else, the text seemed totally devoid of any order. But herein lies Turing’s genius: he discovered that if two machines had been initialized in the same way, it introduced a slight bias in the distribution of letters, so that the two messages were slightly more likely to resemble each other. This bias was so small that no single letter was enough to conclude anything for certain. By accumulating those improbabilities, however, letter after letter, Turing could progressively gather more and more evidence that the same configuration had indeed been used twice. On this basis, and with the help of what they whimsically called “the bomb” (a large, ticking electromechanical machine that prefigured our computers), he and his team regularly broke the Enigma code.

Again, what’s the relevance to our brains? Well, the very same type of reasoning seems to occur inside our cortex.²³ According to this theory, each region of the brain formulates one or more hypotheses and sends the corresponding predictions to other regions. In this way, each brain module constrains the assumptions of the next one, by exchanging messages that convey probabilistic predictions about the outside world. These signals are called “top-down” because they start in high-level cerebral areas, such as the frontal cortex, and make their way down to the lower sensory areas, such as the primary visual cortex. The theory proposes that these signals express the realm of hypotheses that our brain considers plausible and wishes to test.

In sensory areas, these top-down assumptions come into contact with “bottom-up” messages from the outside world, for instance, from the retina. At this moment, the model is confronted with reality. The theory says that the brain should calculate an error signal: the difference between what the model predicted and what has been observed. The Bayesian algorithm then indicates how to use this error signal to modify the internal model of the world. If there is no mistake, it means that the model was right. Otherwise, the error signal moves up the chain of brain areas and adjusts the model parameters along the way. Relatively quickly, the algorithm converges toward a mental model that fits the outside world.

According to this vision of the brain, our adult judgments combine two levels of insights: the innate knowledge of our species (what Bayesians call priors, the sets of plausible hypotheses inherited throughout evolution) and our personal experience (the posterior: the revision of those hypotheses, based on all the inferences we have been able to gather throughout our life). This division of labor puts the classic “nature versus nurture” debate to rest: our brain organization provides us with both a powerful start-up kit and an equally powerful learning machine. All knowledge must be based on these two components: first, a set of a priori assumptions, prior to any interaction with the environment, and second, the capacity to sort them out according to their a posteriori plausibility, once we have encountered some real data.

One can mathematically demonstrate that the Bayesian approach is the best way to learn. This is the only way to extract the very essence of a learning episode and get the most out of it. Even a few bits of information, such as the suspicious coincidences that Turing spotted in the Enigma code, may suffice to learn. Once the system processes them, like a good statistician patiently accumulating evidence, it will inevitably end up with enough data to refute certain theories and validate others.

Is this really how the brain works? Is it capable of generating, at birth, vast realms of hypotheses that it learns to pick from? Does it proceed by elimination, selecting hypotheses according to how well the observed data support them? Do babies, right from birth, act as clever statisticians? Are they able to extract as much information as possible from each learning experience? Let’s now take a closer look at the experimental data on babies’ brains.