In their magisterial survey of life on planet earth, Maynard-Smith and Szathmary (1995) identified eight major transitions in the evolution of complexity of living things, for example, the emergence of chromosomes, the emergence of multicellular organisms, and the emergence of sexual reproduction. Astoundingly, in each case the transition was characterized by the same two fundamental processes. First, in each case there emerged some new form of cooperation with interdependence: “Entities that were capable of independent replication before the transition can replicate only as part of a larger whole after it” (p. 6). Second, in each case this new form of cooperation was made possible by a concomitant new form of communication: “change in the method of information transmission” (p. 6).

The most recent major transition, in this account, was the emergence of human cooperative societies (cultures) structured by linguistic communication. Our ultimate goal is to give an account of this emergence, with a specific focus on the new forms of thinking that it engendered. But we cannot go directly from competitive great ape societies to cooperative human cultures in one giant leap. The problem is that there are thousands of human cultures, and each of them has conventionalized, normativized, and institutionalized a particular set of cultural and communicative practices. But anything may be conventionalized, normativized, or institutionalized; these processes are totally blind to content. And so to get to cooperatively organized human cultures, there must have been in existence in all human populations already—as raw material for these group-level processes of cultural creation—many and varied kinds of cooperative social interactions of a type not possessed by other great apes. Assuming again that great apes are representative of humans’ last common ancestor with other primates, then, it would seem that we need an intermediate step in our natural history. We need some early humans who were not yet living in cultures and using conventional languages, but who were nevertheless much more cooperatively inclined than the last common ancestor.

And so we will posit in this chapter, as an initial step, some early humans who created new forms of social coordination, perhaps in the context of collaborative foraging. Early humans’ new form of collaborative activity was unique among primates because it was structured by joint goals and joint attention into a kind of second-personal joint intentionality of the moment, a “we” intentionality with a particular other, within which each participant had an individual role and an individual perspective. Early humans’ new form of cooperative communication—the natural gestures of pointing and pantomiming—enabled them to coordinate their roles and perspectives on external situations with a collaborative partner toward various kinds of joint objectives. The result was that these early humans “cooperativized” great ape individual intentionality into human joint intentionality involving new forms of cognitive representation (perspectival, symbolic), inference (socially recursive), and self-monitoring (regulating one’s actions from the perspective of a cooperative partner), which, when put to use in solving concrete problems of social coordination, constituted a radically new form of thinking.

So let us look, first, at the new form of collaboration that emerged with early humans, then at the new form of cooperative communication that early humans used to coordinate their collaborative activities, and then at the resulting new form of thinking that all of this collaborating and communicating required as substrate.

A New Form of Collaboration

Cooperation by itself does not create complex cognitive skills—witness the complex cooperation of the cognitively simple eusocial insects and the cooperative child care and food sharing of the not-so-cognitively-complex New World monkeys, marmosets and tamarins. The case of humans is unique, from a cognitive point of view, because the common ancestor to humans and other great apes had already evolved highly sophisticated skills of social cognition and social manipulation for purposes of competition (as well as highly sophisticated skills of physical cognition for purposes of manipulating causality in the context of tool use)—as documented in chapter 2.

Then, out of the elements of these sophisticated processes of individual intentionality built for competition—understanding how the particular goals and perceptions of others generate particular actions—humans evolved, in addition, even more sophisticated processes of joint intentionality, involving joint goals and joint attention, built for social coordination. And social coordination creates unique challenges for cognition and thinking. Whereas the social dilemmas of game theory (e.g., prisoner’s dilemma) occur when interactants’ goals and preferences mostly conflict, coordination dilemmas occur when individuals’ goals and preferences mostly align. The challenge in these cases is not to resolve some conflict, but rather to find a way, perhaps by thinking, to coordinate with a social partner to a common goal.

The Cooperative Turn

Chimpanzees and other great apes live in highly competitive societies in which individuals vie with others for valued resources all day every day, and, as argued above, this is what shapes their cognition most profoundly. But chimpanzees and other apes also engage regularly in a number of important activities that are cooperative in a very general sense. For example, chimpanzees travel together and forage in small groups, “allies” support one another in fights within the group, and males engage in group defense against outsiders and predators (Muller and Mitani, 2005). These group behaviors for traveling, fighting, and defending the group are also common in many other mammalian species.

To illustrate the difference with human cooperation, let us focus on foraging, clearly one of the fundamental activities of all primates. The typical scene for chimpanzees, for instance, is that a small traveling party comes upon a fruit tree. Each individual then scrambles up on its own, finds a good place to procure some fruit on its own, grabs one or several pieces on its own, and then separates from the others by a few meters to eat. In one recent experiment, when given a choice of acquiring food cooperatively or alone, chimpanzees preferred to acquire it alone (Bullinger et al., 2011a). In another recent experiment, when given a choice of eating together with a groupmate or eating alone, both chimpanzees and bonobos preferred to eat alone (Bullinger et al., 2013). If there is ever a conflict over a piece of food, the dominant individual (depending, ultimately, on fighting ability) gets it. In general, the acquisition of food via individual scrambling and contests of dominance characterizes virtually all of the foraging activities of the four great ape species.

The main exception to this general great ape pattern is chimpanzees’ group hunting of monkeys—systematically observed only in chimpanzees, and only in some groups (Boesch and Boesch, 1989; Watts and Mitani 2002). What happens prototypically is that a small party of male chimpanzees spies a red colobus monkey somewhat separated from its group, which they then proceed to surround and capture. Normally, one individual begins the chase, and others scramble to the monkey’s possible escape routes, including the ground. One individual actually captures the monkey, and he ends up getting the most and best meat. But because he cannot dominate the carcass on his own, all participants (and many bystanders) usually get at least some meat, depending on their dominance and the vigor with which they beg and harass the captor (Gilby, 2006).

The social and cognitive processes involved in chimpanzee group hunting could potentially be complex, but they could also be fairly simple. The “rich” reading is a human-like reading, namely, that chimpanzees have the joint goal of capturing the monkey together and that they coordinate their individual roles in doing so (Boesch, 2005). But more likely, in our opinion, is a “leaner” interpretation (Tomasello et al., 2005). In this interpretation, each individual is attempting to capture the monkey on its own (since captors get the most meat), and they take into account the behavior, and perhaps intentions, of the other chimpanzees as these affect their chances of capture. Adding some complexity, individuals prefer that one of the other hunters capture the monkey (in which case they will get a small amount of meat through begging and harassing) to the possibility of the monkey escaping totally (in which case they get no meat). In this view, chimpanzees in a group hunt are engaged in a kind of co-action in which each individual is pursuing his own individual goal of capturing the monkey (what Tuomela, 2007 calls “group behavior in I-mode”). In general, it is not clear that chimpanzees’ group hunting of monkeys is so different cognitively from the group hunting of other social mammals, such as lions and wolves.

In stark contrast, human foraging is collaborative in much more fundamental ways. In modern forager societies, individuals produce the vast majority of their daily sustenance collaboratively with others, either immediately through collaborative efforts or via procurers who bring the food back to some central location for sharing (Hill and Hurtado, 1996; Hill, 2002; Alvard, 2012).¹ Human foragers also collaborate in many other domains of activity in ways that great apes do not. Tomasello (2011) systematically compares the social structures of great ape and human forager societies and concludes that in every domain, whereas apes behave mostly individualistically, humans behave mostly cooperatively. For example, humans but not apes engage in cooperative childcare in which all adults do all kinds of things to support developing children (so-called cooperative breeding; Hrdy, 2009). Humans but not apes engage in cooperative communication in which they provide one another with information that they judge to be useful for the recipient. Humans but not apes actively teach one another things helpfully, again for the benefit of the recipient. Humans but not apes make group decisions about group-relevant matters. And humans but not apes create and maintain all kinds of formal social structures such as social norms and institutions and even conventional languages (using agreed-upon means of expression). In all, cooperation is simply a defining feature of human societies in a way that it is not for the societies of the other great apes.²

Exactly when and how this cooperative turn took place in human evolution are not critical for current purposes. But for whatever it is worth, Tomasello et al. (2012) hypothesize that it began happening in an initial, preparatory step soon after the emergence of the genus Homo, around 2 million years ago. During this period there was a great expansion of terrestrial monkeys, like baboons, that might have outcompeted humans for their normal fruits and other vegetation. Humans then needed a new foraging niche. A beginning might have been scavenging meat, which would probably have required a kind of coalition of individuals to frighten off the animals that made the initial kill. But at some point there began more active collaborative hunting of large game and gathering of plant foods, typically in a mutualistic stag hunt–type situation in which both individuals could expect to benefit from the collaboration—if they could somehow manage to coordinate their efforts. This is the collaborative creature we are imagining here, and for the most clarity we may focus on its culmination in hominins of about 400,000 years ago: the common ancestor to Neanderthals and modern humans, the ever mysterious Homo heidelbergensis. Paleoanthropological evidence suggests that this was the first hominin to engage systematically in the collaborative hunting of large game, using weapons that almost certainly would not enable a single individual to be successful on its own, and sometimes bringing prey back to a home base (Stiner et al., 2009). This is also a time when brain size and population size were both expanding rapidly (Gowlett et al., 2012). We may hypothesize that these collaborative foragers lived as more or less loose bands comprising a kind of pool of potential collaborators.

But more important than when is how. In the hypothesis of Tomasello et al. (2012), obligate collaborative foraging became an evolutionarily stable strategy for early humans because of two interrelated processes: interdependence and social selection. The first and most basic point is that humans began a lifestyle in which individuals could not procure their daily sustenance alone but instead were interdependent with others in their foraging activities—which meant that individuals needed to develop the skills and motivations to forage collaboratively or else starve. There was thus direct and immediate selective pressure for skills and motivations for joint collaborative activity (joint intentionality). The second point is that as a natural outcome of this interdependence, individuals began to make evaluative judgments about others as potential collaborative partners: they began to be socially selective, since choosing a poor partner meant less food. Cheaters and laggards were thus selected against, and bullies lost their power to bully. Importantly, this now meant that early human individuals had to worry, in a way that other great apes do not, both about evaluating others and about how others were evaluating them as potential collaborative partners (i.e., a concern for self-image).

The situation these early humans faced is perhaps best modeled by the stag hunt scenario from game theory (Skyrms, 2004). Two individuals have easy access to low-payoff “hares” (e.g., low-calorie vegetation), and then there appears on the horizon a high-payoff but difficult-to-obtain “stag” (e.g., large game) that can be acquired only if individuals abandon their hares and collaborate. Their motivations thus align, because it is in both their interests to work together. The dilemma is thus purely cognitive: since collaboration is mandatory and I am risking my hare, I want to go for the stag only if you do, too. But you only want to go for the stag if I do, too. How do we coordinate this potential standoff? There are some cognitively simple ways out of the dilemma (see Bullinger et al., 2011b, for the leader-follower strategy that chimpanzees use), but they always involve one individual incurring disproportionate risk, and so they are unstable in certain circumstances. For example, if there were very few hares, so that each was highly valued, and hunting stags was only rarely successful, then the cost/benefit analysis would require that each individual attempt to make certain that their potential partner was also going for the stag before they relinquished their hare.

In the original analyses of Schelling (1960) and Lewis (1969), coordinating in this way required some kind of mutual knowledge or recursive mind reading: for me to go, I have to expect you to expect me to expect you.… For both Schelling and Lewis, this process, while remarkable, did not cause alarm. Later commentators problematized the analysis, pointing out that an infinite back-and-forth of us thinking about one another’s thinking could not actually be happening, or no decision could ever be made. Clark (1996) proposed, as a more realistic account, that humans simply recognize the “common ground” they have with others (e.g., we both know that we both want to go for the stag) and that this is sufficient for making joint decisions toward joint goals. Tomasello (2008) suggested that something like common ground is how people actually operate, but when perturbations occur they often explain them by reasoning that “he thinks that I think he thinks …” (typically only a few iterations deep), suggesting an underlying recursive structure. Our position is thus that human individuals are attuned to the common ground they share with others, and this does not always involve recursive mind reading, but still, if necessary, they may decompose their common ground a few recursive layers deep to ask such things as what he thinks I think about his thinking.

In any case, we may imagine that individuals who were attuned to their common ground with others, and who could engage in some level of recursive mind reading, had a huge advantage in strategically deciding when to stay with their hare and when to join with others to go for the more profitable stag. And those who could develop more sophisticated forms of cooperative communication would have had an even larger advantage. And so, the first step in our natural history of uniquely human thinking is the cognitive mechanisms of joint intentionality that evolved to coordinate humans’ earliest species-unique forms of small-scale collaboration and, later, cooperative communication.

Joint Goals and Individual Roles

We may characterize the formation of a joint goal (or joint intention) in more detail as follows (see Bratman, 1992). For you and me to form a joint goal (or joint intention) to pursue a stag together, (1) I must have the goal to capture the stag together with you; (2) you must have the goal to capture the stag together with me; and, critically, (3) we must have mutual knowledge, or common ground, that we both know each other’s goal.

It is important here that each of our goals is not just to capture the stag but, rather, to capture it together with the other. Each of us wanting separately to capture the stag (even if this was mutual knowledge; see Searle, 1995) would constitute two individuals hunting in parallel, not jointly. It is also important that we have mutual knowledge of one another’s goal, that is, that our respective goals are part of our common conceptual ground. Each of us may want to capture the stag together with the other, but if neither of us knows that this is the case, we very likely will not succeed in coordinating (for all of the reasons outlined by Lewis and Schelling, among others). Thus, joint intentionality is operative both in the action content of each of our goals or intentions—that we act together—and in our mutual knowledge, or common ground, that we both know that we both intend this.

Young children begin engaging with others in ways that suggest some form of joint goal from around fourteen to eighteen months of age, when they are still mostly prelinguistic. Thus, Warneken et al. (2006, 2007) had infants of this age engage in a joint activity with an adult, such as obtaining a toy by each operating one side of an apparatus. Then, the adult simply stopped playing her role for no reason. The children were not happy about this and did various things to attempt to reengage their partner. (They did not do this if her stopping was for a good reason; e.g., she had to attend to something else [Warneken et al., 2012].) Interesting, when this same situation was arranged for human-raised chimpanzees, they simply ignored the recalcitrant partner and tried to find ways to achieve the goal on their own. Although infants’ reengagement attempts do not suggest necessarily that they have a fully adult-like joint goal in common ground with their partner, at the very least they reflect an expectation that, barring obstacles, my partner in this joint activity is committed enough to reengage after a stoppage—an expectation that, apparently, chimpanzees in similar activities do not have.

By the time they are three years of age, children provide much more convincing evidence for joint goals because they themselves display commitment to the joint activity in the face of distractions and temptations. For example, Hamann et al. (2012) had pairs of three-year-old children work together to bring rewards to the top of a step-like structure. The problem was that for one child the reward, surprisingly, became available midway through. Nevertheless, when this happened, the lucky child delayed consumption of her own reward and persevered until the other got hers (i.e., more than they helped the partner in a similar situation in which they were acting individually, without collaboration). Such commitment to the partner suggests that the children constructed a joint goal at the beginning that “we” get the prizes together, and they made whatever adjustments were necessary to realize that joint goal. Again, great apes do not behave in this same way. In a similar experiment with chimpanzees, Greenberg et al. (2010) found no signs of a human-like commitment to follow through on the joint action until both partners received their reward. (And Hamann et al. [2011] found that at the end of the collaborative activity, three-year-old children, but not chimpanzees, were committed to dividing the spoils equally among participants as well.)

Importantly, when children of this same age have it in their common ground with a collaborative partner that each is counting on the other to come through (we are interdependent), they both feel obligated to the other (see Gilbert, 1989, 1990). Thus, Gräfenhain et al. (2009) had preschoolers explicitly agree to play a game with one adult, and then another adult attempted to lure them away to a more exciting game. Although two-year-old children mostly just bolted to the new game straightaway, from three years of age children paused before departing and “took leave,” either verbally or by handing the adult the tool they had been using together. The children seemed to recognize that joint goals involve joint commitments, the breaking of which requires some kind of acknowledgment or even apology. No study of this type has ever been done with chimpanzees, but there are no published reports of one chimpanzee taking leave from, making excuses to, or apologizing to another for breaking a joint commitment.

In addition to joint goals, collaborative activities also demand a division of labor and so individual roles. Bratman (1992) specifies that in joint cooperative activities individuals must “mesh” their subplans together toward the joint goal, and even help one another in their individual roles as necessary. In the Hamann et al. (2012) study cited above, young children stopped to help their partner as needed. This demonstrates that the partners are attending to one another and their respective subgoals, and perhaps even attending to the partner attending to them, and so forth. Indeed, other studies have found that young children, but not chimpanzees, learn important new things about the partner’s role as they are collaborating. For example, Carpenter et al. (2005) found that after young children played one role in a collaboration, they could quickly switch to the other, whereas chimpanzees could not do this (Tomasello and Carpenter, 2005). Most important, Fletcher et al. (2012) found that three-year-olds who had first participated in a collaboration playing role A then knew much better how to play role B than if they had not previously collaborated, whereas this was not true of chimpanzees.

Young children are thus beginning to understand that the roles in a collaborative activity are in most cases interchangeable among individuals, which suggests a “bird’s eye view” of the collaboration in which the various roles, including one’s own, are all conceptualized in the same representational format (see Hobson, 2004). This species-unique understanding may support an especially deep appreciation of self-other equivalence, as individuals imagine different subjects/agents engaging in similar or complementary activities simultaneously in the same collaborative activity. As suggested in our discussion of great ape thinking, the understanding of self-other equivalence is a key component enabling various kinds of combinatorial flexibility in thinking. (It also sets the stage for a full-blooded appreciation of agent neutrality encompassing not just self and other but all possible agents, which is a key feature of cultural norms and institutions, and “objectivity” more generally, as we shall see in chapter 4.)

Preschool children are not good models for the early humans we are picturing here because they are modern humans and they are bathed in culture and language from the beginning. But from soon after their first birthdays, and continuing up to their third birthdays, they come to engage with others in collaborative activities that have a species-unique structure and that do not, in any obvious way, depend on cultural conventions or language. These young children coordinate a joint goal, commit themselves to that joint goal until all get their reward, expect others to be similarly committed to the joint goal, divide the common spoils of a collaboration equally, take leave when breaking a commitment, understand their own and the partner’s role in the joint activity, and even help the partner in her role when necessary. When tested in highly similar circumstances, humans’ nearest primate relatives, great apes, do not show any of these capacities for collaborative activities underlain by joint intentionality. Importantly, young children also seem to have a species-unique motivation for collaboration, as shown in recent studies in which children and chimpanzees had to choose between pulling in a certain amount of food collaboratively with a conspecific or pulling in that same amount of food (or more or less) in a solo activity. Children very much preferred the collaborative option, whereas chimpanzees went wherever there was most food regardless of opportunities for collaboration (Rekers et al., 2011; Bullinger et al., 2011a).

BOX 1. Relational Thinking

Penn et al. (2008) have proposed that what makes human cognition different from that of other primates is thinking in terms of relations, especially higher-order relations. To support their claim, they review evidence from many different domains of cognition: judgments of relational similarity, judgments of same-difference, analogy, transitive inference, hierarchical relations, and so forth.

Their evaluation of the literature is decidedly one-sided, as they dismiss findings suggesting that nonhuman primates have some skills of this type. For example, nonhuman primates clearly understand some relations (consistently choosing the larger of two objects, for example, despite absolute size), and some individuals make same-difference judgments again based on relational not absolute characteristics (Thompson et al., 1997). Some chimpanzees also do something like analogical inference in using a scale model (Kuhlmeier et al., 1999), and many primates make transitive inferences (see Tomasello and Call, 1997, for a review).

But at the same time, it is true that humans are particularly skilled at relational thinking (Gentner, 2003). One hypothesis that might explain the data is that there are actually two kinds of relational thinking. One concerns the concrete physical world of space and quantities, in which we may compare various characteristics or magnitudes such as bigger-smaller, brighter-darker, fewer-greater, higher-lower, and even same-different. Nonhuman primates have some skills with these kinds of physical relations and relational magnitudes. What they may not comprehend at all—though there are few direct tests—is a second type of relation. Specifically, they may not comprehend functional categories of things defined by their role in some larger activity. Humans are exceptional in creating categories such as pet, husband, pedestrian, referee, customer, guest, tenant, and so forth, what Markman and Stillwell (2001) call “role-based categories.” They are relational not in the sense of comparing two physical entities but, rather, in assessing the relation between an entity and some larger event or process in which it plays a role.

The obvious hypothesis here is that this second type of relational thinking comes from humans’ unique understanding of collaborative activities with joint goals and individual roles (perhaps later generalized to all kinds of social activities even if they are not collaborative per se). As humans constructed these kinds of activities, they were creating more or less abstract “slots” or roles that anyone could play. These abstract slots formed role-based categories, such as things that one uses to kill game (viz., weapons; Barsalou, 1983), as well as more abstract narrative categories such as protagonist, victim, avenger, and so on. A further speculation might be that these abstract slots at some point enabled humans to even put relational material in the slots; for example, a married couple can play a role in a cultural activity. This would be the basis for the kinds of higher-order relational thinking that Penn et al. (2008) emphasize as especially important in differentiating human thinking.

In any case, the proposal here is that, at the very least, constructing the kinds of dual-level cognitive models needed to support collaborative activities enhanced, if not enabled, human engagement in much broader and more flexible relational thinking involving roles in larger social realities, and possibly in higher-order relational thinking as well.

The main point for now is that early humans seem to have created a new cognitive model. Collaborating toward a joint goal created a new kind of social engagement, a joint intentionality in which “we” are hunting antelopes together (or whatever), with each partner playing her own interdependent role. This dual-level structure of simultaneous sharedness and individuality—a joint goal but with individual roles—is a uniquely human form of second-personal joint engagement requiring species-unique cognitive skills and motivational propensities. It also has a number of perhaps surprising ramifications for many different aspects of human cognition that go beyond our primary focus here (see box 1 for one example).

Joint Attention and Individual Perspectives

Organisms attend to situations that are relevant to their goals. And so, when two humans act together jointly, they naturally attend together jointly to situations that are relevant to their joint goals. Said another way, as humans coordinate their actions, they also, to do this effectively, coordinate their attention. Underlying this coordination is, once again, some notion of common ground, in which each individual—at least potentially—can attend to his partner’s attention, his partner’s attention to his attention, and so forth (Tomasello, 1995). Joint actions, joint goals, and joint attention are thus of a piece, and so they must have coevolved together.

The current proposal is that the phylogenetic origins of the ability to participate with others in joint attention—the first and most concrete way in which young children create common conceptual ground and so shared realities with others—lie in collaborative activities. This is what Tomasello (2008) calls the “top-down” version of joint attention because it is directed by joint goals. (The alternative is bottom-up joint attention, such as when a loud noise attracts both of our attention, and we both know it must have attracted the other’s attention as well.) Ontogenetically, young children begin to structure their joint actions with others via joint visual attention at around nine to twelve months of age, often called joint attentional activities. These are such activities as giving and taking objects, rolling a ball back and forth, building a block tower together, putting away toys together, and “reading” books together. Despite specific attempts to identify and solicit such joint attentional activities with human-raised chimpanzees, Tomasello and Carpenter (2005) were unable to find any (nor are there any other reliable reports of joint attention in nonhuman primates).

Just as each partner in a joint collaborative activity has her own individual role, each partner in joint attentional engagement has her own individual perspective—and knows that the other has her own individual perspective as well. The crucial point, which will be foundational for all that follows, is that the notion of perspective assumes a single target of joint attention on which we have differing perspectives (Moll and Tomasello, 2007, in press). If you are looking out one window of the house and I am looking out another in the opposite direction, we do not have different perspectives—we are just seeing completely different things. We can thus operate with the notion of individually distinct perspectives only if (1) we are both considering “the same” thing, and (2) we both know the other is attending to it differently. If I see something in one way, and then round the corner to see it in another, this does not give me two perspectives on the same thing, because I do not have multiple perspectives available to me simultaneously for comparison. But when two people are attending to the same thing simultaneously—and it is in their common ground that they are both doing so—then “space is created” (to use Davidson’s [2001] metaphor) for an understanding of different perspectives to arise.³

Young children begin showing an appreciation that others have perspectives that differ from their own from soon after their first birthdays, in conjunction with their earliest joint attentional activities. For example, in one experiment an adult and child played together jointly with three different objects for a short time each (Tomasello and Haberl, 2003). Then, while the adult was out of the room, the child and a research assistant played with a fourth object. After that the adult returned, looked at an array containing all four objects, and exclaimed excitedly “Wow! Cool! Look at that!” Under the assumption that people only get excited about new things, not old things, children as young as twelve months of age identified which of the objects was the new one causing the adult’s excitement—even though they were all equally old for her. The new one is the one we have not attended to together before.

This is what some researchers have called level 1 perspective taking, because it concerns only whether the other person does or does not see something, not how she sees it. In level 2 perspective taking, children understand that someone sees the same thing differently than they do. For example, Moll et al. (2013) found that three-year-old children understood which object an adult intended to indicate by calling it “blue,” even though the object did not appear blue to the child—only to the adult due to her looking through a color filter. Children could thus take the perspective of the other person when it differed from their own. However, these same children could not answer correctly when asked if they and the adult saw the object differently at the same time. Indeed, children struggle with several versions of such simultaneously conflicting perspectives on a jointly attended-to situation until they are four or five years of age (Moll and Tomasello, in press). Thus, children before four or five years have difficulty with dual naming tasks (“it” is simultaneously a horse and a pony), appearance-reality tasks (“it” is simultaneously a rock and a sponge; Moll and Tomasello, 2012), and false belief tasks (“it” is in the cabinet or in the box). Resolving the conflict between perspectives that confront one another on a jointly attended-to entity—especially when both purport to depict “reality”—takes some additional skills for dealing with an objective reality and how different perspectives relate to it, which again will await the next step in our evolutionary story (and see note 6).

And so we have come to a tipping point. Based on their capacity to coordinate actions and attention with others toward joint goals, early humans came to understand that different individuals can have different perspectives on one and the same situation or entity. In contrast, great apes (including the last common ancestor with humans) do not coordinate their actions and attention with others in this same way, and so they do not understand the notion of simultaneously different perspectives on the same situation or entity at all. We thus encounter once again the dual-level structure of simultaneous jointness and individuality. Just as collaborative activities have the dual-level structure of joint goal and individual roles, joint attentional activities have the dual-level structure of joint attention and individual perspectives. Joint attention thus begin the process by which human beings construct an intersubjective world with others—shared but with differing perspectives—which will also be fundamental to human cooperative communication. We may thus posit that joint attention in joint collaborative activity, as manifest even in very young children, was the most basic form of socially shared cognition in human evolution—characteristic already of early humans—and that this primal version of socially shared cognition spawned an equally primal version of perspectivally constructed cognitive representations.

Social Self-Monitoring

Early humans living as obligate collaborative foragers would have become more deeply social in still another way. Although skills of joint intentionality are necessary for human-like collaborative foraging, they are not sufficient. One also has to find a good partner. This may not always be overly difficult, as even chimpanzees, after some experience, learn which partners are good (i.e., lead to success) and which are not (Melis et al., 2006b). But in addition, in situations in which there is meaningful partner choice, one must be—or at least appear to be—a good collaborative partner oneself. To be an attractive partner for others, and so not be excluded from collaborative opportunities, one must not only have good collaborative skills, but also do one’s share of the work, help one’s partner when necessary, share the spoils at the end of the collaboration, and so forth.

And so early humans had to develop a concern for how other individuals in their group were evaluating them as potential collaborative partners, and then regulate their actions so as to affect these external social judgments in positive ways—what we may call social self-monitoring. Other great apes do not appear to engage in such social self-monitoring. Thus, when Engelmann et al. (2012) gave apes the opportunity to either share or steal food from a groupmate, their behavior was totally unaffected by the presence or absence of other group members observing the process. In contrast, in the same situations, young human children shared more with others and stole less from others when another child was watching.

Motivationally, a concern for social evaluation derives from the interdependence of collaborative partners: my survival depends on how you judge me. Cognitively, a concern for social evaluation involves still another form of recursive thinking: I am concerned about how you are thinking about my intentional states. Social self-monitoring is thus the first step in humans’ tendency to regulate their behavior not just by its instrumental success, as apes do in their goal-directed activities, but also by the anticipated social evaluations of important others. Because these concerns are about the evaluations of specific other individuals, we may think of them as second-personal phenomena. They thus represent an initial sense of social normativity—a concern for what others think I should and should not be doing and thinking—and so a first step toward the kind of normative self-governance, so as to fit in with group expectations, that will characterize modern humans in the next step of our story (see chapter 4).

Summary: Second-Personal Social Engagement

Great apes have a multitude of social-cognitive skills for understanding the intentional actions of others, but they do not engage with them in any form of joint intentionality. Thus, great apes understand that others have goals, and they sometimes even help others attain their goals (Warneken and Tomasello, 2009), but they do not collaborate with others by means of a joint goal. Similarly, great apes understand that others see things and so can follow the gaze direction of others to see what they see (Call and Tomasello, 2005), but they do not engage with others in joint attention. And great apes make individual decisions that they self-monitor, but they do not make joint decisions with others or monitor themselves via the social evaluations of others. What emerges for the first time with early humans, in the current account, is a “we” intentionality in which two individuals engage with the intentional states of one another both jointly and recursively.

This new form of joint engagement is second-personal: an engagement of “I” and “you.” Second-personal engagement has two minimal characteristics: (1) the individual is directly participating in, not observing from outside, the social interaction; and (2) the interaction is with a specific other individual with whom there is a dyadic relationship, not with something more general like a group (if there are multiple persons present, there are many dyadic relationships but no sense of group). There is less consensus about other possible features of second-personal engagement, but Darwall (2006) proposes, in addition, that (3) the essence of this kind of engagement is “mutual recognition” in which each partner gives the other, and expects from the other, a certain amount of respect as an equal individual—a fundamentally cooperative attitude among partners.

And so, the evolutionary proposal is that early humans—perhaps Homo heidelbergensis some 400,000 years ago—evolved skills and motivations for joint intentionality that transformed great apes’ parallel group activities (e.g., you and I are each chasing the monkey in parallel) into truly joint collaborative activities (e.g., we are chasing the monkey together, each with our own role). And they also transformed great apes’ parallel looking behavior (e.g., you and I are each looking at the banana) into true joint attention (e.g., we are looking at the banana together, each with our own perspective). But early humans were not doing this in the manner of contemporary human beings, that is, as manifest in relatively permanent cultural conventions and institutions. Rather, their earliest collaborative activities were ad hoc collaborations for particular goals on a particular occasion with a particular person, with their joint attention similarly structured in this second-personal way. There would thus be second-personal joint engagement with the partner, but when the collaboration was over the “we” intentionality would be over as well.

FIGURE 3.1 The dual-level structure of joint collaborative activity

The cognitive model schematized from repeated experiences of this kind was thus a dual-level structure of simultaneous sharing and individuality while in direct social interaction with other particular partners, underlain and supported by some kind of common ground and recursive mind reading (see Figure 3.1). The cognitive model of this second-personal, dual-level social engagement laid the foundation for almost everything that was uniquely human. It provided the joint intentionality infrastructure for uniquely human forms of cooperative communication involving intentions and inferences about perspectives—as we shall see in the section that follows—and, ultimately, it provided the foundation for the cultural conventions, norms, and institutions that brought the human species into the modern human world—as we shall see in the next chapter.

A New Form of Cooperative Communication

Early humans coordinated their actions and attention based on common ground. But coordinating in more complex ways—for example, in planning our specific roles in a collaboration under various contingencies, or in planning a series of joint actions—required a new type of cooperative communication. The gestures and vocalizations of ancient great apes could not have done this coordinating work. They could not have, first, because they were geared totally toward self-serving ends, and this simply does not mesh with mutualistic collaboration toward a joint goal. They could not have, second, because they were used exclusively in attempts to regulate the behavior of others directly, and this does not mesh with the need to coordinate actions and attention referentially on external situations and entities, as in collaborative foraging for food.

Tomasello (2008) argued and presented evidence that the first forms of uniquely human cooperative communication were the natural gestures of pointing and pantomiming used to inform others helpfully of situations relevant to them. Pointing and pantomiming are human universals that even people who share no conventional language can use to communicate effectively in contexts with at least some common ground. But to do this requires an extremely rich and deep set of interpersonal intentions and inferences in the context of this common ground. If I point for you in the direction of a tree, or pantomime for you a tree, without any common ground, you have nothing to guide your inferences about what I am intending to communicate to you or why. Pointing and pantomiming thus created for early humans new problems of social coordination—not just in coordinating actions with others but also in coordinating intentional states—and solving these new coordination problems required new ways of thinking.

A New Motive for Communicating

In joint collaborative activities in which the partners are interdependent, it is in the interest of each partner to help the other play her role. This is the basis for a new motive in human communication, not available to other apes (but see Crockford et al., 2011, for one possible exception), namely, the motive to help the other by informing her of situations relevant to her. The emergence of this motive was aided by the fact that in the context of a joint collaborative activity, directive communication and informative communication are not clearly distinct—because the partners’ individual motives are so closely intertwined. Thus, if we are gathering honey together and you are struggling with your role, I can point to a stick, which I intend as a directive to you to use it, or, alternatively, I can point to the stick intending only to inform you of its presence—because I know that if you see it you will most likely want to use it. When we are working together toward a joint goal, both of these work because our interests are so closely aligned.

The evolutionary proposal is thus that early humans’ first acts of cooperative communication were pointing gestures in joint collaborative activities, and these were underlain by a communicative motive not yet differentiated between requestive and informative. But at some point early humans began to understand their interdependence with others not just while the collaboration was ongoing but also more generally: if my best partner is hungry tonight, I should help her so that she will be in good shape for tomorrow’s foraging. And outside of collaborative activities, the difference between me requesting help from you, for my benefit, and me informing you of things helpfully, for your benefit, becomes crystal clear. And so there arose for early humans two distinct motives for their deictic communication, requestive and informative, which everyone both comprehended and produced.

When great apes work together in experiments, there is an almost total absence of intentional communication of any kind (e.g., Melis et al., 2009; Hirata, 2007; Povinelli and O’Neill, 2000). When apes communicate with one another in other contexts, it is always directive (Call and Tomasello, 2007; Bullinger et al., 2011c). In stark contrast, from as soon as they can collaborate meaningfully with others, at around fourteen to eighteen months of age, young children use the pointing gesture to coordinate their joint activity (e.g., Brownell and Carriger, 1990; Warneken et al., 2006, 2007)—with, again, a telling ambiguity about whether their motive is requestive or informative. But also, outside of collaborative activities, even twelve-month-old infants sometimes point simply to inform others of such things as the location of a sought-for object. For example, Liszkowski et al. (2006, 2008) placed twelve-month-olds in various situations in which they observed an adult misplace an object or lose track of it in some way, and then start searching. In these situations infants pointed to the sought-for object (more often than to distractor objects that were misplaced in the same way but were not needed by the adult), and in doing this they showed no signs of wanting the object for themselves (no whining, reaching, etc.). The infants simply wanted to help the adult by informing her of the location of the sought-for object.

The emergence of the informative communicative motive, alongside the general great ape directive motive, had three important consequences for the evolution of uniquely human thinking. First, the informative motive led communicators to make a commitment to informing others of things honestly and accurately, that is, truthfully. Initially during collaborative activities, but then more generally (as humans’ interdependence extended outside of collaborative activities), if individuals wanted to be seen as cooperative, they would commit themselves to always communicating with others honestly. Of course, you may still lie: you point to where you want me to search for my spear even though it is not really there, for some selfish motive. But lying only works if there is first a mutual assumption of cooperation and trust: you only lie because you know that I will trust your information as truthful and act accordingly. And so, while there is still some way to go to get to truth as an “objective” feature of linguistic utterances (see chapter 4), if we want to explain the origins of humans’ commitment to characterize the world accurately independent of any selfish purpose, then being committed to informing others of things honestly, for their not our benefit, is the starting point. The notion of truth thus entered the human psyche not with the advent of individual intentionality and its focus on accuracy in information acquisition but, rather, with the advent of joint intentionality and its focus on communicating cooperatively with others.⁴

The second important consequence of this new cooperative way of communicating was that it created a new kind of inference, namely, a relevance inference. The recipient of a cooperative communicative act asks herself: given that we know together that he is trying to help me, why does he think that I will find the situation he is pointing out to me relevant to my concerns. Consider great apes. If a human points and looks at some food on the ground, apes will follow the pointing/looking to the food and so take it—no inferences required. But if food is hidden in one of two buckets (and the ape knows it is in only one of them) and a human then points to a bucket, apes are clueless (see Tomasello, 2006, for a review). Apes follow the human’s pointing and looking to the bucket, but then they do not make the seemingly straightforward inference that the human is directing their attention there because he thinks it is somehow relevant to their current search for the food. They do not make this relevance inference because it does not occur to them that the human is trying to inform them helpfully—since ape communication is always directive—and this means that they are totally uninterested in why the human is pointing to one of the boring buckets. Importantly, it is not that apes cannot make inferences from human behavior at all. When a human first sets up with them a competitive situation and then reaches desperately toward one of the buckets, great apes know immediately that the food must be in that one (Hare and Tomasello, 2004). They make the competitive inference, “He wants in that bucket; therefore the food must be in there,” but they do not make the cooperative inference, “He wants me to know that the food is in that bucket.”

This pattern of behavior contrasts markedly with that of human infants. In the same situation, prelinguistic infants of only twelve months of age trust that the adult is pointing out to them something relevant to their current search—they comprehend the informative motive—and so they know immediately that the pointed-to bucket is the one containing the reward (Behne et al., 2005, 2012). The mutual assumption of cooperativeness in such situations is so natural for humans that they have developed a special set of signals—ostensive signals such as eye contact and addressing the other vocally—by means of which the communicator highlights for the recipient that he has some relevant information for her. Thus, as evolutionary example, suppose that while we are collaboratively foraging I point to berries on a bush for you, with eye contact and an excited vocalization. You look and see the bush but, at first, no berries. So you ask yourself: why does he think that this bush is relevant for me—and this makes you look harder for something that is indeed relevant—and thus you discover the berries. As communicator, I know that you, as recipient, are going to be engaging in this process if and only if you see me as directing your attention cooperatively, and so I want to make sure that you know that I am doing this. Therefore, I not only want you to know that there are berries here but also want you to know that I want you to know this—so that you will follow through the inferential process to its conclusion (Grice, 1957; Moore, in press). By addressing you ostensively, and based on our mutual expectation of cooperation, I am in effect saying, “You are going to want to know this”—and you do want to know it because you trust that I have your interests in mind.

The third and final consequence of this newly cooperative way of communicating was that there now emerged, at least in nascent form, a distinction between communicative force—as overtly expressed in requestive and informative intonations—and situational or propositional content as indicated by the pointing gesture. (NB: This means that by this time early humans would have had to control their vocal expressions of emotions voluntarily in a way that apes do not.) Early humans could now point toward berries in a bush, with one of two different motives, expressed intonationally: either an insistent requestive intonation, in the hopes that the recipient would fetch some berries for her, or a neutral intonation to just inform the recipient of the berries’ location so that she might get some for herself. We thus now have a clear distinction between something like communicative force and communicative content: the communicative content is the presence of the berries, and the communicative force is either requestive or informative. All of this is implicit, of course, and so we still have some way to go to reach the conventionalized and so explicit distinction between communicative force and content that is so important in conventional linguistic communication (see chapter 4). But the breakthrough here is the relative independence of referential (situational, propositional) content from the communicator’s motives or intentions for referring attention to it.

And so, early humans’ joint collaborative activities created a new motivational infrastructure for their communication, a cooperative motivation to inform one another of things helpfully and honestly. This then motivated recipients to do significant inferential work to find out why the communicator thought that looking in a certain direction would be relevant for their concerns, which then motivated communicators to advertise when they had something relevant for a recipient. And the fact that there were now two different communicative motives possible—requestive and informative—meant that the situational (propositional) content of the communicative act was starting to be conceptualized as independent of the particular intentional states of the communicator.

New Ways of Thinking for Communicating

The motive to be helpful and cooperative in communication meant that, from the cognitive point of view, communicators had to be able to determine which situations were relevant to a recipient on a particular occasion. Conversely, recipients had to be able to identify the intended situation and its relevance for them, in essence by determining which situation in the direction of his pointing gesture the communicator thinks is relevant or interesting for them on this occasion—and why. The basic problem is that what the communicator wishes to point out to the recipient—his communicative intention⁵—is a whole fact-like situation, for example, that there are bananas in the tree, or that there are no predators in the tree. But the act of pointing—the protruding finger—is the same in all cases. The puzzle is how does one point out for a recipient different situations in the same perceptual scene?

The key to this puzzle is that the participants in a communicative interaction mutually assume the relevance of the communicative act for the recipient (Sperber and Wilson, 1996), and this relevance is always in relation to something that is in our common ground (Tomasello, 2008). Independent of communication, a situation is relevant to you for your own individual reasons. But for me to direct your attention to that situation successfully in communication, you must know that I know it is relevant for you; indeed, we must know together in our common ground that it is relevant for you. The simplest situation is thus when we are in a collaborative activity with the immediate common ground created by our joint goal. For example, if we have been searching unsuccessfully for bananas all day, you will naturally assume that my pointing gesture toward the banana tree is intended to indicate for you the fact that there are bananas in that tree. On the other hand, if we spied the bananas together some minutes ago but there was a predator in the tree and so we waited, and now the predator seems to have left, you will naturally assume that I am indicating for you the fact that there is now no predator in the tree. Common ground and a mutual assumption of relevance—not possible for apes because they simply do not engage in this kind of cooperative communication—enable a meeting of minds in the direction of the protruding finger.

Following the analysis in chapter 2, relevant situations are those that present individuals with opportunities and/or obstacles for reaching their goals and maintaining their values. Thus, if during our search for fruit I point toward a distant banana tree, it would never occur to you that I might be pointing out the presence of the leaves, even if leaves is all that you see at the moment, since the presence of leaves is in no way relevant to what we are doing. Instead, you will continue looking until you see, for example, some bananas behind the leaves, whose presence is highly relevant to what we are doing. Another dimension of this process is that only “new” situations are communicatively relevant, since currently shared situations need not be pointed out. And so, in the example from above, after the predator left the banana tree I pointed to the banana tree with the intention to indicate the situation of the predator’s absence, which you readily discerned. How could I intend and you infer predator absence when the presence of the bananas is also highly relevant? Because the presence of the bananas was already in our current common ground, and so me pointing out this situation to you would be superfluous. If I am going to be helpful, I must point out situations that are new for you, or else why bother. And so, in human cooperative communication, both communicators and recipients mutually assume in their common ground that communicators point out for recipients’ situations that are both relevant and new.

Perhaps surprising, even young infants are skillful at keeping track of the common ground they have with specific other individuals and using that to determine relevance in both the comprehension and production of pointing gestures. For example, Liebal et al. (2009) had a one-year-old infant and an adult clean up together by picking up toys and putting them in a basket. At one point the adult stopped and pointed to a target toy, which the infant then cleaned up into the basket. However, when the infant and adult were cleaning up in exactly this same way, and a second adult who had not shared this context entered the room and pointed toward the target toy in exactly the same way, infants did not put the toy away into the basket; they mostly just handed it to him, presumably because the second adult had not shared the cleaning up game with them as common ground. Infants’ interpretations thus did not depend on their own current egocentric activities and interests, which were the same in both cases, but rather on their shared experience with each of the pointing adults. (In another study, Liebal et al. [2010] found that infants of this same age also produced points differently depending on their common ground with the recipient.)

Infants in this same age range also use a mutual assumption of newness to determine what a pointing adult thinks is relevant for them. Thus, Moll et al. (2006) had eighteen-month-old infants play with an adult and a toy drum. If a new adult now entered the room and indicated the drum excitedly, the child assumed he was talking about the cool drum. But if the adult with whom the child had just been sharing enjoyment of the drum pointed to the drum excitedly in exactly the same manner, the child did not assume that she was excited about the drum: how could she be, since it is old news for us? Rather, children assumed that the adult’s excitement must be due to something new about the drum that they had not previously noticed, and so they attended to some new aspect, for example, on the adult’s side of the drum. In their production of pointing, infants also use this distinction between shared and new information. For example, when a fourteen-month-old infant wanted his mother to put his high chair up to the dining room table: on one occasion he pointed to the chair (because he and his mother had already shared attention to the empty space at the table), whereas on another occasion he pointed to the empty space at the table (because he and his mother had already shared attention to the chair) (Tomasello et al., 2007a). In both cases the infant wants the exact same thing—his chair placed at the table—but to communicate effectively he assumes that the object he and his mother are focused on is already part of their common ground, and so he points out the aspect of the situation that she may not have noticed, the new part.⁶

Engaging in cooperative (ostensive-inferential) communication of this type requires some new types of thinking. In effect, all three components of the thinking process—representation, inference, and self-monitoring—must become socialized.

With respect to representation, the key novelty is that both participants in the communicative interaction must represent one another’s perspective on the situation and its elements. Thus, the communicator attempts to focus the recipient’s attention on one of the many possible situations—fact-like representations—immanent in the current perceptual scene (e.g., there are bananas in the tree versus there is no predator in the tree). The communicative act thus perspectivizes the scene for the recipient. It also perspectivizes the elements. For example, if we are building a fire, me pointing out to you the presence of a log construes that log as firewood. But if we are tidying up the cave, me pointing out to you the presence of that very same log construes it as trash. In the object choice task, the communicator is not pointing to the bucket qua physical object or qua vessel for carrying water, but rather qua location: I am informing you that the reward is located in there. Cooperative pointing thus creates different conceptualizations or construals of things. These presage the ability of linguistic creatures to place one and the same entity under alternative different “descriptions” or “aspectual shapes,” which is one of the hallmarks of human conceptual thinking; but it does this without the use of any conventional or symbolic vehicles with articulate semantic content.

With respect to inference, the key point is that the inferences used in cooperative communication are socially recursive. Thus, implicit in all of the foregoing is a kind of backing-and-forthing of individuals making inferences about the partner’s intentions toward my intentional states. In the object choice task, for example, the recipient infers that the communicator intends that she know that the food is in that bucket—a socially recursive inference that great apes apparently do not make. This inference requires in all cases an abductive leap, something like: his pointing in the direction of that otherwise boring bucket would make sense (i.e., would be consistent with common ground, relevance, and newness) if it is the case that he intends that I know where the reward is. The communicator, for his part, is attempting to help the recipient to make that abductive leap appropriately. To do this, at least in many situations, the communicator must engage in some kind of simulation, or thinking, in which he imagines how pointing in a particular direction will lead the recipient to make a particular abductive inference: if I point in this direction, what inferences will he make about my intentions toward his intentional states? And then, when making his abductive inference, the recipient can potentially take into account the communicator’s taking into account of what kind of inference she is likely to make about his communicative intentions. And so forth.

Finally, with respect to self-monitoring, the key is that being able to operate in this way communicatively requires individuals to self-monitor in a new way. As opposed to apes’ cognitive self-monitoring, this new way was social. Specifically, as an individual was communicating with another, he was simultaneously imagining himself in the role of the recipient attempting to comprehend him (Mead, 1934). And so was born a new kind of self-monitoring in which communicators simulated the perspective of the recipient as a kind of check on whether the communicative act was well formulated and so was likely to be understood. This is not totally unlike the concern for self-image characteristic of early humans (noted earlier in the discussion of collaboration) in which individuals simulate how they are being judged by others for their cooperativeness—it is just that in this case what is being evaluated is comprehensibility. Importantly, both of these kinds of self-monitoring are “normative” in a second-personal way: the agent is evaluating his or her own behavior from the perspective of how other social agents will evaluate it. Of this process, Levinson (1995, p. 411) says, “There is an extraordinary shift in our thinking when we start to act intending that our actions should be coordinated with—then we have to design our actions so that they are self-evidently perspicuous.”⁷ This social self-monitoring for intelligibility in cooperative communication lays the foundation for modern human norms of social rationality, where social rationality means making communicative sense to one’s partner.

These new processes of thinking involved in cooperative communication are well illustrated by two studies with young children. First, from the point of view of the communicator, is a study by Liszkowski et al. (2009) with twelve-month-old infants. In this study, an adult played a game with an infant in which the infant repeatedly needed a particular kind of object, always found in the same location on a plate. At some point, the infant needed one of those objects, but none was around. To get one, many infants alighted upon the strategy of pointing for the adult to the empty plate, that is, to the location where they both knew in common ground that those kind of objects usually are found. To perform this communicative act, the infant had to simulate the adult’s process of comprehension: what abductive inference (about my intentions toward her intentional states) will she make if I point to the plate? That this is not just a simple association is suggested by the fact that chimpanzees, who are perfectly capable of associative learning, in this same setup made no attempts to direct the human’s attention to the empty plate (even though they did make pointing attempts in other contexts in this same study). The children simulated the adult making inferences about their intentions toward her intentional states.

To illustrate the process even more dramatically, and from the point of view of comprehension, we may consider the phenomenon of “markedness.” On some occasions, a communicator may mark (e.g., with intonational stress) something in her communicative act as out of the ordinary, so that the recipient will not make the normal inference but rather a different one. For example, Liebal et al. (2011) had an adult and a two-year-old child again tidying up toys into a large basket. In the normal course of events, when the adult pointed to a medium-sized box on the floor, the child took this to suggest that she should tidy up this box into the basket as well. But in some cases the adult pointed to the box with flashing eyes and a kind of insistent pointing directed at the child, obviously not the normal way of doing it. The adult clearly intended something different from the norm. In this case, many children looked at the adult puzzled but then proceeded to open the box and look at what was inside (and tidy it up). The most straightforward interpretation of this behavior is that the child understood that the adult was anticipating how she would construe a normal point, which he did not want, and so he was marking his pointing gesture so that she would be motivated to search for a different interpretation. This is the child thinking about the adult thinking about her thinking about his thinking.

And so, the kind of thinking that goes on in human cooperative communication is evolutionarily new in that it is perspectival and socially recursive. Individuals must think (simulate, imagine, make inferences) about their communicative partner thinking (simulating, imagining, making inferences) about their thinking—at the very least. Great apes show no signs of making such inferences, and their failure to comprehend even the simplest acts of cooperative pointing, for example, in the object choice task (while making nonrecursive inferences in the same task setting), provides positive evidence that they do not. Human thinking in cooperative communication also involves a new kind of social self-monitoring, in which the communicator imagines what perspective the recipient is taking, or will take, on his intentions toward her intentions—and so imagines how she will comprehend it. In all, what we have at this point in our evolutionary story of human communication is individuals attempting to coordinate their intentional states, and so their actions, by pointing out new and relevant situations to one another. This relies on their having a certain amount and type of common ground, and it requires, further, that the interactants make a series of interlocking and socially recursive inferences about one another’s perspectives and intentional states.

Symbolizing in Pantomime

Beyond the pointing gesture, the second form of “natural” communication that humans employ is spontaneously generated, nonconventional iconic gestures, or pantomime. These gestures are used to direct the imagination of others to nonpresent entities, actions, or situations. Iconic gestures go beyond simply directing attention to situations deictically, as in pointing, by actually symbolizing an entity, action, or situation in an external icon. Iconic gestures are “natural” because they employ normally effective intentional actions, just in a special way. The recipient can, on the basis of observing them, imagine the real actions or objects the communicator is pantomiming, and then, in the context of their common ground, make the appropriate inference to his communicative intention. Examples of informative uses of iconic gestures would be things like warning of a nearby snake by moving one’s hand in a slithering motion, telling of a deer at the waterhole by miming antlers on one’s own head (or the sound of his vocalization), or identifying the whereabouts of a friend by pantomiming him swimming. With the appropriate common ground, such gestures communicate very effectively about all kinds of nonpresent situations.

No nonhuman primates use iconic gestures or vocalizations. Great apes could easily gesture with their hands the way that humans do to mime eating or drinking, but they do not.⁸ Indeed, great apes do not even understand iconic signs. In a modified object choice experiment, a human held up a replica of the object under which food was hidden. Two-year-old children knew that this meant to search in the similar-appearing object, while chimpanzees and orangutans did not (Tomasello et al., 1997; Herrmann et al., 2006). In ongoing research we have been trying to elicit iconic gesturing from apes in situations in which it would benefit them to do so (e.g., showing a human how to extract food for them from an apparatus that only they know how to operate), but so far with no success. Presumably, great apes do not understand iconic gestures because they do not understand communication marked ostensively as “for you” (cooperatively). If an ape views someone hammering a nut, they know perfectly well what he is doing, but if they view him making a hammering motion in the absence of any stone or any nuts, they are simply perplexed. To comprehend iconic gestures, one must be able to see intentional actions performed outside of their normal instrumental contexts as communication—because they are marked as such by the communicator via various kinds of ostensive signals (e.g., eye contact). Extending an analogy from Leslie (1987) on pretense, the bizarre action must be “quarantined” from straightforward interpretation as an instrumental action by marking it as “for communication only.”

Another prerequisite for an individual to produce an iconic gesture is that he is able to produce with his body an action that “resembles” a real action (or object). Presumably the ability to do this derives from the ability to imitate, at which humans are especially skillful compared with other apes (Tennie et al., 2009). Somehow early humans came to understand that “imitating” an action not for real but with an ostensive communicative intention (simulating it in action) could lead a recipient to imagine all kinds of referential situations not in the current perceptual scene. One potentially important social context in this connection is teaching, which has the evolutionary advantage that the primal scene is one of an adult instructing its offspring. Csibra and Gergely (2009) explicate what they call “natural pedagogy” and note its close connection to cooperative communication. The most basic form of natural pedagogy is demonstrating: showing someone how to do something by either doing it directly or pantomiming it in some way. And like communication, the action is being done not for its own sake but for the benefit of the observer/learner. Communicating with iconic gestures thus requires both an understanding of ostensive communication and some ability to imitate actions.

Importantly, iconic gestures may depict the referential object or action quite faithfully, but then it can still be, as in pointing, a large inferential leap to the underlying communicative intention. Thus, to bridge the gap, just as in the case of pointing, common ground and mutual assumptions of cooperation and relevance are needed. If I mime for you a snake’s motion as we approach a cave, if you do not know that snakes are often found in caves, you might wonder why I am waving my hand in that way. In the contemporary world, we recently observed a young child going through airport security. The guard scanning her with his wand moved his hand in a circular motion to tell her to turn around so he can scan her backside. Staring at him, she slowly began moving her hand in a circular motion back at him—she did not understand that his hand was meant to represent her body. Apparently, they did not have in their common ground airport security procedures.

Whereas there is only one basic pointing gesture,⁹ there are myriad possible iconic gestures—a “discrete infinity,” perhaps. With iconic gestures there is, or at least can be, a more or less one-to-one correspondence of gestures and the intended referent (though typically only one aspect of an intended referential situation is mimed). This means that iconic gestures, even though not conventional, have a kind of semantic content. With pointing, I can in principle indicate for you the shape, size, or material of a piece of paper, within the appropriate common ground, but the unique perspective in each case is not in any way contained “in” the protruding finger itself (see Wittgenstein’s [1955] incisive, if cryptic, discussion of this issue). But with iconic gestures, I would indicate for you the shape, size, or material of a piece of paper—or whether I want you to write on the paper or throw it away—by depicting each of these different aspects or actions with different icons. The momentous new feature of iconic gestures is thus that the different perspectives of things and situations only implicit in pointing are now expressed overtly in external symbolic vehicles with semantic content.

Relatedly, the vast majority of communicative conventions in a natural language are category terms. That is, common nouns and most verbs are conventionalized for reference to categories of entities such as dog and bite, which means that to make reference to a specific dog or instance of biting, we must do some kind of pragmatic grounding (such as with the or my dog, or the dog who lives next door in the case of nouns; or tense and aspect markers, as in is biting or bit, in the case of verbs). Iconic gestures are already category terms, because they implore the recipient to imagine something “like this.” (It is possible that one could iconically gesture an individual as well—for example, by mimicking her idiosyncratic mannerisms—and so the distinction between common and proper nouns is at least in principle possible in this modality.) The categorical dimension is bound up with perspective in the sense that calling someone either Bill or Mr. Smith is not perspectival because these are not category terms, but calling him a father or a man or a policeman is perspectival because it puts him “under a description,” that is, it “perspectivizes” him differently on different occasions for different communicative purposes.

Iconic gestures are thus an important step on the road to linguistic conventions in that they are symbolic, with semantic content, and are at least potentially categorical. An interesting fact that reinforces this point is that although young children produce some iconic gestures from early in development, they actually go down in frequency over the second year of life as children begin learning language, whereas pointing increases in frequency during the same period. One hypothesis is that pointing increases because it does not compete with language but complements it by performing a different function. As symbolic vehicles with semantic content, iconic gestures compete with linguistic conventions, and they lose the competition—for many obvious reasons—which usurps the need to create spontaneous gestures on the spot, except in a few exceptional circumstances. If one imagines an evolutionary analog, the story would be one of conventional forms of communication mainly taking over from iconic gestures, whereas pointing would persist. In both evolution and ontogeny, then, the ability to act out nonactual situations could then potentially emerge again in other functions, for example, pretense and other forms of fiction (see box 2).

BOX 2. Pantomime as Imagining in Space

Communicating with iconic gestures and pantomime might plausibly have had two large and important cognitive consequences. The first stems from its intimate involvement with—and therefore stimulation of—imagination and pretense. Iconic gestures enable reference to things ever farther removed in space and time than does pointing, and at the moment of communication these must be imagined. When I gesture to inform you that there is an antelope out of sight over the hill, or to warn you that there are snakes in this cave we are about to enter, or to relate what happened to us on our just-finished hunting trip, I have to act out whole scenarios in which some of the key players are not even present and the actions are either long past or only predicted.

The proposal, then, is that iconic gesturing both depends on preexisting skills of imagination and also takes these skills to new places. Whereas a chimpanzee might imagine what awaits her at the waterhole, we are now talking about depicting for another person in some kind of playacting such imagined scenes—tailored for the recipient’s knowledge and interests, given our common ground, so that she will be capable and motivated to comprehend. It is not unreasonable to suppose, then, that humans evolved ever more powerful forms of imagination to be able to act out scenes for others—in a kind of joint imagining. Indeed, we see this behavior in very young children on a daily basis as they pretend with a parent or peer that this stick is a horse or that they are Superman. The evolutionary origins of pretend play—which would seem to be a bit mysterious because its function is not so obvious—are thus, in the current account, to be found in pantomiming as a serious communicative activity. In modern humans, pantomiming for communication has been supplanted by conventional language. As children learn a conventional language, their tendency to communicate to others by creating pretend scenarios in gesture has no place to go, so to speak. They thus play with this ability and create pretend scenarios together with others, as a pretense activity with no other motivation. A number of scholars have argued as well that engaging in pretense is one source for the distinction between appearance and reality (e.g., Perner, 1991), as I act out X fictively in order to represent the real X, as well as for counterfactual thinking in general (Harris, 1991).

And so the first surprising effect of iconic gestures is that their emergence in human evolution led to skills of acting out pretend scenarios with and for others, which may be the basis for humans’ creation of all of the “imaginary” situations and institutions within which they reside. In addition, to anticipate our story a bit, it is also reasonable to suppose that the creation of what Searle (1995) calls cultural “status functions” such as being a president or a husband—and pieces of paper standing for (indeed, constituting) money—has its phylogenetic and ontogenetic roots in pretend play in which children together anoint a stick as a horse, which gives the stick special powers, in a manner very similar to anointing a person as a president (Rakoczy and Tomasello, 2007). If thinking is at base a form of imagining, then one can hardly overestimate the importance of imagining things for other people, as embodied in iconic gestures, for the evolution and development of uniquely human thinking (Donald, 1991).

The second cognitive effect of iconic gestures and pantomiming is even more speculative. Almost everyone who studies human cognition recognizes the crucially important role of spatial conceptualizations. There are undoubtedly multiple reasons for this, some of them having simply to do with the importance of space in primate cognition in general. It is well known, for example, that episodic memory has intimate connections to spatial cognition.

But more recently, some theorists have dug more deeply into this connection. Beginning with the pioneering work of Lakoff and Johnson (1979), it is well known that humans quite often talk about abstract situations or entities metaphorically or analogically in terms of concrete spatial relationships. As just a few examples, we talk about putting things into and taking them out of our lectures, we fall into love, we are on our way to success, or we are going nowhere in our career, or I am out of my mind, or she is coming to her senses, and on and on. We are not talking about just surface metaphors here, but very basic ways of conceptualizing complicated and abstract situations. Thus, in his follow-up work, Johnson (1987) identified a number of so-called image schemas that seem to permeate our thinking, such as containment (in and out of a lecture), part-whole (the foundation of our relationship), link (we are connected), obstacle (my lack of education gets in the way of my social life), and path (we are on our way to marriage).

Even in the grammars of languages, a number of scholars have noted the inordinate prominence of space, with some even creating such things as “space grammars.” Some of the early work on syntactic case relationships also emphasized that many case markers emerged first historically from words for spatial relationships (adpositions of various kinds). Talmy (2003) has posited a human imaging system that structures grammar via a very strong spatial component. Thus, one of his central schemas is the force dynamic schema in which actors cause effects in other entities (e.g., investors’ anxieties crashed the stock market), and another is various kinds of fictive motion along paths. He has also noted that many complex relationships are expressed spatially, with topological relationships predominating. Even more strongly, conventional signed languages use space to depict all kinds of grammatical relationships, from anaphoric reference to case role (e.g., Liddel, 2003)—which is important if humans’ earliest linguistic conventions were, as hypothesized here, in the gestural modality.

In terms of ontogeny, Mandler (2012) has argued and presented evidence that children’s earliest language is made possible by a set of mainly spatial image schemas such as animate motion, caused motion, the path of motion, obstacles to motion, containment, and so forth. These form the conceptual foundation for children’s early talk about agents doing things (Slobin’s [1985] manipulative activity scene) and objects going places (Slobin’s [1985] figure-ground scene, in which objects move along paths). These are the things children first talk about, and fundamental spatial relationships in motion along paths play a prominent role at all stages.

The speculation is thus that in addition to numerous other reasons for space being important in human cognition, a critically important reason is that at an early stage in their evolution humans conceptualized many things for others in their gestural communication in a fictive space with fictive actors and actions. Basically, the only way to depict many things in spontaneous, nonconventionalized gestures is by acting out in space the referent objects and events. And so if we believe that human thinking is intimately tied to communication—how we have come to conceptualize things for others—then the fact that we did this for some time in our history by pantomiming in space may go a long way toward explaining the inordinately important role of space in human cognition.

Iconic gestures thus represent a kind of middle stage in human communication and thinking that bridges from pointing for others informatively and perspectivally in the context of common conceptual ground to conventional linguistic communication. This bridging step involves external forms of symbolic representation that have semantic content categorized. Nevertheless, iconic gestures almost always have a potentially problematic ambiguity of perspective. If I mime the throwing of a spear, who is supposed to be throwing it—me, you, or someone else? Of course, this typically is determined through our common ground context; it is normally clear if I am requesting you to do it, expressing my desire to do it, or reporting on our friend’s activity. But in some situations—for example, depicting the morning’s hunting events—it might not be clear who is throwing the spear. The only way to resolve this ambiguity is with further communicative acts, either deictic or iconic. And this leads us to the most complex way in which early humans communicated with natural gestures before conventional languages: combining their gestures in multiunit expressions.

Combining Gestures

Great apes do not create new communicative functions by combining their gestures, their vocalizations, or their gestures and vocalizations together (Liebal et al., 2004; Tomasello, 2008). But humans do, including young children from the earliest stages of their communicative development, and including even children exposed to no conventional language, vocal or signed, at all (Goldin-Meadow, 2003).

While there is no principled reason why someone could not string together various pointing gestures—and individuals may do this on occasion—this is not commonly observed. Beginning language learners combine their earliest linguistic conventions with pointing or other conventions, and beginning sign language learners produce iconic or conventional signs in combination with pointing (as do, again, children exposed to no conventional language at all; Goldin-Meadow, 2003). As originating context in evolution, one can easily imagine situations in which an individual pantomimed something—such as eating—and then immediately afterward, in a second thought, pointed to some particular piece of food—for instance, the fruit over there (this process is thus analogous to the “successive single word utterances” in early child language, or the “broken” utterances in pidgin languages). But then, through a process of “mental combination” (Piaget, 1952), these successive thoughts or intentions came to be integrated into a single thought or intention and so expressed as a single utterance within a single intonation contour. With some minimal skills of categorization, individuals could form a schema comprising, for example, an iconic gesture for eating followed by indexical indication of anything edible either by oneself or by others. Productivity in thinking would be thus scaffolded and enhanced by this overt communicative schema.

It is important to emphasize that just as in child language, in early human communication there would have been functional continuity between different expressions of the same referential intention, no matter their internal complexity. For example, one could communicate that there are snakes in the cave either with a snake motion as we approach the cave or by combining a snake motion with a pointing gesture to the cave (e.g., if we are not approaching it)—both with the same communicative intention or function. Combining symbolic and deictic vehicles is not the creation of new communicative intentions, primarily, but rather the parsing of existing ones into their component parts. This means that in combinations a single gesture is typically indicating only one aspect of a situation. Thus, whereas the snake motion while approaching the cave is intended to communicate that there are snakes in the cave, in combination with pointing to the cave (or iconically depicting the cave) this snake motion now only indicates the snakes themselves, as the rest of the situation is indicated through other communicative devices. This focus on function and the parsing of situations into components with different subfunctions are responsible for the hierarchical organization of human communication.

With gesture combinations we now also have the possibility of beginning down the path to the subject-predicate organization characteristic of full propositions.¹⁰ Two ingredients are involved, both of which are already present, in nascent form, in pointing inside collaborative activities. The first ingredient is the specific cognitive distinction between events and participants. Even apes learning human-like forms of communication distinguish events and participants in their sign combinations (see Tomasello, 2008, for evidence). The second ingredient is the distinction between shared (or given) and new information. As noted above, even in pointing there is an implicit distinction between the shared common ground, which typically is not indicated specifically by the pointing gesture, and the new and noteworthy situation, which is indicated deictically. But this is all implicit. With gesture combinations, what often happens is that one or more signs are used to make contact with the common ground—typically to use it as a perspective or “topic”—and then to indicate with another sign the new and interesting information. In many situations, one can imagine that one points to a perceptually present referent—to make sure that it is shared—and then iconically signs about some aspect of it that one thinks is new and noteworthy for the recipient.

The overall picture is thus that early humans used their pointing and iconic gestures, both singly and in combination, to communicate much more richly and powerfully than did their primate cousins. This new form of communication took place initially inside of collaborative activities, which supplied the interactants with both the necessary common conceptual ground and the necessary opportunities for interchanging roles and perspectives with their partner. Early humans’ cooperative communication with natural gestures thus required both levels in our dual-level conception of joint collaborative activities: joint goals and attention, as the shared aspect, and individual roles and perspectives, as the individual aspect. And none of this required language. Communicators conceptualizing or perspectivizing things in different ways for different communicative partners (depending on judgments of common ground, relevance, and newness), and then recipients comprehending the intended perspectives through socially recursive inferences, is not the result of becoming a language user, but rather its prerequisite.

Second-Personal Thinking

We are trying to get to the full flowering of modern human objective-reflective-normative thinking in the context of culture and language. We are halfway there. With the reconstructed early humans we are picturing here, we have creatures who are not just strategizing how to obtain food or mates in bigger, better, and faster ways than others, as are great apes, but rather who are attempting to coordinate their actions and intentional states with others via evolutionarily new forms of collaborative activity and cooperative communication. They are not just organizing their actions via individual intentionality; they are also organizing them via joint intentionality. And this changed the way they imagined the world so as to manipulate it in acts of thinking.

Perspectival, Symbolic Representations

Great apes schematize cognitive models for the various types of situations that are recurrent and important in their lives. And so when early humans began engaging in obligate collaborative foraging, they schematized a cognitive model of the dual-level collaborative structure comprising a joint goal with individual roles and joint attention with individual perspectives. With cooperative communication, early human individuals began overtly indicating or symbolizing for a partner situations that were relevant to her, given her individual role and perspective in the joint activity. To do this, they created evolutionarily new forms of natural gestures—pointing and pantomiming—whose use resulted in cognitive representations that had three new and transformative characteristics.

PERSPECTIVAL. Conceptualizing things from different perspectives comes so naturally to humans that we consider it as almost inevitable; it is just the way cognition works. Typically in cognitive science, concepts are characterized in English words, such as car, vehicle, and anniversary present, that can be applied as needed, even to the same entity sitting in the driveway. But this way of doing things is not inevitable; indeed, it is not even possible for creatures that cannot “triangulate” with another individual simultaneously on the same entity. Great apes may sometimes apply different schematic representations to one and the same entity: on one occasion a particular tree is an escape route, whereas on another occasion it is a sleeping place. But each of these different conceptualizations is tied to the individual’s current goal state; she may know many things about the tree, but she does not entertain them as alternative possible construals simultaneously, and so they are not interrelated perspectives in the way we have defined them here (and this is true even if the ape is solving a problem by imagining nonactual entities or situations, because even here she is doing this aimed only at her current problem situation).

In contrast, when early humans began to communicate cooperatively with others, they were constantly taking the other’s perspective on a situation or entity to which they themselves were already attending (they were triangulating with the other). Indeed, each time they communicated, they had to make their communicative act relevant and new for the recipient in the context of her goals and values, their common ground, and her existing knowledge and expectations. As they were thus thinking how their communicative act might fit into the life of the recipient, communicators had to consider several alternative perspectives simultaneously, and only then choose a communicative act to instantiate one of them. For example, in order to warn of danger, they might pantomime for a recipient as they approached a cave either a snake, or a snake bite on the leg, or just a general danger sign (which the recipient would know, in the context of their common ground about caves, meant snakes).

The key point from the perspective of cognitive representation is that communicators were not tied to their own goals and perspectives, but rather they were considering alternative perspectives for another person, whose conative and epistemic states they could only imagine. For her part, the recipient, in order to make the abductive leap necessary to grasp the communicator’s communicative intentions, had to then simulate his perspective on her perspective (at least). This transacting in perspectives meant that early human individuals did not just experience the world directly for themselves, in the manner of all apes but, in addition, at least in some aspects, experienced the exact same world viewed simultaneously from different social perspectives. This triangulating process inserted for the first time a small but powerful wedge between what we might now call the subjective and the objective.

SYMBOLIC. Iconic gestures, or pantomime, also seem mundane to humans, who have the ability to imitate one another’s actions, and to even imitate or simulate their own past actions outside of their normal instrumental context. But they are anything but mundane, as they represent the first acts by any primate (arguably any animal species) that attempt to re-present for a recipient, in overt action, some event or entity, so that she will imagine it. Iconic gestures also require that the recipient comprehend communicative intentions (in our story, already in the comprehension of pointing) so that she can “quarantine” these gestures as not actual instrumental acts but rather acts of communication.

Producing communicative acts that resemble their intended referents (e.g., miming a monkey climbing) creates a symbolic relationship in which the act is meant to evoke in imagination the intended referent (e.g., a monkey or an act of climbing or a monkey climbing), which is hoped to lead the recipient to infer the communicator’s communicative intention (e.g., that they go hunting for monkeys now). Like pointing, iconic gestures perspectivize a situation, but unlike pointing, they do this articulately in the symbolic vehicle itself. For example, with iconic gestures one would have different icons for “monkey” and “food” even if, on different occasions, they were used for the exact same animal, whereas in pointing, the act (protruding finger) would be the same in both cases, with the common ground of the collaborative activity (whether we are admiring the local fauna or seeking sustenance) carrying the semantic weight. Another important feature of iconic gestures is that they are mostly categorical in nature, that is, used to conceptualize or perspectivize things, events, or situations “like this.” In choosing what to act out for others in pantomime, then, communicators construe the situation from a particular perspective categorically, as opposed to other possible categorical perspectives.

QUASI-PROPOSITIONAL. Combining gestures into a single communicative act parses the referent situation into something like event-participant structure, which limits the semantic scope of each gesture. Thus, pantomiming a monkey in combination with pointing to a spear now suggests even more articulately the desired hunting trip, but the gesture for monkey is now confined to symbolizing the monkey only, not the hunting trip as a whole. In combination with the already established tendency to background knowledge in common ground (topic) and to explicitly communicate about new information (focus), this parsing creates a nascent subject-predicate organization in the communicative act—resulting in something on the way to full propositions. (Interestingly, great apes raised by humans with a human-like communication system typically make the event-participant distinction, but not the topic-focus distinction [because they do not have any notion of a shared focus of attention or topic], and so they do not have subject-predicate organization in their multiunit communicative acts [see Tomasello, 2008].) The addition of a new cooperative communicative motive made for two distinctly marked motives (requestive and informative), which created a first nascent distinction between communicative force and content.

With the advent of early human collaboration and cooperative communication, then, the cognitive representation of experience in type-token format, as in great apes, was “cooperativized.” Individuals interacting in joint attention and common conceptual ground could conceptualize the same event, entity, or situation simultaneously from multiple perspectives. The symbolization of these perspectives in categorical iconic gestures and in gesture combinations with topic-focus organization, with some indication of a force-content distinction, made them at least incipiently propositional as well. This process might be seen to effectively decontextualize (cooperativize or make less egocentric) the individual’s experience of the world, as he decides under which symbolic description to represent a situation for a communicative partner. With this perspectival crack in the experiential egg, we are on our way to thinking that is in some sense “objective.”

Socially Recursive Inferences

Socially recursive inferences, once again, seem so natural for humans as to be barely noticeable: I wonder what she thinks I’m thinking. Great apes make inferences about experience—they simulate the causes and outcomes of physical and social situations—but they do not make inferences about what the other is thinking about their thinking. Such inferences begin with early humans attempting to coordinate their actions and attention with others in collaborative activities with joint goals and attention, but they come into full flower with early humans attempting to coordinate their intentional states and perspectives with others in cooperative communication.

In the context of a joint collaborative activity, early human communicators began thinking about (i.e., simulating) how best to formulate their communicative act for a recipient, with the goal of both honesty (engendered by a concern with being cooperative in general) and communicative efficacy. The concern with honesty—especially given that recipients were now becoming “epistemically vigilant” (Sperber et al., 2010)—puts us on the road to a commitment to the truth of our communicative acts. The concern with communicative efficacy required that both communicator and recipient anticipate the perspective of their partner, which required socially recursive inferences that embedded the intentional states of one partner within those of the other. In addition, the production of overt combinations of gestures for others, once they were schematized, created unprecedented new possibilities for productive inferences about nonactual or even counterfactual states of affairs. Early human inferences thus display two new and transformative properties.

SOCIALLY RECURSIVE. We may reasonably ask why early human communicators began making socially recursive inferences in the first place. The short answer is that they assumed together in common ground that the communicator had cooperative motives, and so they were collaborating toward the joint goal of recipient comprehension. In this context, they were each trying to help the other—as in all joint collaborative activities—and this meant simulating what the other was thinking about their thinking. And since pointing and pantomiming on their own are quite weak communicative vehicles, inferential leaps of at least some distance are always required to reconstruct the communicator’s communicative intention—so that at least some help is almost always needed.

And so developed a form of communication in which a communicator intended that a recipient know something, for her benefit. The recipient understood this and so, for example, understood that he intends for me to know that the banana is in that bucket. The communicator, for his part, knew that the recipient would make such an inference if he helped her to do so by alerting her to the fact that he had such an intention (the Gricean communicative intention that the recipient notice that the communicator wants her to know something). This may not be one multiply embedded communicative intention, as in the Gricean analysis, but rather, as argued by Moore (in press) two singly embedded intentions: I intend that you notice that this communicative act is for you, plus I intend that you know that the banana is in that bucket. Nevertheless, the single embedding in this second intention is already more than great apes can do, and so it represents a new form of recursive inference (the production version occurring when the communicator simulated the recipient’s intentional states in order to formulate communicative acts that would be readily comprehensible for her—not throwing the ball at her, but rather to her).

COMBINATORIAL. Communicating with others using overt gestures, and especially being able to combine gestures to communicate with others in more complex ways, enabled new processes of productive thinking. In their natural communication with one another, great apes do not combine gestures with one another (nor vocalizations) to communicate something new. Their thinking is thus confined to imagining novel situations using their past individual experiences reconfigured in new ways. But once early humans began imagining situations from the perspective of the other in order to communicate with combinations of gestures, and then schematized these combinations, they had the possibility to go beyond their own experience to think about something that others might experience, or even something impossible. For example, I might produce an iconic gesture for traveling followed by pointing to a location, which I generalize to any location. I might then imagine or communicate, via this schema, our child traveling to the sun—something I consider causally impossible. When humans began schematizing communicative constructions with abstract slots in this way, they created for themselves almost unlimited combinatorial freedom. Schema formation in communicative acts and the parsing of communicative intentions into discrete overt components represent a significant step in the direction of the kind of “inferential promiscuity” characteristic of modern human thinking in a conventional language.

Beyond the new possibilities for creating novel, even counterfactual thoughts via external communicative vehicles, a number of theorists have emphasized the necessary role of such external vehicles for individuals to reflect on their own thinking (e.g., Bermudez, 2003). When individuals formulate an overt communicative act and then perceive and comprehend it as they produce it, they are, in effect, reflecting on their own thinking (a process that may become internalized so that we may think about things that we could potentially communicate overtly). Because the gesture combinations at this point have only limited semantic content (e.g., no logical vocabulary and no propositional attitude vocabulary), early humans could reflect only in a highly limited way on their own thinking.

With the advent of early human collaboration and cooperative communication, then, the causal inferences of great apes were, like their cognitive representations, “cooperativized.” This meant that the communicator’s inferences were about what was the situation from the perspective of the recipient, and the recipient’s inferences were about the communicator’s simulations of her simulating his perspective. Overt combinations of symbols, especially if schematized, led to the possibility of thinking various new and even counterfactual thoughts, as well as to the first, rather modest, reflections on one’s own thinking. With all of these new inferential possibilities, then, we are now well on our way to thinking processes that are truly reflectively reasoned.

Second-Personal Self-Monitoring

Great apes self-monitor their goal-directed behavior, including its psychological underpinnings with respect to such things as memory and decision making. But great apes are not normative creatures. They experience “instrumental pressure,” for example, when they have a goal to eat food and they know that food is available at location X; this implies that they “must” go to location X. But this is just the way control systems with individual intentionality work: a mismatch between goal and perceived reality motivates action. In contrast, early humans began to self-monitor from the perspective of others and, indeed, self-regulated their behavioral decisions with others’ evaluations in mind. Now we may talk of something that is socially regulated, that is, socially normative, albeit only in second-personal (as opposed to agent-neutral) form. There were two manifestations.

COOPERATIVE SELF-MONITORING. First, because the collaborative activities of early humans were interdependent and operated with partner choice, each individual, even the most dominant, had to respect the power of other individuals, even the most subordinate, to exclude them from collaborative opportunities. Early humans thus developed not only an ability to make evaluative judgments about others’ cooperative proclivities but also an ability to simulate, and so to anticipate, the evaluative judgments that others were making about them. Young human children are concerned with the social evaluations of others from the preschool years on as they attempt to actively manage the impression they are making on them (Haun and Tomasello, 2011), but chimpanzees seem to be not so concerned (Engelmann et al., 2012).

Early humans’ concerns for how their collaborative partners viewed them—and their active attempts to manage this impression—provided a new motive for actions, namely, to coordinate with the evaluative expectations of potential partners. Individuals thus began to cede power over themselves to the second-personal evaluations of others because these evaluations determined their future collaborative opportunities. From the point of view of normativity, this meant that in making their behavioral decisions, humans not only experienced individual instrumental pressure but also experienced second-personal social pressure from their partners in social engagements. This constitutes, in the current account, one origin of what will later become social norms of morality.

COMMUNICATIVE SELF-MONITORING. Second, because early human communicators wanted to facilitate the recipient’s comprehension, they had to actively self-monitor their potential communicative acts in anticipation of how they might be comprehended and/or interpreted by the recipient. They thus engaged in a self-monitoring of the communicative process, from the perspective of the recipient, especially for intelligibility.

Mead (1934) pointed out the key role of overtness here. As they communicated with others in overt acts—either deictic or symbolic—early humans saw or heard themselves performing those acts, in which case they then comprehended them (as perspectivized for another) as the recipient. Communicators thus adjusted their communicative act so as to maximize the comprehension of the recipient, as part of their commitment to the collaborative act of cooperative communication. Making such adjustments required self-monitoring and evaluating communicative acts for comprehensibility from the perspective of specific communicative partners, each with her own individual knowledge and motives and common ground with the communicator. This constitutes, in the current account, one origin of what will later become social norms of rationality.

The early humans we are picturing here would thus have been able to engage in two kinds of cooperative and communicative self-monitoring that great apes cannot—because great apes do not engage in the kind of joint collaborative activities and cooperative communication that engenders such social self-monitoring. Early humans simulated the evaluative judgments that others made of them with regard to their cooperative proclivities—precursors to norms of morality—and also with regard to the intelligibility of their communicative acts—precursors to norms of rationality. Importantly, the evaluations we are talking about here come from particular individuals, and so we are still a way from the kind of agent-neutral, “objective” norms by which modern humans evaluate others and themselves. But we have begun the process of socially normativizing individual thinking.

Perspectivity: The View from Here and There

It is now widely accepted that what most clearly distinguishes nonhuman primates from other mammalian species cognitively is their complex skills of social cognition. Dunbar (1998), for example, documented that primate brain size correlates most strongly not with primates’ physical ecology but, rather, with their social group size (as a proxy for social complexity). But primates’ special skills of social cognition are aimed mainly at competition (i.e., their intelligence is Machiavellian), in which they keep track of all of the various dominance relationships in the group, as well as all of the various affiliative relationships in the group as these might affect the competition for food and mates.

The question thus arises whether the uniquely human skills of cognition and thinking that we have identified here might instead have arisen for competition. On the level of evolutionary function (ultimate causation), this is true almost by definition, since evolutionary success is defined as having more offspring than others. But on the level of proximate mechanism, we do not think it likely that perspectival cognitive representations, socially recursive inferences, and social self-monitoring could have arisen directly out of competitive contexts. It is true that, in theory, one could get a kind of arms race in mind reading to deal with competitive situations. In competition, individuals could come to realize that both me and my competitor are focused on the same resource at the same time (joint attention?) and then to try to outcompete her for that resource by thinking about what she is thinking about my thinking. But what we for sure cannot get strictly from competition are the unique forms of cooperative communication in which humans engage. Unlike other primates, humans use their communicative acts to actually encourage others to discern their thinking. Thus, human communicators take the perspective of others in order to determine their goals and interests so that they can then inform them of something helpful to them. Those others want this helpful information, and so they do their best to help the communicator discern their goals and interests, and also to discern their knowledge and expectations so that he may formulate his communicative act in a comprehensible manner. Humans, but not other primates, thus collaborate in their communication to make it easier for the other to take their perspective and even to manipulate it if they so desire.

An especially enlightening example of a similar cooperative process concerns a unique physical characteristic of humans. Of the more than two hundred species of primates, only humans have highly visible eye direction (because of especially visible white sclera; Kobayashi and Koshima, 2001). And only humans use this information. Thus, when tested in various conditions contrasting head and eye direction, twelve-month-old human infants tended to follow the eye direction over the head direction of others, whereas great apes tended to follow the head direction of others only (Tomasello et al., 2007b). For humans to have evolved conspicuous cues of gaze direction, there must have been some advantage to the individual to “advertise” her eye direction for others. This suggests predominantly cooperative situations in which the individual may rely on others using this information collaboratively or helpfully, not competitively or exploitively. The point is that human communicative acts serve to advertise the internal states of individuals in this same way, and so it also suggests cooperation of this same type (e.g., cooperative requests such as “I’d like some fruit” are “advertisements” of my internal state of desire, and informative utterances such as “There is some fruit over there” are public offers of helpful information). Communication of this type could never be adaptively stable in contexts that were not fundamentally cooperative, and so fully human-like skills of joint intentionality could never evolve solely in the context of competition.

There can be no doubt that the last common ancestor to humans and other primates engaged in individual thinking in pursuit of individual goals, mostly in order to compete with groupmates for valued resources. Along the way, they attended to situations relevant to those goals. Early human individuals—in response to a changing feeding ecology—then began to join together with other individuals dyadically in pursuit of joint goals, and they jointly attended to situations relevant to that joint goal. Each participant in the collaboration had her own individual role and her own individual perspective on the situation as part of the interactive unit. This dual-level structure—simultaneous jointness and individuality—is the defining structure of what we are calling joint intentionality, and it is foundational for all subsequent manifestations of human shared intentionality.

The problem was how to coordinate these collaborative activities as they became ever more complex, both to negotiate a joint goal and to coordinate the two different roles. The solution was cooperative communication. Early humans directed the attention of their collaborative partner to relevant situations by pointing, which required taking her perspective and simulating her thinking (i.e., in terms of the abductive leap she might be expected to make given different possible communicative acts). To comprehend, the recipient had to take the perspective of the communicator taking her perspective—which constituted to a new form of socially recursive inferring. Early humans’ concern that their partner comprehend them led to social self-monitoring via the anticipated evaluations of the partner with respect to the comprehensibility of the communicative act.

The basic cognitive challenge in all of this was to coordinate one’s own perspective with the perspective of one’s collaborative partner. And so, as early humans engaged in the truck and barter of making a living collaboratively, they began to truck and barter in perspectives with their interactive partners communicatively—and in their own perspectives reflectively to some degree—and this gave human cognitive representation and inference a new kind of flexibility and power. Now, instead of just their own view on the world, early humans could also view the world at the same time from the perspective of the other, which might also include her perspective on my perspective. Early humans had not just a great ape view from here, but rather a view simultaneously from here and there.

We do not know precisely who these early humans were, but we may speculate Homo heidelbergensis some 400,000 years ago, living as loosely structured bands or pools of recurrently collaborating partners. Of course Homo heidelbergensis did not engage in modern human forms of fully objective-reflective-normative thinking. Their thinking was not “objective” but rather was still tied to the two second-personal perspectives of “I” and “you.” Their thinking was only weakly reflective because they could express very few of their intentional states or cognitive operations externally in communicative vehicles (and so they could act as both producers and comprehenders of only some limited semantic content). And their thinking was socially normative only in the sense that they were concerned with how their partner evaluated their cooperative behavior and comprehended their communicative acts, not with the group’s normative standards. There is thus no question that we are still some way from modern human collective intentionality and its objective-reflective-normative thinking. But, we would argue, the “in-between” step of early human joint intentionality and its perspectival-recursive-socially monitored thinking was necessary for getting there. It was necessary because the transition to modern humans was all about creating cultural conventions, and if these were to be in a cooperative direction—as they almost invariably were—then some very strong cooperative tendencies had to be already present in the individuals doing the conventionalizing.

Together, then, early human collaborative activities and cooperative communication represent a kind of second-personal “cooperativization” of great ape lifeways and thinking. But these evolutionarily new forms of second-person social interaction involved joint engagement with specific other persons on specific occasions only, and they did not retain their special characteristics very far outside of the collaborative activities themselves. And so, despite the great leap forward represented by this new joint intentional way of living, communicating, and thinking, the next leap forward will have to take this “cooperativized” cognition and thinking and “collectivize” it by conventionalizing and institutionalizing—and so normativizing and objectifying—almost everything.