Keynes was a great economist. In every discipline, progress comes from people who make hypotheses, most of which turn out to be wrong, but all of which ultimately point to the right answer. Now Keynes, in The General Theory of Employment, Interest and Money, set forth a hypothesis which was a beautiful one, and it really altered the shape of economics. But it turned out that it was a wrong hypothesis.
—Milton Friedman, Opinion Journal, July 22, 2006
The approach to the study of literature that I am calling “macroanalysis” is in some general ways akin to economics or, more specifically, to macroeconomics. Before the 1930s, before Keynes's General Theory of Government, Interest, and Money in 1936, there was no defined field of “macroeconomics.” There was, however, neoclassical economics, or “microeconomics,” which studies the economic behavior of individual consumers and individual businesses. As such, microeconomics can be seen as analogous to our study of individual texts via “close readings.” Macroeconomics, however, is about the study of the entire economy. It tends toward enumeration and quantification and is in this sense similar to bibliographic studies, biographical studies, literary history, philology, and the enumerative, quantitative analysis of text that is the foundation of computing in the humanities. Thinking about macroanalysis in this context, one can see the obvious crossover with WordHoard. Although there is sustained interest in the micro level, individual occurrences of some feature or word, these individual occurrences (of love, for example) are either temporarily or permanently de-emphasized in favor of a focus on the larger system: the overall frequencies of love as a noun versus love as a verb. Indeed, the very object of analysis shifts from looking at the individual occurrences of a feature in context to looking at the trends and patterns of that feature aggregated over an entire corpus. It is here that one makes the move from a study of words in the context of sentences or paragraphs to a study of aggregated word “data” or derivative “information” about word behavior at the scale of an entire corpus.
By way of an analogy, we might think about interpretive close readings as corresponding to microeconomics, whereas quantitative distant reading corresponds to macroeconomics. Consider, then, the study of literary genres or literary periods: are they macroanalytic? Say, for example, a scholar specializes in early-twentieth-century poetry. Presumably, this scholar could be called upon to provide sound generalizations, or “macroreadings,” of twentieth-century poetry based on a broad familiarity with the individual works of that period. This would be a type of “macro” or “distant” reading.* But this kind of macroreading falls short of approximating for literature what macroeconomics is to economics, and it is in this context that I prefer the term analysis over reading. The former term, especially when prefixed with macro, places the emphasis on the systematic examination of data, on the quantifiable methodology. It de-emphasizes the more interpretive act of “reading.” This is no longer reading that we are talking about—even if programmers have come to use the term read as a way of naming functions that load a text file into computer memory. Broad attempts to generalize about a period or about a genre by reading and synthesizing a series of texts are just another sort of microanalysis. This is simply close reading, selective sampling, of multiple “cases”; individual texts are digested, and then generalizations are drawn. It remains a largely qualitative approach.† Macroeconomics is a numbers-driven discipline grounded in quantitative analysis, not qualitative assessments. Macroeconomics employs quantitative benchmarks for assessing, scrutinizing, and even forecasting the macroeconomy. Although there is an inherent need for understanding the economy at the micro level, in order to contextualize the macro results, macroeconomics does not directly involve itself in the specific cases, choosing instead to see the cases in the aggregate, looking to those elements of the specific cases that can be generalized, aggregated, and quantified.
Just as microeconomics offers important perspectives on the economy, so too does close reading offer fundamentally important insights about literature; I am not suggesting a wholesale shelving of close reading and highly interpretive “readings” of literature. Quite the opposite, I am suggesting a blended approach. In fact, even modern economics is a synthesis—a “neoclassical synthesis,” to be exact—of neoclassical economics and Keynesian macroeconomics. It is exactly this sort of unification, of the macro and micro scales, that promises a new, enhanced, and better understanding of the literary record. The two scales of analysis work in tandem and inform each other. Human interpretation of the “data,” whether it be mined at the macro or micro scale, remains essential. Although the methods of inquiry, of evidence gathering, are different, they are not antithetical, and they share the same ultimate goal of informing our understanding of the literary record, be it writ large or small. The most fundamental and important difference in the two approaches is that the macroanalytic approach reveals details about texts that are, practically speaking, unavailable to close readers of the texts.
John Burrows was an early innovator in this realm. Burrows's 1987 book-length computational study of Jane Austen's novels provided unprecedented detail into Austen's style by examining the kinds of highly frequent words that most close readers would simply pass over. Writing of Burrows's study of Austen's oeuvre, Julia Flanders points out how Burrows's work brings the most common words, such as the and of, into our field of view. Flanders writes, “[Burrows's] effort, in other words, is to prove the stylistic and semantic significance of these words, to restore them to our field of view. Their absence from our field of view, their non-existence as facts for us, is precisely because they are so much there, so ubiquitous that they seem to make no difference” (2005, 56–57). More recent is James Pennebaker's book The Secret Life of Pronouns, wherein he specifically challenges human instinct and close reading as reliable tools for gathering evidence: “Function words are almost impossible to hear and your stereotypes about how they work may well be wrong” (2011, 28). Reviewing Pennebaker's book for the New York Times, Ben Zimmer notes that “mere mortals, as opposed to infallible computers, are woefully bad at keeping track of the ebb and flow of words, especially the tiny, stealthy ones” (2011, n.p.). At its most basic, the macroanalytic approach is simply another method of gathering details, bits of information that may have escaped our attention because of their sheer multitude. At a more sophisticated level, it is about accessing details that are otherwise unavailable, forgotten, ignored, or impossible to extract. The information provided at this scale is different from that derived via close reading, but it is not of lesser or greater value to scholars for being such. Flanders goes on: “Burrows’ approach, although it wears its statistics prominently, foreshadows a subtle shift in the way the computer's role vis-à-vis the detail is imagined. It foregrounds the computer not as a factual substantiator whose observations are different in kind from our own—because more trustworthy and objective—but as a device that extends the range of our perceptions to phenomena too minutely disseminated for our ordinary reading” (2005, 57). For Burrows, and for Flanders, the corpus being explored is still relatively small—in this case a handful of novels by Jane Austen—compared to the large corpora available today. This increased scale underscores the importance of extending our range of perception beyond ordinary reading practices. Flanders writes specifically of Burrows's use of the computer to help him see more in the texts that he was then reading or studying. The further step, beyond Burrows, is to allow the computer to help us see even more, even deeper, to go beyond what we are capable of reading as solitary scholars.*
The result of such macroscopic investigation is contextualization on an unprecedented scale. The underlying assumption is that by exploring the literary record writ large, we will better understand the context in which individual texts exist and thereby better understand those individual texts. This approach offers specific insights into literary historical questions, including insights into:
• the historical place of individual texts, authors, and genres in relation to a larger literary context
• literary production in terms of growth and decline over time or within regions or within demographic groups
• literary patterns and lexicons employed over time, across periods, within regions, or within demographic groups
• the cultural and societal forces that impact literary style and the evolution of style
• the cultural, historical, and societal linkages that bind or do not bind individual authors, texts, and genres into an aggregate literary culture
• the waxing and waning of literary themes
• the tastes and preferences of the literary establishment and whether those preferences correspond to general tastes and preferences
Furthermore, macroanalysis provides a practical method for approaching questions such as:
• whether there are stylistic patterns inherent to particular genres
• whether style is nationally determined
• whether and how trends in one nation's literature affect those of another
• the extent to which subgenres reflect the larger genres of which they are a subset
• whether literary trends correlate with historical events
• whether the literature that a nation or region produces is a function of demographics, time, population, degrees of relative freedom, degrees of relative education, and so on
• whether literature is evolutionary
• whether successful works of literature inspire schools or traditions
• whether there are differences between canonical authors and those who have been traditionally marginalized
• whether factors such as gender, ethnicity, and nationality directly influence style and content in literature
A macroanalytic approach helps us not only to see and understand the operations of a larger “literary economy,” but, by means of scale, to better see and understand the degree to which literature and the individual authors who manufacture that literature respond to or react against literary and cultural trends. Not the least important, as I explore in chapter 9, the method allows us to chart and understand “anxieties of influence” in concrete, quantitative ways.
For historical and stylistic questions in particular, a macroanalytic approach has distinct advantages over the more traditional practice of studying literary periods and genres by means of a close study of “representative” texts. Franco Moretti has noted how “a field this large cannot be understood by stitching together separate bits of knowledge about individual cases, because it isn't a sum of individual cases: it's a collective system, that should be grasped as a whole” (2005, 4). To generalize about a “period” of literature based on a study of a relatively small number of books is to take a significant leap from the specific to the general. Naturally, it is also problematic to draw conclusions about specific texts based on some general sense of the whole. This, however, is not the aim of macroanalysis. Rather, the macroscale perspective should inform our close readings of the individual texts by providing, if nothing else, a fuller sense of the literary-historical milieu in which a given book exists. It is through the application of both approaches that we reach a new and better-informed understanding of the primary materials.
An early mistake or misconception about what computer-based text analysis could provide scholars of literature was that computers would somehow provide irrefutable conclusions about what a text might mean. The analysis of big corpora being suggested here is not intended for this purpose. Nor is it a strictly scientific practice that will lead us to irrefutable conclusions. Instead, through the study and processing of large amounts of literary data, the method calls our attention to general trends and missed patterns that we must explore in detail and account for with new theories. If we consider that this macroanalytic approach simply provides an alternative method for accessing texts and simply another way of harvesting facts from and around texts, then it may seem less threatening to those who worry that a quantification of the humanities is tantamount to the destruction of the humanities.
In literary studies, we are drawn to and impressed by grand theories, by deep and extended interpretations, and by complex speculations about what a text—or even a part of a text—might mean: the indeterminacies of deconstruction, the ramifications of postcolonialism, or how, for example, the manifold allusions in Joyce's Ulysses extend the meaning of the core text. These are all compelling. Small findings, on the other hand, are frequently relegated to the pages of journals that specialize in the publication of “notes.” Craig Smith and M. C. Bisch's small note in the Explicator (1990), for example, provides a definitive statement on Joyce's obscure allusion to the Illiad in Ulysses, but who reads it and who remembers it?* Larger findings of fact, more objective studies of form, or even literary biography or literary history have, at least for a time, been “out of style.” Perhaps they have been out of style because these less interpretive, less speculative studies seem to close a discussion rather than to invite further speculation. John Burrows's fine computational analysis of common words in the fiction of Jane Austen is an example of a more objectively determined exploration of facts, in this case lexical and stylistic facts. There is no doubt that the work helps us to better understand Austen's corpus, but it does so in a way that leaves few doors open for further speculation (at least within the domain of common word usage, or “idiolects,” as Burrows defines them). A typical criticism levied against Burrows's work is that “most of the conclusions which he reaches are not far from the ordinary reader's natural assumptions” (Wiltshire 1988, 380). Despite its complexity, the result of the work is an extended statement of the facts regarding Austen's use of pronouns and function words. This final statement, regardless of how interesting it is to this reader, has about it a simplicity that inspires only a lukewarm reaction among contemporary literary scholars who are evidently more passionate about and accustomed to deeper theoretical maneuverings. To Burrows's credit, Wiltshire acknowledges that the value of Burrows's study is “not that it produces novel or startling conclusions—still less ‘readings’—as that it allows us to say that such ‘impressions’ are soundly based on verifiable facts” (ibid.).
Arguments like those made by Burrows have been, and perhaps remain, underappreciated in contemporary literary discourse precisely because they are, or appear to be, definitive statements. As “findings,” not “interpretations,” they have about them a deceptive simplicity, a simplicity or finality that appears to render them “uninteresting” to scholars conditioned to reject the idea of a closed argument. Some years ago, my colleague Steven Ramsay warned a group of computing humanists against “present[ing] ourselves as the people who go after the facts.”* He is right, of course, in the sense that we ought to avoid contracting that unpleasant disease of quantitative arrogance. It is not the facts themselves we want to avoid; however, we certainly still want and need “the facts.”
Among the branches of literary study, there are many in which access to and apprehension of “the facts” about literature are exactly what is sought. Most obvious here are biographical studies and literary history, where determining what the facts are has a great deal of relevance not simply in terms of explaining context but also in terms of determining how we understand and interpret the literary works within that context: the works of a given author or the works of a given historical period. Then there is the matter of stylistics and of close reading, which are both concerned with ascertaining, by means of analysis, certain distinguishing features or facts about a text.
Clearly, literary scholars do not have problems with the facts about texts per se. Yet there remains a hesitation—or in some cases a flat-out rejection—when it comes to the usefulness of quantification. This hesitation is more than likely the result of a mistaken impression that the conclusions following from a computational or quantitative analysis are somehow to be preferred to conclusions that are arrived at by other means. A computational approach need not be viewed as an alternative to interpretation—though there are some, such as Gottschall (2008), who suggest as much. Instead, and much less controversially, computational analysis may be seen as an alternative methodology for the discovery and the gathering of facts. Whether derived by machine or through hours in the archive, the data through which our literary arguments are built will always require the careful and imaginative scrutiny of the scholar. There will always be a movement from facts to interpretation of facts. The computer is a tool that assists in the identification and compilation of evidence. We must, in turn, interpret and explain that derivative data. Importantly, though, the computer is not a mere tool, nor is it simply a tool of expedience. Later chapters will demonstrate how certain types of research exist only because of the tools that make them possible.
Few would object to a comparative study of Joyce and Hemingway that concludes that Hemingway's style is more minimalist or more “journalistic” than Joyce's. One approach to making this argument would be to pull representative sentences, phrases, and paragraphs from the works of both authors and from some sampling of journalistic prose in order to compare them and highlight the differences and similarities. An alternative approach would involve “processing” the entire corpus of both authors, as well as the journalistic samples, and then to compute the differences and similarities using features that computers can recognize or calculate, features such as average sentence length, frequent syntactical patterns, lexical richness, and so on. If the patterns common to Hemingway match more closely the patterns of the journalistic sample, then new evidence and new knowledge would have been generated. And the latter, computational, approach would be all the more convincing for being both comprehensive and definitive, whereas the former approach was anecdotal and speculative. The conclusions reached by the first approach are not necessarily wrong, only less certain and less convincing. Likewise, the second approach may be wrong, but that possibility is less likely, given the method.* Far more controversial and objectionable would be an argument along the lines of “Moby Dick is God, and I have the numbers to prove it.” The issue, as this intentionally silly example makes clear, is not so much about the gathering of facts but rather what it is that we are doing with the facts once we have them.
It is this business of new knowledge, distant reading, and the potentials of a computer-based macroanalysis of large literary corpora that I take up in this book. The chapters that follow explore methods of large-scale corpus analysis and are unified by a recurring theme of probing the quirks of literary influence that push and pull against the creative freedom of writers. Unlike Harold Bloom's anecdotal, and for me too frequently impenetrable, study of influence, the work presented here is primarily quantitative, primarily empirical, and almost entirely dependent upon computation—something that Bloom himself anticipated in writing Anxiety of Influence back in 1973. Bloom, with some degree of derision, wrote of “an industry of source-hunting, of allusion-counting, an industry that will soon touch apocalypse anyway when it passes from scholars to computers” (31). Though my book ends up being largely about literary influence—or, if you prefer, influences upon literary creativity—and to a lesser extent about the place of Irish and Irish American writers in the macro system of British and American literature, it is meant fundamentally to be a book about method and how a new method of studying large collections of digital material can help us to better understand and contextualize the individual works within those collections. The larger argument I wish to make is that the study of literature should be approached not simply as an examination of seminal works but as an examination of an aggregated ecosystem or “economy” of texts. Some may wish to classify my research as “exploratory” or as “experimental” because the work I present here does more to open doors than it does to close them. I hope that this is true, that I open some doors. I hope that this work is also provocative in the sense of provoking more work, more exploration, and more experimentation.
I am also conscious that work classified under the umbrella of “digital humanities” is frequently criticized for failing to bring new knowledge to our study of literature. Be assured, then, that this work of mine is not simply provocative. There are conclusions, some small and a few grand. This work shows, sometimes in dramatic ways, how individual creativity—the individual agency of authors and the ability of authors to invent fiction that is stylistically and thematically original—is highly constrained, even determined, by factors outside of what we consider to be a writer's conscious control. Alongside minor revelations about, for example, Irish American writing in the early 1900s and the nature of the novelistic genre in the nineteenth century, I continually engage this matter of “influence” and the grander notions of literary history and creativity that so concerned Elliot, Bloom, and the more or less forgotten Russian formalists whose bold work in literary evolution was so far ahead of its time.
The chapters that follow share a common theme: they are not about individual texts or even individual authors. The methods described and the results reported represent a generational shift away from traditional literary scholarship, and away even from traditional text analysis and computational authorship attribution. The macroanalysis I describe represents a new approach to the study of the literary record, an approach designed for probing the digital-textual world as it exists today, in digital form and in large quantities.
* Ian Watt's impressive study The Rise of the Novel (1957) is an example of what I mean in speaking of macro-oriented studies that do not rise far beyond the level of anecdote. Watt's study of the novel is indeed impressive and cannot and should not be dismissed. Having said that, it is ultimately a study of the novel based on an analysis of just a few authors. These authors provide Watt with convenient touchstones for his history, but the choice of these authors cannot be considered representative of the ten to twenty thousand novels that make up the period Watt attempts to cover.
† The human aggregation of multiple case studies could certainly be considered a type of macroanalysis, or assimilation of information, and there are any number of “macro-oriented” studies that take such an approach, studies, for example, that read and interpret economic history by examining various case studies. Alan Liu pointed me to Shoshanna Zuboff's In the Age of the Smart Machine (1988) as one exemplary case. Through discussion of eight specific businesses, Zuboff warns readers of the potential downsides (dehumanization) of computer automation. Nevertheless, although eight is better than one, eight is not eight thousand, and, thus, the study is comparatively anecdotal in nature.
* This approach again resonates with the approaches taken by the Annales historians. Patrick H. Hutton writes that whereas “conventional historians dramatize individual events as landmarks of significant change, the Annales historians redirect attention to those vast, anonymous, often unseen structures which shape events by retarding innovation” (1981, 240).
* Smith and Bisch note how Joyce's use of “bronze b[u]y[s] gold” in Sirens mirrors a “minor encounter in the Illiad…[in which] Diomedes trades his bronze armor for the gold armor of Glaucos” (1990, 206). See Joyce's Ulysses 11.1–4.
* Ramsay made these comments at the “Face of Text” conference hosted by McMaster University in November 2004. See http://tapor1.mcmaster.ca/~faceoftext/index.htm.
* There are those who object to this sort of research on the grounds that these methods succeed only in telling us what we already know. In a New York Times article, for example, Kathryn Schulz (2011) responded to some similar research with a resounding “Duh” I think Schulz misses the point here and misreads the work she is discussing (my blog post explaining why can be found at http://www.matthewjockers.net/2011/07/01/on-distant-reading-and-macroanalysis/). To me, at least, her response indicates a lack of seriousness about literature as a field of study. Why should further confirmation of a point of speculation engender a negative response? If the matter at hand were not literary, if it were global warming, for example, and new evidence confirmed a particular “interpretation” or thesis, surely this would not cause a thousand scientists to collectively sigh and say, “Duh.” A resounding “I told you so,” perhaps, but not “Duh.” But then Schulz bears down on the straw man and thus avoids the real revelations of the research being reviewed.