Chapter 10

10 ORPHANS

In a revolution, as in a novel, the most difficult part to invent is the end.

—Alexis de Tocqueville, 1896

I began this book with a call to arms, an argument that what we have today in terms of literary and textual material and computational power represents a moment of revolution in the way we study the literary record. I suggested that our primary tool of excavation, close reading, is no longer satisfactory as a solitary method of literary inquiry. I argue throughout these chapters that large-scale text analysis, text mining, “macroanalysis,” offers an important and necessary way of contextualizing our study of individual works of literature. I hope I have made it clear that the macroanalysis I imagine is not offered as a challenge to, or replacement for, our traditional modes of inquiry. It is a complement.* I ended, in chapter 9, with an exploration of literary influence and the idea—just an idea and some preliminary experiments to test it—that literature can and perhaps even must be read as an evolving system with certain inherent rules. Evolution is the word I am drawn to, and it is a word that I must ultimately eschew. Although my little corpus appears to behave in an evolutionary manner, surely it cannot be as flawlessly rule bound and elegant as evolution. There are those, of course, who question the veracity of biological evolution. I am not among them; it is a convincing argument. Nevertheless, it is true that there are further dimensions to explore, and it may be prudent to consider evolution as an idea, or alternatively “Darwin's dangerous idea,” as Daniel Dennett puts it in the title of his book. In the Wikipedia entry for Dennett's book, it is noted that “Darwin's discovery was that the generation of life worked algorithmically, that processes behind it work in such a way that given these processes the results that they tend toward must be so” (Wikipedia 2011a; emphasis added). The generation of life is algorithmic. What if the generation of literature were also so? Given a certain set of environmental factors—gender, nationality, genre—a certain literary result may be reliably predicted; many results may even be inevitable. This is another dangerous idea, perhaps a crazy one. Even after writing an entire book about how these “environmental” factors can be used to predict certain elements of creative expression, I am reluctant to reach for such a grand conclusion. There is much more work to be done.*

These ideas about the inevitability of literary change and about the inevitability of the form and shape of creativity are still ideas: ideas supported by 3,346 observations and 2,032,248 data points, but ideas nevertheless. Despite some convincing evidence of systematic organization within my corpus, I am not satisfied with my results. Darwin wasn't either. He writes:

I look at the natural geological record, as a history of the world imperfectly kept, and written in a changing dialect; of this history we possess the last volume alone, relating only to two or three countries. Of this volume, only here and there a short chapter has been preserved; and of each page, only here and there a few lines. Each word of the slowly-changing language, in which the history is supposed to be written, being more or less different in the interrupted succession of chapters, may represent the apparently abruptly changed forms of life, entombed in our consecutive, but widely separated formations. (1964, 290)

It is ironic, and heartening, to find Darwin likening the evidence for evolution to an incomplete text. My corpus of 3,346 books is similar: incomplete, interrupted, haphazard, and at the same time revealing in ways that first suggest, then taunt, and ultimately demand. The comprehensive work is still to be done. Unlike evolutionary biologists, however, we literary scholars are at an advantage in terms of the availability of our source material. Having cited the same passage from Darwin, Franco Moretti points out that “the ‘fossils’ of literary evolution are often not lost, but carefully preserved in some great library” (2009, 158). It is fitting, therefore, to follow a chapter on literary genealogies with a few thoughts on digital preservation, orphan works, and the future for macroanalysis.

The fact of the matter is that text miners need digital texts to mine, and “modern copyright law,” as Loyola law professor Matthew Sag puts it, “ensures that this process of scanning and digitization is ensnared in a host of thorny issues” (2012, 2). Well, that is a nice way of putting it! Today's digital-minded literary scholar is shackled in time; we are all, or are all soon to become, nineteenth centuryists. We are all slaves to the public domain. Perhaps Google, or the HathiTrust, is the “great library” that Moretti imagines? With some twenty million books, Google certainly seems the likely candidate for this title of “great library.”* Yet it is not so great. Google is not a library at all. We cannot study literary history from within the arbitrary constraints of “snippet view.” Access to a few random lines from a few random pages is tempting pabulum but not a meal. So here, perhaps, Darwin has the advantage: the artifacts of human evolution are not copyrighted! Sadly, the same cannot be said for the artifacts of creative evolution. The HathiTrust digital repository, a self-described “partnership of major research institutions and libraries working to ensure that the cultural record is preserved and accessible long into the future,” offers some hope that the record will be preserved.† The associated HathiTrust Research Center offers hope that the record will be accessible. For now however, the pages of the HathiTrust Research Center are filled with verbs in the future tense: there will be X and there will be Y. And of what there will be, access to some, at least, will most certainly require authentication through a portal.‡ The project is young, and the initial phase focuses on works already identified as being within the public domain. For the time being, access to in-copyright material must be forbidden.§ There are technical challenges to be sure, but the real hurdle is legal.

Figuring out how to jump this hurdle is the preoccupation of many a good lawyer, and this topic was the focus of a recent University of California-Berkeley Law School Symposium titled “Orphan Works and Mass Digitization: Obstacles and Opportunities.”¶ The final session of this conference asked this question: “Should data mining and other nonconsumptive uses of in-copyright digital works be permissible, and why?” As a contributor to the panel session, I offered an incomplete answer to the second part of the question: I discussed some of the findings from this book. It is not often that a literary scholar gets to address a crowd of several hundred copyright lawyers, so I ended with an appeal: “After seven years of digging in this corpus of thirty-five hundred books, I've come to the conclusion that there are still dozens of stones unturned, scores of dissertations and papers still to be written. Not one of these papers or dissertations, however, will be conclusive; none will answer the really big questions. For those questions we need really big data, big corpora, and these data cannot be in the form of snippets. To do this work, we need your help.” Ultimately, and fittingly, it was the lawyers at the conference who stole the show. They are a motivated crew; they believe in what they are doing, and they are the ones poised to be the unsung heroes of literary history. Matthew Sag's argument in favor of “nonexpressive” use is most compelling.* Sag prefers the term nonexpressive to the competing and more ambiguous and sickly sounding term nonconsumptive. For Sag and other legal thinkers, the crux of the entire copyright entanglement can be distilled into a discussion of the so-called “orphan works.” Books classified as “orphans” are books in which the copyright status is in question because the owners of the books cannot be easily or economically identified. This is the hard nut to crack. Sag reports several estimates regarding the depth of the orphan-works situation:

• 2.3 percent of the books published in the United States between 1927 and 1946 are still in print.

• Five out of seven books scanned by Google are not commercially available.

• Approximately 75 percent of books in U.S. libraries are out of print and have ceased earning income.

That last bullet point is cause for contemplation—Google's digital holdings are made primarily out of books held in libraries.

Sag's essay highlights one of many ironies in this debate: the expressed purpose of copyright is to “promote the Progress of Science and useful Arts.”† Irony indeed. I suspect modern readers are under the impression, as I was, that the aim of copyright is the protection of scientific and creative output, not the promotion of it! I encourage readers to look up Sag's article in the Berkeley Technology Law Journal. For those of us seeking to mine the world's books, the arguments are compelling, and we ought to have these arguments in our arsenals. In short, Sag argues, “If the data extracted [from books] does not allow for the work to be reconstructed there is no substitution of expressive value. Extracting factual information about a work in terms of its linguistic structure or the frequency of the occurrences of certain words, phrases, or grammatical features is a nonexpressive use” (ibid., 21; emphasis added). Despite the clarity and obvious logic of this argument, the matter remains unresolved. We are stuck in a legal limbo, trying to resolve the continuum fallacy: at what point does a nonexpressive work become expressive? Good fodder for lawyers and legal scholars; bad news for humanists. The sad result of this legal wrangling is that scholars wishing to study the literary record at scale are forced to ignore almost everything that has been published since 1923. This is the equivalent of telling an archaeologist that he cannot explore in the Fertile Crescent.

In the opening chapter of this book, I cited Rosanne Potter's comments from 1988. She wrote, “Until everything has been encoded…the everyday critic will probably not consider computer treatments of texts.” It is a great shame that today, twenty-four years later, everything has been digitized, and still the everyday critic cannot consider computer treatments of texts.* Faced with the reality of copyright, many of my colleagues have simply given up. Recently, I was asked to speak to a group of graduate students and literary scholars about what can be done with a large corpus of texts. As I talked, the familiar lights came on; the excitement in the room was palpable. Shortly after the talk ended, the questions began. I watched this same group move from excitement to despair. Can we study Hemingway? No. Fitzgerald? No. Virginia Woolf? Again, no. Their hopes for enhancing their own research were deflated. From a list of several dozen twentieth-century canonical writers, the library had only a single digital text, a “bootleg” copy of Faulkner's Sound and the Fury that could not really be “released” for fear of legal retribution. Released? What would it mean to have a digital copy of The Sun Also Rises on my hard drive? Nothing, of course; fair use permits me to make a digital copy and use it for research purposes. The fear is that I might in turn release it again, into the proverbial wild. And so goes the story…Despite these challenges, I remain hopeful. Ventures such as the HathiTrust Research Center and the Book Genome Project seem poised to make a difference.† Perhaps this nonexpressive book of mine will help.

* There remains, of course, a place for close reading, and some tasks demand it. Years ago, in the black-opal mines of Lightning Ridge, Australia, I learned the value of “close mining.” To extract an opal, one uses a small hand pick and a delicate touch to carefully chip away at the clay that may conceal a hidden gem. A more aggressive approach will most certainly destroy the delicate opals before they are even seen.

* In a coauthored paper, David Mimno and I begin to tackle some of the work that this present study demands. External factors such as author gender, author nationality, and date of publication affect both the choice of literary themes in novels and the expression of those themes, but the extent of this association is difficult to quantify. In this forthcoming work, we address the extent to which themes are statistically significant and the degree to which external factors predict word choice within individual themes. The paper is under review in the journal Poetics.

* According to Jennifer Howard's 2012 report in the Chronicle of Higher Education, Google claims to have scanned “more than 20 million books.”

† See http://www.hathitrust.org/about.

‡ See http://www.hathitrust.org/htrc_access_use.

§ In fairness, there is some language on the website indicating that access to a full-text index of the textual data may be provided in the future.

¶ See http://www.law.berkeley.edu/orphanworks.htm.

* Sag's article, “Orphan Works as Grist for the Data Mill,” appeared in the Berkeley Technology Law Journal. I am grateful to Professor Sag for allowing me access to his unpublished research. See Sag 2012, available at http://ssrn.com/abstract=2038889.

† Article 1, Section 8, Clause 8 of the United States Constitution.

* No, not everything, but compared to 1988, yes, everything imaginable.

† At the time of this writing, Novel Projects’ chief executive officer, Aaron Stanton, is working on a university-affiliated research project to allow academic researchers to access information about books in the corpus of the Book Genome Project. If all goes well, researchers will have access to transformative, nonexpressive data mined from portions of the Book Genome's in-copyright corpus. See http://bookgenome.com/.