Images

Alexander Hamilton, James Madison, or John Jay?

For more than 150 years, historians argued over the authorship of 12 essays in The Federalist Papers, founding documents in the American march toward democracy. Though the essays are world-famous hallmarks in the lexicon of American history, the specific authors of each one remained unknown. The question of which Founding Father penned the essays had sparked such endless debate that it had devolved into a popular parlor game among historians. Just who exactly wrote the stirring arguments upon which our governing structure was based?

The answer was hidden in the words themselves—but to find them, scholars needed not a close reading, but a close counting. They needed to look only at the numbers.

The mystery began in late 1787, when a series of essays advocating the ratification of the Constitution was published in New York newspapers under the pen name “Publius.” Shielding the true identities of the authors with the patriotic nom de plume was a somewhat farcical endeavor. In fact, of the near 4 million people living in the United States in 1787, all but three could be eliminated from contention.

It was an open secret that Hamilton, Madison, and Jay were the authors, but none of the three wanted to step forward and admit to writing any particular essays. Each had political ambitions, later rising to the ranks of Secretary of the Treasury, President, and Chief Justice of the Supreme Court, respectively, so they weren’t without good reason. But their excess of caution left the mystery of authorship intact, titillating history professors and armchair enthusiasts alike for many years to come.

You might think that the scholars and astute politicos of the day would have been able to determine the authorship on their own. There were only three potential candidates, after all, each with his own political slant and style of communication. It would have been the equivalent of an anonymous editorial in the New York Times, penned by Barack Obama, Hillary Clinton, or Bernie Sanders. Or an unsigned manifesto by George W. Bush, John McCain, or Donald Trump. All might be coming from the same side, but they were certainly not all identical.

In 1804, a solution finally seemed to emerge. Hamilton wrote a letter to his friend Egbert Benson listing the author of each essay. Hamilton was preparing to duel Aaron Burr. He sensed both the historical significance of The Federalist Papers and the chances of his survival. He decided not to let his knowledge of the authorship die with him.

This should have been the end of the mystery. A nation of curious observers had no reason to doubt Hamilton’s firsthand knowledge. Yet 13 years later, soon after ending his second term as President, Madison put out his own list of authorship—one that differed from Hamilton’s. Twelve of the essays that Hamilton claimed to have written were also claimed by Madison.

This reopened the debate with a new fervor, fueling spats among historians for more than a century. In 1892, future senator Henry Cabot Lodge wrote on the topic siding with Hamilton, while noted historian E. G. Bourne went with Madison.

Most historians tried to tease out the authors based on the political ideology presented in each essay. Would Madison really have argued for a central bank in those certain terms? Would Hamilton have supported limits on Congress so freely? Or maybe that’s something John Jay would have written?

It wasn’t until 1963, two centuries later, that the mystery was at long last solved. The definitive answer came from respected professors Frederick Mosteller of Harvard University and David Wallace of the University of Chicago. However, unlike the many professors who had attempted to solve the question before them, Mosteller and Wallace were not historians. They were not known for their scholarly work on early America. They had never published a paper on historical figures at all. Mosteller and Wallace were statisticians.

One of Mosteller’s most noteworthy papers dealt with the World Series and whether or not seven games was enough to statistically find the best baseball team. Just a few years prior to looking into the authorship problem, Wallace had published a paper named “Bounds on Normal Approximations to Student’s and the Chi-Square Distributions,” which probably sounds as close to nonsense to you as the thought of probability functions solving a historical mystery sounded to history professors in 1963.

Mosteller and Wallace’s methodology for ending the authorship debate had nothing to do with politics or ideologies. Instead, they were two of the first statisticians to leverage word frequency and probability.

Their process was in some ways complex, featuring equations with factorials, exponents, summations, logarithms, and t-distributions. But the heart of their methods was strikingly simple:

• Count the frequency of common words in essays that we know either Hamilton or Madison wrote.

• Count the frequency of those same words in essays where the author is unknown.

• Compare these frequencies to determine the author of the disputed essays.

Even before any of the fancy probabilistic equations come into play, the results of the statisticians’ approach seem wonderfully obvious in retrospect. In The Federalist Papers, Madison used the word whilst in over half the essays in which his authorship had been confirmed—but he never once used the word while. Hamilton, meanwhile, used the word while in about one-third of his essays but never once used whilst.

Mosteller and Wallace did not rely on a single word for their analysis, however. That would not have been statistically sound. Instead, they systematically chose dozens of basic words and then found the frequency of each in the disputed essays. Many words, entirely nonpolitical in meaning, turned out to have drastically different usage rates between the two authors. For example, Madison used also twice as often as Hamilton, while Hamilton used according much more frequently than Madison.

Mosteller and Wallace had falsifiability on their side. They could show that by using the same methods on papers where the author was known, they could determine the authorship with perfect results. Of the 12 disputed essays, Mosteller and Wallace concluded that James Madison was the actual author of all 12.

In the written summary of their results the two mathematicians proceeded with caution, perhaps out of fear of angering historians who had been scratching their heads for generations. The numbers presented in their experiment showed a different story; the two had complete confidence in the method. It was flawless in all the test cases where authorship was known, and its results were consistent in the essays with unknown authorship. Hamilton’s claim of authorship was wrong.

Today, after countless more studies of the papers in both statistical and nonstatistical manners, Mosteller and Wallace’s findings—that Madison was the author—have become the consensus among statisticians and historians alike. Mosteller and Wallace were ahead of their time. Their study, though it involved some formulaic complexity, relied essentially on counting words. With today’s computers, word counts and frequencies are trivial pursuits. In 1963, this was not the case.

Word counts were done by hand; to find the number of times the word upon appeared in each of the essays, for example, they tallied the usage page by page. To understand what Mosteller and Wallace went through (or at least what their research assistants went through), I printed out a complete collection of The Federalist Papers and set out to count the number of times upon appeared. After 30 minutes I was only one-eighth of the way through—about 40 pages—and had counted 37 instances of the word upon. It wasn’t long before my eyes were pounding and my brain went numb. “Where’s Upon?” was like a devilish version of “Where’s Waldo?”

I gave up on pretending I was in 1963. Instead I did some counting only possible with twenty-first-century technology: I went to Google, searched “Federalist Papers Complete Text File,” downloaded a link from the first result, and opened the file in Microsoft Word. After a grand total of two minutes, a “Find All” on upon turned up 46 occurrences of the word in the section I had covered. Not only was the computerized method 28 minutes faster, it was far more accurate than my weary eyes could be.

Even more staggering: Though the amount of time needed for a person to scan The Federalist Papers in full for an additional word would hover around four hours, scanning via computer for all words would take a negligible amount of time. Doing a similar analysis on the complete works of Shakespeare, the Bible, Moby Dick, or even the corpus of English literature would have been unfathomable to Mosteller and Wallace. Today, using computers to count the instances of a single word in a large text is a task mastered by most teenagers.

In the fifty years since Mosteller and Wallace published their study, the field of computer-aided text processing has grown rapidly. Google uses text analysis both in its search results and in deciding what ads to show you. Researchers have tried to use text analysis to determine what makes a tweet go viral, while media outlets often run similar versions of the same headline with slight tweaks in wording to maximize page views. But the uses thought up so far by tech companies are only one possible route.

Mosteller and Wallace used statistics to investigate a singular question of authorship. The success of their experiment was more profound. Writers have distinct styles that are both consistent and predictable. As it turns out, it’s not just eighteenth-century politicians that leave a stylistic fingerprint. Authors of all books, whether they be popular and renowned or obscure and reviled, repeat their words and structure over decades of writing.

The question Mosteller and Wallace asked and answered was limited in its scope, but text analysis can answer a huge range of questions that have intrigued curious writers and readers for generations. Did Ernest Hemingway actually use fewer adverbs than other writers? How does reading level affect the popularity of a book? Do men and women write differently? Do writers follow their own advice, and is that advice any good? What, besides superficial spellings, distinguishes American and British novelists? From Vladimir Nabokov to E L James, what are our favorite authors’ favorite words?

While there has been a slowly growing movement in academia to investigate the writing patterns of successful authors, there are still enormous questions that have yet to be explored. And for everyone from the casual reader to the literature major to the aspiring writer, these questions are both fascinating and useful. You probably don’t care about the Poisson distribution or the parsing programs used to decipher parts of speech, but you probably do want to know how your favorite author writes—and what that might mean about you as a reader.

The analytical approach to writing can be amusing and informative and often downright funny. Moreover, it can teach us about the writers we read every day and the words we use in our own writing. That’s what we’ll delve into in this book, devoting each chapter to a new literary experiment.

The research won’t be painfully complex. It doesn’t need to be, and shouldn’t be, in order to be worthwhile. Many obvious and intriguing questions about classic literature or the modern bestseller can be viewed through a statistical lens but just haven’t been framed that way yet. This book is about tackling these simple yet unique questions in a new way. It’s a book about words that is, paradoxically, written with numbers.