DIGITIZATION

In 1951, David H. Shephard invented what the newspapers called a “robot reader-writer.” He spent $4,000 of his own money and worked in his attic for over a year. Two years later he took out a patent for what would eventually be known as the Scandex, one of the earliest *optical character recognition machines. He initially called it the “Gismo.”

Shephard’s invention was part of a growing postwar connection between the world of business and that of machines. Automated reading, or what we might think of more primitively as the processing of information from one written format to another, emerged as a core business concern with the rise of consumer-centered businesses. One of Shephard’s first clients was Reader’s Digest, who by the mid-1950s was shipping fifteen to twenty million books per year. Digitizing customer orders allowed companies to achieve an unprecedented scale and speed of service.

Alongside the booming postwar culture of business machines (think IBM), Shephard’s work was also participating in a much longer tradition of thinking about the *conversion of information more generally. The Scandex, or the process of “scanning” that lent it its name, was based on the principle of turning written letters into electric current through the reflection of light. As light reflected off of a surface onto photoconductive cells, a current of varying strength was produced by the absence or presence of letters (dark or light shapes). This current could then be translated from its *analog voltage into a digital or *binary representation.

The principle of conversion behind Shephard’s invention was intimately related with other conversional forms of transmission that would emerge most notably over the course of the nineteenth century, from telegraphy (the conversion of letters into electromagnetic signals) to telephony (the conversion of spoken words). “Seeing by electricity,” as the editors of Scientific American called it in 1880, was first demonstrated as a functioning device by George Carey in the late 1870s. According to the editors, “The art of transmitting images by means of electric currents is now in about the same state of advancement that the art of transmitting speech by telephone had attained in 1876, and it remains to be seen whether it will develop as rapidly and successfully as the art of telephony.” It did not. Nevertheless, Carey’s device underlies all subsequent attempts at developing machines that can reproduce text or images through the use of photoelectric cells, from Shephard’s Gismo to the printer/scanner in your office today.

If the history of digitization owes a deep debt to such nineteenth-century forebears, the fascination with signal conversion that underlay it was part of an even longer humanistic tradition of thinking about converting information from one form to another, whether from image to letter (or vice versa), from manuscript to print (or vice versa), or from one language to another. In its earliest form, dating back to the late fourteenth century, “to scan” meant to analyze a line of verse for its metrical qualities (now falling under the heading of prosody). It signaled an attempt to quantify speech, to transform letters into their appropriate numbers. When we think about digitization as a form of binarization, it is important to remember the scholastic origins of the quantification of information. By the nineteenth century, scanning could also mean “to look searchingly,” as in scanning the heavens or someone’s facial features. There was a synthetic quality to this aspect of scanning, the attempt to take the parts of an observed world and condense them into a single judgment. Scanning was about not just taking things apart, but also putting them back together into a more unified whole. By the middle of the twentieth century, to scan had assumed its modern meaning of distilling information—it combined the swift synthetic qualities of its nineteenth-century visual heritage with the discretizing and quantifying aspects of its bibliographic and scholastic origins.

Much of the historiography on media and information has been guided by a notion of *reproducibility. Each subsequent technological revolution brings with it a more intense level of reproductive potential, whether of text, image, or sound. As copies abound, space is shrunk. This is how the editors of Scientific American, for example, couched the novelty of Carey’s device. “By this means any object so projected and so transmitted will be reproduced in a manner similar to that by which the letter, A [the initial projection], was reproduced.” In focusing on the notion of reproducibility, historians have tended to think more about questions of proliferation and access—“more” as both a problem and a solution. But in doing so we lose a sense of the differences produced between techniques of inscription, the transformations that occur as we move from one form or format to another. Reproducibility, like its theoretical offshoot remediation, presupposes that information doesn’t change as it travels. There is a fundamental notion of presence that resides within a theory of reproducible or remediable information.

One of the ideas that digitization introduces into the history of information, I would argue, is a new way of thinking about organizing the history of information around the notion of conversion rather than that of reproducibility. Conversion is the means of moving between *semiotic and material systems, of representing something in a new way rather than simply reproducing it in the same way. Conversion has been integral to how we have thought about and handled the idea of information. Similar to the practice of translation, conversion allows us to experience the knowledge of holding two incommensurable things together, to feel what it is like to be in between different worlds of signs. But it is also a powerful engine of discovery, much like its theological origins of implying a profound personal transformation. In translating information from one form to another we enter into a space of novelty, of something unknown.

While we often think about digital culture in terms of increasing resolution or fidelity (“high definition” [HD], “dots-per-inch” [dpi], or the next “generation”), at its base, digitization involves a process of sampling and discretization. It brings with it notions of error and uncertainty, but also redundancy, into the definition of cultural objects, whether as images, sounds, or texts. Digitization consists of a two-step process wherein an analog signal is sampled at discrete time intervals (discretization) and then those values are rounded or standardized (quantization) to produce a binary representation. As with any translational process, information is lost in making information accessible in new ways. Conversion, the act of turning around, also involves the act of leaving behind. It highlights all the losses that occur and uncertainties that are introduced through the practice of transformation.

The problem of information loss would become central to Claude Shannon’s Mathematical Theory of Communication, one of the most influential works in the history of information theory. “The fundamental problem of communication,” writes Shannon, “is that of reproducing at one point either exactly or approximately a message selected at another point.” Understanding the reliability of communication channels—whether a message could be exactly or approximately reproduced—depended for Shannon on understanding the underlying redundancy *encoded in the initial message. For Shannon, knowledge of this redundancy allowed for more efficient engineering of communication channels. His work would engender an entire industry devoted to the problem of *compression, the practice of encoding information using fewer *“bits” than the original representation (which could be either “lossless” or “lossy” depending on one’s goals or the nature of the underlying information). This is one more way that the principle of conversion, and the ideas of loss and uncertainty, stand behind the practices of digitization. For a subsequent generation of linguists and cultural theorists, Shannon’s insights would drive a growing consensus about the probabilistic nature of communication more generally—the way an information-theoretic model underlies human communication and cognition. According to this thinking, meaning isn’t a function of what stands out as radically novel or unique but emerges in the space between deviation from norms and the array of repetitions that shape cultural practices and human behavior.

Error is fundamental to any understanding of digital culture. We can measure the degree of error of OCR (optical character recognition), that is, the amount of mistranscription when machines read documents. But accounting for error is also essential in all *machine-learning-based approaches to understanding culture at large scale. If we want to write the history of scientific notation using a collection of over thirty million pages, for example, we can never be entirely certain as to the overall presence of the graphical practices, the tables, footnotes, diagrams, and figures, that underpin modern truth claims. We can only ever estimate and then account for the degree of our uncertainty. If you have not already heard of terms like precision and recall, which measure different kinds of error, they will become increasingly central to *humanities research. They allow us to quantify and thus communicate the degree of one kind of uncertainty about the past (there are of course multiple kinds of historical uncertainty).

In order to combat this aspect of uncertainty, conversion also typically involves the practice of standardization. Standards are essential to control the losses of conversion. In a literary context, we talk about the definitive translation or the authorized or critical edition. Over the course of the nineteenth century, scholars like Friedrich von der Hagen or François-Juste-Marie Raynouard devised a variety of techniques to transform the various manuscript sources of national cultural heritage into definitive print editions. As von der Hagen scoured the royal library in Copenhagen for handwritten versions of the medieval Icelandic Edda to produce a new print edition, he was in the process establishing the rules through which one could translate information from one format to another. Today, textual scholars like Ryan Cordell are drawing attention to the ways in which digitization produces even more variability and error in the archival record. Far from universalizing the inherited print record, the vagaries of optical character recognition mean that we end up with a polyphony of potentially heretical versions of texts. Von der Hagen reemerges in a digital world, this time as an algorithm to collate digital variants.

In the field of music, the development of digital standards follows a similar logic. The sampling frequency that was developed for CDs of 44.1 kilohertz was based on compatibility with video formats that were used for audio storage in the 1970s. As the sound historian Jonathan Sterne has written, standards are often translational tools to move between and make commensurable different material formats. The history of optical character recognition provides another case in point. Faced with the realization of the multiplicity of written characters across historical time and different world cultures, pioneers in document scanning advocated for the creation of a single standard font, the Esperanto of typeface. A committee was formed in 1966 as part of the American National Standards Institute (ominously called Committee X3A), whereupon it produced a template for the most efficient machine-readable typeface. If you look at the numbers on your credit card you will see it in action. At roughly the same historical juncture that produced large amounts of social unrest in Europe and North America in the late 1960s, engineers were busily working on reducing the unruliness of written documents. Along with what is now informally called OCR-A or officially ISO 1073–1:1976, there also emerged, not unsurprisingly, a second (European) version that was created at roughly the same time (OCR-B or ISO 1073–2:1976). The human efficiency of standardization always runs into the problem of cultural difference. Babel continues to lurk in our machines.

But such problems also run through our bodies. The story of digitization is not simply one of miraculous machines. It is also the set of practices through which knowledge of the human sensorium is encoded into material form (known as “perceptual encoding”). The .mpeg, .jpeg, and other formats of compression are designed to remove information based on the limits of how we think we see and hear. Standards of digitization encode a theory of human embodiment.

If uncertainty, standardization, and embodiment are key terms by which to think about digitization as conversion, a final one would be the idea of the model. The digitized representations circulating in the world today are models, of either some things out in the world or, like compressed files, other digitized representations. And like a compressed file, they are meant to make life more manageable by being smaller or more manipulable (the latter-day version of “hand”-books). In this sense, digital files are technologies of scale. And like any good model (*early modern globes or nineteenth-century ships in bottles), they can oscillate between being tools and being toys, useful techniques for understanding a highly complex world or giant wastes of time.

With the growing digitization of so many facets of contemporary life, computational modeling has entered into a vast array of academic fields, including the humanities. We might think of modeling as a second wave of digitization, where the first wave involved the act of scanning, converting, and reproducing historical records (a very lossy process). Computational models allow humanists to simulate not only different versions of the past, but different explanatory mechanisms. They explicitly place a degree of fictionality within the historical method, what Hans Vaihinger called the philosophy of the “as if,” which he saw as the foundation of all knowledge. Just as the modeling of possible worlds has become essential to understanding the future, whether it is predicting the path of a storm or the end of the Anthropocene, so too do digitized historical documents allow for a more hypothetical relationship to the past. Computational modeling emphasizes the idea of “possible histories,” once again foregrounding the prominence of uncertainty to the process of digitization. Possible history not only emphasizes the sense of constructedness about the past (by no means a novel idea); it also emphasizes the importance of testability (far more novel). It shifts the accountability of verification from one based exclusively on personal authority to one distributed between a person and a model (a person-model to use language inspired by the philosopher-anthropologist Bruno Latour). Models acknowledge their contingency but also open themselves up to verification by others. As forms of conversion, models interweave humans.

When it comes to the history of digitization, the media of sound, text, and image have their own genealogies. They are far from fully fleshed out. And yet one of the effects of digitization is to put pressure on these disciplinary boundaries that have historically been so mediacentric. Bits (“binary digits,” the fundamental unit of digital forms) are emerging as an analytical lingua franca. The field of optical character recognition, for example, has expanded to include the larger heading of “document image analysis,” where the page image replaces the letter or character as the ultimate referent of a text. It incorporates the multisensorial ways we engage with documents beyond a notion of immersive reading. In the field of musicology, researchers are using vector-space models pioneered by linguists in the field of information retrieval to think about the large-scale relationships of musical forms. And acoustic measures are now being applied to the study of texts through digitized collections of poetry readings. While each of these domains requires subspecialization, they also work together to form more general theories of culture and human-machine cognition. One of the consequences of digitization—by no means a mandate—is that it affords the ability to seek more general understandings of human creativity.

When the *Internet Archive was founded in 1996, it promised “universal access to all knowledge.” We now know that access to digitized information is not universal, nor does digital information encompass all forms of knowledge. Conversions involve losses as well as gains. The more general insights to be gained from digitization necessarily come at the expense of more particularized knowledge (and vice versa). Perhaps even more important is the fact that just because a machine has read or seen or heard something does not mean a human has understood it. Digitization does not by itself produce new knowledge. We also need ways of understanding the growing heaps and stacks of information stored on “drives” or *“clouds” around the world. This is yet another way that conversion underlies the history of information: the art of how we convert information into knowledge. As the engineers would say, this problem remains nontrivial.

Andrew Piper

See also commodification; computers; data; databases; documentary authority; error; files; information, disinformation, misinformation; photocopiers; quantification; storage and search

FURTHER READING

  • Mohamed Cheriet, Nawwaf Kharma, Cheng-Lin Liu, and Ching Y. Suen, eds., Character Recognition Systems, 2007; Julia Flanders and Fotis Jannidis, eds., The Shape of Data in Digital Humanities: Modeling Texts and Text-Based Resources, 2018; Andrew Piper, Enumerations: Data and Literary Study, 2018; Herbert F. Schantz, The History of OCR, 1982; Claude Shannon and Warren Weaver, The Mathematical Theory of Communication, 1998; Jonathan Sterne, MP3: The Meaning of a Format, 2012; Ted Underwood, Distant Horizons: Digital Evidence and Literary Change, 2019; Hans Vaihinger, The Philosophy of “As If”: A System of the Theoretical, Practical, and Religious Fictions of Mankind, 1911, translated by C. K. Ogden, 1924, repr. 1949.