`DATABASES`

A database is a structured collection of aggregated, commensurable data capable of being sorted and accessed for some purpose of knowledge production, normally through the application of algorithms. A common feature of all databases—and perhaps the central one—is that they permit the recontextualization of information derived from multiple, often heterogeneous sources. It is common in the digital age to define databases as only those collections of data stored on electronic computers (and accessed via computerized algorithms), but this overly narrow and restrictive definition ignores the history, usage, and even etymology of the term. In the first instance, many of the key structural features of today’s electronic databases (and the algorithms used to operate with them) were present in data collections dating back hundreds of years and encompassing a variety of (nonelectronic) media. Second, while the term database can be used as a noun, it also invokes a variety of practices of information management and analysis that again long predate electronic computers: databases are defined as much by how they are used as by what format they take. Some scholars have even suggested that the term can be employed as a verb: “to database.” There is no question, however, that in the modern computer era databases have emerged as perhaps the preeminent technology both for managing increasing economies of scale of information and for structuring and managing our social, political, and economic lives. Nonetheless, attention to the broader historical trajectory of the database is vital to understanding how this particular form of information management has acquired such enormous importance in the current *“big data” ecology.

It is generally accepted that the first published use of the term in the specific context of computing occurred in a 1962 technical memorandum produced by System Development Corporation (a spin-off of the RAND Corporation devoted to developing command and control systems for the US military), which stated that a “ ‘data base’ is a collection of entries containing item information that can vary in its storage media and in the characteristics of its entries and items.” While this reference is an important point of departure in the genealogy of today’s modern electronic databases, the term was in fact adopted from a wider contemporary usage not strictly limited to computer applications. Since at least the early 1950s, the term data base was often used to refer to large collections of economic or social data in any form that served as the basis for analysis. For example, in what may be the earliest example of the term, a 1953 article in the journal Sociometry on the measurement of interpersonal communications described large collections of survey reports and interviews as “a data base” for conducting sociological analysis. Indeed, during the period between the early 1950s and the mid-1960s (by which point it acquired its more narrow, specific association with electronic computers), the term data base appeared in more than a dozen publications across a variety of social science disciplines without reference to computers. Thus, while the Oxford English Dictionary reflects common current usage by defining a database as “a structured set of data held in computer storage and typically accessed or manipulated by means of specialized software,” it appears that the term was adapted from a wider convention in which it referred simply to any large, structured collection of information.

Why is this significant? Beyond simply making an etymological point, the broader application of database (or “data base”) supports the legitimacy of applying the term to practices and technologies that predate electronic computers, perhaps by a century or more. While technically anachronistic, referring to earlier data collections as databases helps recover important continuities in information management across material cultures, and it highlights developments in the social and political functions of data collections that directly influenced the culture of data we live in today. Databases are not just tools for storing and processing large volumes of data: they are also systems of rational organization that create categories of people and phenomena, define relationships, and impose structure on our perceptions of reality. The technical and social functions of today’s massive electronic databases are the product of a genealogy that began long before computers, and which encompasses a variety of forms and practices that are not specific to any particular technology.

In suggesting the value of this deliberate anachronism, it should be noted that there are probably limits to its usefulness. In the broadest possible sense, it could be argued that a great many collections of data, perhaps dating back centuries or millennia, could be considered “databases,” including astronomical observations, accountings of economic goods and taxes, botanical and zoological taxonomies, medical records and reports, and a wide variety of other kinds of compendia of information recorded on paper, *parchment, *papyrus, and even clay tablets. While it is true that many of these collections are antecedents and analogs to modern databases, they do not always share some of the most vital characteristics of modern databases. One such characteristic relates to the structure of databases: they are organized in such a fashion that accommodates the dis- and reaggregation of information—in other words, they are amenable to some degree of recombination, sorting, and random access. If this is technically possible (though laborious) with a static list or catalog recorded on paper or other media (particularly if one is willing to recompose it to accommodate new information or order), it is far more practical if some type of movable medium is employed—whether paper slips, index cards, or another transposable system.

Techniques involving the collection of information on slips of papyrus or parchment may have been used in Roman antiquity, although few of the large reference works that were composed then have survived, and some have argued that such practices were employed in the composition of indexes to scriptural and philosophical works during the European Middle Ages. By the sixteenth century, these methods were evidently common, and reference works like Conrad Gessner’s Bibliotheca universalis (Universal library, 1545–48) were no doubt composed, in keeping with Gessner’s own advice, by collecting and organizing slips of paper containing bibliographical entries that could be rearranged in preparation for the production of the printed text. In the eighteenth century, Carl Linnaeus composed his massive Systema naturae (System of nature, 1750) and other taxonomic works thanks, in part, to an elaborate system of slips and cards, and there is evidence that he used this technique not just to lay out and rearrange the text for publication, but as a private data repository (he even had a cabinet constructed for the purpose) to which he could continually add and refer.

Practices like Linnaeus’s card index appear to have been inspired by new systems for the organization of libraries developed during the second half of the eighteenth century, particularly in France, Germany, and Austria. The use of standardized cards (at first these were playing cards) for the storage of information was becoming widespread by the end of the century, and the first example of a general card catalog—at the court library in Vienna—was assembled during the 1780s. Indeed, libraries (and bibliographic practices more generally) have always been on the forefront of data management, from the catalogs (or pinakes) of the famed Library of Alexandria to the first publicly accessible electronic catalog systems (Online Public Access Catalogs, or OPACs) of the 1970s. In addition to introducing techniques of movability and random access to data, library catalogs also contributed to another key feature of databases: the organization of data by tagging or encoding them with *metadata (or “data about data”). Metadata—like references to the shelf on which a book is located or additional works by an author—allow data to be recombined and recontextualized in new ways and provide a means for random access. Above all, they help structure the data in a collection in terms of their relationships to one another, a crucially important feature of today’s massive electronic databases that was, nonetheless, available in more limited forms in card and paper collections as well.

While schemes for the mechanical sorting of collections had been proposed in the eighteenth century (by G. W. Leibniz and others), the first truly effective technology for automated sorting was developed by Herman Hollerith, who invented an electromechanical punched-card sorting machine that was used to compile the 1890 US census (and which was immediately adopted by census takers in Europe and elsewhere). The Hollerith machine is rightly considered an immediate precursor to the first electronic computers—indeed, Hollerith’s company became part of a conglomerate later renamed International Business Machines, or IBM—and the card collections processed by mechanical tabulating machines certainly qualify as databases in the modern, more narrow sense. But a variety of other practices emerged in the nineteenth century that were equally important in the development of the broader technical and cultural characteristics of modern databases.

One especially important feature of late eighteenth- and nineteenth-century data collections is their role in creating and defining new phenomena and categories of natural and social relationships. Linnaeus’s data collection, for example, helped to produce the view that the organic world is structured as a nested hierarchy of taxonomic categories (families, genera, species, etc.) that to some degree have a real, if abstract, existence. Data collected about the fossil record and organized in massive compendia during the first half of the nineteenth century revealed genealogical relationships between organisms that led to theories of evolution, which helped scientists identify processes like diversification and extinction that contributed to the understanding of the natural world as having a directional history. In a more sobering case, analysis of organized collections of data on human physical and cultural characteristics—such as Samuel Morton’s famous collection of human skulls—contributed to the belief in a hierarchy of human races and underwrote social and economic practices of division and discrimination. Such “databases,” then, helped construct and reify categories that were implicit in other kinds of data collections, such as census tabulations (which to this day use “race” as a category for sorting data) and early life and medical insurance databases (which often informed restriction or denial of coverage to people who fell into undesirable racial categories).

Databases are not just collections of information, but, as tools for the accounting of resources and people, they are sources of power. As statistical practices became widespread during the second half of the nineteenth century that allowed for new kinds of analysis and prediction to be applied to large data sets, databases became a central tool for governments to regulate the lives of citizens, and for private interests (insurance companies, banks, etc.) to extend or deny opportunities to individuals. A central feature of the emergence of databases in this era is that the logic of aggregation—that is, the belief that patterns identified in large numbers can serve to make meaningful generalizations that cannot be observed from individual cases—became entrenched in modern statistical rationality. Databases themselves did not produce this rationality on their own, but as the main tools for decision making in a wide variety of social spheres, they came to be encoded with it. This feature was inherited during the transition to the era of computers and is not specific to any particular form of technology.

The first computerized electronic databases were developed during the early 1950s and 1960s, as engineers developed data-processing applications for military and administrative uses. Strictly speaking, what are commonly referred to as “databases” are in fact an interaction between “database management systems” (DBMS), which is the software that allows access and control of data, and the data tables stored on punch cards or magnetic media. The first electronic databases were hierarchical, meaning that data were stored in nested categories that could be accessed only by following a specific path, and the only way the database could be re-sorted was to essentially re-create the order in which the data were arranged, an often complicated and laborious process. A major innovation was the introduction of “relational” databases in the 1970s, in which data are organized in a matrix of multiple tables organized according to particular “keys,” allowing access to individual data without needing to proceed linearly through the entire database. Most complex databases today are variations on “object-oriented” or “object-relational” databases, in which data and associated metadata are treated as independent “objects” rather than being keyed to predetermined fields. This allows for much greater flexibility and speed for querying enormous sets of data, and object- or document-oriented DBMS software are the primary applications for everything from simple address books on personal computers or smart phones to massive distributed databases used in big data computing.

Despite the increasing complexity and sophistication of DBMS software over the last several decades, however, there remains a fairly direct and tangible genealogy connecting paper and electronic databases. Though a distinction is sometimes made between *“analog” and “digital” collections of data, in point of fact nearly all databases, regardless of medium, are “digital,” in the sense that they are composed of discrete symbols or numbers. A collection of numerical (or even textual) data stored on paper can be transferred to electronic storage media to be accessed by DBMS, and indeed many of the earliest electronic databases in a variety of fields were simply transcriptions of paper databases to electronic format. When census bureaus began using electronic data, records held on mechanical punched card storage were often directly transferred to magnetic tape. A number of important early scientific databases, including the records of the US National Weather Records Center, the National Library of Medicine bibliographic index, the Atlas of Protein Sequences and Structures, the fossil record, and other collections, began their lives in paper format or other transportable media (e.g., microfilm), and users of these data collections still often refer to pre- and postelectronic versions as “databases.” Indeed, no crucial distinction need be drawn between the media on which data are stored; rather, the advent of computers is notable for the software applications that allow access to data, reporting, and analysis, most of which is now performed by complex automated software algorithms. The database is a dominant feature of modern life in countless, inescapable ways, but its emergence as a “central cultural form” is part of a much longer history of collecting, storing, analyzing, and valuing data stretching back centuries.

David Sepkoski

`FURTHER READING`

Ann Blair, Too Much to Know, 2010; Geoffrey Bowker, Memory Practices in the Sciences, 2008; Lisa Gitelman, ed., “Raw Data” Is an Oxymoron, 2013; Thomas Haigh, “ ‘A Veritable Bucket of Facts’: Origins of the Data Base Management System,” SIGMOD Record 35 (2006): 33–49; Albert H. Rubenstein, “Problems in the Measurement of Interpersonal Communication in an Ongoing Situation,” Sociometry 16 (1953): 78–100; David Sepkoski, “The Database before the Computer?,” Osiris 32 (2017): 175–201; Bruno J. Strasser and Paul Edwards, “Big Data Is the Answer … but What Is the Question?,” Osiris 32 (2017): 328–45; System Development Corp., “Technical Memo,” TM-WD-16/007/00, 1962; Paul Wright, Cataloging the World, 2013.