2 Document and Evidence

The word information commonly refers to bits, bytes, books, and other signifying objects, and it is convenient to refer to this class of objects as documents, using a broad sense of that word. Documents are important because they are considered as evidence, and so there are cognitive and cultural as well as physical aspects to them. Writing, printing, telecommunications, and copying allow documents to be made more available across space and time, and there has been an enormous increase in documents of many kinds, most recently in the form of vast digital data sets (“big data”) for which we are not well prepared. Techniques are needed to organize this rising mass of material so that one can discover the most suitable resources for any given purpose. There are several quite diverse requirements for later use of documents and, as in any developing field, terminology has been inconsistent and often quite figurative.

Information as Thing

We have noted that during the twentieth century the word information became fashionable and was used in many ways. Some writers extended information to denote patterns unrelated to human knowing. Others have limited the meaning to true statements or to the reduction of uncertainty. Most of the meanings that have to do with human knowing fall into one of three categories:

  1. information as knowledge, meaning the knowledge imparted;
  2. information as process, the process of becoming informed; and
  3. information as thing, denoting bits, bytes, books, and other physical media. In effect, this, the commonest use of the word, includes any material thing or physical action perceived as signifying. In this third sense, information becomes a synonym for a broad view of document.

Document, as a verb, means to make evident, to provide an explanation. Document, as a noun, was historically something you learned from, including a lesson, a lecture, or an example. Gradually, document came increasingly to mean a written text, while retaining a sense of evidence. Nevertheless, the definition of document has remained unsettled, and three views of it can be identified:

  1. a conventional, material view. The everyday, conventional view of documents is of graphic records, usually text, written on a flat surface (paper, clay tablet, microfilm, word processor files, etc.) that are material, local, and, generally, transportable. These objects are made as documents. The limits of inclusion are unclear.
  2. an instrumental view. Almost anything can be made to serve as a document, to signify something, to be held up as constituting evidence of some sort. Natural history collections and archeological traces can be included in this view. Before the adoption of military uniforms, it was hard for a soldier in battle to know who was a friend and who was an enemy. In a sixth-century battle between Welsh and Saxons, fought in a field of leeks, Saint David instructed the Welsh to indicate their identity to each other by wearing a leek. The leek documented Welsh identity to those who understood the code and remains today a national symbol of Wales. In her manifesto What Is Documentation? Suzanne Briet’s discussion of documents examined documents made of or from objects. She famously asserted that a newly discovered species of antelope, when positioned in a taxonomy and placed in a cage, was made to serve as a document. This view follows from her assertion that bibliography is properly considered to be concerned with access to evidence, not just to texts.
  3. a semiotic view. The two previous views emphasize the creation of documents and imply intentional creation. So they are incomplete from a semiotic view, in which anything could be considered a document if it is regarded as evidence of something, regardless of what the creator (if any) of that object intended (if anything).

The importance of a document depends on how we understand it, since the ability to make sense and respond is what enables living organisms to survive. Documents and documentation constitute evidence that may be useful to us in making sense of our situation and options. Documents are used as intermediaries between ourselves and others, and we judge documents in varied ways. We try to understand what we see. We decide how far we trust what we perceive, and how we feel about what we see influences us. How accessible it appears to be and how easy to use both strongly influence whether we bother with it. We “make do” (satisfice) rather than optimize.

We also need to distinguish between meaning and sense. A sentence can be meaningful, yet not make sense. “A mouse swallowed the elephant” is a grammatically meaningful statement, but it does not make sense in any realistic context, although it could in cartoons or other imaginary contexts. We commonly construct sense when there is uncertain or incomplete meaning, such as abstract art or Rorschach images.

Documents and Document Anatomy

Documentation—the management of documents—leads to the question: with what kinds of document is documentation concerned? Clearly, digital and printed texts have been the primary concern, but once one accepts the notion of documents as objects from which one may learn, handwritten manuscripts should also be included, and there is no basis for limiting the scope to texts. Since diagrams, drawings, maps, and photographs are used to describe or explain, images should not be excluded. If printed maps are included, then there is no rational basis for excluding terrestrial globes, which are three-dimensional maps. And, if diagrams are included, why not also three-dimensional models and educational toys? If three-dimensional objects are included, museum specimens and expressive sculpture cannot reasonably be excluded. If written language is included, then why not recorded, spoken language or music? And if recorded speech and music are included, why not recorded performances? And if recorded performances, why not live performances?

Anything regarded as a document can be seen as having the following four aspects.

  1. Significance. There is a phenomenological aspect to documents. So long as documents are objects perceived as signifying something, the status of being a document is not inherent (essential), but attributed to an object. Meaning is always constructed by a viewer.
  2. Cultural codes. All forms of communicative expression depend on some shared understanding, which can be thought of as language in a broad sense.
  3. Media type. Different types of expression have evolved: texts, images, numbers, diagrams, art, music, dance, and so on.
  4. Physical medium. Media include clay tablets, paper, film, magnetic tape, punch cards, and so on, sometimes in combination, as in a passport.

The status of being a document, therefore, is attributive (1), and every document has cultural (2), type (3), and physical (4) aspects. Genres are culturally and historically situated combinations. Being digital directly affects only the physical medium, but, like the invention of paper and of printing, the consequences are extensive.

It is reasonable to refer to any object that has documentary characteristics as a document, but, of course, that does not mean it should be considered only as a document. Leeks are not always and only symbols of Welshness. The same is true in reverse, even for an archetypal document: a printed book can make a convenient doorstop, a role that depends on its physicality, not on any documentary aspect.

The History of Information Technology

The use of gesture and speech is transitory and highly localized. To see or hear you must be present there and then, but technologies have steadily reduced these limitations.

Writing

By recording, writing provides an alternative to speech. Writing can put speech, which is local and ephemeral, into a new and enduring form. Since the statement or image remains, it can be seen at a future time, and, being portable, it can overcome effects of distance as well as time. And writing is not limited to the recording of speech and gesture. It can simply be an original inscription, commonly a record that something has happened (history) or that something should be done (agenda).

In all cases, the effect is to establish a trace, evidence that can be perceived by others or serve as a reminder for oneself. In this way, the written record can endure and overcome the passage of time for as long as the record remains legible.

A single record can, in principle, be read by anyone anywhere. Although it can be in only one place at any given point in time, the effect is to provide continuity. Writing, then, diminishes the effect of time and so provides a partial alternative to human memory, an artificial “external memory.” Much has been written on the consequences of the invention of writing in providing an enduring form of evidence, thereby facilitating communication, control, and commerce. It is very hard now to imagine life without writing.

Writing in the sand is washed away. Ink fades. Paper can burn or disintegrate. Electronic records are very fragile. But, within its limits, writing exceeds speech or gesture with the advantage of being able to counteract the effects of time and, by also being portable, of distance.


Within its limits, writing exceeds speech or gesture with the advantage of being able to counteract the effects of time and, by also being portable, of distance.


Printing

Printing provides multiplication of writing, and so extends the effect of writing with two consequences. First, while a piece of writing can conquer distance by being moved, it can still only ever be in one place at one time. Printing makes copies that can be in as many different places as copies have been made. The more widely copies are scattered, the more convenient for geographically dispersed individuals. This matters, because convenience of access is a powerful determinant of use.

Second, since any individual record is vulnerable to alteration, including falsification and destruction, there is safety in numbers. The more copies that are made and the more widely they are distributed, the more difficult it becomes to alter them and the more likely that one or more copies will survive.

Making communication permanent in a record that is an alternative to human memory and distributing many copies has far-reaching consequences. The use of print facilitated the Renaissance, the development of science, and the rise of the modern state. Much has been written on the impact of printing.

Telecommunication

Until well into the nineteenth century, telecommunication was a person on foot, horse, or ship bringing good news or bad. The rise of transmission technologies, notably railways, telegraph, telephone, radio, and now the Internet, have had the effect of reducing the effects of distance and diminishing the delays associated with travel. Telecommunications, like printing, facilitated managerial coordination and commercial propaganda and here, too, there is a large literature.

Copying

Transcribing texts is as old as writing. In the eighteenth century, handwritten documents were copied by “letter press.” A thin moist sheet of paper was pressed against the original so that some of the ink of the original would transfer into the moist sheet. Documents were occasionally photographed during the nineteenth century, but generating rapid, reliable, economical copies of documents is a twentieth-century development, with three important techniques: photostat, microfilm, and electrostatic copying (xerography, dry writing). (The numerous forms of duplicating that involve the creation of a new original are more properly regarded as small-run printing). There has been much less historical and social commentary on the impact of copying technology.

Photostat—direct-projection photography onto sensitized paper without an intermediate negative—was pioneered by René Graffin of the Institut Catholique in Paris to facilitate his editing of early Christian writings in Syriac. The image produced was negative (white writing in a black ground). The left-to-right reversal was corrected by using a 45-degree mirror. His equipment received a prize in the International Exposition in Paris in 1900, and a few photostat cameras were built for European libraries, but there was little impact until photostat equipment became commercially available in 1910. The speed, accuracy, and efficiency of photostats for both text and images compared with manual transcription or typewritten copies were quickly recognized. The photostat process was widely adopted and became the copying process of choice at least until the late 1930s.

Microfilm carried by carrier pigeons was famously used by René Dagron to transport messages across enemy lines during the siege of Paris in 1870–71, but widespread use came only in the 1930s, when compact precision cameras, standard film speeds, and 35 mm safety film became available. Banks, newspapers, libraries, and other organizations adopted microfilm and its variants on a large scale.

Electrostatic copying, better known as xerography (“dry writing”), was developed to replace photostats, became widely used during the 1960s, and is today the technology of choice for copying and for printing digital documents.

Making legible copies from varied originals is not, in practice, separable from image enhancement. A faded document may need contrast enhancement to be made more legible. Using ultraviolet light can reveal erased text on a reused medieval manuscript, and infrared light may reveal text that a censor has inked over. Thus, copying involves more than merely making copies. We see by means of light radiated or reflected from some surface, so a legible image is one where the contrast in light is suitable for human vision. It is not surprising, then, that the photographic copying of documents quickly expanded to the use of techniques (different light sources, filters, fluorescence, and special emulsions) to render legible copies of humanly illegible originals.

The primary effect of these technologies is to reduce the effects of both time and space. Records became increasingly accessible in any place and at any time, making it easy to read what would otherwise have been forgotten. These technologies amplify the effects of each other and can be combined. A good example is the use of photography to make an image of something on a printing plate (photolithography) to print many copies of it.

These developments were greatly enhanced by successive advances in engineering, such as the use of steam engines, electricity, photography and, especially now, digital computing and communication. These technologies have enabled massive increases in the number of documents. In the nineteenth century, people worried about the “information flood.” In the twentieth century, it became the “information explosion”—and now, everything that came before is dwarfed by “big data.”


In the nineteenth century, people worried about the “information flood.” In the twentieth century, it became the “information explosion”—and now, everything that came before is dwarfed by “big data.”


The Rise of Data Sets

Academic research projects typically generate data sets, but in practice it is generally impractical for anyone else to attempt to make further use of these data, even though major funders of research now mandate that researchers have a data management plan to preserve generated data sets and make them accessible.

Science and engineering are constructive enterprises evolving through hypotheses and model building, trial and error, and testing and revision. For this reason, shared access to the record of prior work is critical. Historically, the record has been primarily textual, in the form of published technical reports, articles, conference papers, books, and other genres, although there were always some nontextual records, such as images, numerical tables, and collected specimens.

Print-on-paper materials are made accessible through a slowly evolved infrastructure of scholarly norms (including acknowledgment and citation), genres of technical writing, specialized publishers and distribution channels, libraries and bibliographies, catalogs and indexes. The infrastructure for publishing and bibliographical access was established by scholars, societies, librarians, and publishers. During the second half of the twentieth century, digital methods made additional search techniques feasible. (One thinks of Chemical Abstracts, Medline, and the Science Citation Index.) It is a creaky system, but it works.

No comparable infrastructure is yet in place for data sets. If one were to pick a random selection of papers reporting the results of projects completed five or 10 years ago and ask to reuse the datasets they were based on, the effort would probably generate more embarrassment and frustration than success.

If these data sets are regarded as illustrative appendages to a definitive textual record, then this situation is regrettable. However, the situation is worse than that, because the practice of science and engineering has also been transformed by the pervasive adoption of digital computation. The potentially useful record of scholarship, even in the humanities, is increasingly not written reports, but (mainly nontextual) digital data sets of many kinds. The raw material, the operations upon it, and progressively more refined derivations can be beneficially shared and built upon by other researchers, not only in the same field but also in adjacent fields. This extends the impact and broadens the evidence in ways not practical using textual reports alone and has enabled a rise in computationally intensive, data-centric scholarship. The potential now exists, therefore, for a far greater return on investment in research. But there is a requirement. The infrastructure of well-developed work practices, publication norms, libraries, and bibliographical access that evolved to create and sustain an accessible archive of the literature of each field has to be complemented by a corresponding set of work practices and infrastructure for the archive of digital resources that constitute an ever-increasing proportion of the record.

Researchers tend to work within domains and in relatively narrow research fronts, with informal interpersonal interaction within each specialty. Within a specialized field, researchers commonly know each other, or, at least, of each other. They graduate from similar programs, work in teams, meet in conferences, read the same journals, and correspond by email. These informal social networks strongly complement or replace formal channels of communication and documentation. In interaction between research fronts, however, this informal social network is largely absent. Without membership in the same “invisible college,” researchers are unlikely to know what they could ask for or whom they could ask, and they are less likely to receive cooperation.

Rich results can be obtained when researchers explore at or over the boundaries of their fields and encounter ideas or data that are relevant but new and different (for them). This is why research funders and academic planners have long tried to encourage interdisciplinary interactions in the strongly discipline-based academic environment. A resource that can be made to provide benefit to more than one group yields a greater return on the investment.

There are examples of good practices in the very largest of science projects and in social science numeric data series, but widespread and largely undocumented deficiencies exist elsewhere. The significance of the problem can be understood by imagining that a large proportion of the textual record was written but never published, and so remains largely unknown or inaccessible and likely to be lost. What a waste!

Some Practical Initiatives

Our memories cannot retain everything we might wish to remember. Organizations as well as individuals need records. Thus, increasing quantities of documents are retained as a kind of artificial “external memory,” with two consequences: first, the explosion of documents and their complexity increase problems of knowing what to trust. Second, to cope with this explosion, a fifth vector of technical development became necessary for finding and selecting the most suitable documents as and when needed. This fifth vector has had various names, including bibliography, documentation, information retrieval, and information science.

Collections of documents (libraries) were traditionally supervised by knowledgeable scholars, whose familiarity with the collection enabled them to recommend the most suitable documents for any purpose. This approach is unreliable, however, because scholar-librarians become forgetful, move away, or die. Martin Schrettinger, a Bavarian monk turned librarian, was conscious of this problem and asserted the need for systems, instead of persons, for finding and retrieving documents. He coined the term “Library science” (Bibliothek-Wissenschaft) for his textbook of 1808. A political refugee from Italy, Sir Anthony Panizzi developed sound cataloging practices for the library of the British Museum. In the United States, Melvil Dewey promoted efficiency and standardized procedures. The techniques of modern librarianship were well developed by the end of the nineteenth century.

Libraries, however, ordinarily deal with only a limited range of published documents and with limited attention to the detail of their contents. In 1895, two Belgians, Paul Otlet and Henri La Fontaine, decided to provide a more complete solution. They started a complete and detailed index, the Universal Bibliographic Repertory, to everything in every medium everywhere for everyone anywhere: texts, images, maps, government records, statistical data, manuscripts, films, museum objects—everything.

Otlet rightly considered that most authors were too wordy, that published texts were duplicative and inefficient, and that the bound book (“codex”) was an unsatisfactory design because pages and lines did not coincide with intellectual units. Also the printed book is static, fixed, and so cannot be corrected or updated. Otlet wanted to extract facts from printed books and to transfer them into a better, more flexible kind of “book” using new media. One vision, shared with the German chemist Wilhelm Ostwald and the English writer H.G. Wells, was an encyclopedia of concise factual statements, each updated as and when needed, and all linked in a web of subject indexing. At that time, filing cards were the most flexible and most promising storage technology. They used an elaborate artificial language, the Universal Decimal Classification, to describe each item in detail and to show how each was related by topic, date, and origin to each other item. The result was a kind of hypertextual network. Card technology becomes increasingly labor-intensive as more cards are added, however, and after decades of work and many millions of cards, their system could no longer be sustained.

Wilhelm Ostwald was inspired by Otlet to use his Nobel prize money to establish an institute in Munich named The Bridge (Die Brücke) to develop technology for intellectual work. Ostwald wanted to extract facts and concepts from books and periodicals in order to create efficient recall of recorded knowledge. New concepts and facts would be added and each element updated as needed. By 1912 Ostwald and his colleagues were writing lyrically about this “World Brain” (Das Gehirn der Welt). Novelist H. G. Wells promoted the same idea. Imagine a rigorously edited Wikipedia with concise factual records on cards. “World Memory” would have been more accurate than World Brain, but Ostwald wanted more. He hoped that, just as individual letters in Gutenberg’s moveable type could be rearranged to form new words, so also, rather in the spirit of data mining, rearranging concepts might yield new knowledge.

Ostwald and Otlet represented a modernist view based on systems, logic, standards, machinery, efficiency, and progress, but the technology available to them was inadequate. It was also a utopian view based on a simplistic view of knowledge. Even scientific facts cannot be properly understood out of context, which has large consequences for anyone imagining that a technological system of total recall could be sufficient.

Problems of Later Use

The use of data sets generated by others in the past can be impeded in many different ways. The hard drive crashed and there was no backup, the person who could give permission cannot be found, and so on. As a result, there are several barriers to overcome. Here is one typology.

  1. Discovery. Does a suitable data set exist?
  2. Location. Where is a copy?
  3. Deterioration. Is the copy too deteriorated or obsolete to be usable?
  4. Permission. May it be used?
  5. Interoperability. Is it standardized enough to be usable with acceptable effort?
  6. Description. Is it clear enough what the data represent?
  7. Trust. Are the lineage, version, and error rate understood and acceptable?
  8. Use. Should I use it for my purpose?

These questions form a chain: if you learn that a data set existed, you may not be able to locate a copy; if you can locate a copy, it may not be usable; if it is usable, you may not be able to obtain permission; and so on. Any problem might prevent reuse.

In practice, the answers are unlikely to be a simple Yes or No. A positive answer is not, in itself, enough. The effort required to achieve a positive outcome may be too great. The decision is always situational. The willingness to invest effort depends on the perceived benefits of success and the known alternatives as well as the cost and the resources available. One may satisfice: a less perfect result requiring less effort will often be a reasonable course of action.

These barriers are different in kind and require different kinds of solutions: policies, work practices, infrastructure, remedial processing, and so on. For example, one repository has accepted datasets with the condition that the permission of the depositing researcher was required for third-party use, but with no contingency policy for when that researcher died or was unavailable. Some remedies are more feasible or more affordable than others.

A particular problem is that descriptive metadata sufficient for the original compiler of the data is unlikely to be sufficient for someone else who comes to use it, years later, who may not know what the original compiler took for granted and so did not provide an explanation.

The final question—should I use it for my purpose?–is different from the other seven because in this case responsibility rests with the potential user, the individual scholar him or herself. Yet the decision is influenced by the answers to the other seven questions, for each of which there are identifiable specialists and institutions capable of providing support. These are currently more clearly identifiable for the textual record than for data sets. Traditionally, bibliographies identify what resources exist, catalogs list where copies can be found, and now search engines support both tasks. Publishers provide copies in the short term; libraries provide copies in the long term; and so on. Arrangements for sustained access to data sets are, as yet, far less well developed.

Bibliography Reconsidered

In ancient Greece, a bibliographer was a “book-writer,” a copyist who transcribed an existing text to make a new copy. When the word “bibliographer” came into use in Europe, it was used more or less interchangeably with “librarian” until library science developed as a distinct technical field in the nineteenth century. A century ago, more rigorous bibliographical techniques were developed. Although an interest in the intellectual and cultural “contents” of books was asserted, the emphasis was on technical analysis and description of the physical printed book itself, and the “new bibliography” came to be known as analytical or historical bibliography. Nevertheless, by the mid-twentieth century, “bibliographical access” or, simply, “bibliography” (used in a broad sense) were terms of choice in the print-on-paper world for the issues associated with the questions listed above. This is reflected in the subtitle of Patrick Wilson’s classic 1968 analysis of the problems of organizing and selecting documents, Two Kinds of Power: An Essay on Bibliographical Control. But terminology changes and this broad sense of bibliography were largely displaced by “organization of information” and similar phrases. By default, the word “bibliography” increasingly had a narrower meaning: as the detailed examination of printed books as physical objects. An eloquent protest against this narrow view can be found in Donald McKenzie’s Bibliography and the Sociology of Texts. McKenzie, a specialist in historical bibliography and textual criticism, argues persuasively for a broader approach in two ways. First, bibliography should extend beyond the text in the book to include its interpretation and social context. This has happened. Second, “text” should be interpreted widely to extend beyond writing in the printed books to include other media—notably films, maps, and digital data sets—in the sense of “document” discussed above. On this second goal, much more needs to be done.

Whether they’re called bibliography or not, there are numerous areas needing attention in addition to the central issue of preservation of digital data.

These issues apply to all kinds of resources.

World Brain and Other Imagery

Ostwald, Wells, and others like to refer to their grand encyclopedic design as a world brain, but this is a metaphor. It did not really resemble a brain or do what a living brain does. Referring to an encyclopedia as an “external memory” is closer, but no human remembering is involved. Records, if found and read, might serve as a partial alternative to human memory. Disk drives and other storage devices are referred to as “memory,” but remembering is a creative act. We typically recall something of the context of what is remembered, and we tend to remember a little differently each time, either in the details or in our understanding of them. Humans can remember; technology can be used to record. Humans express meaning, documents mention.

When one looks, one quickly sees that discourse about information is very rich in figurative language that both helps and hinders: “external memory,” “world brain,” and many other examples attribute active, human-like behavior to inanimate objects or imply that information is somehow a vital, active force. Text has “content,” documents “inform” us, computers “think,” and “memes” are ideas that fly around infecting minds. Metaphors as figurative speech can help understanding and are commonly a step toward more adequate terminology, but forgetting that they are figurative leads easily to confusion and nonsense.

Summary

The word information commonly refers to physical stuff such as bits, books, and other physical media, or any physical thing perceived as signifying something: that is, documents, in a broad sense. Ordinarily, documents are graphic records, usually text, created or used to express some meaning. However, almost anything can be made to serve as a document, such a leek to express Welsh identity. On a semiotic view in which meaning is constructed in the mind of the viewer, any object might be perceived as signifying something and, in that sense, could be considered a document. So if we hold to the idea of documents being evidence, a wide variety of objects and actions could be regarded as being “documents” in this extended sense. Anything regarded as a document must be perceived as signifying something, depend on shared understandings (“cultural codes”), as well as having a physical form. Since prehistoric times, four kinds of technology have become increasingly important: writing, printing, telecommunication, and copying. The rising tide of documents has brought initiatives to organize them, the challenge of knowing which to trust, and imaginative metaphorical language to describe both problems and opportunities. Data sets are a type of document, but the infrastructure making digital data sets accessible for use over time is much less developed than for printed material. The requirements are in principle the same. Scholarly practices and the field known as bibliography need updating accordingly.

In the next chapter, we take a closer look at the use we make of documents, both as individuals and socially. Physical, mental, and social aspects are all always present in the use of information.