Finding operations depend heavily on the names assigned to document descriptions and on the named categories to which documents are assigned. Naming is a language activity, and so inherently a cultural activity. Here we provide a brief introduction to the issues, tensions, and compromises involved in describing collected documents. The notation can be codes or ordinary words. Search will be more reliable if related words are linked. Combinations of terms will be needed for complex topics, and the amount of detail needed varies with the situation. Naming draws on already established terminology for future searching, but problems arise because language continually changes and because new concepts need new names, which are often, at first, unreliable. Systems can only work on physical marks—in effect, on mentions, not meanings. Because language is cultural, changes in culture can affect the acceptability of names as well as their meaning.
Once collected, documents have to be made accessible in an organized way. Librarians, for example, make descriptions of documents in their catalogs and also through classified subject arrangements on their shelves. Assigning topic names to documents and assigning documents to named topical categories are central.
Names (marks) are essential for collected documents to be findable. Names are, necessarily, linguistic expressions and—as we shall see—they create tensions and difficulties. Libraries are cultural institutions concerned with recorded knowledge, and their mission is to support learning, both research (knowing more) and teaching (sharing understanding). Libraries exist to advance learning, knowledge, understanding, and belief. But what people know, what they would like to know, and what others have learned and written about, all resist mechanical treatment. If it were otherwise, knowledge management could be reduced to data processing.
Searchers seeking documents relevant to their interests have to locate what they need on the system’s terms. There is, or should be, collaboration with information service providers seeking to anticipate their users’ interests and vocabulary, and users trying to make sense of the category names in the catalog, classification, or bibliography being used. Even if a limited vocabulary or an artificial notation, such as the Dewey Decimal Classification system, is used, all description is a language activity. Description is always and necessarily based in culture, because descriptions are based on the concepts, definitions, and understandings that have developed in a community.
This naming (bibliographic description) follows rules. For more than a century, there has been gradual international standardization of rules for representing the imprint (where and by whom published), collation (physical features of a document), proper names (authors, institutions, and places), and other attributes of documents. The real difficulty, however, for both librarians and library users is in describing what a document is about, in naming its topic, which is usually presented as a two-stage process: first, the cataloger examines a document to determine what concepts it is about; then he or she assigns terms (linguistic expressions) from a vocabulary to denote those concepts. The literature has little to say about the first stage and concentrates on the second. Research has revealed that different indexers will commonly assign different index terms to the same document, as will a single indexer at different times.
There are a variety of methods for representing what documents are about: subject classifications, lists of subject headings, thesauri, ontologies, and so on. A traditional collective term for all of them is documentary languages. We need not examine each type, but will note four dimensions along which they vary.
Verbal approaches using natural language words are a simple and popular way to create descriptions. However, using ordinary vocabulary has disadvantages, and ease of creation does not lead to effective use. The multiplicity and fluidity of natural language vocabulary makes for unpredictable results: should I look under violin or fiddle, or both? The variations of natural language can be mitigated by adopting a restricted (“controlled”) vocabulary, as explained below.
Natural language words do not arrange themselves in a helpful way. Alphabetical filing order is determined by accidents of spelling rather than related meanings. “If the names of the classes, in a natural language, are used to arrange them, we do not get a helpful order. In fact names scatter classes in a most unhelpful chaotic order. It will give us an order like algebra, anger, apple, arrogance, asphalt, and astronomy,” wrote the Indian librarian, S. R. Ranganathan (1951, 34). Also, natural language indexes are ordinarily created only in a single language.
Both of these problems can be addressed by using an artificial notation for the descriptive names (such as the Dewey Decimal Classification) designed to achieve some desired arrangement, with natural language indexes leading to the class numbers in as many different natural languages as desired. Having an artificial notation of letters, numerals, and other symbols does not mean that it is no longer a language. It is an artificial language and is not immune to the problems of obsolescence and perspective discussed below. It is the same approach as the artificially constructed, restricted languages used, for example, in botanical and chemical nomenclature.
Everyday language is characterized by multiplicity, such as singular and plural forms, variant spellings, or synonyms and antonyms (opposites). The same topic could be assigned any number of names, or represented in an indefinite number of ways (“unlimited semiosis”), so documents on the same topic could be scattered under any of several different headings. A searcher might find some but not others. The standard solution is “vocabulary control,” whereby one form of name, for example, violins, is “preferred,” and used exclusively. Other commonly used but “nonpreferred” terms are listed, but only to redirect the searcher to the preferred term: for example, fiddles, see violins. An “authority file,” a list of the carefully differentiated preferred and nonpreferred terms, is compiled and followed.
Vocabulary control can take care of synonyms, near-synonyms, antonyms, and variant spellings. Exact synonyms are quite rare. It is near-synonyms that are frequent. Birds and ornithology, for example, are closely related but not quite the same. Even so, someone interested in birds might look under ornithology and vice versa. Near-synonyms require endless situational judgments concerning what to combine and what to keep separate.
In practice, vocabulary control also extends to hierarchical and other relationships (“see also”). Vocabulary control extends beyond semantic to functional relationships, which differentiates this kind of word list from traditional dictionaries. Biogas, pig manure, and water hyacinths, for example, are very different from each other, but since pig manure and water hyacinths are important ingredients in making biogas anybody interested in one is likely to be interested in the others. Thus, see also references in both directions between each and biogas are justifiable.
Many documents are concerned with complex topics, needing at least a phrase to express the scope. A simple approach is to simply list the terms, in any order, needed to comprise the meaning. Documents about the “Parents of handicapped children” would have three terms: the three keywords children and handicapped and parents. There are also some documents on “children of handicapped parents,” which would also be retrieved by the same keywords, but, being relatively few, would probably not be noticed in the retrieved set. Computers can easily handle keyword searches, but the earlier technology of catalog cards cannot: any such combination has to be “precoordinated” using some syntax at the time of cataloging to differentiate and to express relationships among the terms. The US Library of Congress subject headings have two quite separate headings, children of handicapped parents and parents of handicapped children, which, because they constitute grammatical phrases, are not confused by human searchers. This is a simple case. Syntactic rules are used to generate quite elaborate headings in which a primary term is progressively qualified, either as a complex phrase, such as hand-to-hand fighting, oriental, in motion pictures, or with a chain of qualifying terms, as in God—knowableness—history of doctrines—early church, ca. 30–600—congresses. The latter is a single subject heading in which the grammar is expressed by the positioning of the terms. For an English speaker accustomed to adjectives preceding the nouns they qualify, it sounds more natural if such headings are read in a reverse order, with some conjunctions and prepositions added: “Congresses on the history of doctrines in the early church, ca 30-600, concerning the knowableness of God.” Fortunately, the use of numbers and letters in the artificial notation of classification schemes allows elaborately coordinated topics to be expressed much more concisely. In this way, all but the simplest documentary languages for naming topics have grammar and well as a vocabulary.
A collection composed of one or very few documents needs no catalog. At the other extreme, distinguishing every little nicety in order to differentiate every topic becomes cumbersome. Collections of millions do need very detailed description in order to achieve the fineness of sifting required to select a handful rather than a flood of records. In practice, the level of detail in subject indexing is situational, depending on how many different items are acquired in each topic.
Subject indexing can be formulated as a matter of fitting descriptions. The challenge is to create descriptions that will enable those to be served to identify and select the best documentary means to whatever their ends may be. By definition, the descriptions used by indexers are for future use. This requires thinking about likely needs and describing (naming) in a forward-looking way. To do this, the indexer constructs, consciously or not, some mental narrative about future use, some story in which the document in hand could be relevant to future needs. It is not simply a matter of what the document is about, but of where it might be useful in an imagined future. Familiarity with the community and its purposes, ways of thinking, and terminology are important requirements for the effective indexer.
Vesa Suominen asked the question, “What is it that makes a good librarian?” Drawing on the ideas of linguist Ferdinand de Saussure, he answered that the task is one of “filling empty space.” The good librarian is one who is effective in arranging documents in relation to each need of each library user. That the populations of documents, of library users, and of needs are all very large and quite unstable makes the task more difficult, but does not undermine the principle. Suzanne Briet extended the idea of this forward-looking stance with her image of the librarian as a hunting dog guided by the hunter (researcher), but also prospecting ahead and pointing to prey invisible to the hunter in a dynamic partnership: “Like a hunter’s dog—really in front, guided and guiding.” (“Comme le chien du chasseur–tout à fait en avant, guidé, guidant.”)
The effort to be forward-looking is, however, affected by the describing (naming) process. Topical description is a matter of naming what a document is about, and describing is a matter of summarizing. Assigning subject headings is an extreme of summarizing what a document is about. But what, actually, is “aboutness” about? Stating that a subject heading represents a topic or a concept is valid, but unhelpful, because saying that merely points to another name and does not explain. An explanation of what a subject heading (and, therefore, a document) is “about” must be derived from the discourse from which the indexing term originates. A subject description assigned to a document says that this discourse (document) relates to that discourse (literature, discussion, dialogue) which means that the subject description is invariably based in the past. Similarly, library users don’t want topics, they want discourse: a statement, a description, an explanation—or, at least, a discussion of whatever they are curious about. Thus, a subject heading expressing a topic derives its importance from past discourse.
Meanings are established by usage and so always draw on the past. The indexer, then, is creating descriptions by drawing on the past, but expressing them with an eye to the future. This Janus-like stance might seem difficult enough in a stable world, but the reality of indexing is made much worse by time, technology, the nature of language, and social change.
Meanings are established by usage and so always draw on the past. The indexer, then, is creating descriptions by drawing on the past, but expressing them with an eye to the future.
The indexer’s formal act of naming, of recording the topical description of a document or of specifying a relationship between named topics, is necessarily performed at some point in time and inscribed into the apparatus of indexes and catalogs. As time passes, that act recedes from the present into the past. During the same flow of time the prior discourse, from which the choice of name was derived, has continued, evolved, and changed, and indexing practices will evolve with those changes. Also, as the future becomes the present, new futures continue to be foreseen, and the forward-looking perspective would increasingly be related to changed future expectations. However, an assigned name, once inscribed, is fixed. So, with the passing of time, its relationship with both the then-past discourses and also the then-future expected needs drifts away from relevance to the perceptions of an advancing present. Assigned names are, therefore, inherently obsolescent with respect to both the past and the future. Discourses and the indexer flow forward with time, but the assigned names have been inscribed for, and fixed in, a receding past. Because old indexing relates to past situations, updating it is difficult and not generally attempted.
It is not simply that a new document has to be positioned in relation to both past discourse and future needs. Additional complexity arises because there are, of course, not one but many simultaneous communities of discourse.
New names arise, especially for new topics, through figurative use of language, especially through metaphor. Well-established terms are used figuratively, based on some perceived similarity, for emerging concepts—for example, a computer “mouse.” Then, through usage, the new meaning becomes fixed, at first within its context, then more widely. The instability of language is not of the indexer’s making. Indexers must follow changes in word usage. They take a conservative approach because changes in terminology call into question older terminology, and the task of making retroactive alterations to the marks in an index takes resources away from other worthy purposes.
Information services depend heavily on technology. Documents are physical objects on paper, film, magnetic disk, or other physical media. Libraries could not operate as they do if the tasks to be performed were not heavily routinized and most of them reduced to clerical procedures performed by support staff or delegated to machines. The modern library arose in the spirit of late nineteenth-century technological modernism as “library economy,” imbued by Melvil Dewey and others with an emphasis on standards, system, efficiency, and collective progress that lives on in visions of digital libraries, the “semantic web,” and the “virtual.” Detailed control is needed for effectiveness and for efficiency, and librarians, pioneers of new technology for filing and record processing, inspired modern office-management procedures.
In subject indexing, the machinic and the cultural collide like two tectonic plates, and naming lies at the fault line where indexers use vocabulary control to try to mitigate the linguistic ruptures and slidings they can neither prevent nor avoid. Thus, there is an endemic battle between the incorrigibly cultural and aesthetic character of the underlying mission and the machinic tendencies essential for cost-effective performance. The central battle line of these tensions is in naming what documents are about.
The fact that the documents are overwhelmingly textual has allowed the heavy use of natural language processing techniques to infer semantic relationships between documents and between documents and queries. But this is a matter of lexical entities, of character strings, not of meanings. Fairthorne (1961) analyzed this difference by saying that these techniques deal with mentions, not meanings. For example, if information and retrieval commonly co-occur in that order, then they are presumed to constitute a phrase. And if the phrase information retrieval and the phrase vector space tend to co-occur in the same texts, they are computed as being close in “document space,” and a topical relationship is inferred from this “spatial” proximity. If relationships between marks are statistically significant, semantic affinities are implied but not explained. Machines can be programmed to detect regularities and inconsistencies among marks, even if they cannot distinguish sense from nonsense.
It is further evidence of the inherently linguistic character of bibliographical access that formulaic natural language processing techniques work quite well, but not always and not very reliably. It is the textual (lexical) similarity between documents that allows relatedness between discourses and descriptions to be inferred, since the same words are mentioned when the same or very similar language is in use. From the method employed, homographs with different meanings—for example, host (landlord) and host (crowd)—will dilute the precision of retrieval. The compelling economic attraction of this approach is, of course, that it is mechanical and so can be delegated to machines. The poverty of this approach arises when different vocabularies are used to refer to the same topic without using (mentioning) the same terms. For this and for cross-lingual search, formal structures such as bilingual dictionaries or statistical associations help. The important and useful specialized vocabularies relating to places, events, and persons, which are partly cultural and partly physical, will be discussed in the next chapter.
Technical writing on information retrieval draws heavily on natural language processing to identify personal and institutional names and many types of frequency counts and statistical associations. This needs to be complemented by attention to the way that words are used and the unlimited number of ways of saying something (Blair 1990). Both categories and the language used to label them are deeply subjective (e.g., Lakoff 1987). Research on the social practices of science is contributing to understanding the use and role of documents and document description (e.g., Frohmann 2004). Sorting Things Out: Classification and its Consequences by Geoffrey Bowker and Susan Leigh Star (1999) provides revealing case studies of how social agendas influence the design of supposedly objective categorization systems.
Language evolves within communities of discourse and produces and evokes those communities. Every such community has its own more or less specialized, stylized practice of language. Attempts at controlled or stabilized vocabulary must deal with multiple and dynamic discourses and the resulting multiplicity and instability of meanings. Most bibliographies and catalogs have a single topical index, but include material of interest to more than one community. Since each community has slightly different linguistic practices, no one index will be ideal for everyone and, perhaps, not for anyone. In vernacular discussion of health, for example, the terms cancer and stroke are commonly used, but in professional medical writing neoplasm and cerebrovascular accident are preferred names. So, in theory, multiple, dynamic indexes, one for each community, would be ideal. It is not, however, only a matter of linguistic variation, but also of perspective. Different discourses discuss different issues or, when the same issue, from different perspectives. A rabbit can be discussed as a pet, as a pest, as food, or as a character in a book.
Aside from these “dialect” differences, the vocabulary used by indexers to characterize their documents can become problematic for other reasons as the world changes. There are cognitive developments: new ideas and new inventions need new names. Horseless carriages were invented, then renamed automobiles. Also, new referents emerge for existing names. Some sixty years ago the word computer meant a human who performed calculations, but now always means a machine. More recently, the word printer made the same transition.
There are also consequences for library naming from affective changes. Even when the meaning (denotation) is stable, the associated context (connotation) or attitudes to what is referred to may change. Always, some linguistic expressions are socially unacceptable. That might not matter much, except that what is deemed acceptable or unacceptable not only differs from one cultural group to another, but changes over time, and, especially during changes, may be unpleasantly controversial. The phrase yellow peril was once widely used to denote what was seen as excessive immigration from the Far East, but it is now considered too offensive to use, even though there is no convenient and acceptable replacement term for this view and the phrase yellow peril is needed in historical discussion.
Much has been written concerning the social acceptability of subject headings, both the terms used and how they are related to each other. “Sexual perversion see also Homosexuality” was once, but is now no longer, acceptable. Sanford Berman’s Prejudices and Antipathies: A Tract on the LC Subject Heads Concerning People (1971) is an excellent introduction. Berman picks out scores of subject headings, explains why each is, in his opinion, offensive, and recommends alternative terms he considers more acceptable. His examples and commentary show how naming always reflects a cultural perspective, that terminology acceptable to one group may be offensive to another, and that attitudes change over time. For example, Jewish question implies untenable assumptions; Gypsies are not from Egypt and prefer to be called Roma; the cross-reference “Rogues and vagabonds see also Gypsies” exhibits prejudice; the headings Mammies and Negroes are offensive to those so named; Eskimos are properly called Inuit; and so on. His examples are far too many and too interesting to summarize here.
One’s own behavior is reflected as superior to that of others: rebellions by slaves are named insurrections, but rebellions by citizens are more positively named revolutions. Indians of North America, Civilization of did not refer to the culture of Native Americans, but to progress in the eradication of their culture and its replacement with European settlers’ lifestyle, as the Library of Congress instruction made clear: “Here is entered literature dealing with efforts to civilize the Indians.” European powers have colonies; the United States has offshore “territories and possessions” not called colonies. Many of Berman’s examples reflect a male and Christian world view, the social attitudes of past times, and obsolete medical and psychological terminology (e.g., Idiocy). In some cases, counterarguments can be made. For example, using Roma for Gypsies is counterproductive or inefficient if the library’s users are unfamiliar with that term.
Tracing shifts in subject indexing back through time is an instructive form of cultural and linguistic archeology. The Library of Congress Subject Headings is more than a hundred years old, has well over a hundred thousand different headings, and is difficult to update. It is an easy target in spite of many reforms, and a good example of a problem that is endemic in indexes and categorization systems. Linguistic expressions are necessarily culturally grounded and so unstable and, for that reason, are in conflict with the need to have stable, unambiguous marks if systems are to perform efficiently.
Linguistic expressions are necessarily culturally grounded and so unstable and, for that reason, are in conflict with the need to have stable, unambiguous marks if systems are to perform efficiently.
In addition to the problems of naming, much of the naming is of concepts that are themselves abstract or problematic, and there is no linguistic solution for conceptual vagueness or confusion.
Describing is a matter of naming characteristics of documents, especially what they are about. Descriptions vary by notation (words or codes), vocabulary control (standardized terminology), coordination of combinations for complex topics (e.g., venetian blind), and fineness (how detailed). Describing is a language activity drawing on already established terminology for future searches, but since language evolves, descriptions are necessarily obsolescent. Because language is cultural, descriptions of sensitive topics may be contested. The next chapter will examine in more detail how descriptions are organized and used.