Naming

Topic Descriptions

Once collected, documents have to be made accessible in an organized way. Librarians, for example, make descriptions of documents in their catalogs and also through classified subject arrangements on their shelves. Assigning topic names to documents and assigning documents to named topical categories are central.

Names (marks) are essential for collected documents to be findable. Names are, necessarily, linguistic expressions and—as we shall see—they create tensions and difficulties. Libraries are cultural institutions concerned with recorded knowledge, and their mission is to support learning, both research (knowing more) and teaching (sharing understanding). Libraries exist to advance learning, knowledge, understanding, and belief. But what people know, what they would like to know, and what others have learned and written about, all resist mechanical treatment. If it were otherwise, knowledge management could be reduced to data processing.

Searchers seeking documents relevant to their interests have to locate what they need on the system’s terms. There is, or should be, collaboration with information service providers seeking to anticipate their users’ interests and vocabulary, and users trying to make sense of the category names in the catalog, classification, or bibliography being used. Even if a limited vocabulary or an artificial notation, such as the Dewey Decimal Classification system, is used, all description is a language activity. Description is always and necessarily based in culture, because descriptions are based on the concepts, definitions, and understandings that have developed in a community.

This naming (bibliographic description) follows rules. For more than a century, there has been gradual international standardization of rules for representing the imprint (where and by whom published), collation (physical features of a document), proper names (authors, institutions, and places), and other attributes of documents. The real difficulty, however, for both librarians and library users is in describing what a document is about, in naming its topic, which is usually presented as a two-stage process: first, the cataloger examines a document to determine what concepts it is about; then he or she assigns terms (linguistic expressions) from a vocabulary to denote those concepts. The literature has little to say about the first stage and concentrates on the second. Research has revealed that different indexers will commonly assign different index terms to the same document, as will a single indexer at different times.

Documentary Languages for Naming Topics

There are a variety of methods for representing what documents are about: subject classifications, lists of subject headings, thesauri, ontologies, and so on. A traditional collective term for all of them is documentary languages. We need not examine each type, but will note four dimensions along which they vary.

Notation

Verbal approaches using natural language words are a simple and popular way to create descriptions. However, using ordinary vocabulary has disadvantages, and ease of creation does not lead to effective use. The multiplicity and fluidity of natural language vocabulary makes for unpredictable results: should I look under violin or fiddle, or both? The variations of natural language can be mitigated by adopting a restricted (“controlled”) vocabulary, as explained below.

Natural language words do not arrange themselves in a helpful way. Alphabetical filing order is determined by accidents of spelling rather than related meanings. “If the names of the classes, in a natural language, are used to arrange them, we do not get a helpful order. In fact names scatter classes in a most unhelpful chaotic order. It will give us an order like algebra, anger, apple, arrogance, asphalt, and astronomy,” wrote the Indian librarian, S. R. Ranganathan (1951, 34). Also, natural language indexes are ordinarily created only in a single language.

Both of these problems can be addressed by using an artificial notation for the descriptive names (such as the Dewey Decimal Classification) designed to achieve some desired arrangement, with natural language indexes leading to the class numbers in as many different natural languages as desired. Having an artificial notation of letters, numerals, and other symbols does not mean that it is no longer a language. It is an artificial language and is not immune to the problems of obsolescence and perspective discussed below. It is the same approach as the artificially constructed, restricted languages used, for example, in botanical and chemical nomenclature.

Vocabulary Control

Everyday language is characterized by multiplicity, such as singular and plural forms, variant spellings, or synonyms and antonyms (opposites). The same topic could be assigned any number of names, or represented in an indefinite number of ways (“unlimited semiosis”), so documents on the same topic could be scattered under any of several different headings. A searcher might find some but not others. The standard solution is “vocabulary control,” whereby one form of name, for example, violins, is “preferred,” and used exclusively. Other commonly used but “nonpreferred” terms are listed, but only to redirect the searcher to the preferred term: for example, fiddles, see violins. An “authority file,” a list of the carefully differentiated preferred and nonpreferred terms, is compiled and followed.

Vocabulary control can take care of synonyms, near-synonyms, antonyms, and variant spellings. Exact synonyms are quite rare. It is near-synonyms that are frequent. Birds and ornithology, for example, are closely related but not quite the same. Even so, someone interested in birds might look under ornithology and vice versa. Near-synonyms require endless situational judgments concerning what to combine and what to keep separate.

In practice, vocabulary control also extends to hierarchical and other relationships (“see also”). Vocabulary control extends beyond semantic to functional relationships, which differentiates this kind of word list from traditional dictionaries. Biogas, pig manure, and water hyacinths, for example, are very different from each other, but since pig manure and water hyacinths are important ingredients in making biogas anybody interested in one is likely to be interested in the others. Thus, see also references in both directions between each and biogas are justifiable.

Coordination

Many documents are concerned with complex topics, needing at least a phrase to express the scope. A simple approach is to simply list the terms, in any order, needed to comprise the meaning. Documents about the “Parents of handicapped children” would have three terms: the three keywords children and handicapped and parents. There are also some documents on “children of handicapped parents,” which would also be retrieved by the same keywords, but, being relatively few, would probably not be noticed in the retrieved set. Computers can easily handle keyword searches, but the earlier technology of catalog cards cannot: any such combination has to be “precoordinated” using some syntax at the time of cataloging to differentiate and to express relationships among the terms. The US Library of Congress subject headings have two quite separate headings, children of handicapped parents and parents of handicapped children, which, because they constitute grammatical phrases, are not confused by human searchers. This is a simple case. Syntactic rules are used to generate quite elaborate headings in which a primary term is progressively qualified, either as a complex phrase, such as hand-to-hand fighting, oriental, in motion pictures, or with a chain of qualifying terms, as in God—knowableness—history of doctrines—early church, ca. 30–600—congresses. The latter is a single subject heading in which the grammar is expressed by the positioning of the terms. For an English speaker accustomed to adjectives preceding the nouns they qualify, it sounds more natural if such headings are read in a reverse order, with some conjunctions and prepositions added: “Congresses on the history of doctrines in the early church, ca 30-600, concerning the knowableness of God.” Fortunately, the use of numbers and letters in the artificial notation of classification schemes allows elaborately coordinated topics to be expressed much more concisely. In this way, all but the simplest documentary languages for naming topics have grammar and well as a vocabulary.

Fineness

A collection composed of one or very few documents needs no catalog. At the other extreme, distinguishing every little nicety in order to differentiate every topic becomes cumbersome. Collections of millions do need very detailed description in order to achieve the fineness of sifting required to select a handful rather than a flood of records. In practice, the level of detail in subject indexing is situational, depending on how many different items are acquired in each topic.

Mention and Meaning

Information services depend heavily on technology. Documents are physical objects on paper, film, magnetic disk, or other physical media. Libraries could not operate as they do if the tasks to be performed were not heavily routinized and most of them reduced to clerical procedures performed by support staff or delegated to machines. The modern library arose in the spirit of late nineteenth-century technological modernism as “library economy,” imbued by Melvil Dewey and others with an emphasis on standards, system, efficiency, and collective progress that lives on in visions of digital libraries, the “semantic web,” and the “virtual.” Detailed control is needed for effectiveness and for efficiency, and librarians, pioneers of new technology for filing and record processing, inspired modern office-management procedures.

In subject indexing, the machinic and the cultural collide like two tectonic plates, and naming lies at the fault line where indexers use vocabulary control to try to mitigate the linguistic ruptures and slidings they can neither prevent nor avoid. Thus, there is an endemic battle between the incorrigibly cultural and aesthetic character of the underlying mission and the machinic tendencies essential for cost-effective performance. The central battle line of these tensions is in naming what documents are about.

The fact that the documents are overwhelmingly textual has allowed the heavy use of natural language processing techniques to infer semantic relationships between documents and between documents and queries. But this is a matter of lexical entities, of character strings, not of meanings. Fairthorne (1961) analyzed this difference by saying that these techniques deal with mentions, not meanings. For example, if information and retrieval commonly co-occur in that order, then they are presumed to constitute a phrase. And if the phrase information retrieval and the phrase vector space tend to co-occur in the same texts, they are computed as being close in “document space,” and a topical relationship is inferred from this “spatial” proximity. If relationships between marks are statistically significant, semantic affinities are implied but not explained. Machines can be programmed to detect regularities and inconsistencies among marks, even if they cannot distinguish sense from nonsense.

It is further evidence of the inherently linguistic character of bibliographical access that formulaic natural language processing techniques work quite well, but not always and not very reliably. It is the textual (lexical) similarity between documents that allows relatedness between discourses and descriptions to be inferred, since the same words are mentioned when the same or very similar language is in use. From the method employed, homographs with different meanings—for example, host (landlord) and host (crowd)—will dilute the precision of retrieval. The compelling economic attraction of this approach is, of course, that it is mechanical and so can be delegated to machines. The poverty of this approach arises when different vocabularies are used to refer to the same topic without using (mentioning) the same terms. For this and for cross-lingual search, formal structures such as bilingual dictionaries or statistical associations help. The important and useful specialized vocabularies relating to places, events, and persons, which are partly cultural and partly physical, will be discussed in the next chapter.

Technical writing on information retrieval draws heavily on natural language processing to identify personal and institutional names and many types of frequency counts and statistical associations. This needs to be complemented by attention to the way that words are used and the unlimited number of ways of saying something (Blair 1990). Both categories and the language used to label them are deeply subjective (e.g., Lakoff 1987). Research on the social practices of science is contributing to understanding the use and role of documents and document description (e.g., Frohmann 2004). Sorting Things Out: Classification and its Consequences by Geoffrey Bowker and Susan Leigh Star (1999) provides revealing case studies of how social agendas influence the design of supposedly objective categorization systems.

Naming Is Cultural

Language evolves within communities of discourse and produces and evokes those communities. Every such community has its own more or less specialized, stylized practice of language. Attempts at controlled or stabilized vocabulary must deal with multiple and dynamic discourses and the resulting multiplicity and instability of meanings. Most bibliographies and catalogs have a single topical index, but include material of interest to more than one community. Since each community has slightly different linguistic practices, no one index will be ideal for everyone and, perhaps, not for anyone. In vernacular discussion of health, for example, the terms cancer and stroke are commonly used, but in professional medical writing neoplasm and cerebrovascular accident are preferred names. So, in theory, multiple, dynamic indexes, one for each community, would be ideal. It is not, however, only a matter of linguistic variation, but also of perspective. Different discourses discuss different issues or, when the same issue, from different perspectives. A rabbit can be discussed as a pet, as a pest, as food, or as a character in a book.

Aside from these “dialect” differences, the vocabulary used by indexers to characterize their documents can become problematic for other reasons as the world changes. There are cognitive developments: new ideas and new inventions need new names. Horseless carriages were invented, then renamed automobiles. Also, new referents emerge for existing names. Some sixty years ago the word computer meant a human who performed calculations, but now always means a machine. More recently, the word printer made the same transition.

Summary

Describing is a matter of naming characteristics of documents, especially what they are about. Descriptions vary by notation (words or codes), vocabulary control (standardized terminology), coordination of combinations for complex topics (e.g., venetian blind), and fineness (how detailed). Describing is a language activity drawing on already established terminology for future searches, but since language evolves, descriptions are necessarily obsolescent. Because language is cultural, descriptions of sensitive topics may be contested. The next chapter will examine in more detail how descriptions are organized and used.

5 Naming

Topic Descriptions

Documentary Languages for Naming Topics

Notation

Vocabulary Control

Coordination

Fineness

Time and Naming

Naming Is Forward-Looking

Naming Is Backward-Looking

Time of Inscription

Figurative Use of Language

Mention and Meaning

Naming Is Cultural

Fighting Words

Summary