Having looked at how naming is used in describing, we now examine how descriptions are used. Metadata (literally beyond or with data) is a common name for descriptions of documents, records, and data: it is data about data. Here we do not distinguish between data and documents. The first and most obvious use of metadata is description, but inverting the relationship so that description becomes central rather than peripheral enables metadata to also serve as the preferred basis for search and discovery. However, depending on metadata for search also runs into difficulties caused by differing vocabularies used in different contexts, the unlimited variety and instability of language, and the need to relate different but comparable terms and to distinguish different uses of the same words. It is useful to distinguish fundamentally different aspects (facets), such as what, where, when, and who, and to treat them separately.
Metadata (literally beyond or with data) is a common name for descriptions of documents, records, and data: it is data about data.
The first and original use of metadata is to describe documents. There are different kinds of descriptive metadata:
These descriptions help in understanding a document’s character and in deciding whether to make use of it. Description can be very useful, even if nonstandard terminology is used. Almost any description is better than none. However, it is always strongly recommended that descriptive metadata follow standard forms in order to facilitate comparison.
Metadata has two components: a format and a set of values. Well-known formats include XML, the Dublin Core, and MARC (for sharing library catalog records). Each is associated with specific standards for defining the kinds of descriptions that may be used with them. The use of standardized formats for storing and displaying makes use of metadata easier. Use of standard vocabularies has the advantage of consistency and aids understanding.
When documents are browsed, especially digital documents, descriptive metadata is used to understand what kind of document it is, what it is about, and how to use it. This process resembles the way one can look at the cover of a book to help assess the text inside.
To establish a meaningful connection between a query and a document or between two different documents, two actions are required: first, make a connection between them, and then express the nature of the relationship between them. For example, one might assign the same topic description to both, or a subject heading can be assigned to a text and the same subject heading also assigned to an image, as shown in figure 3.
The next step is to invert this relationship, so that one can go from the subject heading both to the text and also to the image. This allows a unified search of both texts and images relating to the same topic, starting with a query leading to a subject heading (value) leading to documents, as discussed in chapter 5 and shown in figure 4.
This maneuver inverts the original structure. Instead of descriptions being attached to documents, the documents are attached to the descriptions. The vocabulary of the descriptions becomes primary, and the documents become peripheral. This inversion is clearly seen in citation indexes. When you examine books and articles, the references are peripheral, in footnotes or at the end, and are often in smaller type. But a citation index inverts that relationship. The citations themselves and the relationships between them become primary. Only when a citation of interest has been selected is a document, at the periphery, consulted.
This relationship also allows a transverse search from a text through a subject heading to an image assigned the same subject heading, or, equally, from an image to texts, as shown in figure 5.
In this way, two or more documents on the same topic, however different their format or content, can be related to each other as a network. This process depends on having a single vocabulary to describe topics, or, at least, interoperable vocabularies.
Tagging—inviting anybody to assign any words that seem appropriate—has become popular. This practice is convenient and can be helpful as a starting basis for more formal indexing vocabulary. It can also help identify symbolic and emotional aspects of images and texts. However, best professional indexing practice is based on the following three principles.
These three techniques permit the development of very precise descriptive systems.
Thinking of metadata as a means for describing individual documents reflects only one of the two roles of metadata. The second use of metadata is different: it emerges when you start with a query or with the description rather than the document—with the metadata rather than the data—when searching in an index.
This second use of metadata is for finding, for search and discovery. In a digital environment for texts, it is common and convenient to use textual queries and to search for the occurrence of text fragments in the available documents, as web search engines do. So, a search for the topic “mouse” is expressed as the character string “m o u s e,” and every document containing that sequence of characters will be retrieved, whether it discusses a small mammal or refers to its (originally figurative) use for a computer input device or other uses of the word. The technique of searching text by character strings works quite well, but not always and not perfectly, because text resources are not entirely homogeneous. Some words have multiple meanings (polysemy); sometimes different words use the same character string but have different meanings (homographs); and different words may be used with the same meaning (synonyms, such as cancer and neoplasm).
Simple text searches break down in multilingual environments as well as when nontextual resources are included, such as images, sounds, and numeric data sets. An image can be compared with other images, and a sound can be compared with other sounds, but an image cannot be compared directly with a sound or other media forms. One cannot ordinarily use a query composed of a few pixels, or a sound, as a query in a text file. The usual remedy is to add to each nontext object a textual description that can be searched by a textual query.
Infrastructure is a collective term for the subordinate parts of an undertaking. It was initially used to refer to fixed resources used for transportation and military operations and has been gradually extended to include services ancillary to, or in support of, the performance of a central task. Minimally, travel by train requires tracks, a locomotive, and wagons, but an effective and reliable railroad service also depends on other auxiliary resources: systems for signaling, ticketing, communication between stations, fuel supplies, a management structure, publication of timetables, and so on. The collective name for these auxiliary resources is infrastructure.
Infrastructure is always some kind of structure, but which structures should be considered infrastructure is situational. A bank needs the support of data processing services to provide its banking services, and this computing support is considered part of the bank’s infrastructure. For the computer services industry, auxiliary resources—the infrastructure—include reliable banking services for handling payments. So banking services are, in turn, part of the infrastructure of the computer services industry.
Standards and protocols are an intangible form of infrastructure with very tangible consequences. Since infrastructure is considered to be the environment of support that enables and empowers, social conventions and mentalities—the structures of thought discussed in Michel Foucault’s The Order of Things (1970)—could be considered a form of infrastructure.
To summarize, the first and original use of metadata is for describing documents, and the name metadata (beyond or with data) along with its popular definition, “data about data,” are based on this use. A second use of metadata is to form organizing structures by means of which documents can be arranged. These structures can be used both to search for individual documents and also to identify patterns within a population of documents. The second role of metadata involves an inversion of the relationship between document and metadata. These structures can be considered infrastructure.
A second use of metadata is to form organizing structures by means of which documents can be arranged.
The remainder of this chapter examines features of metadata when used for search and discovery within a reference work, in a library, or when searching online.
Since language evolves within cultural contexts, this becomes more marked when one ventures into the indexes of other disciplines, of different media, and of foreign countries. One can search effectively and efficiently only when one dealing with a vocabulary with which one is familiar. How many people know that search terms for automobiles should include, among many others, the following?
The whole point of a network environment is to make more and different resources accessible, so the number of resources with unfamiliar vocabulary increases both absolutely and as a proportion of what is accessible. This is a recipe for less effective and less efficient searching. An important (though neglected) response to this problem is to provide search term recommender services. A simple form is a mapping from the familiar to the unfamiliar. In the subject index to first edition of the Dewey’s decimal classification in 1876 is:
Railroads 385.
The sixth edition of Dewey’s Decimal Classification and Relativ Index for Libraries, Clippings, Notes, etc. of 1899, used railroad to illustrate that the best link may vary according to the context (“in different connections”; Dewey 1899, 10), including:
Since searchers come from different backgrounds, they do not have a single familiar vocabulary, so there should be a different set of subject indexing for each group of users. Until now that has not been economically feasible, but this can be approximated by creating multiple search term recommender services to any given vocabulary.
Imagine three doctors—an anesthesiologist, a drug therapy specialist, and a geriatrician—who each wanted recent literature on cardiac arrest (a medical term for a heart attack). “Cardiac arrest” itself is not a heading used in the standard Medical Subject Headings, or MeSH, vocabulary, so what would be the most effective MeSH headings to use? The three doctors are specialists. They do not have the same kind of interest in cardiac arrests. Each inhabits a different medical subculture. Each would not be interested in (and might not understand) the specialized literature of interest to the others. Suitably biased training sets can generate specialized search term recommendations for each.
As our environment is increasingly a networked environment, the citing of authoritative resources in a print-on-paper environment gives way to linking to such resources. This has advantages. The online resource will have much more detailed information than a local authority list can. Consider place names. A local list can do little more than specify the preferred form of the name and as much detail as is needed to differentiate it from other places with the same name—typically, the type of place (“geographical feature type”) and the geopolitical entity in which it is located. However, a place-name gazetteer will have all that and more, notably the latitude and longitude. Similarly with personal names, a local name authority list will have the preferred name, a note of other names also used, and just enough detail to differentiate it from other people encountered locally with the same name. A large, authoritative list, such as the name authority file of a national library or a biographical dictionary, will have a great deal of detail about the individual’s life and career. The information is richer and can be drawn on as needed, for example, to make a map display or a time line.
Another benefit is that an invoked online resource can and should be continuously updated in a way that a printed volume cannot be, and the link makes this updating available whenever the link is invoked.
So far we have discussed topics (“what”) in general terms. There are, of course, different kinds of topic and some are different enough to justify specialized treatment. In the rest of this chapter, we show how three special cases (who, where, and when) are quite different and then how they are related.
Personal names are important for authorship and biographical texts. The need to differentiate between different persons with the same name and aggregating different names for the same person is well understood in archives, libraries, museums, and elsewhere. However, the techniques for handling interpersonal relationships appears to have been rather neglected. Genealogists have experience with encoding family relationships (parent-child, spouse, etc.), but people can be related to each other in other important ways (e.g., teacher-pupil, business partner) for which techniques and terminology need further development.
Searching in a text environment is dominated by topical keywords or undifferentiated keywords, possibly including the names of persons, places, and institutions. However, for searching in some resources, such as socioeconomic data series and photographs, it becomes important to specify geographical location reliably and exactly. “Place” is a cultural construct, and this is reflected in place names, which, like topic names, are often multiple (e.g., Lisboa, Lisbon, Lisbona, Lisbonne, Lissabon); ambiguous (Galicia, Poland; Galicia, Spain); and unstable (e.g., St Petersburg became Leningrad then St Petersburg again).
Space, in contrast, is defined in physical terms of latitude and longitude, which provides descriptions that are neither ambiguous nor unstable. A large advantage of spatial coordinates is that they allow places to be shown on a map. There is, therefore, for geographical areas, a dual naming system of place and space: place names and spatial coordinates. A place-name gazetteer can be considered a kind of bilingual dictionary between places and spaces. A gazetteer enables place names to be disambiguated and places to be located on a map. A well-designed gazetteer will indicate when a place name was in use, thereby supporting changes over time.
Events and time tend to be mutually defining. Time is calibrated by physical events, and cultural epochs by cultural events. But physical events and cultural epochs are also calibrated by calendar time. In speech and in writing, we commonly mark time by reference to events, as in “after I graduated” or “before the Second World War.” This duality of events and of time resembles the duality of place and space and invites a similar approach: the use of a directory relating named events to calendar time. Associating events with dates supports the construction of time lines and chronologies in the same way that a place-name gazetteer relates place names to spatial coordinates and map displays.
So far we have spoken of indexes for topic, place, time, and persons as if the indexes for these facets were separate and independent, but in practice they are not, except in primitive examples. In a mature topical index such as the Library of Congress Subject Headings system, the topic heading will be commonly combined with geographical and chronological qualifiers, e.g., Architecture—Japan—Meiji period, 1868–1912. In other words, subject headings may have geographical and temporal components as well as topical.
A place-name gazetteer ordinarily indicates the kind of place (geographic “feature type”) it is: castle, church, lake, city, etc. A physical feature is not the same as a topic, but any kind of feature can be treated as a topic. An individual castle is an instance of the category castles. Documents about castles generally may be helpful as well as any documents concerning this particular castle. And a discussion of the topic castles can be enriched by moving from the subject heading to the geographical feature type codes in the gazetteer in order to identify and to locate instances of castles in any region, so a mapping between feature types and subject headings can be useful. Since a well-designed gazetteer will also have an indication of when that name was in use, entries in gazetteers, like subject headings, can have temporal and topical as well as geographical aspects.
A time-period directory modeled on gazetteer designs would have a coding for kind of event or period. So, as with gazetteer entries, a specific event (e.g., an earthquake) can be linked to subject headings both by proper name (e.g., Lisbon Earthquake 1755) and also the literature on that class of events (e.g., Earthquakes). Events are specific to geographic areas, and so a proper time-period dictionary will have geographical codings, and it should be possible to link each event to both geographic subject headings and to gazetteer entries.
The texts of entries in biographical dictionaries are very rich in mentions of (1) kinds of activities, which could be linked to subject headings for that kind of activity; (2) places that could be linked to gazetteer entries and to geographic subject headings; (3) periods of time that could be linked to other, contemporaneous events via time-period directories, time lines, and chronologies; and (4) other people with whom the biographee interacted and for which biographical information could be found in biographical dictionaries and encyclopedias.
Although there are effective methods for handling peoples’ names, methods for handling the events in their lives are much less developed. One approach is to categorize each biographical event or life activity as a four-aspect unit of what kind of activity (topical aspect), when (temporal aspect), where (geographical aspect), and with whom (biographical aspect). An attraction of this approach is that life events could be encoded with the terminology and methods already established, or being developed, for subject indexing, time periods, place names, and biographical dictionaries.
Subject indexes, place-name gazetteers, time-period directories, and biographical dictionaries are quite different genres for quite different aspects of reality, but we find geographical connections, chronological links, and topical affinities across all four. There is a large and useful agenda in finding ways to build effective infrastructures of connections between these genres, because understanding requires a knowledge of context.
We have discussed different kinds of topic using what, who, where, and when. The technical term for such differentiated categories is facets and the use of facets is central in classification and knowledge organization. Linked data, being normally a <sameAs> or similar relationship, are ordinarily mappings within a single facet. Likewise, in a library reference collection, the reference works are classified by facet-specific genres: biographies, geography (maps and place-name gazetteers), histories (and chronologies), and so on. But when we look beyond the heading in a catalog or a reference work and examine the content of the entry or beyond it in an explanation, we find no such limit to a single facet, but rather, multiple facets:
Actual instances will vary greatly, but the important point is that any main heading in a bibliography or catalog and any entry in any facet-limited reference work is likely to have qualifiers or explanations using any or all other facets. Figure 6 shows what one might expect, with lines connecting instances from different facets: time, place, who, and what. We see the same effect in complex precoordinate systems such as Library of Congress Subject Headings and the Universal Decimal Classification.
There may be reasons for the sequence of the facets in each row, but if, for the purposes of illustration, we disregard those reasons and we rearrange the elements in each row such that the facets align vertically, we get figure 7.
The realignment of the contents of each row in figure 6 to the arrangement by facet lines in figure 7 shows more clearly the potential for using vertical and horizontal links. For example, a library catalog subject heading “Lighthouses” could be linked to the geographical description code “Lthse” (Lighthouse) in a place-name gazetteer. The gazetteer would give locations of actual lighthouses, and the catalog would list publications about lighthouses. This combination provides far more information than either does separately by coupling the two quite different kinds of resource. As presented in figure 7, vertical mappings provide links to additional vocabularies, which will lead to additional resources. Horizontal links provide additional context.
Document descriptions (“metadata”) cover technical, administrative, and topical aspects and help us understand a document’s character and whether it is of interest. Descriptions are created by assigning descriptive fragments, such as subject headings, to each document. Inverting this relationship—in effect, assigning documents to subject headings—creates indexes, thereby supporting a second purpose: discovery of documents of any given character. Problems arise from the differences between the many different languages, both natural and artificial (codes and classifications), in use. As a result, we need links that lead us from familiar terms to unfamiliar terms, especially in unfamiliar languages. In some cases, there are dual naming systems, such as place and space in geography, calendar and event in time, and formula and narrative explanation in mathematics, where the two aspects can be usefully combined. An important simplifying technique is the division in fundamentally different concepts (facets), such as who, what, when and where. Terms in each facet can be usefully linked across different languages, but these conceptually different elements are always combined together in real contexts, and there are many opportunities for taking advantage of these complex relationships.
This and the previous chapter examined how documents are described and how these descriptions can be organized and linked. In the next chapter, we look more closely at the mechanics of discovery and selection.