Discovery and Selection

Retrieval and Selection

The word retrieval is ordinarily used to include quite different procedures: identifying, the discovery of documents, in the sense of establishing their existence; locating (“look-up”), when identified objects have known addresses; fetching, bringing an object from a known address; and selecting, in the sense of choosing. Locating and fetching are relatively straightforward procedures. It is the fourth, selecting, that is more interesting, especially when we do not know ahead of time what the choices are or even whether anything suitable is, in fact, available to be found, and so the challenge becomes to find a way to identify the least unsuitable documents.

These tasks can be divided into the following two situations:

locating and fetching known items using unambiguous requests, sometimes called data retrieval. This is relatively straightforward, although the scale and complexity of the records can pose technical challenges.
discovery and selection when the resources available are not precisely or unambiguously known, sometimes called document retrieval. Selection systems tend to be complex. They are usually proprietary, and their mechanisms opaque. All such systems have to be complete in order to work at all. These factors seem to have deflected attention from the simple nature of their components.

The core function of all finding and selecting systems is the matching of needs with documents and trying to separate (partition) those suitable for selection from the unsuitable. In mental selection, we think about different options and select one “within our head.” On vacation we may decide to send postcards to some friends and relatives, but if we cannot remember their addresses confidently, we may need to look in a list of names and addresses, which we might also scan to see if we have forgotten someone who would want to receive a card. The need for selection aids becomes progressively more necessary as the size of a list or collection increases. Web search engines, searchable databases, library catalogs, and similar devices perform this function and have become everyday tools.

Selection machines are devices used to identify, locate, and fetch records. Because the choice is often so large and our familiarity with the options available inadequate, selection machines may also be used to choose for us. Web search engines, for example, select web pages for us. In nearly all cases, however, an initial shortlist selected by machine is followed by a human, mental selection from the choice provided by the machine.

In filtering systems, commonly used on incoming email, objects are represented, filtered (searched), and then selected for attention, relegated to other storage, or discarded. In this case the query, once developed, remains indefinitely in place as a stored instruction (selection rule) and is used to select incoming documents. Filters using stored queries on flows of data objects are symmetrical with retrieval systems with stably stored data objects and transient queries.

The Anatomy of Selection Machinery

General models of information retrieval systems are commonly found in information retrieval textbooks in the form shown in figure 8, with varying amounts of additional descriptive detail depending on the purpose of the description.

Figure 8 General model of selection systems.

There is a symmetry between queries and documents. Queries can be posed to a collection of documents (“ad hoc” search) or a flow of documents can be matched against a standing query (filtering).

There are many different procedures for matching queries with records: exact match; partial match; match using shortened words (truncation); positional and other relationships; logical combinations of query components (Boolean matching, e.g., cats AND dogs); and so on. There are limitless degrees of partial (“weaker”) matching, and multiple techniques can be combined. Much of the very extensive technical literature on information retrieval is composed of descriptions of the use of minor variations. Description of the operation of any individual operational selection system is likely to require detailed diagrams of that particular system’s components and workflow. Here, however, we are provide only a general description, and we start with a simple distinction between searching within a document and searching for a document.

Searching Text

A simple example of selection is the Find command provided in word processing software, which takes whatever the string of characters has been entered into the “find” field and then scans the entire text serially, word by word, comparing the query string with each word and then drawing attention to each match in the order found.

Where the amount of searchable text is large, the speed and efficiency of searching can be greatly improved by preprocessing the text to create an index that lists all words found in the text and provides the location within the text at which each word occurs. All words are indexed as found. In computing terms, this rearrangement of the elements is called an inverted file.

The traditional name for an index to the occurrences of words in a text is a concordance, but strictly speaking text searching uses strings of characters separated by spaces, not words. Words can have alternative spellings as well as ambiguous meanings, and quite different words with other meanings can have the same spelling; these variations are not distinguished. Uppercase and lower case letters will be treated as identical and punctuation marks excluded, but otherwise every sequence of characters is indexed automatically as found, regardless of meaning.

Full-text searching for character strings is very economical because the high cost of human indexing is avoided. It works well for discovery and selection to the extent that the query words used by the searcher match the terminology used in the texts being searched.

For efficiency, concordances may exclude (“stop”) very frequently found words that are not expected to be useful when searching, such as articles (e.g., a, an, and the) and prepositions (e.g., at, from, and to). Many additional refinements can be provided by, for example, combining known variations in spelling and, where many different documents are being searched, the relative frequency of words in different documents can be used to give priority in the display of search results to documents containing the searched word more frequently.

More elaborate algorithms can be used to infer different meanings by looking at nearby words. For example, the character string bank is likely to refer to a river bank if used close to words such as water, fish, or boat, but is more likely to refer to a financial institution if close to words such as finance, mortgage, or manager. Another elaboration would be to represent more pairs of terms that commonly occur next to each other, such as information and retrieval, so that the significance of the combination can be represented. Much ingenuity has been invested in developing complex, algorithmically generated representations of combinations of words in a text.

A Library Catalog

A typical online library catalog provides a quite different approach. A carefully structured index—a database—is created. A precise and detailed description of each book or periodical in the library’s collection is created in a carefully defined and standardized way as discussed in chapter 4, where an object-attribute-value approach was illustrated with the examples person-age-45 and book-topic-economics.

In library catalogs, each aspect considered important—notably authorship, title, topic, date, size, publisher, and place of publication—is noted, using consistent terminology to create a precise representation of each one in a way that makes the catalog records compatible with records in other library catalogs worldwide. So, as an example of a set of attributes and values:

Author: Wright, Alex.
Title: Glut : mastering information through the ages.
Topic: Information organization -- History.
Topic: Information society -- History.
Call number: Z666.5 .W75 2007
Date: 2007.
Size: 286 p.; 24 cm.
Publisher: Joseph Henry Press
Place published: Washington, D.C.

As noted in the previous chapter, descriptive metadata of this kind serves two different purposes: description and search. There are many other attributes that could be included, such as the binding, typeface, paper, and weight, that might be of interest for some library users, but catalogs and cataloging cost money, so investment is made only in the attributes considered most important. Further, since the creation and maintenance of searchable indexes consume significant resources, only some of the attributes are made searchable. In this example, often only the first four attributes (author, title, topic, and call number) would be searchable. The remaining four attributes (date, size, publisher, and place of publication) are usually not searchable but are displayed as description when a record has been found.

In early online catalogs, as with card catalogs, one specified an attribute and a value in the form FIND AUTHOR WRIGHT, ALEX. So to search by title one had to know the title exactly, or, at least, how it started: FIND TITLE GLUT would find all titles that started with the word “glut.” As computing became more affordable, the technique of full-text search described above was added to allow search for individual words within titles. In addition, support became available for compound queries, so-called Boolean searches, in the form FIND TITLE GLUT AND AUTHOR WRIGHT, ALEX, where only records that satisfied both conditions would be retrieved, or FIND TITLE GLUT OR AUTHOR WRIGHT, ALEX, where records satisfying either condition would be retrieved.

The choice of attributes and the presentation of values follows long-established cataloging codes designed to achieve consistency and interoperability. The creation of the records, the preparation of searchable indexes, the procedure for posing queries, and the display of search results involves a series of steps. Analysis of these steps shows a chain of records and operations alternating. The books are subjected to a cataloging process resulting in catalog records; the catalog records are operated upon to generate the searchable index; the user’s query is formalized to become a formal query; formal query and the searchable index are matched to yield the selected set, the selected set is sorted and processed for display, and so on. Each process derives a new set. These processes are of two quite different types: those that modify the objects being processed, and those that rearrange the objects. These two processes correspond to Fairthorne’s marking and parking, noted in chapter 4.

In practice, searching is commonly a series of selection stages. One might start by browsing subject headings (a first search) and when a suitable subject heading has been selected, search for documents associated with that subject heading (a second search) to find one or more selected documents, which when inspected suggest that a modified search (a third search) would be more useful, and so on.

Appendix A provides a more detailed explanation of this precisely designed system, with its carefully edited records.

Searching the Web

The carefully prepared records, standardized formats, and precise search options found in library catalogs and in searchable, well-edited databases were developed before the emergence of the World Wide Web. The web posed a new challenge because web pages do not have the standardized, carefully edited, and well-structured content that characterizes library catalog records. Subject description is commonly lacking or, if provided, not standardized. The creators of web pages and other documents may also add description that may be intentionally misleading in order to attract attention. The lack of control over the creation of web pages and the sheer number of them make it impossible to catalog the web in any way comparable to library cataloging.

The lack of control over the creation of web pages and the sheer number of them make it impossible to catalog the web in any way comparable to library cataloging.

The basic solution adopted is simple. The web is downloaded and treated as text. Software designed to crawl around the web downloads as many pages as possible. Each page found by a web-crawler is copied and stored. Each word in each stored page is used to produce an index to that page. All the index entries to all the pages are combined to form a unified index to all of the pages collected. Every query is expressed in the form of one or more words and leads through the unified index to web pages containing one or more of those words.

This approach allows extremely economical and rapid selection from vast numbers of web pages, but because of size of the web, the result of any query is likely to be inconveniently large and in no useful order. In addition to knowing which documents are “on topic,” it would be helpful to know which ones are in some sense preferable to others and to give those priority in the presentation of the search results. The solution was to adapt a principle from academic scholarship. Since writings considered significant are more likely to be cited than those that are not, the frequency of citations to any given book or article can be considered an indicator of importance or, at least, of popularity. Web pages cite each other with links, so a count of the links to any given web page can be used in the same way to sort and to rank the pages found in a web search. This combination of downloading, index building, and page ranking provides a powerful and efficient selection service, even though major simplifications (reliance on character strings and page ranking) are made.

This combination of downloading, index building, and page ranking provides a powerful and efficient selection service, even though major simplifications (reliance on character strings and page ranking) are made.

The enormous demand for web searching allows for substantial revenue from advertisements inserted into the displayed results and from payments for more prominent display of sponsored pages. Although this revenue would be insufficient for the conventional cataloging of every web page, it does allow the development of very sophisticated software to improve the search results. Spell-checking software and other techniques used for full-text search can be used to suggest alternative search results. Dictionaries and thesauri can be used to elaborate the searchable index with some vocabulary control (e.g., connecting synonyms and variant spellings). When recorded, the history of an individual’s previous searches can analyzed to infer the searcher’s intention, to suggest related options, and to display advertisements likely to be of interest.

Other Examples

Large organizations in both public and private sectors need control over their corporate records and depend on the ability to find the right material when needed (“enterprise search”). The scale, while far less than the web, can be large. More importantly, over time, changes in standards, terminology, and software lead to both records and software becoming obsolescent and so maintaining access, security, and preservation becomes increasingly difficult. Mergers and takeovers result in the need to cope with alien material and previously unsupported software. In this difficult environment, different approaches will be combined. As in library catalogs, controlled vocabularies, often called thesauri or ontologies, will be used where affordable along with the techniques used in full-text search. Document management software may be used to provide a more or less coherent environment for all or most of the corporate records.

In searching text, one can look for words that tend to occur together. Similarly, data sets of any kind can be examined to look for statistically significant relationships that might suggest unknown, unexpected, or interesting relationships and anomalies. The “data mining” of sales records, of social media, and of news reports are examples.

Summary

The phrase information retrieval is used for finding and fetching already known documents, as well the more difficult task of search and discovery of previously unknown resources. The basic approach is to match queries against documents or their descriptions. Texts can be searched serially, but it is more efficient to use the words in the text to make an index. Alternatively, as in the case of library catalogs, a database of carefully prepared descriptions can be searched. All selection machinery, both filtering services and retrieval systems, can be seen to be composed of just two types of component: objects (data sets) and operations on them. There are only two kinds of operations: transforming, or deriving modified versions or representations of objects, and arranging (or rearranging) objects by combining, dividing, ranking, and other comparable sorting operations. These two operations are Fairthorne’s marking and parking, and they can be described as semantic and syntactic, respectively. More familiar terms would be description and arrangement.

In the next chapter, we will examine how selection methods are evaluated.