8 Evaluation of Selection Methods

When serious studies of retrieval evaluation started in the 1960s, relevance was adopted as the criterion of selection success and measured in two ways: completeness in selecting relevant documents (recall) and the quality in terms of selecting only relevant documents (precision). In practice, a trade-off was found. Achieving a more complete selection of relevant documents tended to also increase the number of nonrelevant documents, while seeking to reduce the number of nonrelevant documents selected tended to reduce the completeness of the selection of relevant ones.

Relevance is a central concept in information retrieval and dominates the evaluation of selection systems, but there are problems with it. In practice, treating documents as simply relevant or nonrelevant is a very convenient but unrealistic simplification. Documents are often somewhat relevant; or their relevance is situational depending on what documents have already been selected; or they are relevant to one person but not to another, or at one time but not at another. The explanation lies in the evidential nature of documents and the cognitive needs relative to that evidential role.


In practice, treating documents as simply relevant or nonrelevant is a very convenient but unrealistic simplification.


Relevance, Recall, and Precision

  1. Recall is a measure of completeness. For any given query, how completely were relevant documents retrieved? Were all the relevant documents retrieved? If not, how many? What proportion? The answer is usually expressed as the percentage of the relevant documents in a collection that were found by the retrieval system in response to a query. So, for example, if there were 10 relevant documents in the collection but only eight of them were retrieved, the recall performance was eight out of 10, or 80%. The results for multiple different queries could be averaged to provide a more broad-based assessment.
  2. Precision is a measure of purity. Did the retrieved set include only relevant documents, or were some unwanted, nonrelevant documents (“false drops”) also retrieved in error? Precision is used as a technical term for the proportion of the documents in a retrieved set that is relevant to the query. If 10 documents were retrieved but only six were relevant and four were not, then precision was six out of 10, or 60%.

The obvious objective of retrieving all of the relevant items (perfect recall) and only relevant items (perfect precision) is rarely achieved in practice. Efforts to increase completeness in retrieval performance (higher recall) tend to increase the number of nonrelevant documents also retrieved (lower precision). Efforts to avoid nonrelevant items in order to achieve high precision tend also to increase the number of relevant items that are not retrieved (lower recall). One may want selection systems that would retrieve all and only relevant documents, but in practice it seems that one has to choose between seeking all but not only relevant items or only but not all relevant items. Either way, the results are less than perfect. These empirical results happen often enough to be accepted as normal. Appendix B explains why this happens.

Recall with Random, Perfect, and Realistic Retrieval

If documents are retrieved at random from a collection, the odds are always the same that the next document retrieved will be relevant. Suppose, for example, that a hundred documents in a collection of a thousand documents (just 10%) are relevant to a given query. Then, if documents are retrieved from the collection at random, the odds of the next document retrieved being relevant would remain, in this example, 100 in 1,000, which is 10 in 100, or 10%. As a result, the number of relevant documents retrieved would grow slowly and would not be completed (100% recall) until all or almost all of the documents in the collection had been retrieved.

A perfect retrieval system would retrieve only relevant documents until no more were left. In this ideal case, the recall measure would increase rapidly to 100% when the first 100 documents (all relevant) had been retrieved. Of course, any additional documents retrieved beyond the first hundred would necessarily have to be nonrelevant. But since all the relevant documents have been retrieved by then, the recall measure remains perfect at 100%.

It is realistic to assume that any actual retrieval system will be less than perfect but better than retrieval at random, so performance will be somewhere between those two theoretical extremes. What happens is that since retrieval is better than random, there is early success: the first documents will tend to be relevant ones, and so the recall performance will initially rise quickly as retrieval proceeds. But a consequence of this early success is that the proportion of relevant documents in the pool of not-yet-retrieved documents steadily decreases. As a result, although a realistic retrieval starts well, improvement gradually slows as the proportion of still-retrievable relevant documents in the collection diminishes. Achieving retrieval of every last relevant document (100% recall) might not be achieved until most or even all of the documents in the collection have been retrieved.

Precision with Random, Perfect, and Realistic Retrieval

Similarly, if documents are retrieved at random, the odds are always the same—only one in 10 in our example—that the next document retrieved will be relevant, so precision will tend to be around 10%, however many documents are retrieved.

With perfect retrieval, however, the first 100 documents retrieved will all be relevant, and so precision will start and remain at 100%. Of course, any additional documents retrieved beyond that first hundred would necessarily have to be nonrelevant, and thus, if retrieval is continued, precision will gradually decline to the limit of 10% if all the documents in the collection are retrieved.

With any actual retrieval system, being less than perfect but better than random, since retrieval is better than random, there is early success, and the first documents will tend to be relevant ones. As a result, precision will start high, but then gradually decline if retrieval continues and the proportion of relevant documents among those retrieved diminishes.

Trade-off between Recall and Precision

Ideally one would retrieve all relevant documents (perfect recall) and retrieve only relevant documents (perfect precision), and to the extent that recall and precision are not perfect, both should be improved. However, experience shows that in practice there is a trade-off. Wider searching to retrieve more relevant documents will tend to result in retrieving more nonrelevant documents, too, so improved recall tends to be at the expense of precision. Meanwhile, more careful retrieval intended to exclude nonrelevant documents from retrieval (improved precision) will also tend to yield less complete retrieval, so improved precision tends to be at the expense of recall. One has a choice of emphasizing all-but-not-only or only-but-not-all, but not both at the same time. (For a more detailed explanation, see Appendix B.)

Some Problems with Relevance

Relevance is the traditional criterion in the evaluation of selection systems and is widely considered to be the most central concept in the field. But relevance is deeply problematic in several ways, including its definition. “Relevant” could be those items wanted by the inquirer, those that will please, or those that are most useful. However, want, please, and useful are not the same. Further, assessments will be highly subjective and, since the search is presumed to be by someone inadequately informed, are likely to be unreliable.

Relevance is highly situational, depending on what the inquirer already knows. And relevance is unstable because the inquirer is, or should be, actively learning, so the very fact of retrieving informative documents should change the relevance status of documents for the searcher.

The standard assumption that all items are independent in the sense that the relevance of one item does not affect the relevance of any other item is a convenient but unconvincing simplification. If two documents are very similar, one usually does not need both. Further, the relative relevance of documents is unstable because the population of documents is always changing.

The discussion above also assumes that relevance is binary: either a document is relevant, or it is not. This, too, is unrealistic. In practice, documents are more likely to be somewhat relevant, partially relevant, marginally relevant, or of uncertain relevance. Added to all this, judgments concerning relevance tend to be inconsistent between different judges and also over time for the same judge.

Information services are purposive, and a document is said to be relevant if it serves someone’s mental activity beneficially. This raises further difficulties: whose benefit? Who determines what is beneficial? How is the benefit to be measured?

Why Relevance Is Difficult

With all these difficulties, it is not surprising that, although relevance has been regarded as central to information science, it remains problematic despite sustained attention by many talented minds. Howard White (2010) provides an excellent account of relevance theory. He states, correctly, that although relevance is well understood, it resists satisfying definition, observation, or scientific treatment, as has been noted by critics all along.

To be relevant, a document must be useful to an actual human being’s mental activity. Therefore, relevance is subjective, idiosyncratic, hard to predict, and unstable. (Relevance to a specific need of a specific person is sometimes named pertinence.) Ordinarily, one can only make a judicious guess that a given document is likely to be relevant to a given query for a supposed population of users at some point in time.


To be relevant, a document must be useful to an actual human being’s mental activity. Therefore, relevance is subjective, idiosyncratic, hard to predict, and unstable.


The basic problem is that documents have both physical and mental aspects. Scientific measurement depends on there being something physical to measure. The physical aspects of documents can be measured and so treated scientifically, but the highly situational, unstable, idiosyncratic, and subjective mental angle cannot. Thus, because every document also has a significant but inaccessible mental aspect, its relevance cannot be measured scientifically. For this reason, relevance can never be satisfactorily a scientific matter in the normative sense of formal and physical sciences such as mathematics and physics, based on formal conjecture and refutation.

In practice, we fall back on distant substitutes. We can use the physical angle only, primarily of coded character strings, and use character strings in a query to discover similar character strings in documents that might be discourse on the same topic. The matching of character strings works quite well, but not very reliably. We can ask a jury to predict whether a document is likely to be relevant to a hypothetical inquirer. We can ask an inquirer, after a search, whether a document was relevant, but either judgment might not be valid for someone else or for the same person at another time.

A scientific approach to relevance could work very well if a document had only a physical aspect and not also a mental one. We see this situation in the case of the modeling of signaling reliability developed by Claude Shannon as communication theory and now better known as information theory. The scientific quality and practical utility of this model is beyond question, and it can be achieved because no mental or social properties are involved, only physical properties. A desire to make this information theory a central component in library and information science has not proven successful, and the reason is not hard to see. For any information science concerned with what individuals know requires a mental angle, and Shannon-Weaver information theory is powerful precisely because it excludes the mental angle. It can be useful as a tool, just as queuing theory and other quantitative tools can be, but despite its name it cannot claim any greater special status.

Ultimately, then, relevance is a convenient conjectured relationship, and it is not surprising that despite 50 years of hard work by talented researchers, it remains ill-defined and not measurable in any direct way. Nevertheless, such a measure is needed, so convenient substitutes, usually the use of similar words, are used instead: if someone asks for documents about bicycles, then one infers that any document including the word “bicycle” is likely to be, at least in part, about bicycles and so worth adding to the set of selected documents.

Summary

Relevance is the most central concept in the evaluation of selection systems, but its use as a measure depends on very severe simplification of a complex reality. There is necessarily a trade-off between the completeness of selection (recall) and the purity of selection (precision). A more fundamental problem is that status as a document involves more than physical existence. There is a cognitive element as well which resists measurement and, as a result, treating relevance quantitatively can be very useful, yet unscientific.