We’ve mentioned human labeling in parts of this book. In this chapter we will consider how humans can actually do labeling for different kinds of NLP tasks. Some of the principles—for example, guidelines—are applicable to general labeling. Most special consideration required for NLP labeling tasks is around the technical aspects and the hidden caveats when dealing with language tasks. For example, asking someone to label parts of speech requires that they understand what parts of speech are. Let’s first consider some basic issues.
It is probably worth some thought as to what your actual input is. For example, if you are labeling a document for a classification task, the input is obvious—the document. However, if you are marking named entities, humans do not need to see the whole document to find them, so you can break this up by paragraphs or even by sentences. On the other hand, coreference resolution, which we discussed in Chapter 9, may have long-distance coreferents, so you likely need to human the whole document.
Another thing to think about is whether your task requires domain expertise or just general knowledge. If you require expertise, gathering labels is likely to take more time and money. If you are unsure, you can run an experiment to find out. Have a group of nonexperts, as well as an expert (or a group of experts if possible), label a subset of the data. If the nonexperts and experts have a high enough level of agreement, then you can get by without expert labeling. We will go over inter-labeler agreement, measuring how often labelers agree, later in this chapter.
The first thing we need to think about when doing labeling is defining the task for our labelers. This is a sometimes controversial subject, as opinions differ on how much instruction you should give.
There are many terms for labeling and for the people who do the labeling: labeling/labelers, rating/raters, judging/judges, etc. There are also many terms for prelabeled examples that are used for evaluating labelers—ground truth set, golden set, and so on. In this chapter I will use “labeling/labelers” and “golden set.”
Guidelines are instructions that tell your labelers how to do the task. The amount of details required for tasks is often debatable. Fortunately, there are some rules of thumb that you can keep in mind. First, make sure that your guidelines reflect what is expected for the product. For example, if you are gathering labels for a spam email classifier, you will need to be clear about what you mean by spam. People sometimes refer to newsletters and other automated emails as spam. Your model will be expected not only to approximate the process of the human labelers but also to serve as a product feature. This means that there are two sets of expectations that we can use to clearly define our task. I like to begin with a thought experiment. What if I forgot all time and budgetary constraints and hired an army of labelers to work on my product? What would I tell them is necessary for the customer? The answer is the basis for the guidelines.
Now that we have a good definition of the task, we still have some other considerations. The second rule is to avoid overconstraining what a correct label is. Some tasks are naturally ambiguous. If you attempt to constrain this natural ambiguity, you may introduce a number of problems. The first problem is that you will introduce bias into the model. The second problem is that if you unnaturally constrain the problem, you may cause labelers to give wrong results in situations you did not consider. Let’s consider a scenario to make this idea more concrete.
We will take our scenario from the previous chapter. We are building an application that takes research papers in multiple languages (English, French, German, and Russian) and classifies them by which academic department they belong to—for example, mathematics, biology, and physics. Our labeling pool consists of undergraduate and graduate students from various departments. We will be giving documents out randomly, except we will make sure that the labeler speaks the language of the research paper. This means that an undergraduate from the linguistics department who speaks English and French may get a physics paper that is in French but will never get a paper in German.
Let’s apply our first rule of thumb. The users of our product will expect a correctly assigned department tag for each document. However, there are interdisciplinary papers, so perhaps there is not a single correct answer to each document. This creates a somewhat vague boundary between correct and incorrect. We can define some simple rules to constrain the problem reasonably. Physics papers will always include something mathematical, with possible exceptions being philosophical and pedagogical papers. However, this does not mean that every paper with a physics tag should have a mathematics tag. In fact, it would be much worse to proliferate false positives than false negatives. The user of an application like this is likely searching or browsing papers. If almost every physics paper has a mathematics tag, then people looking at mathematics papers will need to wade through all the physics papers. If we do not support multiple labels, it means that interdisciplinary papers will have less discoverability. We can address the latter problem with techniques related to inter-labeler agreement and iterative labeling techniques. For now, though, we should make clear in our guidelines that labelers are not allowed to specify multiple labels. We will instead instruct labelers to pick the department that is the best match for a given document.
The second rule of thumb is about not unnaturally constraining the task. It seems like we have already done this by following the first rule. We start to reduce this problem by making sure that every paper is seen by more than one person. This does mean that the workload will double or more, depending on how many eyes we want on each paper.
So our guidelines will instruct our labelers to pick the best department tag for each document. It will warn them that ambiguity is possible. We also need to include examples in our guidelines. I usually like to show a couple of clear examples where the label is easy to discern and one ambiguous example. For example, include Einstein’s paper on special relativity, “On the Electrodynamics of Moving Bodies,” as a clear example of a physics paper. You want to prepare your labelers for ambiguity early, so that they are not derailed when they come across ambiguous examples.
Even when using external labelers (labelers who work for a different organization), it is good to test the guidelines in-house. I recommend getting a few people from your team and having them read the guidelines and judge a handful of examples. After this, review the examples with your product owner and, if possible, a stakeholder. Writing guidelines forces you to write down many of your assumptions. By having other people use your guidelines and evaluate the results, you get to check these assumptions.
Now that we have guidelines, let’s talk about some of the techniques that can improve our use of labels.
Where you find labelers depends on what your task is. If you are looking to gather labels for a task using public data that requires general knowledge, you can use crowd-sourcing solutions like Amazon Mechanical Turk or Figure Eight. If you need specialized knowledge, you may be able to use crowd sourcing, although it will be more expensive. If the skill is rare enough, you will likely need to seek out labelers.
If your data can’t be made public, then you may need to recruit within your own organization. Some organizations have their own full-time labelers for this purpose.
Inter-labeler agreement is the agreement between labelers. This term is also used to refer to a metric for the proportion of examples labeled identically by different labelers. This concept has many uses in human labeling. First, we can use it to determine how well our models can realistically be expected to do. For example, if we find that in our scenario 85% of documents labeled by multiple labelers are identically labeled, then we know that, on this task model, performing at a human level can be expected to get 85% accuracy. This does not always have to be true. If the task requires only that labelers approve of the model-based recommendation, then you may very well see a much higher accuracy. This is due to the model-based recommendation biasing the human.
Another use for inter-labeler agreement is to find labelers who are having difficulty with the task, or who are not putting in the effort to actually label. If you find that there is a labeler who has a low rate of agreement with the other labelers, you may want to review their work. There can be many possible explanations for this. The following are some possible reasons:
You should probably rule out other explanations before jumping to bad intent. If it is one of the first two explanations, you can tune your guidelines appropriately. If it is the third explanation, then you should consider whether their conclusions are valid for your product. If so, then the labels are fine, but the problem may be more difficult than you originally thought. If the conclusions are not valid, then you should put guidance about these kinds of examples in your guidelines. If you think it is due to bad intent, then you should discard these labels because they will add noise to your data.
You can also measure labeling quality using a golden set of validated-label examples that you mix into your unlabeled data. This golden set can be from a public data set that is similar to your data, or we can hand-curate it ourselves. This will let you find labelers who are producing potentially problematic labels, even if you do not show examples to multiple labelers. Remember that these validated labels may still be based on your assumptions, so if your assumptions are wrong it may falsely appear that labelers are producing incorrect labels.
Perhaps the most helpful use case for inter-labeler agreement is to find ambiguous examples. After you have reviewed the inter-labeler agreement and believe the labels are of good quality, we can consider differing labels for an example to indicate whether it is ambiguous. First, we should find the prevalence of this. If a quarter of the research papers have multiple labels, then you may want to consider this as a multilabel problem and not a multiclass problem. If only a small number of documents have multiple labels, then you can simply take the label with majority support, or random if tied. Alternatively, you can also keep the multiple labels in your validation and hold-out sets. This will keep your training data consistent, but it won’t penalize you for recommending a valid alternative. Another technique we can use to deal with ambiguity in labels is iterative labeling. This lets us use labelers to anonymously check each other’s work.
Iterative labeling can be used to improve the quality of labels without increasing your workload much. The idea is that you break your labeling task into at least two steps. The first step is for the labelers to quickly assign an appropriate label, with the understanding that there may be errors due to ambiguity and perhaps to lack of domain expertise. Then you have another labeler with more expertise validate or invalidate the label. Let’s see what this would look like in our scenario.
The first task is a research paper sent to an undergraduate labeler who knows the language of the paper. The paper is then sent to a graduate student, who also knows the language, in the department assigned by the first labeler. The second labeler, the graduate, will only validate the first labeler’s work. This has some pros and cons. This means that the workload on graduates, who may be more expensive, is less, which saves us money. It also means that a research paper that is assigned to a department is reviewed by someone in the department. The con is that it requires each department-language pair to have a graduate student, which may not be possible. You ease this requirement by allowing graduate students to volunteer to represent departments they are familiar with. For example, the physics graduate who speaks Russian might volunteer to do mathematics in Russian as well.
Iterative labeling can also be used to simplify complex tasks into a sequence of simple tasks. This is especially useful in text-related labeling. Let’s look at some of the special considerations of text-related labeling.
Most of what we have covered so far applies to labeling in general. Let’s look at some special considerations we should keep in mind when labeling text.
You should be mindful of the size of the documents you are classifying. If you are classifying tweets or similar small pieces of text, your labelers should be able to work through tasks quickly. If they are larger texts, like in our scenario, there is the danger of labeler fatigue. Labeler fatigue occurs when the individual task (classifying a document), is very time-consuming, and the labeler becomes less attentive after many tasks. It can be debated that humans are naturally lazy, and this is why we made computers to do things for us. This means that the labelers will, sometimes unintentionally, find shortcuts to the tasks—for example, searching for specific words in the document. These labels will be of poor quality. If you want to make this task smaller, you can do it in two ways. First, have the labelers classify the abstracts. This means that the labelers have less information for the task, but they will also get through tasks more quickly. The other possibility is to use the guidelines to advise labelers to not spend much time on an individual task. With the latter approach, you should definitely try to get the documents in front of multiple people.
The second kind of text-labeling task is tagging. This is where you are asking labelers to identify pieces of the text. Finding the named entities in a document would be an example of this. In our scenario, we might use this to find technical words in the document. These could then be used to build a concept extraction annotator, which is fed downstream to the classification model. If your documents are longer than a few sentences, individual tasks can become extremely laborious. If you are doing something like named-entity recognition, you should consider breaking the documents into sentences and making your tasks to identify entities in sentences, instead of documents. Another caveat to consider with this kind of task is that it may require linguistics knowledge. For example, let’s say that we will be accepting papers written in Polish. However, all the other languages are supported by a processing pipeline that includes a part-of-speech tagger, but we have no such model for Polish. Identifying parts of speech may not be general knowledge. You will need to find people who not only speak Polish but also know the technicalities of Polish grammar. Some Polish speakers will have this knowledge, but you should specify this requirement when you are looking for labelers.
Consider these questions about your project:
Guidelines checklist
Inter-labeler agreement checklist
Iterative labeling checklist
Labeling text checklist
Gathering labels is a valuable skill needed for any application that can be helped by measuring human judgment. Being able to create your own labeled data can make an otherwise impossible task possible.
Now that we have talked about gathering labels, let’s look to what we should do to release our application.