Techniques for name recognition

There are a number of NER techniques available. Some use regular expressions and others are based on a predefined dictionary. Regular expressions have a lot of expressive power and can isolate entities. A dictionary of entity names can be compared to tokens of text to find matches.

Another common NER approach uses trained models to detect their presence. These models are dependent on the type of entity we are looking for and the target language. A model that works well for one domain, such as web pages, may not work well for a different domain, such as medical journals.

When a model is trained, it uses an annotated block of text, which identifies the entities of interest. To measure how well a model has been trained, several measures are used:

Precision: It is the percentage of entities found that match exactly the spans found in the evaluation data
Recall: It is the percentage of entities defined in the corpus that were found in the same location
Performance measure: It is the harmonic mean of precision and recall given by F1 = 2 * Precision * Recall / (Recall + Precision)

We will use these measures when we cover the evaluation of models.

NER is also known as entity identification and entity chunking. Chunking is the analysis of text to identify its parts, such as nouns, verbs, or other components. As humans, we tend to chunk a sentence into distinct parts. These parts form a structure that we use to determine its meaning. The NER process will create spans of text such as Queen of England. However, there may be other entities within these spans, such as England.

An NER system is built using different techniques and can be categorized as the following:

A rule-based approach uses rules crafted by a domain expert to recognize entities. A rule-based system parses the text and generates a parse tree or some other abstraction format. It can be a list-based lookup where a bag of words is used, or a linguistic approach, which requires deep knowledge of entity identification.
The machine learning approach uses pattern-based learning with statistical models where the nouns are identified and classified. Machine learning again can be categorized into three different types:
- Supervised learning uses labeled data to make a model
- Semi-supervised learning uses labeled data, as well as other information, to make a model
- Unsupervised learning uses unlabeled data and learns from the input
NE extraction is normally used for extracting data from web pages. It not only learns, but also forms or builds a list for NER.