Tokenizers

Tokenization means defining what a word is in NLP ML algorithms. Given a text, tokenization is the task of cutting it down into pieces, called tokens, while at the same time also removing particular characters (such as punctuation or delimiters). For example, given this input sentence in the English language:

To be, or not to be, that is the question

The result of tokenization would produce the following 11 tokens:

To be or or not to be that is the question

One big challenge with tokenization is about what the correct tokens to use are. In the previous example, it was easy to decide: we cut down on white spaces and removed all the punctuation characters. But what if the input text isn't in English? For some other languages, such as Chinese for example, where there are no white spaces, the preceding rules don't work. So, any ML/DL model training for NLP should consider the specific rules for a language.

But even when limited to a single language, let's say English, there could be tricky cases. Consider the following example sentence:

David Anthony O'Leary is an Irish football manager and former player

How do you manage the apostrophe? There are five possible tokenizations in this case for O'Leary. These are as follows:

  1. leary
  2. oleary
  3. o'leary
  4. o' leary
  5. o leary

But which one is the desired one? A simple strategy that comes quickly to mind could be to just split out all the non-alphanumeric characters in sentences. So, getting the o and leary tokens would be acceptable, because doing a Boolean query search with those tokens would match three cases out of five. But what about this following sentence?

Michael O'Leary has slammed striking cabin crew at Aer Lingus saying “they aren't being treated like Siberian salt miners”.

For aren't, there are four possible tokenizations, which are as follows:

  1. aren't
  2. arent
  3. are n't
  4. aren t

Again, while the o and leary split looks fine, what about the aren and t split? This last one doesn't look good; a Boolean query search with those tokens would match two cases only out of four.

Challenges and issues with tokenization are language-specific. A deep knowledge of the language of the input documents is required in this context.