Tokenization means defining what a word is in NLP ML algorithms. Given a text, tokenization is the task of cutting it down into pieces, called tokens, while at the same time also removing particular characters (such as punctuation or delimiters). For example, given this input sentence in the English language:
The result of tokenization would produce the following 11 tokens:
To be or or not to be that is the question
One big challenge with tokenization is about what the correct tokens to use are. In the previous example, it was easy to decide: we cut down on white spaces and removed all the punctuation characters. But what if the input text isn't in English? For some other languages, such as Chinese for example, where there are no white spaces, the preceding rules don't work. So, any ML/DL model training for NLP should consider the specific rules for a language.
But even when limited to a single language, let's say English, there could be tricky cases. Consider the following example sentence:
How do you manage the apostrophe? There are five possible tokenizations in this case for O'Leary. These are as follows:
- leary
- oleary
- o'leary
- o' leary
- o leary
But which one is the desired one? A simple strategy that comes quickly to mind could be to just split out all the non-alphanumeric characters in sentences. So, getting the o and leary tokens would be acceptable, because doing a Boolean query search with those tokens would match three cases out of five. But what about this following sentence?
For aren't, there are four possible tokenizations, which are as follows:
- aren't
- arent
- are n't
- aren t
Again, while the o and leary split looks fine, what about the aren and t split? This last one doesn't look good; a Boolean query search with those tokens would match two cases only out of four.
Challenges and issues with tokenization are language-specific. A deep knowledge of the language of the input documents is required in this context.