Sentence segmentation is the process of splitting up a text into sentences. From its definition, it seems a straightforward process, but several difficulties can occur with it, for example, the presence of punctuation marks that can be used to indicate different things:
Streamsets Inc. has released the new Data Collector 3.5.0. One of the new features is the MongoDB lookup processor.
Looking at the preceding text, you can see that the same punctuation mark (.) is used for three different things, not just as a sentence separator. Some languages, such as Chinese for example, come with unambiguous sentence-ending markers, while others don't. So a strategy needs to be set. The quickest and dirtiest approach to locate the end of a sentence in a case like that in the previous example is the following:
- If it is a full stop, then it ends a sentence
- If the token preceding a full stop is present in a hand-precompiled list of abbreviations, then the full stop doesn't end the sentence
- If the next token after a full stop is capitalized, then the full stop ends the sentence
This gets more than 90% of sentences correct, but something smarter can be done, such as rule-based boundary disambiguation techniques (automatically learn a set of rules from input documents where the sentence breaks have been pre-marked), or better still, use a neural network (this can achieve more than 98% accuracy).