Stemming with LingPipe

The PorterStemmerTokenizerFactory class is used to find stems using LingPipe. In this example, we will use the same words array as in the Using the Porter Stemmer section. The IndoEuropeanTokenizerFactory class is used to perform the initial tokenization, followed by the use of the Porter Stemmer. These classes are defined here:

TokenizerFactory tokenizerFactory = 
IndoEuropeanTokenizerFactory.INSTANCE; TokenizerFactory porterFactory = new PorterStemmerTokenizerFactory(tokenizerFactory);

An array to hold the stems is declared next. We reuse the words array declared in the previous section. Each word is processed individually. The word is tokenized and its stem is stored in stems, as shown in the following code block. The words and their stems are then displayed:

String[] stems = new String[words.length]; 
for (int i = 0; i < words.length; i++) { 
    Tokenization tokenizer = new Tokenization(words[i],porterFactory); 
    stems = tokenizer.tokens(); 
    System.out.print("Word: " + words[i]); 
    for (String stem : stems) { 
        System.out.println("  Stem: " + stem); 
    } 
} 

When executed, we get the following output:

Word: bank  Stem: bank
Word: banking  Stem: bank
Word: banks  Stem: bank
Word: banker  Stem: banker
Word: banked  Stem: bank
Word: bankart  Stem: bankart  

We have demonstrated the Porter Stemmer using OpenNLP and LingPipe examples. It is worth noting that there are other types of stemmers available, including Ngrams and various mixed probabilistic/algorithmic approaches.