The PorterStemmerTokenizerFactory class is used to find stems using LingPipe. In this example, we will use the same words array as in the Using the Porter Stemmer section. The IndoEuropeanTokenizerFactory class is used to perform the initial tokenization, followed by the use of the Porter Stemmer. These classes are defined here:
TokenizerFactory tokenizerFactory =
IndoEuropeanTokenizerFactory.INSTANCE; TokenizerFactory porterFactory = new PorterStemmerTokenizerFactory(tokenizerFactory);
An array to hold the stems is declared next. We reuse the words array declared in the previous section. Each word is processed individually. The word is tokenized and its stem is stored in stems, as shown in the following code block. The words and their stems are then displayed:
String[] stems = new String[words.length]; for (int i = 0; i < words.length; i++) { Tokenization tokenizer = new Tokenization(words[i],porterFactory); stems = tokenizer.tokens(); System.out.print("Word: " + words[i]); for (String stem : stems) { System.out.println(" Stem: " + stem); } }
When executed, we get the following output:
Word: bank Stem: bank Word: banking Stem: bank Word: banks Stem: bank Word: banker Stem: banker Word: banked Stem: bank Word: bankart Stem: bankart
We have demonstrated the Porter Stemmer using OpenNLP and LingPipe examples. It is worth noting that there are other types of stemmers available, including Ngrams and various mixed probabilistic/algorithmic approaches.