Removing stop words and punctuation

Stop words are all those words that don't add much information to the sentence. For example, the last sentence can be shortened to: stop words don't add useful information sentence. And despite the fact that it doesn't look like a proper English sentence, you'd likely understand the meaning if you heard it somewhere. That's why in many cases we can make our models simpler by simply ignoring these words. Stop words are usually the most common words in natural texts. For English, a list of them can be found in nltk.corpus.stopwords:

In [32]: 
sentences_to_train_on = map(lambda words: [word for (word, pos) in words], lowercased_pos_sentences) 
In [33]: 
print(sentences_to_train_on[203:205]) 
[[u'everybody', u'wa', u'going', u'to', u'the', u'famous', u'Paris', u'Exposition', u'--', u'i', u',', u'too', u',', u'wa', u'going', u'to', u'the', u'Paris', u'Exposition', u'.'], [u'the', u'steamship', u'line', u'were', u'carrying', u'Americans', u'out', u'of', u'the', u'various', u'port', u'of', u'the', u'country', u'at', u'the', u'rate', u'of', u'four', u'or', u'five', u'thousand', u'a', u'week', u'in', u'the', u'aggregate', u'.']] 
In [34]: 
import itertools 
In [35]: 
filtered = map(filter_meaningful, lowercased_pos_sentences) 
flatten = list(itertools.chain(*filtered)) 
words_to_keep = set(flatten) 
In [36]: 
del(filtered, flatten, lowercased_pos_sentences) 
In [37]: 
from nltk.corpus import stopwords 
import string 
In [38]: 
stop_words = set(stopwords.words('english') + list(string.punctuation) + ['wa'])