12.1 (Web Scraping with the Requests and Beautiful Soup Libraries) Web pages are excellent sources of text to use in NLP tasks. In the following IPython session, you’ll use the requests
library to download the www.python.org
home page’s content. This is called web scraping. You’ll then use the Beautiful Soup library37 to extract only the text from the page. Eliminate the stop words in the resulting text, then use the wordcloud
module to create a word cloud based on the text.
In [1]: import requests
In [2]: response = requests.get('https://www.python.org')
In [3]: response.content # gives back the page's HTML
In [4]: from bs4 import BeautifulSoup
In [5]: soup = BeautifulSoup(response.content, 'html5lib')
In [6]: text = soup.get_text(strip=True) # text without tags
In the preceding code, snippets [1]
–[3]
get a web page. The get
function receives a URL as an argument and returns the corresponding web page as a Response
object. The Response
’s content
property contains the web page’s content. Snippets [4]
–[6]
get only the web page’s text. Snippet [5]
creates a BeautifulSoup
object to process the text in response.content
. BeautifulSoup
method get_text
with the keyword argument strip=True
returns just the text of the web page without its structural information that your web browser uses to display the web page.
12.2 (Tokenizing Text and Noun Phrases) Using the text from Exercise 12.1, create a TextBlob
, then tokenize it into Sentence
s and Word
s, and extract its noun phrases.
12.3 (Sentiment of a News Article) Using the techniques in Exercise 12.1, download a web page for a current news article and create a TextBlob
. Display the sentiment for the entire TextBlob
and for each Sentence
.
12.4 (Sentiment of a News Article with the NaiveBayesAnalyzer
) Repeat the previous exercise but use the NaiveBayesAnalyzer
for sentiment analysis.
12.5 (Spell Check a Project Gutenberg Book) Download a Project Gutenberg book and create a TextBlob
. Tokenize the TextBlob
into Word
s and determine whether any are misspelled. If so, display the possible corrections.
12.6 (Word Frequency Bar Chart and Word Cloud from Shakespeare’s Hamlet) Using the techniques you learned in this chapter, create a top-20 word frequency bar chart and a word cloud, based on Shakespeare’s Hamlet. Use the mask_oval.png
file provided in the ch12
examples folder as the mask.
12.7 (Textatistic: Readability of News Articles) Using the techniques in the first exercise, download from several news sites current news articles on the same topic. Perform readability assessments on them to determine which sites are the most readable. For each article, calculate the average number of words per sentence, the average number of characters per word and the average number of syllables per word.
12.8 (spaCy: Named Entity Recognition) Using the techniques in the first exercise, download a current news article then use the spaCy library’s named entity recognition capabilities to display the named entities (people, places, organizations, etc.) in the article.
12.9 (spaCy: Similarity Detection) Using the techniques in the first exercise, download several news articles on the same topic and compare them for similarity.
12.10 (spaCy: Shakespeare Similarity Detection) Using the spaCy techniques you learned in this chapter, download a Shakespeare comedy from Project Gutenberg and compare it for similarity with Romeo and Juliet.
12.11 (textblob.utils
Utility Functions) TextBlob’s textblob.utils
module offers several utility functions for cleaning up text, including strip_punc
and lowerstrip
. You call strip_punc
with a string and the keyword argument all=True
to remove punctuation from the string. You call lowerstrip
with a string and the keyword argument all=True
to get a string in all lowercase letters with whitespace and punctuation removed. Experiment with each function on Romeo and Juliet.
12.12 (Research: Funny Newspaper Headlines) To understand how tricky it is to work with natural language and its inherent ambiguity issues, research “funny newspaper headlines.” List the challenges you find.
12.13 (Try the Demos: Named Entity Recognition) Search online for the Stanford Named Entity Tagger and the Cognitive Computation Group’s Named Entity Recognition Demo. Run each with a corpus of your choice. Compare the results of both demos.
12.14 (Try the Demo: TextRazor) TextRazor is one of many paid commercial NLP products that offer a free tier. Search online for TextRazor Live Demo. Paste in a corpus of your choice and click Analyze
to analyze the corpus for categories and topics, highlighting key sentences and words within them. Click the links below each piece of analyzed text for more detailed analyses. Click the Advanced Options
link to the right of Analyze
and Clear
for many additional features, including the ability to analyze text in different languages.
12.15 (Project: Readability Scores with Textatistic) Try Textatistic with famous authors’ books from Project Gutenberg.
12.16 (Project: Who Authored the Works of Shakespeare) Using the spaCy similarity detection code introduced in this chapter, compare Shakespeare’s Macbeth to one major work from each of several other authors who might have written Shakespeare’s works (see https:/
). Locate works on Project Gutenberg from a few authors listed at https:/
, then use spaCy to compare their works’ similarity to Macbeth. Which of the authors’ works are most similar to Macbeth?
12.17 (Project: Similarity Detection) One way to measure similarity between two documents is to compare frequency counts of the parts of speech used in each. Build dictionaries of the parts of speech frequencies for two Project Gutenberg books from the same author and two from different authors and compare the results.
12.18 (Project: Text Visualization Browser) Use http:/
to view hundreds of text visualizations. You can filter the visualizations by categories like the analytic tasks they perform, the kinds of visualizations, the kinds of data sources they use and more. Each visualization’s summary provides a link to where you can learn more about it.
12.19 (Project: Stanford CoreNLP) Search online for “Stanford CoreNLP Python” to find Stanford’s list of Python modules for using CoreNLP, then experiment with its features to perform tasks you learned in this chapter.
12.20 (Project: spaCy and spacy-readability) We used Textatistic for readability assessment in this chapter. There are many other readability libraries, such as readability-score, textstat, readability, pylinguistics and spacy-readability, which works in conjunction with spaCy. Investigate the spacy-readability module, then use it to evaluate the readability of King Lear (from Project Gutenberg).
12.21 (Project: Worldwide Peace) As you know, TextBlob
language translation works by connecting to Google Translate. Determine the range of languages Google Translate recognizes. Write a script that translates the English word “Peace” into each of the supported languages. Display the translations in the same size text in a circular word cloud using the mask_circle.png
file provided in the ch12
examples folder.
12.22 (Project: Self Tutor for Learning a New Language) With today’s natural language tools, including inter-language translation, speech-to-text and text-to-speech in various languages, you can build a self-tutor that will help you learn new languages. Find Python speech-to-text and text-to-speech libraries that can handle various languages. Write a script that allows you to indicate the language you wish to learn (use only the languages supported by Google Translate through TextBlob). The script should then allow you to say something in English, transcribe your speech to text, translate it to the selected language and use text-to-speech to speak the translated text back to you so you can hear it. Try your script with words, colors, numbers, sentences, people, places and things.
12.23 (Project: Accessing Wikipedia with Python) Search online for Python modules that enable you to access content from Wikipedia and similar sites. Write scripts to exercise the capabilities of those modules.
12.24 (Project: Document Summarization with Gensim) Document summarization involves analyzing a document and extracting content to produce a summary. For example, with today’s massive flow of information, this could be useful to busy doctors studying the latest medical advances in order to provide the best care. A summary could help them decide whether a paper is worth reading. Investigate the summarize
and keywords
functions from the Gensim library’s gensim.summarization
module, then use them to summarize text and extract the important words. You’ll need to install Gensim with
conda install -c conda-forge gensim
Assuming that text
is a string representing a corpus, the following Gensim code summarizes the text and displays a list of keywords in the text:
from gensim.summarization import summarize, keywords
print(summarize(text))
print(keywords(text))
12.25 (Challenge Project: Six Degrees of Separation with Wikipedia Corpus) You may have heard of “six degrees of separation” for finding connections between any two people on the planet. The idea is that as you look at a person’s connections to friends and family, then look at their friends’ and family’s connections, and so on, you’ll often find a connection between two people within the first six levels of connection.
Research “six degrees of separation Python” online. You’ll find many implementations. Execute some of them to see how they do. Next, investigate the Wikipedia APIs and Python modules for using them. Choose two famous people. Load the Wikipedia page for the first person, then use named entity recognition to locate any names in that person’s Wikipedia page. Then repeat this process for the Wikipedia pages of each name you find. Continue this process six levels deep to build a graph of people connected to the original person’s Wikipedia page. Along the way, check whether the other person’s name appears in the graph and print the chain of people. In the “Big Data: Hadoop, Spark, NoSQL and IoT” chapter, we’ll discuss the Neo4j graph database, which can be used to solve this problem.
12.26 (Project: Synonym Chain Leading to an Antonym) As you follow synonym chains—that is, synonyms of synonyms of synonyms, etc.—to arbitrary levels, you’ll often encounter words that do not appear to be related to the original. Though rare, there actually are cases in which following a synonym chain eventually results in an antonym of the initial word. For several examples, see the paper “Websterian Synonym Chains”:
https:/
Choose a synonym chain in the paper above. Use the WordNet features introduced in this chapter to get the first word’s synonyms and antonyms. Next, for each of the words in the Synset
s, get their synonyms, then the synonyms of those synonyms and so on. As you get the synonyms at each level, check whether any of them is one of the initial word’s antonyms. If so, display the synonym chain that led to the antonym.
12.27 (Project: Steganography) Steganography hides information within other information. The term literally means “covered writing.” Research online “text steganography” and “natural language steganography.” Write scripts that use various steganography techniques to hide information in text and to extract that information.