An interesting use of natural language processing is assessing text readability, which is affected by the vocabulary used, sentence structure, sentence length, topic and more. While writing this book, we used the paid tool Grammarly to help tune the writing and ensure the text’s readability for a wide audience.
In this section, we’ll use the Textatistic library24 to assess readability.25 There are many formulas used in natural language processing to calculate readability. Textatistic uses five popular readability formulas—Flesch Reading Ease, Flesch-Kincaid, Gunning Fog, Simple Measure of Gobbledygook (SMOG) and Dale-Chall.
To install Textatistic, open your Anaconda Prompt (Windows), Terminal (macOS/Linux) or shell (Linux), then execute the following command:
pip install textatistic
Windows users might need to run the Anaconda Prompt as an Administrator for proper software installation privileges. To do so, right-click Anaconda Prompt in the start menu and select More > Run as administrator
.
First, let’s load Romeo and Juliet into the text
variable:
In [1]: from pathlib import Path
In [2]: text = Path('RomeoAndJuliet.txt').read_text()
Calculating statistics and readability scores requires a Textatistic
object that’s initialized with the text you want to assess:
In [3]: from textatistic import Textatistic
In [4]: readability = Textatistic(text)
Textatistic
method dict
returns a dictionary containing various statistics and the readability scores26:
In [5]: %precision 3
Out[5]: '%.3f'
In [6]: readability.dict()
Out[6]:
{'char_count': 115141,
'word_count': 26120,
'sent_count': 3218,
'sybl_count': 30166,
'notdalechall_count': 5823,
'polysyblword_count': 549,
'flesch_score': 100.892,
'fleschkincaid_score': 1.203,
'gunningfog_score': 4.087,
'smog_score': 5.489,
'dalechall_score': 7.559}
Each of the values in the dictionary is also accessible via a Textatistic
property of the same name as the keys shown in the preceding output. The statistics produced include:
char_count
—The number of characters in the text.
word_count
—The number of words in the text.
sent_count
—The number of sentences in the text.
sybl_count
—The number of syllables in the text.
notdalechall_count—A count of the words that are not on the Dale-Chall list, which is a list of words understood by 80% of 5th graders.27 The higher this number is compared to the total word count, the less readable the text is considered to be.
polysyblword_count—The number of words with three or more syllables.
flesch_score—The Flesch Reading Ease score, which can be mapped to a grade level. Scores over 90 are considered readable by 5th graders. Scores under 30 require a college degree. Ranges in between correspond to the other grade levels.
fleschkincaid_score—The Flesch-Kincaid score, which corresponds to a specific grade level.
gunningfog_score—The Gunning Fog index value, which corresponds to a specific grade level.
smog_score—The Simple Measure of Gobbledygook (SMOG), which corresponds to the years of education required to understand text. This measure is considered particularly effective for healthcare materials.28
dalechall_score—The Dale-Chall score, which can be mapped to grade levels from 4 and below to college graduate (grade 16) and above. This score considered to be most reliable for a broad range of text types.29 ,30
You can learn about each of these readability scores produced here and several others at
https:/ / en.wikipedia.org/ wiki/ Readability
The Textatistic documentation also shows the readability formulas used:
http:/ / www.erinhengel.com/ software/ textatistic/
(Fill-In) indicates how easy is it for readers to understand text.
Answer: Readability.
(IPython Session) Using the results in this section’s IPython session, calculate the average numbers of words per sentence, characters per word and syllables per word.
Answer:
In [7]: readability.word_count / readability.sent_count # sentence length
Out[7]: 8.117
In [8]: readability.char_count / readability.word_count # word length
Out[8]: 4.408
In [9]: readability.sybl_count / readability.word_count # syllables
Out[9]: 1.155