Text and Language

c38-fig-5001.jpg

String Search

Find a table of color names and their corresponding RGB values. Create an interaction in which a square becomes colored when a user types a known color name. (Is your program case-insensitive?)

c38-fig-5002.jpg

Nonsense Words

Find some lists of common English prefixes, word roots, and suffixes. Select a random item from each list and combine them in a simple syntax (prefix+root+suffix) to generate plausible nonsense words. What might these words mean?

c38-fig-5003.jpg

Letter Frequency

Write a program to calculate the frequencies of the letters in a provided text. (Be careful to make your program “case-insensitive.”) Write code to generate a visualization (such as a bar chart or pie chart) of the letters’ frequencies.

c38-fig-5004.jpg

Letter-Pair Frequency

Write a program to calculate the frequencies of letter pairs (character 2-grams, such as “aa,” “ab,” “ac”) in a large text source. Plot the frequencies in a 26x26 matrix.

c38-fig-5005.jpg

Average Word Length

Write a program that calculates the average word length of a provided text. This is a useful approximation of a text's “reading level.” Run your program on several different source materials.

c38-fig-5006.jpg

Sorting Words

Load a document and display its words (a) sorted alphabetically, (b) sorted by their length, and (c) sorted according to their frequency in the text.

c38-fig-5007.jpg

Cut-Up Machine

In the Dada Manifesto, Tristan Tzara describes using a newspaper, scissors, and some gentle shaking to generate irrational poetry. Do the same with code. Write a program that randomizes the lines or sentences of a newspaper article to make a Dada-style poem.

c38-fig-5008.jpg

Bigram Calculator

Write a program to calculate the frequency of all bigrams (word-pairs) in a document. Advanced students: develop a program that judges the similarity of two text files based on how many bigrams they have in common.

c38-fig-5009.jpg

Dammit Jim

Find a list of occupations. Use these in a generative grammar that produces sentences in this format: “Dammit, Jim, I'm an X, not a Y!” (popularized by the sci-fi TV show Star Trek). Be sure to write “an X” if X begins with a vowel and “a X” if it begins with a consonant.1

c38-fig-5010.jpg

Knock-Knock Joke Generator

Create a program that generates knock-knock jokes. At a miminum, your program should select a random word as a response to “Who's there?” and add to this word to create the final line of the joke. Generate ten jokes.

c38-fig-5011.jpg

Pig Latin Translator

Create a program that translates provided text into Pig Latin. In this playful scheme, the initial consonant (or consonant cluster) of each word is transferred to the end of that word, after which the syllable “ay” is added.

c38-fig-5012.jpg

Argots and Language Games

Investigate argots, secret languages, and word games like Ubbi Dubbi, Tutnese, Pirate English, and Dizzouble Dizzutch. Select one and write a program that translates a provided text into it.

c38-fig-5013.jpg

Noun Swizzler

Load a text, and replace each noun with a randomly selected noun taken from a second text. You may need to use a “part-of-speech tagger” to identify the nouns. In your substitutions, try to match the use of plural and proper nouns in the first text.

c38-fig-5014.jpg

Rhyming Couplets

Select and load a large expository text. Create a program that finds rhyming couplets within it, in order to make new poems. You may need to use an additional library (such as RiTa) to tell you what the words sound like.2

c38-fig-5015.jpg

Haiku Finder

Write a program that automatically discovers “inadvertent haikus”: sentences in a chosen text whose words happen to fall in groups of five, seven, and five syllables. A basic solution will discover haikus with awkward breaks; add heuristics to improve the quality of your results.

c38-fig-5016.jpg

Markov Text Generator

In a Markov chain, a dataset of letter-pair, bigram, or n-gram frequencies is used as a “probability transition matrix” to synthesize new text that statistically resembles the dataset's source. Build a Markov generator using the data you collected earlier.3

c38-fig-5017.jpg

Keyword Extraction with TF-IDF

Obtain a collection of related documents, such as poems, recipes, or obituaries. Write a program that uses the TF-IDF (“Term Frequency - Inverse Document Frequency”) algorithm to determine the keywords that best characterize each document.4

c38-fig-5018.jpg

Limerick Generator

Limerick poems have five lines in an AABBA rhyming pattern. The building block of these lines is the anapest: a foot of verse consisting of three syllables, the third of which is accented: da-da-DA. Lines 1, 2, and 5 consist of three anapests; they end with a similar phoneme in order to create the rhyme. Lines 3 and 4 also rhyme with each other, but are shorter, consisting of two anapests each. Write a program to generate limericks. Use a code library such as RiTa to help evaluate your words’ rhymes, syllables, and stress patterns.5

Notes