Assignment 4

Put all your answers in one CodeSkulptor file. Put text answers in Python comments. Use Python comments to clearly mark each problem number.

Work in assigned pairs on this assignment. Submit it on OWL-Space, under Assignment 4. Only the first person listed in the pair should submit the Codeskulptor URL. Put the pair designation as posted, (NetID1, NetID2), at the top of your submission as well as in Python comments at the top of your CodeSkulptor code. Including the person's actual names too would be nice.

As with the previous assignment, your grade will partly depend on your code style.

Docstrings — Every function should have a docstring that describes what the function does, including relating any outputs to the inputs.
Code reuse — Wherever reasonable you should reuse functions that have been defined previously on this assignment, on a previous assignment, or in class.

Be sure to read the course policies.

Regular expressions (30 points total)

In the following, consider the provided number of BNC hits as guidance, rather than a hard target that you must reach. In our use of REs, we have seen that there are questions as to exactly what we want to match that are separate from what is a RE that will match that goal.

You will be providing REs for both re.findall() and for the BNC web search. As we have seen, the REs we use in these two search contexts are sometimes slightly different.

(5 points)

Many words beginning with the letters “sl” have related meanings. Consider, “slip“, “slide”, “slosh”, “slick”, and “slather”, for example. The “sl” root comes from Proto-Indo-European, the proposed language ancestor of Greek, Latin, and Sanskrit.

Create a RE for searching a large text string with re.findall() for words beginning with “sl” All you need to provide is the RE. What text strings you test it on are your own choice.

Create a RE for searching BNC for words beginning with “sl”. (my version: 91335 BNC hits)
(10 points)

Create a RE for searching a large text string with re.findall() for words, abbreviations, and Roman numerals that contain three or more consecutive, identical vowels.

Create a RE for searching BNC for words, abbreviations, and Roman numerals that contain three or more consecutive, identical vowels. (my version: 7656 BNC hits)
(15 points)

We want to find entries in the BNC that relate to the 1980s. Create a RE for searching BNC for references to the decade or its years.

Among the things you would want to find are “1984”, “'80s”, and “eighties”. You are not expected to search for terms relating to things that happened during the decade, such as the Berlin Wall being torn down. Only search for explicit references to the years.

Think about what search terms are relevant, and combine them into one RE.

Text analysis (70 points total)

Word statistics

For this first group of problems, combine your file reading, word counting, and statistics functions to examine some properties of John F. Kennedy's “Moon Speech”, as found in "comp200-Kennedy_Moon_Speech.txt". Of course, you should test your code on smaller files, where you can determine the appropriate answers by hand.

For this set of problems, consider numbers as words. Do not count punctuation as words. Consider words to be case-sensitive, so that “the” and “The” are distinct. Use the word-splitting RE provided in class.

Here, we are primarily interested in the answers, but also provide the code you used to obtain the answers.

(5 points)

How many distinct words occur in the text?
(5 points)

Find the frequency and a word that represent the median of all the word frequencies. (As before in the course, use the lower median.) Note that multiple words can have the median frequency, and you need only report one of them.
(5 points)

Taking the number of occurrences of “the” and “The” together, what percentage of the total number of word occurrences do they represent?

Word-sequence analysis

In class, you are generalizing your original word-counting program to word-sequence-counting. Also in class, you had written code to find word frequencies and word successors. In the following exercises, you'll write code that combines all these ideas to determine the frequencies of word-sequence successors, also known as a Markov chain, as described in your notes. As with the in-class exercises, parameters will indicate whether to treat punctuation as words and whether to treat the words as being case-sensitive.

Again, you've written most of the necessary code before. You now need to combine the pieces appropriately.

On the next assignment, you'll use Markov chains to generate text, as described in your notes.

(20 points)

Define a function wordseq_successors(filename, seq_size, include_punc, is_case_sensitive). It returns a dictionary, where each key is a seq_size-element tuple of words. Each key's value is a list of distinct successor words. For example, in Eight days a week, the phrase “love babe” occurs eight times. Six times it is followed by a comma, and twice by “just”. If the sequence size is two and we are counting punctuation, then ("love", "babe") should be a key which maps to a list [",", "just"]. (Note that the comma comes first here since love babe, occurs before love babe just.)

Like with wordseq_counts_file_case(), make sure you don't look past the end of the list of words.
(20 points)

Define a function wordseq_successor_counts(filename, seq_size, include_punc, is_case_sensitive). It returns a dictionary, where each key is a seq_size-element tuple of words. Each key's value is a dictionary mapping distinct successor words to their counts. Continuing our example, ("love", "babe") should be a key which maps to a dictionary {"," : 6, "just" : 2}.

This should be similar to the previous function. We strongly encourage you to write your code so that the portions that are in common (e.g., the file reading and word splitting) are in their own functions that you call.
(15 points)

Define a function wordseq_successor_frequencies(filename, seq_size, include_punc, is_case_sensitive). It returns a dictionary, where each key is a seq_size-element tuple of words. Each key's value is a dictionary mapping distinct successor words to their frequencies. Continuing our example, ("love", "babe") should be a key which maps to a dictionary {"," : 0.75, "just" : 0.25}.

Define this in terms of the previous function. Then calculate the frequencies from the counts.