Word-Sequence Counting

Previously, you wrote word_counts_file() to count words read from a file. This used your previous word_counts().

Smaller files useful for testing: "comp200-test.txt", "comp200-Eight_Days_a_Week.txt", "comp200-Johnny_B_Goode.txt", "comp200-Gettysburg_Address.txt"
Large files: "comp200-Odyssey.txt", "comp200-War_and_Peace.txt" (You may get time out errors on these files.)

In the following exercises, you will extend and generalize word_counts_file(). On your assignment, you will similarly modify word_successors_file(). You'll have three full class days to work on these exercises.

Define is_whitespace(string) that returns whether or not the given string consists only of whitespace. Note that re.findall does not properly process empty text strings, so you will need to explicitly check for the empty string to make sure that your function returns the proper result.

For example, is_whitespace("") and is_whitespace(" \n \t") should each return True. But, is_whitespace(" x ") should return False.

Test your is_whitespace function using OwlTest
Redefine word_counts_file(filename, include_punc) to use our more accurate word-splitting. It also takes a Boolean indicating whether punctuation should be included in the word count.

Hints: Use the previous function to identify the whitespace strings and remove them from the list (Hint-in-a-hint: What did you write recently about breaking your problem down into smaller functions?). Similarly, if we are not including punctuation, identify those strings that only contain punctuation (or, equivalently, don't contain a letter or digit), and remove them from the list. Or, even better, can you make a simple modification to your splitting regular expression so that you don't include the punctuation in the first place?

Test your word_counts_file function using OwlTest
Define word_counts_file_case(filename, include_punc, is_case_sensitive). It is just like the previous version, except that it takes an extra parameter that indicates whether the counting should be case sensitive or not. If not, each word should be converted to lower-case before counting.

Test your word_counts_file_case function using OwlTest
Define next_pair(pair, new_element). It takes a pair, i.e., a tuple of two elements, (x₀, x₁). It returns a pair (x₁, new_element).

This should be a useful helper function for the next exercise.

Test your next_pair function using OwlTest
Define wordpair_counts_file_case(filename, include_punc, is_case_sensitive). It is like the previous, except that instead of counting each word, it counts each word pair.

For example, given a text "a b c b b c b", it should return {("a", "b") : 1, ("b", "c") : 2, ("c", "b") : 2, ("b", "b") : 1}. Observe that if there are n words in the text, then there the total count of pairs is n-1, since the last word can't be paired with a following word. E.g., this text has 7 words, and the total of the counts in the dictionary is 6.

As a reminder, we can't use lists as dictionary keys, so we use tuples, instead.

Test your wordpair_counts_file_case function using OwlTest
Define next_seq(prev_tuple, new_element). It takes an n-tuple, i.e., a tuple of n elements, (x₀, …, x_n-1). It returns an n-tuple (x₁, …, x_n-1, new_element).

Hint: It's convenient to convert from a tuple to a list, make the desired change, and then convert back to a tuple.

This should be a useful helper function for the next exercise.

Test your next_seq function using OwlTest
Define wordseq_counts_file_case(filename, seq_size, include_punc, is_case_sensitive). It is like the previous, except that instead of counting each word pair (a sequence of length 2), it counts each word sequence of length seq_size. The sequence size should be a positive integer.

For example, given a text "a b c b b c b", and sequence size 3, it should return {("a", "b", "c") : 1, ("b", "c", "b") : 2, ("b", "b", "c") : 1, ("c", "b", "b") : 1}. Observe that if there are n words in the text, then there the total count of pairs is n-(seq_size-1), since the last seq_size-1 words can't be paired with a following word. E.g., this text has 7 words, and the total of the counts in the dictionary is 5.

Thus, a sequence size of 1 should count individual words (except that the dictionary keys are 1-tuples of strings, instead of strings), and a sequence size of 2 should count word pairs.

Test your wordseq_counts_file_case function using OwlTest

COMP 200: Elements of Computer Science Spring 2013

Word-Sequence Counting

COMP 200: Elements of Computer Science
Spring 2013