Word-Sequence Counting
Previously, you wrote word_counts_file()
to count
words read from a file. This used your previous
word_counts()
.
-
Smaller files useful for testing:
"comp200-test.txt"
,"comp200-Eight_Days_a_Week.txt"
,"comp200-Johnny_B_Goode.txt"
,"comp200-Gettysburg_Address.txt"
-
Large files:
"comp200-Odyssey.txt"
,"comp200-War_and_Peace.txt"
(You may get time out errors on these files.)
In the following exercises, you will extend and generalize
word_counts_file()
. On your assignment, you will
similarly modify word_successors_file()
.
You'll have three full class days to work on these exercises.
-
Define
is_whitespace(string)
that returns whether or not the given string consists only of whitespace. Note thatre.findall
does not properly process empty text strings, so you will need to explicitly check for the empty string to make sure that your function returns the proper result.For example,
is_whitespace("")
andis_whitespace(" \n \t")
should each returnTrue
. But,is_whitespace(" x ")
should returnFalse
. -
Redefine
word_counts_file(filename, include_punc)
to use our more accurate word-splitting. It also takes a Boolean indicating whether punctuation should be included in the word count.Hints: Use the previous function to identify the whitespace strings and remove them from the list (Hint-in-a-hint: What did you write recently about breaking your problem down into smaller functions?). Similarly, if we are not including punctuation, identify those strings that only contain punctuation (or, equivalently, don't contain a letter or digit), and remove them from the list. Or, even better, can you make a simple modification to your splitting regular expression so that you don't include the punctuation in the first place?
-
Define
word_counts_file_case(filename, include_punc, is_case_sensitive)
. It is just like the previous version, except that it takes an extra parameter that indicates whether the counting should be case sensitive or not. If not, each word should be converted to lower-case before counting. -
Define
next_pair(pair, new_element)
. It takes a pair, i.e., a tuple of two elements,(x0, x1)
. It returns a pair(x1, new_element)
.This should be a useful helper function for the next exercise.
-
Define
wordpair_counts_file_case(filename, include_punc, is_case_sensitive)
. It is like the previous, except that instead of counting each word, it counts each word pair.For example, given a text
"a b c b b c b"
, it should return{("a", "b") : 1, ("b", "c") : 2, ("c", "b") : 2, ("b", "b") : 1}
. Observe that if there are n words in the text, then there the total count of pairs is n-1, since the last word can't be paired with a following word. E.g., this text has 7 words, and the total of the counts in the dictionary is 6.As a reminder, we can't use lists as dictionary keys, so we use tuples, instead.
-
Define
next_seq(prev_tuple, new_element)
. It takes an n-tuple, i.e., a tuple of n elements,(x0, …, xn-1)
. It returns an n-tuple(x1, …, xn-1, new_element)
.Hint: It's convenient to convert from a tuple to a list, make the desired change, and then convert back to a tuple.
This should be a useful helper function for the next exercise.
-
Define
wordseq_counts_file_case(filename, seq_size, include_punc, is_case_sensitive)
. It is like the previous, except that instead of counting each word pair (a sequence of length 2), it counts each word sequence of length seq_size. The sequence size should be a positive integer.For example, given a text
"a b c b b c b"
, and sequence size 3, it should return{("a", "b", "c") : 1, ("b", "c", "b") : 2, ("b", "b", "c") : 1, ("c", "b", "b") : 1}
. Observe that if there are n words in the text, then there the total count of pairs is n-(seq_size-1), since the last seq_size-1 words can't be paired with a following word. E.g., this text has 7 words, and the total of the counts in the dictionary is 5.Thus, a sequence size of 1 should count individual words (except that the dictionary keys are 1-tuples of strings, instead of strings), and a sequence size of 2 should count word pairs.