COMP 200 Elements of Computer Science &
COMP 130 Elements of Algorithms and Computation
Spring 2012

Extending our textual analysis

Previously, we developed code to count word frequencies. Your current assignment uses those word frequencies as the basis of some statistical analyses. Now, we'll consider word-sequence frequencies.

Previously, by analyzing words, we were considering an author's word choice, but not how an author would put those words together. By analyzing word sequences, we are looking at the phrase and sentence structure of the text. It is a relatively simple way of looking at the textual structure, however, since we are still not considering the semantics, i.e., meaning, of the text, or even the structure in terms of parts of speech and such. And yet, this approach is one that is used in practice, as we'll explain more, later.

Word-sequence frequency counting, part 1

What we want to accomplish is very similar to before. For each length-n word sequence in the text, we want to have a count of how many times it occurs.

For example, if our text is "This example is silly. That example is also silly.", then some of our previous code would produce the list ["This","example","is","silly",".","That","example","is","also","silly","."]. If we choose n=2, then we want to get the following frequency counts:

["This","example"] 1
["example","is"] 2
["is","silly"] 1
["silly","."] 2
[".","That"] 1
["That","example"] 1
["is","also"] 1
["also","silly"] 1

Using sequences of only some fixed length n is a common convention. It is also simpler than building counts for sequences of all lengths.

Previously, our algorithm for word counting was essentially

Our new algorithm is almost identical.

There is no built-in Python way of looping over such word sequences, so we'll have to choose a way to accomplish that.

Also, we need to consider the exact representation of the resulting dictionary. For our previous example, the following choice seems obvious:

Try that in Python. Unfortunately, this obvious choice is not allowed in Python. Python does not allow the dictionary keys to be lists.

However, we can do something very similar. We can convert the lists into tuples and produce the dictionary mapping tuples of words into their counts:

Syntactically, all we have done is change the list square brackets into tuple parentheses. To explain the difference, we need to temporarily change gears and discuss tuples.

Tuples

Tuples are another way of packaging multiple items into a larger unit, along with lists and dictionaries. For example, a four-item tuple:

and a two-item tuple, more commonly called a pair:

A common usage of pairs would be to represent 2-dimensional points in space, using their x and y coordinates:

Similarly, 3-tuples or triples would be used to represent 3-dimensional points in space, using their x, y, and z coordinates.

We have briefly seen tuples in several contexts without much discussion. They were in the plotting code we gave you for the Predator/Prey problem. The function aDict.items() returns a list of pairs, where each pair contains a key and value. Returning two items from a function is implicitly returning a pair, even though you don't need to wrap them in parentheses.

So, to be more explicit, you can return a tuple from a function. And you can assign its result to a tuple.

You can loop over a list of tuples.

We can easily convert between lists and tuples:

Tuples are like lists

In many ways, tuples are like lists.

Writing an explicit tuple is identical to writing an explicit list, except for the parentheses. Like lists, they can contain any kind of data.

You can index a tuple by its index or position and also use slices.

You can loop over the items in a tuple and loop over its indices.

As also illustrated by the previous example, tuples support some of the same functions as list, including len().

Tuples are not like lists

However, tuples are more limited that lists in several ways. These amount to the simple idea that tuples are immutable; they cannot be changed.

You cannot update an element of a tuple.

You cannot add an element to a tuple.

You cannot remove an element from a tuple.

Why would we want a restricted form of list? Because it often matches our intentions better. A good example is the previous 2-dimensional point. If we are writing code for 2-dimensional graphics, we would rather use 2-tuples than 2-item lists. Doing so prevents us from making certain kinds of mistakes. For example, adding a third element to a point just doesn't make any sense.

We often have multiple tools that can get the job done. Pick the one that conceptually best corresponds to the task at hand. It will lead to better code and an easier time writing the code.

Why does Python allow dictionaries to have keys that are tuples, but not dictionaries? In short, the technique that Python uses to implement dictionaries efficiently doesn't allow for the keys to be mutable. Thus, we need to use the immutable tuples, instead.

Optional readings about Python tuples

Word-sequence frequency counting, part 2

So, back to our problem of counting the occurrences of word sequences in a list of words. As a reminder, our basic algorithm is

Our n-word sequences will be represented as tuples, so a sample output would be

There is one piece of the problem we haven't yet addressed: how to find each n-word sequence in the text. Actually, COMP 130 students have solved this before, albeit in a different context! Can you remember where?

We've now described solved each piece of the problem, often by just referring back to where we previously solved the same piece. Can you now retrieve and assemble the corresponding pieces of code into a Python function countSequences(words,n)? As always, try to solve this on your own or with help before looking at our solution.