Extending our textual analysis

Previously, we developed code to count word frequencies. Your current assignment uses those word frequencies as the basis of some statistical analyses. Now, we'll consider word-sequence frequencies.

Previously, by analyzing words, we were considering an author's word choice, but not how an author would put those words together. By analyzing word sequences, we are looking at the phrase and sentence structure of the text. It is a relatively simple way of looking at the textual structure, however, since we are still not considering the semantics, i.e., meaning, of the text, or even the structure in terms of parts of speech and such. And yet, this approach is one that is used in practice, as we'll explain more, later.

Word-sequence frequency counting, part 1

What we want to accomplish is very similar to before. For each length-n word sequence in the text, we want to have a count of how many times it occurs.

For example, if our text is "This example is silly. That example is also silly.", then some of our previous code would produce the list ["This","example","is","silly",".","That","example","is","also","silly","."]. If we choose n=2, then we want to get the following frequency counts:

`["This","example"]`	→	1
`["example","is"]`	→	2
`["is","silly"]`	→	1
`["silly","."]`	→	2
`[".","That"]`	→	1
`["That","example"]`	→	1
`["is","also"]`	→	1
`["also","silly"]`	→	1

Using sequences of only some fixed length n is a common convention. It is also simpler than building counts for sequences of all lengths.

Previously, our algorithm for word counting was essentially

countOccurrences:
    Given a list of words, returns a dictionary of the word counts.

    Initialize a dictionary to have zero counts.
    For each word in the text,
        Increment the word's count in the dictionary.
    Return the dictionary.

Our new algorithm is almost identical.

countSequences:
    Given a list of words and a length n, returns a dictionary of the
    counts of all length-n word sequences.

    Initialize a dictionary to have zero counts.
    For each length-n word sequence in the text,
        Increment the sequence's count in the dictionary.
    Return the dictionary.

There is no built-in Python way of looping over such word sequences, so we'll have to choose a way to accomplish that.

Also, we need to consider the exact representation of the resulting dictionary. For our previous example, the following choice seems obvious:

{["This","example"]:1, ["example","is"]:2, ["is","silly"]:1, ["silly","."]:2, [".","That"]:1, ["That","example"]:1, ["is","also"]:1, ["also","silly"]:1}

Try that in Python. Unfortunately, this obvious choice is not allowed in Python. Python does not allow the dictionary keys to be lists.

However, we can do something very similar. We can convert the lists into tuples and produce the dictionary mapping tuples of words into their counts:

{("This","example"):1, ("example","is"):2, ("is","silly"):1, ("silly","."):2, (".","That"):1, ("That","example"):1, ("is","also"):1, ("also","silly"):1}

Syntactically, all we have done is change the list square brackets into tuple parentheses. To explain the difference, we need to temporarily change gears and discuss tuples.

Tuples

Tuples are another way of packaging multiple items into a larger unit, along with lists and dictionaries. For example, a four-item tuple:

```
tuple4 = ("abc",3,True,3)
    
```

and a two-item tuple, more commonly called a pair:

```
pair = ("abc",3)
    
```

A common usage of pairs would be to represent 2-dimensional points in space, using their x and y coordinates:

```
point = (7,12)
    
```

Similarly, 3-tuples or triples would be used to represent 3-dimensional points in space, using their x, y, and z coordinates.

We have briefly seen tuples in several contexts without much discussion. They were in the plotting code we gave you for the Predator/Prey problem. The function aDict.items() returns a list of pairs, where each pair contains a key and value. Returning two items from a function is implicitly returning a pair, even though you don't need to wrap them in parentheses.

So, to be more explicit, you can return a tuple from a function. And you can assign its result to a tuple.

def silly1(x):
    return x,x+2,x+4

def silly2(x):
    return (x,x+2,x+4)

a,b,c = silly1(3)
y = silly1(3)
print a
print b
print c
print y

(d,e,f) = silly2(4)
z = silly2(4)
print d
print e
print f
print z

You can loop over a list of tuples.

tuples = [(1,2,3),(4,5,6),(7,8,9)]
for a,b,c in tuples:
    print a+b+c

Observe that Python uses parentheses in a lot of different ways. They are for grouping mathematical operations 3*(4+5), for function arguments abs(-4), for function parameters in a declaration def f(x): …, and now for tuples (7,12). There even one more usage we haven't seen, called “generator expressions”. One potential confusion is that the syntax (8) is considered to be the mathematical grouping, and thus the result is just 8 — it is not a 1-tuple. Instead, to specify a 1-tuple, you need a comma, as in (5,)

We can easily convert between lists and tuples:

aList1 = [5,8,1,0,2]
aTuple = tuple(aList1)
print aTuple
aList2 = list(aTuple)
print aList2

Tuples are like lists

In many ways, tuples are like lists.

Writing an explicit tuple is identical to writing an explicit list, except for the parentheses. Like lists, they can contain any kind of data.

aTuple = (5,8,1,0,2)
aList = [5,8,1,0,2]

You can index a tuple by its index or position and also use slices.

print aTuple[1]
print aList[1]
print aTuple[2:4]
print aList[2:4]
print aTuple[-1]
print aList[-1]

You can loop over the items in a tuple and loop over its indices.

for x in aTuple:
    print x
for x in aList:
    print x

for i in range(len(aTuple)):
    print aTuple[i]
for i in range(len(aList)):
    print aList[i]

As also illustrated by the previous example, tuples support some of the same functions as list, including len().

Tuples are not like lists

However, tuples are more limited that lists in several ways. These amount to the simple idea that tuples are immutable; they cannot be changed.

You cannot update an element of a tuple.

```
aList[3] = 9
aTuple[3] = 9
    
```

You cannot add an element to a tuple.

```
aList.append(12)
aTuple.append(12)
    
```

You cannot remove an element from a tuple.

```
del aList[0]
del aTuple[0]
    
```

Why would we want a restricted form of list? Because it often matches our intentions better. A good example is the previous 2-dimensional point. If we are writing code for 2-dimensional graphics, we would rather use 2-tuples than 2-item lists. Doing so prevents us from making certain kinds of mistakes. For example, adding a third element to a point just doesn't make any sense.

We often have multiple tools that can get the job done. Pick the one that conceptually best corresponds to the task at hand. It will lead to better code and an easier time writing the code.

Why does Python allow dictionaries to have keys that are tuples, but not dictionaries? In short, the technique that Python uses to implement dictionaries efficiently doesn't allow for the keys to be mutable. Thus, we need to use the immutable tuples, instead.

Optional readings about Python tuples

Python Tutorial: Tuples
Wikibooks: Python tuples

Word-sequence frequency counting, part 2

So, back to our problem of counting the occurrences of word sequences in a list of words. As a reminder, our basic algorithm is

countSequences:
    Given a list of words and a length n, returns a dictionary of the
    counts of all length-n word sequences.

    Initialize a dictionary to have zero counts.
    For each length-n word sequence in the text,
        Increment the sequence's count in the dictionary.
    Return the dictionary.

Our n-word sequences will be represented as tuples, so a sample output would be

{("This","example"):1, ("example","is"):2, ("is","silly"):1, ("silly","."):2, (".","That"):1, ("That","example"):1, ("is","also"):1, ("also","silly"):1}

There is one piece of the problem we haven't yet addressed: how to find each n-word sequence in the text. Actually, COMP 130 students have solved this before, albeit in a different context! Can you remember where?

Recall the moving average problem of Assignment 1.  You needed to find each
length-n sublist, in order to calculate the mean of the sublist.
In our current problem, we want to find each length-n sublist, in order
to increment its count in a dictionary.

Refer to the provided Assignment 1 solutions for how to do that.

We've now described solved each piece of the problem, often by just referring back to where we previously solved the same piece. Can you now retrieve and assemble the corresponding pieces of code into a Python function countSequences(words,n)? As always, try to solve this on your own or with help before looking at our solution.

def countSequences(words,n):
    """Given a list, returns a dictionary mapping each n-element sequence tuple to its number of occurrences in the list."""

    # Initialize all counts implicitly to 0.
    countDict = defaultdict(int)

    for i in range(len(words)-n+1):
        key = tuple(words[i:i+n])
        countDict[key] = countDict[key] + 1

    return countDict