COMP 200 Elements of Computer Science &
COMP 130 Elements of Algorithms and Computation
Spring 2012

Very basic textual analysis

With our regular expression knowledge, we will return to our goal of textual analysis, especially for comparing authorship. Our goal for today is to do a complete, albeit very simple analysis, that of counting word frequencies. Such a frequency analysis will be the basis for later, more sophisticated analyses. Also, we will later measure word sequence frequencies, which will allow us to model phrase or sentence structure, as well.

Our current analysis will have three steps:

  1. Read text from a file.
  2. Split the text into words.
  3. Count word occurrences.
The first and third steps will require that we introduce additional Python features.

Reading text from a file

We have provided a bunch of text files that you can use. Save one of these to your own computer or network drive. For consistency, save it with the same “.txt”. file extension at the end of the name. Pay attention to where you stored it. Note that the file sizes are listed — we recommend using small files at first. In the following example, we'll use Quick fox text.txt, but you can substitute any file name in the following steps.

We will need to tell Python what folder (a.k.a. directory) the file is in. If you are using IPython, it will default to looking in the same folder as your code. If you are using IDLE/Python, it will default to a folder you don't care about, so we need to explicitly tell it where to look.

On Windows, you can right-click on a file, select “Properties” and look at “Location” to see the name of the folder. On a Mac, the equivalent is to right-click, select “Get Info” and look at “Where”. As an example, let's say it is U:\COMP 200, but you can substitute any folder name in the following steps.

We need a string for the filename. Again, if you're using IPython and you're file is in the same directory, then just the filename should work, e.g., Quick fox text.txt. Otherwise, you need to put the folder and file names together. On Windows, substitute a forward slash “/” for any backslash “\”. Thus, you get something like "U:/COMP 200/Quick fox text.txt".

Once you know what file name to use, actually reading the file is easy.

This gives you one long string of everything in the file. If you are opening a big file, this string will also be big, and printing it out may be very slow and unhelpful.

Of course, there are other things you can do with files, but this is all we need for now.

Splitting the text into words

At the end of the previous discussion of regular expressions, we saw some examples of splitting a string into words. In short, we will use re.split to split the text string into pieces.

Our overall goal is to generate statistics that represent the style of the text. The words are only part of the style; the punctuation is another part. So, we will keep not only the words, but also the punctuation, and generate statistics for them, as well. The whitespace, however, we will throw away (much to e.e. cummings chagrin).

Thus, if our input is

then we want to split it into the following pieces:

In order to do this, we still need to be able to distinguish what symbols should be considered part of a word, and which are punctuation. We will briefly consider what decisions to make here.

What will we consider punctuation, or more generally, not part of a word? We'll start with whitespace and the following characters: comma, colon, semi-colon, period, double-quotation marks, parentheses, slash (as in “and/or”), and ampersand. For exclamation marks and question marks, we'll consider combinations of them (such as ?!?) to be a single punctuation symbol. We'll consider two or more dashes together to be punctuation, whereas we'll assume a single one is a hyphen inside a word. We'll assume all single quotation marks are just apostrophes and thus part of a word. This isn't completely accurate, but it's pretty good, and about the best we can accomplish without a more sophisticated tool than regular expressions.

Can you devise a regular expression that corresponds to this description of whitespace and punctuation?

Now, use re.split as we've done previously:

Unfortunately, that's not what we want, as it has removed the punctuation, as well. It also contains some empty strings.

Instead, we can add parentheses to the regular expression. Perhaps surprisingly, this tells re.split not to remove anything when splitting. Thus, the whitespace and punctuation are in the resulting list. Try it:

Now, we want to remove all the whitespace from this list. Define a function called isNotWhitespace that returns False if it is blank or only whitespace, or True otherwise.

So, now we can use that to tell us what to remove from the list. Define a function removeWhitespace that takes a list of strings and returns a list with just the non-whitespace strings.

Put these together to split the input string apart and remove the whitespace.

Note that we are not doing anything special with capitalized words. If we were strictly doing a word frequency analysis, we might want to count "The" and "the" as the same. But, given our overall purpose, keeping such pairs distinct can give us information about the text style, such as what words are most likely to start sentences.

Let's put all our pieces together, and define a function inputText that takes a filename and returns a list of the text strings we want.

Counting word occurrences

How can we store the information that, for example, "you" occurs one time? With what we've seen so far, we have two main options, neither of which is very good.

With either of these approaches, to find the count corresponding to "you", we need to loop through a list until we find the desired string.

Given a list aList and an integer key, we can lookup a corresponding value with aList[key]. We would like to be able to do the same thing when key is a string. We can, but with what Python calls a dictionary.

Dictionaries

Here is a quick introduction to Python dictionaries. For more information, see the dictionary finger exercises.

Note that dictionary syntax is much like list syntax, except with curly braces for literal dictionaries.

Word frequency counting

All we need to do is to loop through our input text strings and either initialize or increment each word's count.

Try countOccurrences(inputText(filename)) on some of the provided files.   Click here for more information on defaultdict.

Counters (Optional for COMP 200)

Dictionaries are very useful, and we will be using them throughout the rest of the course. But, we can count the word occurrences even more simply using Counters.

No looping or anything else needed — everything is done for you!

Some related resources