Very basic textual analysis

With our regular expression knowledge, we will return to our goal of textual analysis, especially for comparing authorship. Our goal for today is to do a complete, albeit very simple analysis, that of counting word frequencies. Such a frequency analysis will be the basis for later, more sophisticated analyses. Also, we will later measure word sequence frequencies, which will allow us to model phrase or sentence structure, as well.

Our current analysis will have three steps:

Read text from a file.
Split the text into words.
Count word occurrences.

The first and third steps will require that we introduce additional Python features.

Reading text from a file

We have provided a bunch of text files that you can use. Save one of these to your own computer or network drive. For consistency, save it with the same “.txt”. file extension at the end of the name. Pay attention to where you stored it. Note that the file sizes are listed — we recommend using small files at first. In the following example, we'll use Quick fox text.txt, but you can substitute any file name in the following steps.

We will need to tell Python what folder (a.k.a. directory) the file is in. If you are using IPython, it will default to looking in the same folder as your code. If you are using IDLE/Python, it will default to a folder you don't care about, so we need to explicitly tell it where to look.

On Windows, you can right-click on a file, select “Properties” and look at “Location” to see the name of the folder. On a Mac, the equivalent is to right-click, select “Get Info” and look at “Where”. As an example, let's say it is U:\COMP 200, but you can substitute any folder name in the following steps.

We need a string for the filename. Again, if you're using IPython and you're file is in the same directory, then just the filename should work, e.g., Quick fox text.txt. Otherwise, you need to put the folder and file names together. On Windows, substitute a forward slash “/” for any backslash “\”. Thus, you get something like "U:/COMP 200/Quick fox text.txt".

Once you know what file name to use, actually reading the file is easy.

filename = "U:/COMP 200/Quick fox text.txt"
textFile = open(filename,"r")      # The "r" means we will read the file.
inputString = textFile.read()

Alternately, we can change the “working directory” to be what we want.

import os
os.chdir("U:/COMP 200")            # chdir = CHange the current working DIRectory
filename = "Quick fox text.txt"
textFile = open(filename,"r")      # The "r" means we will read the file.
inputString = textFile.read()

This gives you one long string of everything in the file. If you are opening a big file, this string will also be big, and printing it out may be very slow and unhelpful.

Of course, there are other things you can do with files, but this is all we need for now.

Splitting the text into words

At the end of the previous discussion of regular expressions, we saw some examples of splitting a string into words. In short, we will use re.split to split the text string into pieces.

Our overall goal is to generate statistics that represent the style of the text. The words are only part of the style; the punctuation is another part. So, we will keep not only the words, but also the punctuation, and generate statistics for them, as well. The whitespace, however, we will throw away (much to e.e. cummings chagrin).

Thus, if our input is

```
"Are you crazy, or brilliant!?!"
    
```

then we want to split it into the following pieces:

["Are", "you", "crazy", ",", "or", "brilliant", "!?!"]

In order to do this, we still need to be able to distinguish what symbols should be considered part of a word, and which are punctuation. We will briefly consider what decisions to make here.

What will we consider punctuation, or more generally, not part of a word? We'll start with whitespace and the following characters: comma, colon, semi-colon, period, double-quotation marks, parentheses, slash (as in “and/or”), and ampersand. For exclamation marks and question marks, we'll consider combinations of them (such as ?!?) to be a single punctuation symbol. We'll consider two or more dashes together to be punctuation, whereas we'll assume a single one is a hyphen inside a word. We'll assume all single quotation marks are just apostrophes and thus part of a word. This isn't completely accurate, but it's pretty good, and about the best we can accomplish without a more sophisticated tool than regular expressions.

Can you devise a regular expression that corresponds to this description of whitespace and punctuation?

r"\s+|[,:;.\"()/&]|[?!]+|-{2,}"

This has four parts:
\s+           A sequence of one or more whitespaces
[,:;.\"()/&]  A punctuation symbol that isn't combined with anything else
[?!]+         A sequence of ? and !
-{2,}         A sequence of 2 or more dashes

Now, use re.split as we've done previously:

re.split(r"\s+|[,:;.\"()/&]|[?!]+|-{2,}","Are you crazy, or brilliant!?!")

Unfortunately, that's not what we want, as it has removed the punctuation, as well. It also contains some empty strings.

Instead, we can add parentheses to the regular expression. Perhaps surprisingly, this tells re.split not to remove anything when splitting. Thus, the whitespace and punctuation are in the resulting list. Try it:

re.split(r"(\s+|[,:;.\"()/&]|[?!]+|-{2,})","Are you crazy, or brilliant!?!")

Now, we want to remove all the whitespace from this list. Define a function called isNotWhitespace that returns False if it is blank or only whitespace, or True otherwise.

def isNotWhitespace(s):
    """Returns whether the string s doesn't consist only of whitespace or not."""
    if re.match(r"\s*$",s):     # re.match checks for whitespace from beginning of string including the end of the string, '$', i.e. the whole string
        return False
    else:
        return True

So, now we can use that to tell us what to remove from the list. Define a function removeWhitespace that takes a list of strings and returns a list with just the non-whitespace strings.

def removeWhitespace(stringList):
    """Returns a list like stringList, except with the whitespace-only strings removed."""
    newList = []
    for s in stringList:
        if isNotWhitespace(s):
            newList.append(s):
    return newList

A short form using a list comprehension:

def removeWhitespace(stringList):
    """Returns a list like stringList, except with the whitespace-only strings removed."""
    return [s for s in stringList if isNotWhitespace(s)]

A short form using a filter:

def removeWhitespace(stringList):
    """Returns a list like stringList, except with the whitespace-only strings removed."""
    return filter(isNotWhitespace,stringList)

Put these together to split the input string apart and remove the whitespace.

text = re.split(r"(\s+|[,:;.\"()/&]|[?!]+|-{2,})",inputString)
text = removeWhitespace(text)

Note that we are not doing anything special with capitalized words. If we were strictly doing a word frequency analysis, we might want to count "The" and "the" as the same. But, given our overall purpose, keeping such pairs distinct can give us information about the text style, such as what words are most likely to start sentences.

Let's put all our pieces together, and define a function inputText that takes a filename and returns a list of the text strings we want.

def inputText(filename):
    """Returns a list of text and punctuation strings, omitting whitespace, read from a file named filename."""
    file = open(filename,"r")      # The "r" means we will read the file.
    inputString = file.read()
    text = re.split(r"(\s+|[,:;.\"()/&]|[?!]+|-{2,})",inputString)
    text = removeWhitespace(text)
    return text

Or, here's a version that rearranges the code a bit:

def inputText(filename):
    """Returns a list of text and punctuation strings, omitting whitespace, read from a file named filename."""
    delimiters = r"(\s+|[,:;.\"()/&]|[?!]+|-{2,})"
    inputString = open(filename,"r").read()
    return removeWhitespace(re.split(delimiters,inputString))

Counting word occurrences

How can we store the information that, for example, "you" occurs one time? With what we've seen so far, we have two main options, neither of which is very good.

Use two lists, one of strings, and one of the corresponding counts.
(COMP 130 only) Use a list of pairs, where each pair consists of a string and its count.

With either of these approaches, to find the count corresponding to "you", we need to loop through a list until we find the desired string.

Given a list aList and an integer key, we can lookup a corresponding value with aList[key]. We would like to be able to do the same thing when key is a string. We can, but with what Python calls a dictionary.

Dictionaries

Here is a quick introduction to Python dictionaries. For more information, see the dictionary finger exercises.

An empty dictionary:
```
{ }
    
```
An example dictionary, indexed by strings, containing integers, as in the word counts that we will be using. Note that the elements can be printed in a different order and still mean the same thing.
```
d = {"a": 3, "b": 5, "xyz": 2, "silly": 42}
d
    
```
Looking up a value:
```
d["b"]
    
```
Updating a value:
```
d["b"] = 17
d
    
```
Adding a value:
```
d["newone"] = -9
d
    
```

Note that dictionary syntax is much like list syntax, except with curly braces for literal dictionaries.

Word frequency counting

All we need to do is to loop through our input text strings and either initialize or increment each word's count.

def countOccurrences(aList):
    """Given a list, returns a dictionary mapping each element to its number of occurrences in the list."""
    countDict = {}
    for x in aList:
        if x in countDict:
            countDict[x] = countDict[x] + 1
        else:
            countDict[x] = 1
    return countDict

Here's a slick way to initialize the dictionary all at once:

from collections import defaultdict

def countOccurrences(aList):
    """Given a list, returns a dictionary mapping each element to its number of occurrences in the list."""
    countDict = defaultdict(int)    # Initializes all values implicitly to 0.
    for x in aList:
        countDict[x] = countDict[x] + 1
    return countDict

Try countOccurrences(inputText(filename)) on some of the provided files. Click here for more information on defaultdict.

Counters (Optional for COMP 200)

Dictionaries are very useful, and we will be using them throughout the rest of the course. But, we can count the word occurrences even more simply using Counters.

from collections import Counter
Counter(inputText(filename))

No looping or anything else needed — everything is done for you!

Some related resources

Python documentation: Dictionaries, Counters
Python tutorial: Reading and Writing Files, Dictionaries
Wikibooks: File input
Dive Into Python: Files
Project Gutenberg — a source of many free books, all(?) of which are available in plain text form