Very basic textual analysis
With our regular expression knowledge, we will return to our goal of textual analysis, especially for comparing authorship. Our goal for today is to do a complete, albeit very simple analysis, that of counting word frequencies. Such a frequency analysis will be the basis for later, more sophisticated analyses. Also, we will later measure word sequence frequencies, which will allow us to model phrase or sentence structure, as well.
Our current analysis will have three steps:
- Read text from a file.
- Split the text into words.
- Count word occurrences.
Reading text from a file
We have provided a bunch of text files that you can use. Save one of these to your own computer or network drive. For consistency, save it with the same “.txt”. file extension at the end of the name. Pay attention to where you stored it. Note that the file sizes are listed — we recommend using small files at first. In the following example, we'll use Quick fox text.txt, but you can substitute any file name in the following steps.
We will need to tell Python what folder (a.k.a. directory) the file is in. If you are using IPython, it will default to looking in the same folder as your code. If you are using IDLE/Python, it will default to a folder you don't care about, so we need to explicitly tell it where to look.
On Windows, you can right-click on a file, select “Properties” and look at “Location” to see the name of the folder. On a Mac, the equivalent is to right-click, select “Get Info” and look at “Where”. As an example, let's say it is U:\COMP 200, but you can substitute any folder name in the following steps.
We need a string for the filename. Again, if you're using IPython and
you're file is in the same directory, then just the filename should work,
e.g., Quick fox text.txt. Otherwise, you need to
put the folder and file names together.
On Windows, substitute a forward slash “/” for any
backslash “\”.
Thus, you get something like
"U:/COMP 200/Quick fox text.txt"
.
Once you know what file name to use, actually reading the file is easy.
-
filename = "U:/COMP 200/Quick fox text.txt" textFile = open(filename,"r") # The "r" means we will read the file. inputString = textFile.read()
-
Alternately, we can change the “working directory” to
be what we want.
import os os.chdir("U:/COMP 200") # chdir = CHange the current working DIRectory filename = "Quick fox text.txt" textFile = open(filename,"r") # The "r" means we will read the file. inputString = textFile.read()
This gives you one long string of everything in the file. If you are opening a big file, this string will also be big, and printing it out may be very slow and unhelpful.
Of course, there are other things you can do with files, but this is all we need for now.
Splitting the text into words
At the end of
the previous discussion of regular expressions, we saw
some examples of splitting a string into words.
In short, we will use re.split
to split the text string
into pieces.
Our overall goal is to generate statistics that represent the style of the text. The words are only part of the style; the punctuation is another part. So, we will keep not only the words, but also the punctuation, and generate statistics for them, as well. The whitespace, however, we will throw away (much to e.e. cummings chagrin).
Thus, if our input is
-
"Are you crazy, or brilliant!?!"
-
["Are", "you", "crazy", ",", "or", "brilliant", "!?!"]
In order to do this, we still need to be able to distinguish what symbols should be considered part of a word, and which are punctuation. We will briefly consider what decisions to make here.
What will we consider punctuation, or more generally, not part of a word? We'll start with whitespace and the following characters: comma, colon, semi-colon, period, double-quotation marks, parentheses, slash (as in “and/or”), and ampersand. For exclamation marks and question marks, we'll consider combinations of them (such as ?!?) to be a single punctuation symbol. We'll consider two or more dashes together to be punctuation, whereas we'll assume a single one is a hyphen inside a word. We'll assume all single quotation marks are just apostrophes and thus part of a word. This isn't completely accurate, but it's pretty good, and about the best we can accomplish without a more sophisticated tool than regular expressions.
Can you devise a regular expression that corresponds to this description of whitespace and punctuation?
-
r"\s+|[,:;.\"()/&]|[?!]+|-{2,}"
This has four parts: \s+ A sequence of one or more whitespaces [,:;.\"()/&] A punctuation symbol that isn't combined with anything else [?!]+ A sequence of ? and ! -{2,} A sequence of 2 or more dashes
Now, use re.split
as we've done previously:
-
re.split(r"\s+|[,:;.\"()/&]|[?!]+|-{2,}","Are you crazy, or brilliant!?!")
Instead, we can add parentheses to the regular expression.
Perhaps surprisingly, this tells re.split
not to remove anything
when splitting. Thus, the whitespace and punctuation are in the
resulting list. Try it:
-
re.split(r"(\s+|[,:;.\"()/&]|[?!]+|-{2,})","Are you crazy, or brilliant!?!")
Now, we want to remove all the whitespace from this list.
Define a function called isNotWhitespace
that returns
False
if it is blank or only whitespace, or True
otherwise.
-
def isNotWhitespace(s): """Returns whether the string s doesn't consist only of whitespace or not.""" if re.match(r"\s*$",s): # re.match checks for whitespace from beginning of string including the end of the string, '$', i.e. the whole string return False else: return True
So, now we can use that to tell us what to remove from the list.
Define a function removeWhitespace
that takes a list of
strings and returns a list with just the non-whitespace strings.
-
def removeWhitespace(stringList): """Returns a list like stringList, except with the whitespace-only strings removed.""" newList = [] for s in stringList: if isNotWhitespace(s): newList.append(s): return newList
-
A short form using a
list comprehension:
def removeWhitespace(stringList): """Returns a list like stringList, except with the whitespace-only strings removed.""" return [s for s in stringList if isNotWhitespace(s)]
-
A short form using a
filter
:def removeWhitespace(stringList): """Returns a list like stringList, except with the whitespace-only strings removed.""" return filter(isNotWhitespace,stringList)
Put these together to split the input string apart and remove the whitespace.
-
text = re.split(r"(\s+|[,:;.\"()/&]|[?!]+|-{2,})",inputString) text = removeWhitespace(text)
Note that we are not doing anything special with capitalized words.
If we were strictly doing a word frequency analysis, we might want
to count "The"
and "the"
as the same.
But, given our overall purpose, keeping such pairs distinct can give
us information about the text style, such as what words are most likely
to start sentences.
Let's put all our pieces together, and define a function inputText
that takes a filename and returns a list of the text strings we want.
-
def inputText(filename): """Returns a list of text and punctuation strings, omitting whitespace, read from a file named filename.""" file = open(filename,"r") # The "r" means we will read the file. inputString = file.read() text = re.split(r"(\s+|[,:;.\"()/&]|[?!]+|-{2,})",inputString) text = removeWhitespace(text) return text
-
Or, here's a version that rearranges the code a bit:
def inputText(filename): """Returns a list of text and punctuation strings, omitting whitespace, read from a file named filename.""" delimiters = r"(\s+|[,:;.\"()/&]|[?!]+|-{2,})" inputString = open(filename,"r").read() return removeWhitespace(re.split(delimiters,inputString))
Counting word occurrences
How can we store the information that, for example,
"you"
occurs one time?
With what we've seen so far, we have two main options, neither of which
is very good.
- Use two lists, one of strings, and one of the corresponding counts.
- (COMP 130 only) Use a list of pairs, where each pair consists of a string and its count.
"you"
, we need to loop through a list until we find
the desired string.
Given a list aList
and an integer key
,
we can lookup a corresponding value with aList[key]
.
We would like to be able to do the same thing when key
is a string.
We can, but with what Python calls a dictionary.
Dictionaries
Here is a quick introduction to Python dictionaries. For more information, see the dictionary finger exercises.
-
An empty dictionary:
{ }
-
An example dictionary, indexed by strings, containing integers,
as in the word counts that we will be using.
Note that the elements can be printed in a different order and
still mean the same thing.
d = {"a": 3, "b": 5, "xyz": 2, "silly": 42} d
-
Looking up a value:
d["b"]
-
Updating a value:
d["b"] = 17 d
-
Adding a value:
d["newone"] = -9 d
Note that dictionary syntax is much like list syntax, except with curly braces for literal dictionaries.
Word frequency countingAll we need to do is to loop through our input text strings and either initialize or increment each word's count.
-
def countOccurrences(aList): """Given a list, returns a dictionary mapping each element to its number of occurrences in the list.""" countDict = {} for x in aList: if x in countDict: countDict[x] = countDict[x] + 1 else: countDict[x] = 1 return countDict
-
Here's a slick way to initialize the dictionary all at once:
from collections import defaultdict def countOccurrences(aList): """Given a list, returns a dictionary mapping each element to its number of occurrences in the list.""" countDict = defaultdict(int) # Initializes all values implicitly to 0. for x in aList: countDict[x] = countDict[x] + 1 return countDict
Try countOccurrences(inputText(filename))
on some
of the provided files. Click here for more information on
defaultdict.
Counters (Optional for COMP 200)
Dictionaries are very useful, and we will be using them throughout the
rest of the course. But, we can count the word occurrences even more
simply using
Counter
s.
-
from collections import Counter Counter(inputText(filename))
No looping or anything else needed — everything is done for you!
Some related resources
- Python documentation: Dictionaries, Counters
- Python tutorial: Reading and Writing Files, Dictionaries
- Wikibooks: File input
- Dive Into Python: Files
- Project Gutenberg — a source of many free books, all(?) of which are available in plain text form