COMP 200: Elements of Computer Science
Spring 2013

First, a couple quick notes on previous material. Originally, our suggested function for building a dictionary of word counts was not fast enough in CodeSkulptor to handle large text files. That was due to a CodeSkulptor bug that has now been fixed. Also, urllib2.urlopen() currently only works on files that we provide, not any URL on the web.


Python Regular Expression Basics

Using methods such as string.count(), string.find(), and string.index(), it is easy to search a text for a “fixed” pattern, i.e., a single string. For example, we can search for the string "help" with the call my_string.index("help"). Similarly, it is easy to split a text using a fixed separator using string.split().

Often, however, we want to search or split using a more general pattern. We can describe many kinds of patterns with regular expressions (REs). Regular expressions are used a lot by computational linguists. We will also use them to search the British National Corpus.

Consider looking for the Semitic root “S–L–M”, as found in words such as “Islam”, “Muslim”, “salam”, “shalom”, “Solomon”, and “Suleiman”. To search for this root in a piece of text, we want to specify that the pattern is "s" or "S", followed by any letter, followed by "l", followed by any letter, followed by "m". This is captured by the RE "[sS][a-z]*l[a-z]*m". Let's break this down into parts to understand it:

Thus, this represents the desired pattern. As you can see, REs are like a separate mini-language within Python. (Most programming languages these days have similar functions for matching with REs, and have similar syntax for REs.)

We will introduce two functions: re.findall() for searching for RE patterns and re.split() for splitting based upon RE patterns. We will start to explore REs in these examples, and follow up with more later. A lot of features available in regular expression libraries are downright confusing, even to experts. We're going to limit ourselves to the commonly-used basics, plus what we specifically need for our example of accurately splitting text into words.

Regular Expressions in Python

How do we use regular expressions (REs) in Python code? We will illustrate re.findall() and re.split() on REs of increasing complexity. For now, we will focus on the core ideas in REs. For each example, think about what you expect the result to be, then run the code and think about when the results actually are.

Now let's try to combine these ideas into some more useful examples.

One way of using re.findall() is just to test whether a pattern occurs in the string:

This particular code uses the Python fact that the empty list acts like False in a conditional. This is a very “Pythonic” way of writing this test.