COMP 200 Elements of Computer Science &
COMP 130 Elements of Algorithms and Computation
Spring 2012

More with Regular Expressions

In Python

In the previous finger exercises, we used re.search to search a text string to see whether any part of it matched a given regular expression. Total, we'll focus on other functions in the Python re library.

First, let's import the re module before any examples.

Finding all matches

With re.search, we found out if the text had a match to our pattern. As briefly noted before, it actually returned a Match Object, which we could use to get more information, such as exactly what in the text was matched. But, it would only find the first match. With re.findall, we can see all the matches.

We could use this to search through a large corpus to find pieces of text that we want to study. The following are a couple small examples.

Let's practice with re.search and re.findall. Can you create regular expressions that find what you want?

Find a regular expression that matches a digit. Test your answer, e.g., we want re.search(____,"There are 8 dogs here.") to be successful. Below are four different answers, longest first:

Find a regular expression that matches a U.S. Social Security number, i.e., nine digits with a dash after the third and fifth. Test it using re.search or re.findall. Below are two different answers. Before looking at the second, look up the {m} notation in the Python documentation for regular expressions.

Matching the entire text

What if we were using these previous answers to check whether an input string is a syntactically correct Social Security number? re.search(_____,"123-45-6789") would success, as desired, but so would re.search(_____,"0123-45-67890")! That's because re.search looks for the pattern to occur anywhere in the text string. If we want to do this sort of verification, it's better to turn to re.match.

The function re.match looks for a match at the beginning of string. The dollar sign in the pattern is to match the end of the string. Together, these ensure that we only succeed if the pattern matches the entire string, rather than it being simply somewhere in string. This function returns a Boolean, rather than the Match Object or None that re.match returns.

As another exercise, write a regular expression that matches what is syntactically a word, i.e., it has one or more lower-case letters, but nothing else. Use re.match to test your answer. Below are two answers. Before looking at the second, look up the + notation in the Python documentation for regular expressions.

As a variation, what about a word with the first letter possibly capitalized?

After class, ponder how to generalize these previous two exercises to search for any syntactically-valid word. Ideally, we would want to allow for words such as "This", "can't", "fo'c's'le" (a common nautical contraction), "O'Rourke", "OK", "pince-nez", "'n'" (contraction for “and”), and "1970s".

Splitting the text into pieces

When reading in a large text, we want to break it apart into a list of words. For example, we might start with "This is a small example." and we want to get ["This","is","a","small","example"]. This is exactly what re.split is for.

For a first example, let's assume words are simply delimited by single spaces.

Modify that so that it can be any amount of whitespace between words. Look back at the previous finger exercises about whitespace.

Now, what if we also want to consider the following punctuation: period, comma, semicolon, colon, exclamation mark, and question mark?

Again, for after class, ponder how to deal with the complicated cases. Is “'” a quotation mark (punctuation) or an apostrophe (part of a word)? Is “-” a dash (punctuation) or a hyphen (part of a word)?

Additional optional readings about Python regular expressions

Regular expression searching on the web

The British National Corpus is one site that allows you to search using regular expressions. In its box labeled “Look up:” you can put a regular expression. Just be sure to surround the regular expression with curly braces: { regexp } . While the search displays at most 50 “hits”, it also reports how many hits it found.

As a simple example, searching for { house } finds lots (49424) of instances of the word “house” or “House”. It matches the latter, since its searches are case-insensitive. It does not find words such as “houses” or “household”, so this behavior is more similar to re.match than re.search.

For each of the following problems, find an appropriate regular expression. Test whether your answer has the expected number of hits.

To fully undersstand these examples, see the British National Corpus' regular expression syntax.

Some related linguistic resources