Practice with Regular Expressions

Two common uses of `re.findall`

As a reminder, we previously saw two basic uses of re.findall.

First, as its name implies, we can use it to find all of the matches of a pattern within a string.
```
import re

print re.findall("[a-z]*[aeiou][aeiou][a-z]*", "Can you help me?", re.IGNORECASE)
```

Second, we can use it to find whether there were any matches of a pattern within a string.

import re

if re.findall("[a-z]*[aeiou][aeiou][a-z]*", "Can you help me?", re.IGNORECASE):
    print "Found match!"
else:
    print "No match."

When we are interested in the second case, we can often simplify the REs. In the previous example, our RE basically says we are looking for any words with two adjacent vowels. But, as long as we're not wanting to see what those matches were, we can equivalently look for just two adjacent vowels. Obviously, if there are two adjacent vowels, they will be within a word. But, this leads us to the following simpler RE.

import re

if re.findall("[aeiou][aeiou]", "Can you help me?", re.IGNORECASE):
    print "Found match!"
else:
    print "No match."

It shouldn't be too surprising that matching with a simpler RE is faster. In general, it can be tough to say one RE is simpler than other, but it should be clear that "[aeiou][aeiou]" is simpler than "[a-z]*[aeiou][aeiou][a-z]*", because it lacks the outer "[a-z]*" parts. As one simple rule of thumb, a RE will be faster if it doesn't start with a repeated pattern like "[a-z]*".

Exercises

For each of the following, create a RE for searching for the specified pattern. Test your RE with both re.findall and an appropriate sample text string. Try both styles of matching described above.

A primary color, i.e., red, green, or blue
A Rice course number — four capital letters, a space, and then a three digit number
A non-negative integer, such as 13, 6723, or 0
Any integer, such as 13, 6723, 0, or -3785
A number with a decimal point, such as 13.35, 6723.17, 0.05, or -3785.3
Your name, including common misspellings
Valid Roman numerals — For simplicity, limit yourself to just “x”, “v”, and “i”.

You might find it useful to first write down example Roman numerals as a guide.

One more common use of `re.findall`

Often, rather than looking for a pattern in one string, we are looking for a pattern in a list of strings.

Write a function grep(pattern, texts), where pattern is a RE string, and texts is a list of strings. It prints all the strings in texts in which the pattern appears.

Why the funny function name? grep is the traditional name in Unix/Linux for this operation. (Well, technically, grep traditionally takes a filename. It then considers each “line” of text to be a string.) “grep”, originally a command in the Vi text editor, stands for “global regular expression print”.

We'll later see that the British National Corpus search, for one, has the same basic form.

Where We're Headed

We've covered what you can do with the RE basics. So far, we've stayed clear of most of the more confusing points, although it's entirely possible that you've accidentally stumbled into some of them. Next, we'll introduce a few of the more subtle points that are very useful. As a side benefit, the ideas we'll introduce are applicable beyond just REs.

We'll use the British National Corpus search. This provides a very realistic use for the REs. This database has been used substantially in computational linguistics, including research in language education and natural language processing. While not the only such database, it provides a simple RE search suitable for our purposes.

We'll return to the word-splitting problem that was a prime motivation for introducing REs.

Then we'll return to our text analysis issues, once we can accurately distinguish words from punctuation.

COMP 200: Elements of Computer Science Spring 2013

Practice with Regular Expressions

Two common uses of re.findall

Exercises

One more common use of re.findall

Where We're Headed

COMP 200: Elements of Computer Science
Spring 2013

Two common uses of `re.findall`

One more common use of `re.findall`