COMP 200: Elements of Computer Science
Spring 2013

Practice with Regular Expressions

Two common uses of re.findall

As a reminder, we previously saw two basic uses of re.findall.

When we are interested in the second case, we can often simplify the REs. In the previous example, our RE basically says we are looking for any words with two adjacent vowels. But, as long as we're not wanting to see what those matches were, we can equivalently look for just two adjacent vowels. Obviously, if there are two adjacent vowels, they will be within a word. But, this leads us to the following simpler RE.

It shouldn't be too surprising that matching with a simpler RE is faster. In general, it can be tough to say one RE is simpler than other, but it should be clear that "[aeiou][aeiou]" is simpler than "[a-z]*[aeiou][aeiou][a-z]*", because it lacks the outer "[a-z]*" parts. As one simple rule of thumb, a RE will be faster if it doesn't start with a repeated pattern like "[a-z]*".

Exercises

For each of the following, create a RE for searching for the specified pattern. Test your RE with both re.findall and an appropriate sample text string. Try both styles of matching described above.

One more common use of re.findall

Often, rather than looking for a pattern in one string, we are looking for a pattern in a list of strings.

Write a function grep(pattern, texts), where pattern is a RE string, and texts is a list of strings. It prints all the strings in texts in which the pattern appears.

Why the funny function name? grep is the traditional name in Unix/Linux for this operation. (Well, technically, grep traditionally takes a filename. It then considers each “line” of text to be a string.) “grep”, originally a command in the Vi text editor, stands for “global regular expression print”.

We'll later see that the British National Corpus search, for one, has the same basic form.

Where We're Headed

We've covered what you can do with the RE basics. So far, we've stayed clear of most of the more confusing points, although it's entirely possible that you've accidentally stumbled into some of them. Next, we'll introduce a few of the more subtle points that are very useful. As a side benefit, the ideas we'll introduce are applicable beyond just REs.

We'll use the British National Corpus search. This provides a very realistic use for the REs. This database has been used substantially in computational linguistics, including research in language education and natural language processing. While not the only such database, it provides a simple RE search suitable for our purposes.

We'll return to the word-splitting problem that was a prime motivation for introducing REs.

Then we'll return to our text analysis issues, once we can accurately distinguish words from punctuation.