Practice with Regular Expressions
Two common uses of re.findall
As a reminder, we previously
saw two basic uses of re.findall
.
-
First, as its name implies, we can use it to find all of the matches of a pattern within a string.
import re print re.findall("[a-z]*[aeiou][aeiou][a-z]*", "Can you help me?", re.IGNORECASE)
-
Second, we can use it to find whether there were any matches of a pattern within a string.
import re if re.findall("[a-z]*[aeiou][aeiou][a-z]*", "Can you help me?", re.IGNORECASE): print "Found match!" else: print "No match."
When we are interested in the second case, we can often simplify the REs. In the previous example, our RE basically says we are looking for any words with two adjacent vowels. But, as long as we're not wanting to see what those matches were, we can equivalently look for just two adjacent vowels. Obviously, if there are two adjacent vowels, they will be within a word. But, this leads us to the following simpler RE.
-
import re if re.findall("[aeiou][aeiou]", "Can you help me?", re.IGNORECASE): print "Found match!" else: print "No match."
It shouldn't be too surprising that matching with a simpler RE is faster.
In general, it can be tough to say one RE is simpler than other, but it
should be clear that "[aeiou][aeiou]"
is simpler than
"[a-z]*[aeiou][aeiou][a-z]*"
, because it lacks the outer
"[a-z]*"
parts.
As one simple rule of thumb, a RE will be faster if it doesn't start
with a repeated pattern like "[a-z]*"
.
Exercises
For each of the following, create a RE for searching for the specified
pattern. Test your RE with both re.findall
and an
appropriate sample text string.
Try both styles of matching described above.
-
A primary color, i.e., red, green, or blue
-
A Rice course number — four capital letters, a space, and then a three digit number
-
A non-negative integer, such as 13, 6723, or 0
-
Any integer, such as 13, 6723, 0, or -3785
-
A number with a decimal point, such as 13.35, 6723.17, 0.05, or -3785.3
-
Your name, including common misspellings
-
Valid Roman numerals — For simplicity, limit yourself to just “x”, “v”, and “i”.
You might find it useful to first write down example Roman numerals as a guide.
One more common use of re.findall
Often, rather than looking for a pattern in one string, we are looking for a pattern in a list of strings.
Write a function grep(pattern, texts)
,
where pattern
is a RE string, and texts
is a list of strings. It prints all the strings in texts
in which the pattern
appears.
Why the funny function name? grep
is the traditional name in
Unix/Linux for this operation. (Well, technically, grep
traditionally takes a filename. It then considers each “line”
of text to be a string.) “grep”, originally a command
in the Vi text editor, stands for
“global regular expression print”.
We'll later see that the British National Corpus search, for one, has the same basic form.
Where We're Headed
We've covered what you can do with the RE basics. So far, we've stayed clear of most of the more confusing points, although it's entirely possible that you've accidentally stumbled into some of them. Next, we'll introduce a few of the more subtle points that are very useful. As a side benefit, the ideas we'll introduce are applicable beyond just REs.
We'll use the British National Corpus search. This provides a very realistic use for the REs. This database has been used substantially in computational linguistics, including research in language education and natural language processing. While not the only such database, it provides a simple RE search suitable for our purposes.
We'll return to the word-splitting problem that was a prime motivation for introducing REs.
Then we'll return to our text analysis issues, once we can accurately distinguish words from punctuation.