Python Regular Expressions “Finger Exercises”
In class, we searched in a text for a “fixed” pattern,
i.e., a single string. E.g., we searched in
"Can I help you?"
for the string "help"
.
Often, however, we want to search for a more general pattern.
We will start to explore regular expressions in these
finger exercises, and follow this up in class.
Consider looking for the Semitic root “S–L–M”, as found in words such as “Islam”, “Muslim”, “salam”, “shalom”, “Solomon”, and “Suleiman”.
To search for this, we want to specify that the pattern is
"s"
or "S"
, followed by any letter,
followed by "l"
, followed by any letter, followed by
"m"
.
This is captured by the regular expression
"[sS][a-z]*l[a-z]*m"
.
Let's break this down in its parts to understand it:
-
[sS]
— Represents any one of the letters in the square brackets. I.e., either lower- or upper-cases
. -
[a-z]
— Represents any one of the letters in the range in the square brackets. I.e., any lower-case letter. -
*
— Repeat the previous part of the pattern any number of times. Thus,[a-z]*
means any sequence of lower-case letters. -
l
— Represents the given letter. -
[a-z]*
, again -
m
— Again, represents the given letter.
Regular Expressions in Python
How do we use regular expressions in Python code?
All of the examples are going to use the re
package,
so first let's import that:
-
import re
In these finger exercises, we will focus on the function
re.search
. It is mostly straightforward:
its first argument is a regular expression to search for, and its
second argument is the text to search in.
-
re.search("help","Can you help me?")
The result of this is a bit odd, however. It isn't True
,
but something called a Match Object. While a Match Object can be used
in various ways, including to find where the pattern matched
in the text, but we'll consider a very simple way of using it.
-
if re.search("help","Can you help me?"): # Whatever you want to do if there is a match. print "It matched." else: # Whatever you want to do if there is not a match. print "It didn't match."
Let's see one more example like this.
-
re.search("help","Help!")
When there is no match, re.search
doesn't return
False
. Instead, it returns None
. That's a
kind-of weird value which Python doesn't even print.
Again, an easy way to use this is the following.
-
if re.search("help","Help!"): # Whatever you want to do if there is a match. print "It matched." else: # Whatever you want to do if there is not a match. print "It didn't match."
In the following examples, we'll just show the re.search
call. As in the above examples, it might be more useful to put the
code within an if-else
. Try each of these.
-
re.search("[Hh]elp","Help!")
-
re.search("help","Help!",re.IGNORECASE)
-
re.search("help","It's snowing.")
-
re.search("[sS][a-z]*l[a-z]*m","Kind Solomon's mines")
-
re.search("s[a-z]*l[a-z]*m","Kind Solomon's mines",re.IGNORECASE)
Another thing you can do with regular expressions is to specify
that you want to match any one of multiple options. Here, let's
see if the text contains either "green"
, "red"
,
or "blue"
.
-
re.search("(green|red|blue)","I'm wearing a green shirt.")
-
re.search("(green|red|blue)","I'm wearing a red shirt.")
-
re.search("(green|red|blue)","I'm wearing a black shirt.")
One of our future uses will be to find delimeters between words.
Words can be separated by spaces and punctuation. Let's ignore
punctuation for the moment and consider just spaces. The surprising
thing is that we'll need to consider not just spaces, but space-like
things. The two most easily-explained space-like characters are
“tab” (corresponding to the Tab key) and the end-of-line
”new line” (roughly corresponding to the
Enter key). These are denoted "\t"
and "\n"
,
respectively.
There are several others, denoted
"\r"
(carriage return), "\f"
(form feed), and
"\v"
(vertical tab).
Together, these are known as whitespace.
So, we can see if a text string contains whitespace.
-
re.search(r"[ \t\n\r\f\v]","There is whitespace here.")
The mysteriousr
at the beginning of the string is not a typo! It tells Python this is a “raw” string, so that it won't treat these backslashes specially. -
re.search(r"[ \t\n\r\f\v]","Thereisnowhitespacehere.")
-
re.search(r"\s","There is whitespace here.")
This\s
is just a short-hand for the whitespace characters. -
re.search(r"\s","Thereisnowhitespacehere.")
In class, we'll do a couple more examples of searching, plus
we'll see a couple more functions in the regular expression library.
To find all the matches of a pattern in a text, we'll use
re.findall
.
To split a text into a bunch of words separated by delimeters, we'll
use re.split
.
Additional optional readings about Python regular expressions
- Python documentation: Regular Expression Library
- Python tutorial: Regular Expressions
- Google's Python class: Regular Expressions
- Python Course: Regular Expressions