Regular Expressions — Beyond the Basics
The following are some additional features with most RE packages. Unfortunately, these details get rather picky. We've tried to limit you to those that are particularly necessary or useful.
Escape Sequences
We have seen that some characters in regular expression have
special meaning. One important consequence is that when used in an RE,
they don't simply match that character.
For example, if for some reason we want to find instances of the
string "*|+"
within a large text, we cannot
simply call re.findall("*|+", text)
.
We will have to do something special.
When placed outside square brackets, we have seen that *, +, |, [, ], (, and ) have special meaning. With the previous examples, you might have noticed that . also had special meaning. Inside the square brackets, we have seen that -, and ] have special meaning. With the previous examples, you might have noticed that, for example, . does not have special meaning within the square brackets, and - does not have special meaning outside the square brackets. Thus, one annoying feature of REs is keeping track of two separate things — which characters have special meaning inside the square brackets and which have special meaning outside the square brackets.
Special characters via escapes
The solution is to preceed the special characters by an
“escape” character. The escape character simply means
that the next character should be treated specially. In this case,
it means that the next character has its usual non-RE meaning,
instead of its usual RE meaning.
In REs, the escape character is the backslash. (\
).
-
import re print re.findall(r"\*\|\+", "A test string *|+ blah *| blah |+.")
Note the
r
at the beginning of the RE pattern string. Use it before any RE pattern with these escape sequences. For an explanation why, see the bottom of the page. -
import re print re.findall(r"[a\-z]", "a tezt string - blah.")
The most common mistake I make in REs is to forget to escape special characters. It is very easy to forget which characters have special meanings.
Non-keyboard characters via escapes
There are some standard characters that don't appear anywhere on your keyboard. Instead, we can use escape sequences to represent them. These are used not only in REs, but regular strings for printing.
The most important ones are
\n
, the “newline” character, and
\t
, the “tab” character.
There are others.
-
print "This \tis\na\ttest."
Shortcuts via escapes
REs define some additional escape sequences as convenient shortcuts.
\d
represents any digit, i.e., [0-9]
.
\s
represents any whitespace character.
This includes a space, \n
, and \t
.
\w
represents any alphanumeric character or underscore,
i.e., [a-zA-z0-9_]
. That is useful as it describes the
legal characters in a Python variable name.
Repetition
We previously saw that *
means repeat 0 or more times
and +
means repeat 1 or more times. But in the
previous example of matching Rice course numbers,
you should have come up with a RE like
"[A-Z][A-Z][A-Z][A-Z] [1-9][0-9][0-9]"
.
A more concise form is
"[A-Z]{4} [1-9][0-9]{2}"
.
{n}
means repeat n times.
{m,n}
means repeat between m and n times.
{m,}
means repeat m or more times.
Write a RE for matching a social security number (3 digits, a dash, 2 digits, a dash, 4 digits).
Matching the beginning or end of a string or word
To satisfy some English teachers, we want to find out if a text begins with a conjunction.
This is a first attempt, but we get an extra undesired match.
-
import re print re.findall(r"And|Or", "And Stephen did that, Anderson.")
We'll introduce two new features.
^
matches only at the beginning of the text string.
\b
matches any word boundary, i.e., the transition
from letters to non-letters, or vice versa.
Note that neither of these matches any characters of the text,
but instead matches some property of the text.
-
import re print re.findall(r"^And\b|^Or\b", "And Stephen did that, Anderson.")
Here's a similar example for matching a text ending with a period.
The $
matches the end of the string. The dot (.
)
is a special character in a RE, so it needs an escape.
-
import re print re.findall(r"\.$", "U.S.A. is OK.")
To find words ending in dom, again use the word
boundary \b
.
-
import re print re.findall(r"[A-Za-z]*dom\b", "Freedom dominates the indominable newsdom.")
OK, so I make up some weird sentences for examples.
Lookahead
Let's consider one part of what we'll need for our accurate word-splitting. How to deal with the seemingly simple period. A period ends a sentence, and is not part of a word. But that's not the whole story. A period can also be part of an ellipsis, which is also punctuation. Well, a period can also be part of an abbreviation, as in U.S.A.. A period can also be part of a number as a decimal point, and it might be convenient to consider a number to be a word. How can we distinguish between a period being punctuation and not being punctuation?
Here's my imperfect solution. A period is punctuation if it is not immediately followed by a letter or digit, i.e., an alphanumeric character.
So, we want to be able to look ahead at the next character to decide what to do.
-
import re print re.split(r"\.+(?![A-Za-z0-9])", "The U.S.A. is, um, ..., where I, John Doe, Esq., live.")
Here,
x(?!y)
means “pattern x is not followed by pattern y”.
This correctly matches, and thus splits, on the ellipses and the final period. It also matches on the final period in the two abbreviations. But, I don't know how to more accurately distinguish those cases without an understanding of the semantics.
As another example of look-ahead, what if we want to find words followed by a comma.
-
import re print re.findall(r"[A-Za-z]+(?=,)", "The U.S.A. is, um, ..., where I, John Doe, Esq., live.")
Here,
x(?=y)
means “pattern x is followed by pattern y”.
Look-ahead isn't useful when you're just looking to see if there is a match, rather than actually seeing the match.
Grouping
Here's an example that threw me for quite a while. Let's find all words containing ie or ei.
-
import re print re.findall(r"[a-z]*ei[a-z]*|[a-z]*ie[a-z]*", "Is Dr. Greiner a friend or fiend?", re.IGNORECASE)
That works, but let's make it more concise. We'll use parentheses for grouping.
-
import re print re.findall(r"[a-z]*(ei|ie)[a-z]*", "Is Dr. Greiner a friend or fiend?", re.IGNORECASE)
Alas, that doesn't work. In fact, it doesn't seem to make much sense
as it doesn't seem to be using the [a-z]*
parts.
Grouping is very useful, so let's not give up yet.
After all, think of how much we use parentheses in arithmetic expressions.
Short answer
Use a less-intuitive syntax for grouping, (?:…)
,
not (…)
.
-
import re print re.findall(r"[a-z]*(?:ei|ie)[a-z]*", "Is Dr. Greiner a friend or fiend?", re.IGNORECASE)
Optional Longer answer
Parentheses do group. However, they
also capture the inner part of the match.
That is, in the "[a-z]*(ei|ie)[a-z]*"
example,
it remembers both the overall matches like "Greiner"
,
but also what matched within the parentheses like "ei"
.
This enables a bunch of useful features which we won't use.
But, strangely, it reports the inner matches and not the outer ones
that we actually want.
Instead, (?:…)
is for grouping without any
capture.
Raw strings
Short version
Preceed RE strings with an r
. Just do it, even
though it usually doesn't matter.
-
print re.findall(r"\*\|\+", "A test string *|+ blah *| blah |+.") print re.findall("\*\|\+", "A test string *|+ blah *| blah |+.") print re.findall(r"\\", "A test string \ blah.") print re.findall("\\\\", "A test string \ blah.") print re.findall("\\", "A test string \ blah.")
Optional Longer version
Python string literals understand some escape sequences. Note which strings consist of one character versus two in the following examples.
-
print len("\n"), list("\n") print len("\\"), list("\\") print len("\t"), list("\t") print len("\s"), list("\s") print len("\b"), list("\b")
So, how do you represent the RE pattern that matches a backslash?
Since backslash is a special character, it needs to be escaped,
so "\\"
, right? No, we just saw that "\\"
is a single backslash character. So, we need "\\\\"
.
The string literal and the RE pattern matching will each treat
the backslash special, so you have to escape it twice!
-
print "\\\\" == r"\\"
For escape characters that string literals don't treat special, this just doesn't matter, so you're usually OK without raw strings.
-
print "\\*" == r"\*" print "\*" == r"\*"