COMP 200: Elements of Computer Science
Spring 2013

Regular Expressions — Beyond the Basics

The following are some additional features with most RE packages. Unfortunately, these details get rather picky. We've tried to limit you to those that are particularly necessary or useful.

Escape Sequences

We have seen that some characters in regular expression have special meaning. One important consequence is that when used in an RE, they don't simply match that character. For example, if for some reason we want to find instances of the string "*|+" within a large text, we cannot simply call re.findall("*|+", text). We will have to do something special.

When placed outside square brackets, we have seen that *, +, |, [, ], (, and ) have special meaning. With the previous examples, you might have noticed that . also had special meaning. Inside the square brackets, we have seen that -, and ] have special meaning. With the previous examples, you might have noticed that, for example, . does not have special meaning within the square brackets, and - does not have special meaning outside the square brackets. Thus, one annoying feature of REs is keeping track of two separate things — which characters have special meaning inside the square brackets and which have special meaning outside the square brackets.

Special characters via escapes

The solution is to preceed the special characters by an “escape” character. The escape character simply means that the next character should be treated specially. In this case, it means that the next character has its usual non-RE meaning, instead of its usual RE meaning. In REs, the escape character is the backslash. (\).

The most common mistake I make in REs is to forget to escape special characters. It is very easy to forget which characters have special meanings.

Non-keyboard characters via escapes

There are some standard characters that don't appear anywhere on your keyboard. Instead, we can use escape sequences to represent them. These are used not only in REs, but regular strings for printing.

The most important ones are \n, the “newline” character, and \t, the “tab” character. There are others.

Shortcuts via escapes

REs define some additional escape sequences as convenient shortcuts. \d represents any digit, i.e., [0-9]. \s represents any whitespace character. This includes a space, \n, and \t. \w represents any alphanumeric character or underscore, i.e., [a-zA-z0-9_]. That is useful as it describes the legal characters in a Python variable name.

Repetition

We previously saw that * means repeat 0 or more times and + means repeat 1 or more times. But in the previous example of matching Rice course numbers, you should have come up with a RE like "[A-Z][A-Z][A-Z][A-Z] [1-9][0-9][0-9]". A more concise form is "[A-Z]{4} [1-9][0-9]{2}".

{n} means repeat n times. {m,n} means repeat between m and n times. {m,} means repeat m or more times.

Write a RE for matching a social security number (3 digits, a dash, 2 digits, a dash, 4 digits).

Matching the beginning or end of a string or word

To satisfy some English teachers, we want to find out if a text begins with a conjunction.

This is a first attempt, but we get an extra undesired match.

We'll introduce two new features. ^ matches only at the beginning of the text string. \b matches any word boundary, i.e., the transition from letters to non-letters, or vice versa. Note that neither of these matches any characters of the text, but instead matches some property of the text.

Here's a similar example for matching a text ending with a period. The $ matches the end of the string. The dot (.) is a special character in a RE, so it needs an escape.

To find words ending in dom, again use the word boundary \b.

Lookahead

Let's consider one part of what we'll need for our accurate word-splitting. How to deal with the seemingly simple period. A period ends a sentence, and is not part of a word. But that's not the whole story. A period can also be part of an ellipsis, which is also punctuation. Well, a period can also be part of an abbreviation, as in U.S.A.. A period can also be part of a number as a decimal point, and it might be convenient to consider a number to be a word. How can we distinguish between a period being punctuation and not being punctuation?

Here's my imperfect solution. A period is punctuation if it is not immediately followed by a letter or digit, i.e., an alphanumeric character.

So, we want to be able to look ahead at the next character to decide what to do.

This correctly matches, and thus splits, on the ellipses and the final period. It also matches on the final period in the two abbreviations. But, I don't know how to more accurately distinguish those cases without an understanding of the semantics.

As another example of look-ahead, what if we want to find words followed by a comma.

Look-ahead isn't useful when you're just looking to see if there is a match, rather than actually seeing the match.

Grouping

Here's an example that threw me for quite a while. Let's find all words containing ie or ei.

That works, but let's make it more concise. We'll use parentheses for grouping.

Alas, that doesn't work. In fact, it doesn't seem to make much sense as it doesn't seem to be using the [a-z]* parts. Grouping is very useful, so let's not give up yet. After all, think of how much we use parentheses in arithmetic expressions.

Short answer

Use a less-intuitive syntax for grouping, (?:…), not (…).

Optional Longer answer

Parentheses do group. However, they also capture the inner part of the match. That is, in the "[a-z]*(ei|ie)[a-z]*" example, it remembers both the overall matches like "Greiner", but also what matched within the parentheses like "ei". This enables a bunch of useful features which we won't use. But, strangely, it reports the inner matches and not the outer ones that we actually want.

Instead, (?:…) is for grouping without any capture.

Raw strings

Short version

Preceed RE strings with an r. Just do it, even though it usually doesn't matter.

Optional Longer version

Python string literals understand some escape sequences. Note which strings consist of one character versus two in the following examples.

So, how do you represent the RE pattern that matches a backslash? Since backslash is a special character, it needs to be escaped, so "\\", right? No, we just saw that "\\" is a single backslash character. So, we need "\\\\". The string literal and the RE pattern matching will each treat the backslash special, so you have to escape it twice!

For escape characters that string literals don't treat special, this just doesn't matter, so you're usually OK without raw strings.