Python Regular Expressions “Finger Exercises”

In class, we searched in a text for a “fixed” pattern, i.e., a single string. E.g., we searched in "Can I help you?" for the string "help". Often, however, we want to search for a more general pattern. We will start to explore regular expressions in these finger exercises, and follow this up in class.

Consider looking for the Semitic root “S–L–M”, as found in words such as “Islam”, “Muslim”, “salam”, “shalom”, “Solomon”, and “Suleiman”.

To search for this, we want to specify that the pattern is "s" or "S", followed by any letter, followed by "l", followed by any letter, followed by "m". This is captured by the regular expression "[sS][a-z]*l[a-z]*m". Let's break this down in its parts to understand it:

[sS] — Represents any one of the letters in the square brackets. I.e., either lower- or upper-case s.
[a-z] — Represents any one of the letters in the range in the square brackets. I.e., any lower-case letter.
* — Repeat the previous part of the pattern any number of times. Thus, [a-z]* means any sequence of lower-case letters.
l — Represents the given letter.
[a-z]*, again
m — Again, represents the given letter.

Thus, this represents the desired pattern.

Regular Expressions in Python

How do we use regular expressions in Python code? All of the examples are going to use the re package, so first let's import that:

```
import re
    
```

In these finger exercises, we will focus on the function re.search. It is mostly straightforward: its first argument is a regular expression to search for, and its second argument is the text to search in.

re.search("help","Can you help me?")

The result of this is a bit odd, however. It isn't True, but something called a Match Object. While a Match Object can be used in various ways, including to find where the pattern matched in the text, but we'll consider a very simple way of using it.

if re.search("help","Can you help me?"):
    # Whatever you want to do if there is a match.
    print "It matched."
else:
    # Whatever you want to do if there is not a match.
    print "It didn't match."

Let's see one more example like this.

```
re.search("help","Help!")
    
```

When there is no match, re.search doesn't return False. Instead, it returns None. That's a kind-of weird value which Python doesn't even print. Again, an easy way to use this is the following.

if re.search("help","Help!"):
    # Whatever you want to do if there is a match.
    print "It matched."
else:
    # Whatever you want to do if there is not a match.
    print "It didn't match."

In the following examples, we'll just show the re.search call. As in the above examples, it might be more useful to put the code within an if-else. Try each of these.

```
re.search("[Hh]elp","Help!")
    
```

re.search("help","Help!",re.IGNORECASE)

```
re.search("help","It's snowing.")
    
```

re.search("[sS][a-z]*l[a-z]*m","Kind Solomon's mines")

re.search("s[a-z]*l[a-z]*m","Kind Solomon's mines",re.IGNORECASE)

Another thing you can do with regular expressions is to specify that you want to match any one of multiple options. Here, let's see if the text contains either "green", "red", or "blue".

re.search("(green|red|blue)","I'm wearing a green shirt.")

re.search("(green|red|blue)","I'm wearing a red shirt.")

re.search("(green|red|blue)","I'm wearing a black shirt.")

One of our future uses will be to find delimeters between words. Words can be separated by spaces and punctuation. Let's ignore punctuation for the moment and consider just spaces. The surprising thing is that we'll need to consider not just spaces, but space-like things. The two most easily-explained space-like characters are “tab” (corresponding to the Tab key) and the end-of-line ”new line” (roughly corresponding to the Enter key). These are denoted "\t" and "\n", respectively. There are several others, denoted "\r" (carriage return), "\f" (form feed), and "\v" (vertical tab). Together, these are known as whitespace.

So, we can see if a text string contains whitespace.

```
re.search(r"[ \t\n\r\f\v]","There is whitespace here.")
    
```
The mysterious r at the beginning of the string is not a typo! It tells Python this is a “raw” string, so that it won't treat these backslashes specially.

re.search(r"[ \t\n\r\f\v]","Thereisnowhitespacehere.")

```
re.search(r"\s","There is whitespace here.")
    
```
This \s is just a short-hand for the whitespace characters.

re.search(r"\s","Thereisnowhitespacehere.")

In class, we'll do a couple more examples of searching, plus we'll see a couple more functions in the regular expression library. To find all the matches of a pattern in a text, we'll use re.findall. To split a text into a bunch of words separated by delimeters, we'll use re.split.

All these example regular expressions are relatively simple. We can combine these features in many ways, such as nesting the choice "(green|red|blue)" and repetition "*" operations like this: "(((green)*|(red|blue)*)*|black)". Within our application of text analysis, there is little point in such fancier combinations.

Additional optional readings about Python regular expressions

Python documentation: Regular Expression Library
Python tutorial: Regular Expressions
Google's Python class: Regular Expressions
Python Course: Regular Expressions

COMP 200 Elements of Computer Science & COMP 130 Elements of Algorithms and Computation Spring 2012

Python Regular Expressions “Finger Exercises”

Regular Expressions in Python

Additional optional readings about Python regular expressions

COMP 200 Elements of Computer Science &
COMP 130 Elements of Algorithms and Computation
Spring 2012