More with Regular Expressions

In Python

In the previous finger exercises, we used re.search to search a text string to see whether any part of it matched a given regular expression. Total, we'll focus on other functions in the Python re library.

First, let's import the re module before any examples.

```
import re
    
```

Finding all matches

With re.search, we found out if the text had a match to our pattern. As briefly noted before, it actually returned a Match Object, which we could use to get more information, such as exactly what in the text was matched. But, it would only find the first match. With re.findall, we can see all the matches.

We could use this to search through a large corpus to find pieces of text that we want to study. The following are a couple small examples.

We could search for all uses of color words.

re.findall("green|red|blue","I'm wearing a green shirt and red pants.")

We could search for all uses of the S-L-M word stem.

re.findall("[sS][a-z]*l[a-z]*m","Suleiman was a Muslim.  Solomon says shalom!")

As before, once you have figured out Match Objects, using them gives you more information. While re.findall gives you back a list of the matching strings, re.finditer gives you Match Objects.

Let's practice with re.search and re.findall. Can you create regular expressions that find what you want?

Find a regular expression that matches a digit. Test your answer, e.g., we want re.search(____,"There are 8 dogs here.") to be successful. Below are four different answers, longest first:

```
"0|1|2|3|4|5|6|7|8|9"
    
```
```
"[0123456789]"
    
```
```
"[0-9]"
    
```
```
r"\d"
    
```

Find a regular expression that matches a U.S. Social Security number, i.e., nine digits with a dash after the third and fifth. Test it using re.search or re.findall. Below are two different answers. Before looking at the second, look up the {m} notation in the Python documentation for regular expressions.

r"\d\d\d-\d\d-\d\d\d\d"

r"\d{3}-\d{2}-\d{4}"

Matching the entire text

What if we were using these previous answers to check whether an input string is a syntactically correct Social Security number? re.search(_____,"123-45-6789") would success, as desired, but so would re.search(_____,"0123-45-67890")! That's because re.search looks for the pattern to occur anywhere in the text string. If we want to do this sort of verification, it's better to turn to re.match.

def isSevenDigitPhoneNumber(string)
    """Returns whether the input string is a seven digit phone number in standard U.S. format."""
    if re.match(r"\d{3}-\d{4}$",string):
        return True
    else:
        return False

The function re.match looks for a match at the beginning of string. The dollar sign in the pattern is to match the end of the string. Together, these ensure that we only succeed if the pattern matches the entire string, rather than it being simply somewhere in string. This function returns a Boolean, rather than the Match Object or None that re.match returns.

As another exercise, write a regular expression that matches what is syntactically a word, i.e., it has one or more lower-case letters, but nothing else. Use re.match to test your answer. Below are two answers. Before looking at the second, look up the + notation in the Python documentation for regular expressions.

```
"[a-z][a-z]*$"
    
```
```
"[a-z]+$"
    
```

As a variation, what about a word with the first letter possibly capitalized?

```
"[a-zA-Z][a-z]*$"
    
```

After class, ponder how to generalize these previous two exercises to search for any syntactically-valid word. Ideally, we would want to allow for words such as "This", "can't", "fo'c's'le" (a common nautical contraction), "O'Rourke", "OK", "pince-nez", "'n'" (contraction for “and”), and "1970s".

Splitting the text into pieces

When reading in a large text, we want to break it apart into a list of words. For example, we might start with "This is a small example." and we want to get ["This","is","a","small","example"]. This is exactly what re.split is for.

For a first example, let's assume words are simply delimited by single spaces.

re.split(" ","This is a small example without punctuation")

Modify that so that it can be any amount of whitespace between words. Look back at the previous finger exercises about whitespace.

re.split(r"\s+","This is a small \n\r example \t   without     punctuation")

Now, what if we also want to consider the following punctuation: period, comma, semicolon, colon, exclamation mark, and question mark?

re.split(r"[\s.,:;!?]+","This is a very, very, small \n\r example \t   with     punctuation!")

Again, for after class, ponder how to deal with the complicated cases. Is “'” a quotation mark (punctuation) or an apostrophe (part of a word)? Is “-” a dash (punctuation) or a hyphen (part of a word)?

Additional optional readings about Python regular expressions

Python documentation: Regular Expression Library
Python tutorial: Regular Expressions
Google's Python class: Regular Expressions

Regular expression searching on the web

The British National Corpus is one site that allows you to search using regular expressions. In its box labeled “Look up:” you can put a regular expression. Just be sure to surround the regular expression with curly braces: { regexp } . While the search displays at most 50 “hits”, it also reports how many hits it found.

As a simple example, searching for { house } finds lots (49424) of instances of the word “house” or “House”. It matches the latter, since its searches are case-insensitive. It does not find words such as “houses” or “household”, so this behavior is more similar to re.match than re.search.

For each of the following problems, find an appropriate regular expression. Test whether your answer has the expected number of hits.

Search for all words beginning with “house”, such as “house”, “houses”, “household”, and “house-husband”. In other words, if must have zero or more characters after house. (70398 hits)
```
house.*
    
```
Search for all words containing “house”, but which are not just “house”. For example, “houses”, “warehouse”, and “warehouses”. In other words, it must have something before or after house. (31463 hits)
```
.*house.+ | .+house.*
    
```
Search for all hyphenated words containing “house”, such as “hen-house”, “house-party”, and “household-related”. (2279 hits)
```
.*-.*house.* | .*house.*-.*
    
```

To fully undersstand these examples, see the British National Corpus' regular expression syntax.

Some related linguistic resources

Regex Dictionary — English dictionary supporting searching by regular expressions and other features. Includes interesting example searches.
Natural Language Toolkit — Python tools for natural language processing and text analysis
Wikipedia: Text corpus — includes links to some similar corpora
Computational Resources for Linguistic Research

COMP 200 Elements of Computer Science & COMP 130 Elements of Algorithms and Computation Spring 2012

More with Regular Expressions

In Python

Finding all matches

Matching the entire text

Splitting the text into pieces

Additional optional readings about Python regular expressions

Regular expression searching on the web

Some related linguistic resources

COMP 200 Elements of Computer Science &
COMP 130 Elements of Algorithms and Computation
Spring 2012