More with Regular Expressions
In Python
In the previous finger exercises,
we used re.search
to search a text string to see whether
any part of it matched a given regular expression.
Total, we'll focus on other functions in the Python re
library.
First, let's import the re
module before any examples.
-
import re
Finding all matches
With re.search
, we found out if the text had a match to
our pattern. As briefly noted before, it actually returned a Match
Object, which we could use to get more information, such as exactly
what in the text was matched. But, it would only find the first
match. With re.findall
, we can see all the matches.
We could use this to search through a large corpus to find pieces of text that we want to study. The following are a couple small examples.
-
We could search for all uses of color words.
re.findall("green|red|blue","I'm wearing a green shirt and red pants.")
-
We could search for all uses of the S-L-M word stem.
re.findall("[sS][a-z]*l[a-z]*m","Suleiman was a Muslim. Solomon says shalom!")
Let's practice with re.search
and re.findall
.
Can you create regular expressions that find what you want?
Find a regular expression that matches a digit. Test your answer,
e.g., we want
re.search(____,"There are 8 dogs here.")
to be successful.
Below are four different answers, longest first:
-
"0|1|2|3|4|5|6|7|8|9"
-
"[0123456789]"
-
"[0-9]"
-
r"\d"
Find a regular expression that matches a U.S. Social Security number,
i.e., nine digits with a dash after the third and fifth.
Test it using re.search
or re.findall
.
Below are two different answers. Before looking at the second,
look up the {m}
notation in
the Python documentation for regular expressions.
-
r"\d\d\d-\d\d-\d\d\d\d"
r"\d{3}-\d{2}-\d{4}"
Matching the entire text
What if we were using these previous answers to check whether an input string
is a syntactically correct Social Security number?
re.search(_____,"123-45-6789")
would success, as desired,
but so would re.search(_____,"0123-45-67890")
!
That's because re.search
looks for the pattern to occur
anywhere in the text string. If we want to do this sort of verification,
it's better to turn to re.match
.
-
def isSevenDigitPhoneNumber(string) """Returns whether the input string is a seven digit phone number in standard U.S. format.""" if re.match(r"\d{3}-\d{4}$",string): return True else: return False
The function re.match
looks for a match at the beginning
of string
. The dollar sign in the pattern is to match the
end of the string. Together, these ensure that we only succeed if the
pattern matches the entire string
, rather than it being
simply somewhere in string
.
This function returns a Boolean, rather than the Match Object or None
that re.match
returns.
As another exercise, write a regular expression that matches
what is syntactically a word, i.e., it has one or more
lower-case letters, but nothing else. Use re.match
to test your answer.
Below are two answers. Before looking at the second,
look up the +
notation in
the Python documentation for regular expressions.
-
"[a-z][a-z]*$"
-
"[a-z]+$"
As a variation, what about a word with the first letter possibly capitalized?
-
"[a-zA-Z][a-z]*$"
After class, ponder how to generalize these previous two exercises to
search for any syntactically-valid word.
Ideally, we would want to allow for words such as
"This"
, "can't"
, "fo'c's'le"
(a common nautical contraction), "O'Rourke"
,
"OK"
, "pince-nez"
, "'n'"
(contraction for “and”), and "1970s"
.
Splitting the text into pieces
When reading in a large text, we want to break it apart into a list of
words. For example, we might start with
"This is a small example."
and we want to get
["This","is","a","small","example"]
.
This is exactly what re.split
is for.
For a first example, let's assume words are simply delimited by single spaces.
-
re.split(" ","This is a small example without punctuation")
Modify that so that it can be any amount of whitespace between words. Look back at the previous finger exercises about whitespace.
-
re.split(r"\s+","This is a small \n\r example \t without punctuation")
Now, what if we also want to consider the following punctuation: period, comma, semicolon, colon, exclamation mark, and question mark?
-
re.split(r"[\s.,:;!?]+","This is a very, very, small \n\r example \t with punctuation!")
Again, for after class, ponder how to deal with the complicated cases. Is “'” a quotation mark (punctuation) or an apostrophe (part of a word)? Is “-” a dash (punctuation) or a hyphen (part of a word)?
Additional optional readings about Python regular expressions
- Python documentation: Regular Expression Library
- Python tutorial: Regular Expressions
- Google's Python class: Regular Expressions
Regular expression searching on the web
The British National Corpus is one site that allows you to search using regular expressions. In its box labeled “Look up:” you can put a regular expression. Just be sure to surround the regular expression with curly braces: { regexp } . While the search displays at most 50 “hits”, it also reports how many hits it found.
As a simple example, searching for { house }
finds lots (49424) of
instances of the word “house” or “House”.
It matches the latter, since its searches are case-insensitive.
It does not find words such as “houses” or
“household”, so this behavior is more similar to
re.match
than re.search
.
For each of the following problems, find an appropriate regular expression. Test whether your answer has the expected number of hits.
-
Search for all words beginning with “house”,
such as “house”, “houses”,
“household”, and “house-husband”.
In other words, if must have zero or more characters after house.
(70398 hits)
house.*
-
Search for all words containing “house”, but which
are not just “house”. For example,
“houses”, “warehouse”, and
“warehouses”. In other words, it must have something
before or after house. (31463 hits)
.*house.+ | .+house.*
-
Search for all hyphenated words containing “house”,
such as “hen-house”, “house-party”, and
“household-related”. (2279 hits)
.*-.*house.* | .*house.*-.*
To fully undersstand these examples, see the British National Corpus' regular expression syntax.
Some related linguistic resources
- Regex Dictionary — English dictionary supporting searching by regular expressions and other features. Includes interesting example searches.
- Natural Language Toolkit — Python tools for natural language processing and text analysis
- Wikipedia: Text corpus — includes links to some similar corpora
- Computational Resources for Linguistic Research