First, a couple quick notes on previous material.
Originally, our
suggested function
for building a dictionary of
word counts was not fast enough in CodeSkulptor to handle large text files.
That was due to a CodeSkulptor bug that has now been fixed.
Also, urllib2.urlopen()
currently only works on
files that we provide, not any URL on the web.
Python Regular Expression Basics
Using methods such as string.count()
, string.find()
,
and string.index()
,
it is easy to search a text for a “fixed” pattern,
i.e., a single string. For example, we can search
for the string "help"
with
the call my_string.index("help")
.
Similarly, it is easy to split a text
using a fixed separator using string.split()
.
Often, however, we want to search or split using a more general pattern. We can describe many kinds of patterns with regular expressions (REs). Regular expressions are used a lot by computational linguists. We will also use them to search the British National Corpus.
Consider looking for
the Semitic root “S–L–M”,
as found in words such as
“Islam”,
“Muslim”,
“salam”,
“shalom”,
“Solomon”, and
“Suleiman”.
To search for this root in a piece of text,
we want to specify that the pattern is
"s"
or "S"
, followed by any letter,
followed by "l"
, followed by any letter, followed by
"m"
.
This is captured by the RE
"[sS][a-z]*l[a-z]*m"
.
Let's break this down into parts to understand it:
-
[sS]
— Represents any one of the letters in the square brackets. I.e., either lower- or upper-cases
. -
[a-z]
— Represents any one of the letters in the range in the square brackets. I.e., any lower-case letter. -
*
— Repeat the previous part of the pattern any number of times. Thus,[a-z]*
means any sequence of zero or more lower-case letters. -
l
— Represents the given letter. -
[a-z]*
, again -
m
— Again, represents the given letter.
Thus, this represents the desired pattern. As you can see, REs are like a separate mini-language within Python. (Most programming languages these days have similar functions for matching with REs, and have similar syntax for REs.)
We will introduce two functions: re.findall()
for
searching for RE patterns and re.split()
for
splitting based upon RE patterns.
We will start to explore REs in these
examples, and follow up with more later.
A lot of features available
in regular expression libraries are downright confusing, even to experts.
We're going to limit ourselves to the commonly-used basics, plus what
we specifically need for our example of accurately splitting text into words.
Regular Expressions in Python
How do we use regular expressions (REs) in Python code?
We will illustrate re.findall()
and re.split()
on REs of increasing complexity.
For now, we will focus on the core ideas in REs.
For each example, think about what you expect the result to be,
then run the code and think about when the results actually are.
-
A RE can be a simple fixed string.
import re print re.findall("help", "Can you help me?") print re.split("help", "Can you help me?")
Really this is a simple example of combining four one-character patterns via adjacency, i.e., putting them next to each other.
-
A RE can be a one of a selection of individual characters.
import re print re.findall("[hl]", "Can you help me?") print re.split("[ep]", "Can you help me?") print re.split("([ep])", "Can you help me?")
Note what the extra level of parentheses does in
re.split()
. Inside a RE, parentheses will usually just mean grouping, like in an arithmetic expression, but they have a special meaning on the outside of a RE when usingre.split()
.Here's a some additional examples, also illustrating a range of characters. Note that the hyphen (
-
) has special meaning inside the square brackets.import re print re.findall("[l-p]", "Can you help me?") print re.split("[l-p]", "Can you help me?") print re.split("([l-p])", "Can you help me?") print re.findall("[ ?A-Z]", "Can you help me?") print re.split("[ ?A-Z]", "Can you help me?") print re.split("([ ?A-Z])", "Can you help me?")
Note that
re.split()
's returned list can include empty strings. Here, the split list starts and ends with the empty list because we split on the first and last characters. -
A RE can check for multiple patterns. A vertical bar represents “or”, and it joins two patterns.
import re print re.findall("help|[A-Z]", "Can you help me?") print re.split("help|[A-Z]", "Can you help me?")
-
A RE can check for repeated patterns. An asterisk means zero or more patterns, while a plus means one or more.
import re print re.findall("[aeiou]*", "Can you help me?") print re.split("[aeiou]*", "Can you help me?") print re.findall("[aeiou]+", "Can you help me?") print re.split("[aeiou]+", "Can you help me?")
Now let's try to combine these ideas into some more useful examples.
-
For example, words that have two adjacent vowels.
import re print re.findall("[a-z]*[aeiou][aeiou][a-z]*", "Can you help me?", re.IGNORECASE)
This pattern literally says to find strings that contain any number of letters, which are followed by two adjacent vowels, and then followed by any number of letters. Oh, and by the way, ignore case when matching.
Note that the pattern matching automatically matches as much as possible. For example,
"ou"
would have been a correct match, but"you"
is a larger match. -
Similarly, what if we just want to find words with two vowels, adjacent or not?
import re print re.findall("[a-z]*[aeiou][a-z]*[aeiou][a-z]*", "Can you help me, Jonathan?", re.IGNORECASE)
One way of using re.findall()
is just to test whether
a pattern occurs in the string:
-
… if re.findall(pattern, string): print "Match found!" else: print "Match not found." …
This particular code uses the Python fact that the empty list acts
like False
in a conditional. This is a very
“Pythonic” way of writing this test.