Practice with Regular Expressions
Now let's practice with REs more, including the new features
that you've learned about.
These examples will use both re.findall
and the
British National Corpus.
British National Corpus
The BNC is a large collection of (British) English sentences. This database provides much more realistic examples than the silly sentences that I come up with. It is also much larger than any of the text data files that we have provided as samples. This database is often used for research generally in the area of computational linguistics, especially for work in language education and also natural language processing. This is not the only such corpus (a large body of text), this one is convenient for our purposes, because it provides a simple RE search on its home page.
There are a few things to know about this search, that we will illustrate by examples. Type in each of the following examples into the search box on the BNC home page.
-
Let's search just for the word “house”. Enter { house } into the search box. The curly brackets indicate a RE search. We don't need the curly brackets in this simple search, since the RE is just a fixed string, but all of our other examples will. The spaces between the curly brackets and the RE are optional, but I'll use them for clarity.
First, notice that there were 49424 matches, and 50 of those are printed. As previously motivated, this search acts like “grep”. The database is separated into sentences. The RE is matched against each sentence.
Next, you can see that some of the matching sentences have the word “House”, not “house”. The matching is automatically case-insensitive.
Finally, all of the matches are for “house” or “House”, but not variants such as “houses”. The RE is only matched against each entire word in the sentence. This is sort-of like the RE has
\b
added at its beginning and end. As a reminder about\b
, try the next examples.import re print re.findall(r"house", "The in-house counsel for the house is housed in the madhouse.") print re.findall(r"\bhouse\b", "The in-house counsel for the house is housed in the madhouse.")
-
Search for
{ house, }
. No matches?!? From the previous searche, we know there are sentences where “house” is followed by a comma.The previous description of adding
\b
isn't quite right. More accurately, the database has internally already been separated into words, and the RE match is being done against each word, not the full sentence text. -
Search for
{ house.* }
. As a reminder, the period matches any one character. So, this matches any word starting with “house” and ending with anything. This includes “houses” and “household”. -
Search for
{ .*[aeiou][aeiou].* }
. The search times out. First, the site isn't particularly fast, but some searches are just too broad. Trying searching for{ .*uu.* }
, instead. (2089 BNC matches) -
Search for
{ .*(aa|ii).* }
. (32463 BNC matches) Grouping with plain-old parentheses (not(?:…)
works fine with BNC.
Exercises
For each of the following, create an example to match with
re.findall
and also search the BNC.
Because of the differences noted in the previous examples, you may need
slightly different REs for each use.
-
All hyphenated words that start with “house”. For example, it should match “house-husband” and “house-to-house”. (497 or 501 BNC matches, depending on your RE)
-
All hyphenated words that end with “house”. For example, it should match “in-house” and “house-to-house”. (1549 BNC matches)
-
All hyphenated words that include “house”. (1985 BNC matches)
-
All words containing three or more “u”'s, such as “unusual” and “tumultuous”. (19252 BNC matches)
-
A more typical use of the BNC or similar resources is to research some topic of interest. You each have some interest, such as neuroscience. Search the BNC on the topic of your interest.
Keep in mind that the BNC database is not highly technical. Also, the texts are all from the later part of the 20th century, none newer than 1994.
Additional Systems Searchable by REs
- Wikipedia's list of RE-searchable software
- Bible Analyzer (free download)
- Openquran (free download)