What is a Word?

Big Picture

Our end goal is to analyze a text by looking at its components. Such an analysis could be used to identify the text author, or properties about the author (remember the Gender Guesser?), or to general similar texts.

The components of a text need not just be its words. We might also consider its punctuation, as how an author uses punctuation can be very distinctive.

So, given the first paragraph above, we we would like to be able to split it into a list of strings of its words and punctuation.

["Our", "end", "goal", "is", "to", "analyze", "a", "text", "by",
 "looking", "at", "its", "components", ".",
 "Such", "an", "analysis", "could", "be", "used", "to", "identify",
 "the", "text", "author", ",", "or", "properties", "about", "the",
 "author", "(", "remember", "the", "Gender", "Guesser", "?", ")", ",",
 "or", "to", "general", "similar", "texts", "."]

To accomplish this, we will use re.split() to split a text string. Its argument is a regular expression that describes what should be the separators or delimeters of the text. Here, we want both whitespace and punctuation to serve as separators. This will give us a list with words, punctuation, and whitespace. We don't care about the whitespace, so we will then either remove it from the list or simply ignore it.

Developing a Regular Expression

We want the RE to distinguish punctuation as accurately as possible, so that our end analysis is as useful as possible. However, some symbols, such as . and ' have multiple uses, and we will find it difficult to distinguish these uses solely based upon the syntax.

We'll build up an appropriate RE piece by piece. Try each example on sample text strings. Also, these REs will be somewhat inaccurate. They are the best I have come up with. If you have suggestions for improvement, please tell me. However, for consistency in grading, please use the RE developed here on the upcoming assignment.

In our original version of splitting, we used the following to split solely based upon whitespace.

text.split()

Using re.split(), we can accomplish the same with the following simple RE. As a reminder, \s is RE shorthand for any single whitespace character, including a space, newline, or tab.

re.split(r"\s+", text)

Let's start with some easy cases. Many punctuation characters never appear as part of a word and are never grouped with other characters are part of a larger punctuation mark. We'll put the following in this group: colon, semicolon, slash, (double) quotation mark, backwards (single) quotation mark, parentheses, ampersand, and pound sign. One exception comes to mind — filler such as #*!?#@! used in place of an expletive. But the symbols used like that are so arbitrary, so we'll ignore that.

re.split(r"\s+|[:;\"`()/&#]", text)

As a reminder, we want to generate a list of strings that include the punctuation. As we've seen before, we can accomplish that by adding parentheses around the whole RE.

re.split(r"(\s+|[:;\"`()/&#])", text)

Next, let's consider question mark and exclamation mark. Again, neither appear as part of a word. Typically, a question mark or exclamation mark occurs singly, but we sometimes also see them combined, such as “??????” or “?!?”.

re.split(r"(\s+|[:;\"`()/&#]|[?!]+)", text)

Now we get into the harder cases. Consider “-”. It can be a hyphen in a word, a hyphen added within a word due to a line-break on a page, part of a dash between words, or a leading minus sign on a number. A dash is often written with two of the characters, but sometimes one, and sometimes three. We'll ignore the line-break issue. Ideally our text strings wouldn't have these, although our text could come from a scanned page.

With numbers, there's a question of whether something like “555-1234” is a telephone number or a mathematical expression involving subtraction. Given that we are expecting text, we'll assume the former.

So, we'll say that the character is punctuation if there are two or more together or if there is one followed immediately by something other than a letter or digit.

re.split(r"(\s+|[:;\"`()/&#]|[?!]+|-{2,}|-(?![A-Za-z\d]))", text)

A period is similar. It can be part of an abbreviation, a decimal point, part of an ellipsis (typically, three dots, but not always), or a sentence-ending period. We'll assume it's punctuation if it's not immediately followed by a letter or punctuation. Note that this is inaccurate for a period at the end of an abbreviation, as in “U.S.A.”. However, if an abbreviation comes at the end of the sentence, it's uncommon to have both the abbreviation's period and the sentence's period.

By the way, remember that the period is a special character in an RE, so it needs an escape.

re.split(r"(\s+|[:;\"`()/&#]|[?!]+|-{2,}|-(?![A-Za-z\d])|\.{2,}|\.(?![A-Za-z\d]))", text)

Or, combining cases to simplify…

re.split(r"(\s+|[:;\"`()/&#]|[?!]+|-{2,}|\.{2,}|[\-.](?![A-Za-z\d]))", text)

A comma is punctuation unless it is followed immediately by a digit, as in “10,000”.

re.split(r"(\s+|[:;\"`()/&#]|[?!]+|-{2,}|\.{2,}|[\-.](?![A-Za-z\d])|,(?!\d))", text)

Lastly, we'll consider the forward single quote, which doubles as the apostrophe. We'll treat it like the hyphen and period. It is presumably a quotation mark if it is not immediately followed by a letter or digit. If it is followed by a letter or digit, then it's part of a word, as in “can't” or “'70s”. However, that inaccurate when it should be an apostrophe at the end of the word, as in “the two boys' parents”.

Single quotation marks should come in pairs, as in “`This is a quote.'” or “'This is a quote.'”, so a potentially a more accurate approach would be to match them in pairs. Any others should be apostrophes. Alas, such pairing can't be done in general with REs.

By ignoring the dollar sign, percent sign, and plus, I've implicitly included them as letter-like symbols that are parts of words, as in “$50”, “50%”, and “50+”. I've ignored some other symbols on the keyboard that rarely occur in text, such as the tilde, vertical bar, carat, less than, greater than, equals, and underscore. As is, they can also be parts of words. However, it could make sense to add them to the first set of punctuation symbols.

So, once again, our final version is as follows.

import re

text = …

words = re.split(r"(\s+|[:;\"`()/&#]|[?!]+|-{2,}|\.{2,}|[\-.](?![A-Za-z\d])|,(?!\d))", text)

For use in text analysis, one thing we'll want to do it to remove all the instances of whitespace from this list.

COMP 200: Elements of Computer Science Spring 2013

What is a Word?

Big Picture

Developing a Regular Expression

COMP 200: Elements of Computer Science
Spring 2013