Text Problems Overview

This outlines our next set of real-world problems, each of which will deal with textual data such as in books.

We are going to focus on three inter-related problems.

String/text matching & searching — Given a bunch of text, we want to find some specific word, word stem, or pattern.
Text analysis & author identification — Given a text, we want to determine some properties of the text or of the author
Text generation — We want to be able to automatically generate meaningful text.

As we explore these problems with some potential solutions, we will also introduce more Python features.

String/text matching & searching

Given a bunch of text, we want to find some specific word, word stem, or pattern.

One typical motivation is that a word is really a placeholder for a topic that you are researching. A music historian may search for all instances of “Hildegard von Bingen”, or variations such as “Hildegardis Bengensis”. In some fields, scholars have found this so important that they have created concordances, indices mapping words to their occurrences (examples: OpenSource Shakespeare Concordance, Strong's Exhaustive Bible Concordance). More modern examples would include most web searches with Google, or looking on IMDB for movies, or Amazon for products you want to buy.

Another motivation is that you are researching the word itself, its meanings, and its etymology. You want to see how the word is used over time. You also want to have a flexible search that can allow for alternate spellings. This is a typical motivation for using the British National Corpus, which we will use for some examples.

Alternate motivations arise when we think of text that doesn't represent words. Consider DNA or RNA encoded as text like “AGTCCG”. We can search for matching genes or alleles in multiple organisms. We again may want flexible matching to allow for genetic variation from mutation. Also, consider music encoded as text. Text searching can then where melodies have been reused within a piece or by other artists. You could use this basic idea to implement something like the popular Shazam app.

In the next class, we will look at some simple algorithms for matching strings. These will serve as introductions to working with text, as well as more experience writing Python functions with loops and conditionals. Soon in the course, we will look at regular expressions, a common tool for describing text patterns and often used for more flexible string matching.

Text analysis & author identification

Our motivational problem here is that given a text, we would like to identify who wrote it. Some examples:

Who wrote Shakepeare's plays?
The authorship of Biblical books has led to numerous historical and religious controversies. For example, read the introduction to this technical paper (Putnins et al. '05).
Did Samuel Coleridge write the anonymous 1821 translation of Goethe's Faust?
Determining Authorship of Ron Paul Newsletters
Did Bill Ayers write Barack Obama's book?

Alas, a solution would also have more sinister purposes, such as a totalitarian state identifying anonymous bloggers.

How can we approach such a problem? Actually, a wide variety of approaches are possible, including

Historical — Find and interpret documents that describe the text in question or its potential authors.
Chemical — Analyze the chemical properties of the ink and paper (or papyrus, etc.) used in the document. Carbon-date the materials. Look for DNA in sweat or blood stains. Try anything else that might appear on CSI.
Handwriting — Analyze the handwriting (or, similarly, typewriting) of the original document. If possible, compare to known exemplars of suspected authors.
Literary — Analyze the literary style of the text. If possible, compare to known exemplars of suspected authors.
Linguistic (stylometry) — Look for statistical patterns in the word use of the text. Compare to known exemplars of suspected authors.

We will focus on making some basic statistical analyses that can be used as the basis of more sophisticated stylometric approaches. We will also use such analysis as a basis for automatic text generation, as outlined in the next section.

More broadly, such analysis has uses other than for attributing authorship. This paper (Juola '06) provides a good, mostly non-technical, overview. We might not be able to attribute authorship to an individual, but instead want a profile of the author, such as demographics. Read the introduction to this technical paper (Argamon et al, '09) for further discussion. As an amusing application of the technique desccribed by this paper, try this Gender Guesser on writings by yourself and your friends.

Closely related to determining authorship is detecting plagiarism. Many schools now use automated plagiarism detectors for essays and code.

Text generation

Numerous interactive systems attempt to generate meaningful text or speech in order to communicate in a user-friendly manner. Most simpler systems use a library of canned text pieces which are glued together. More sophisticated systems incorporate some sort of knowledge about semantics (what words mean) and syntax (how to put words together).

Historically, one of the early goals of computer science was to develop a system that passed the Turing Test. Simply put, the goal was to have a computer program that could have a conversation and pass as human. The underlying assumption was that this was a more testable and well-defined problem than judging whether a computer could “think”.

Conversational programs, now known as chatbots can take a variety of approaches, but we will develop one based upon statistical text analysis.

For now, try some of the following online chatbots. As you'll see, in general conversation they can be very amusing or quite idiotic. Here's an early example of two chatbots talking to each other. Save your most humerous conversional snippets for class discussion.

For practical use, they are best when their topics of conversation are more constrained. You've probably used similar systems on automated phone systems, in computer games, or on your cell phone (e.g., Siri or Google Voice).

COMP 200: Elements of Computer Science Spring 2013

Text Problems Overview

String/text matching & searching

Text analysis & author identification

Text generation

COMP 200: Elements of Computer Science
Spring 2013