COMP 200: Elements of Computer Science
Spring 2013

Who was Mark V. Shaney?

Mark was a member of a UseNet News group called net.singles, a users group chock full of dating tips, lonely heart chatter, frank discussions of sexual problems and high tech missionary gospel about the sins of premarital smut-typing.

Penn Jillette

For a longer intro about Mr. Shaney by the colorful Mr. Jillettee, read the rest of his short posting entitled “I Spent an Interesting Evening Recently with a Grain of Salt”.

Example quotes from Mark V. Shaney

I'm not prone to either violence or obscenities, but the outside pressure was too much for both of us. I live for the moment. I find (among yuppies) that the readership of net.singles is the same indecision, embarrassment, and confusion that they refer to themselves as the yuppies- you figure it out! You know lately I have ever direct personally insulting remarks to anyone in my brief (and very enjoyable) net career. Now, let's get the definition of teasing to be meeting a lot of people. We've been together a year now, but I won't go back through old news articles to get your opinions. (I bet you can guess how much mail I've been getting recently:-) Send in your responses and hopefully we can come up with better ones.

Commenting on President Bush (#1):

Mr. Chairman, delegates, fellow citizens, I'm honored to aid the rise of democracy in Germany and Japan, Nicaragua and Central Europe and the freedom of knowing you can take them. Tonight, I remind every parent and every school must teach, so we do to improve health care and a more hopeful America. I am in their days of worry. We see that character in our future. We will build a safer world today. The progress we and our friends and allies seek in the life of our work. The terrorists are fighting freedom with all their cunning and cruelty because freedom is not America's gift to every man and woman in this place, that dream is renewed. Now we go forward, grateful for our older workers. With the huge baby boom generation approaching retirement, many of our work. About 40 nations stand beside us in the next four years.

Mark V. Shaney was a Computer Program

Surprise, surprise!

You can try out this web-based chatbot version.

Like most chatbots, it worked in two phases. First, previous posts on net.singles were used to “train” the program, i.e., to build some sort of description of how phrases and sentences are formed. It used no knowledge of the English language other than a basic knowledge of words, punctuation, and whitespace. The data structure that it built during training is known as a Markov chain. Markov chains will be discussed more below. On Assignment 4, you will be writing the code for this phase.

In the second phase, the Markov chain guides the random generation of new text. The new text only uses words found in the original training text. Furthermore, each small sequence of text in the generated text also occurs in the training text. On Assignment 5, you will write the code for this phase.

See also “his” Wikipedia entry.

Markov Chains

Very generally, a Markov chain represents statistical data about how past information was followed by successor information. Such data can be used to predictively under the assumption that future behavior will be similar to past behavior. It can also be used to generate future data probabilistically.

More specifically, when applied to word sequences, a Markov chain maps each n-word sequence in the training data to the possible successors and their probabilities. We will represent this as a Python dictionary mapping each n-word tuple to a dictionary which, in turn, maps each successor word to a probability. We can choose which n to use, and we say we are using the nth-order Markov chain.

We've provided full examples for the songs Eight Days a Week and Johnny B. Goode.

Intuitively, with a smaller n, the word sequences are smaller, so a Markov chain provides less knowledge about the structure of the training data. For each of these smaller word sequences, there are more possible successors. Conversely, with a larger n, each word sequence has fewer possible successors. Often there will be only one successor for a given word sequence.

So, when we turn to generating text, a smaller n will generate text that is more random. It will typically be less grammatical and make less sense, since the only grammatical knowledge is implicit within what word sequences occur in the training data. On the other hand, a larger n will generate text that is more similar to the training data. By definition, n-word sequences be taken straight from the training. But, because the Markov chain will often map word sequences to only one possible successor, then longer word sequences will also often be verbatim from the training.