COMP 200: Elements of Computer Science
Spring 2013

File Input

As always, collaborate with your neighbor and upload your solutions.

The following is a list of the text files that we currently have available for you to use. To open these in CodeSkulptor, use urllib2.urlopen(codeskulptor.file2url(filename)).

Today's exercises build upon the exercises from last class and the examples that you were to read before today.

To ponder

The last exercise was a challenge, but things get harder. Think about taking this further.

How would you separate punctuation such as ', which can be either a single quote (not part of a word) or an apostrophe (part of a word)? Similarly, consider -, which can be a dash between words or a hyphen within a word. How can you distinguish between the cases? Writing code for this would be a huge challenge for you at this point.

The Python feature that we will introduce for next class (regular expressions) will help us with not only the simpler punctuation like periods, but also the more complicated punctuation like apostrophes.

Use urllib2.urlopen() on some HTML file on the web such as these notes. Especially if you've never seen HTML before, observe all the “tags” in angle brackets. How would you remove all the HTML tags, so that you're left with just the textual content?

Solving this would go beyond what regular expressions can do well, but we're going to stick to text file input, anyway.