File Input

As always, collaborate with your neighbor and upload your solutions.

The following is a list of the text files that we currently have available for you to use. To open these in CodeSkulptor, use urllib2.urlopen(codeskulptor.file2url(filename)).

Smaller files useful for testing: "comp200-test.txt", "comp200-Eight_Days_a_Week.txt", "comp200-Johnny_B_Goode.txt", "comp200-Gettysburg_Address.txt"
Large files: "comp200-Odyssey.txt", "comp200-War_and_Peace.txt" — If your computer or Internet connection is not fast enough, you may receive a "TimeLimitError: Program exceeded run time limit." error when trying to run your program on these large files.

Today's exercises build upon the exercises from last class and the examples that you were to read before today.

Use urllib2.urlopen() to read data from a URL not provided by us. You can open any URL, but a text file is most appropriate for our current purposes. Search the web for something interesting.
Use each of your functions from the previous class on the words obtained from these data files.
Define a function word_counts_file(filename). It takes a filename and returns the word count dictionary. It simply packages all of the file reading code together with a call to count_words(). You should not change the definition of count_words() for this.

Test your word_counts_file function using OwlTest
The previous exercises counted "the" and "The" separately, for example. Change word_counts_file() to use string.lower() to first uncapitalize all the words. Call your new function word_counts_file_case_insensitive().

Test your word_counts_file_case_insensitive function using OwlTest
The provided code splits the big string only based upon “whitespace”. Ideally, we would like to also split based upon punctuation, separating the words from the punctuation.

As a challenge, how would you separate punctuation such as periods or commas? This is a challenge, given your tools so far. But, we encourage you to try, or at least think about it. You might write your own splitter that loops over the string's characters. You can use string.split() with an argument to split on other characters (see its documentation). Also, you might find string.index() or string.partition() useful.

To ponder

The last exercise was a challenge, but things get harder. Think about taking this further.

How would you separate punctuation such as ', which can be either a single quote (not part of a word) or an apostrophe (part of a word)? Similarly, consider -, which can be a dash between words or a hyphen within a word. How can you distinguish between the cases? Writing code for this would be a huge challenge for you at this point.

The Python feature that we will introduce for next class (regular expressions) will help us with not only the simpler punctuation like periods, but also the more complicated punctuation like apostrophes.

Use urllib2.urlopen() on some HTML file on the web such as these notes. Especially if you've never seen HTML before, observe all the “tags” in angle brackets. How would you remove all the HTML tags, so that you're left with just the textual content?

Solving this would go beyond what regular expressions can do well, but we're going to stick to text file input, anyway.

COMP 200: Elements of Computer Science Spring 2013

File Input

To ponder

COMP 200: Elements of Computer Science
Spring 2013