File Input
As always, collaborate with your neighbor and upload your solutions.
The following is a list of the text files that we currently have
available for you to use. To open these in CodeSkulptor, use
urllib2.urlopen(codeskulptor.file2url(filename))
.
-
Smaller files useful for testing:
"comp200-test.txt"
,"comp200-Eight_Days_a_Week.txt"
,"comp200-Johnny_B_Goode.txt"
,"comp200-Gettysburg_Address.txt"
-
Large files:
"comp200-Odyssey.txt"
,"comp200-War_and_Peace.txt"
— If your computer or Internet connection is not fast enough, you may receive a "TimeLimitError: Program exceeded run time limit." error when trying to run your program on these large files.
Today's exercises build upon the exercises from last class and the examples that you were to read before today.
-
Useurllib2.urlopen()
to read data from a URL not provided by us. You can open any URL, but a text file is most appropriate for our current purposes. Search the web for something interesting. -
Use each of your functions from the previous class on the words obtained from these data files.
-
Define a function
word_counts_file(filename)
. It takes a filename and returns the word count dictionary. It simply packages all of the file reading code together with a call tocount_words()
. You should not change the definition ofcount_words()
for this. -
The previous exercises counted
"the"
and"The"
separately, for example. Changeword_counts_file()
to usestring.lower()
to first uncapitalize all the words. Call your new functionword_counts_file_case_insensitive()
.Test your
word_counts_file_case_insensitive
function using OwlTest -
The provided code splits the big string only based upon “whitespace”. Ideally, we would like to also split based upon punctuation, separating the words from the punctuation.
As a challenge, how would you separate punctuation such as periods or commas? This is a challenge, given your tools so far. But, we encourage you to try, or at least think about it. You might write your own splitter that loops over the string's characters. You can use
string.split()
with an argument to split on other characters (see its documentation). Also, you might findstring.index()
orstring.partition()
useful.
To ponder
The last exercise was a challenge, but things get harder. Think about taking this further.
How would
you separate punctuation such as '
, which can
be either a single quote (not part of a word) or an apostrophe
(part of a word)? Similarly, consider -
, which can
be a dash between words or a hyphen within a word.
How can you distinguish between the cases?
Writing code for this would be a huge challenge for you at this point.
The Python feature that we will introduce for next class (regular expressions) will help us with not only the simpler punctuation like periods, but also the more complicated punctuation like apostrophes.
Use urllib2.urlopen()
on some HTML file on the web
such as these notes.
Especially if you've never seen HTML before, observe all the
“tags” in angle brackets.
How would you remove all the HTML tags, so that you're left with
just the textual content?
Solving this would go beyond what regular expressions can do well, but we're going to stick to text file input, anyway.