COMP 200: Elements of Computer Science
Spring 2013

Introduction to Efficiency and Files

Efficiency

The word_counts() example from last class provides a good opportunity to start a discussion about efficiency. When we write software, our first and foremost goal should be correctness. It needs to solve the problem with which we are faced. But, once we have satisfied that goal, we have many other goals to satisfy, as well. One that we have begun to emphasize is that our code should be well-documented and understandable.

Another goal is for our software to solve our problem quickly. Often, this is not a big issue — computers are fast, and the difference between an answer in half a second versus two seconds is often irrelevant. But, waiting for hours for an answer can be not only annoying, but in time-sensitive situations, completely impractical.

So, as a first look at this issue, let's consider three possible solutions to the word_counts() example.

def word_counts1(words):
    """Returns a dictionary mapping each word to its number of occurrences."""
    counts = {}
    for word in words:
        counts[word] = words.count(word)
    return counts

def word_counts2(words):
    """Returns a dictionary mapping each word to its number of occurrences."""
    counts = {}
    for word in words:
        if word not in counts:
            counts[word] = words.count(word)
    return counts

def word_counts3(words):
    """Returns a dictionary mapping each word to its number of occurrences."""
    counts = {}
    for word in words:
        if word not in counts:
            counts[word] = 1
        else:
            counts[word] += 1
    return counts

Theoretical Observation

The first version is the most common that I saw students writing. It is clearly the shortest. Much of its strategy is to leverage the already-existing list.count() method to do much of the computation. Using existing code and functions is generally a good thing.

However, one disadvantage of the first version is readily apparent. Given a list with duplicate words, it will count and recount those duplicated words. The “extra” if statement of the second version addresses this issue, avoiding such recounting, and is thus more efficient. (Well, it's more efficient assuming that the not in check is cheaper than counting a whole list, which it is.)

The third version is clearly longer. Even though it's what I hinted towards, few students I saw used this strategy on their own. After all, the version is arguably more challenging conceptually. However, I want to argue that it is more efficient.

In the third version, we loop on the list. Metaphorically, we're walking down the list, looking at each word. For each word, we count that instance. More importantly, what we do for that one word is essentially a single mathematical operation — either storing a number or incrementing a number.

Let's compare that to the first version. As we walk down the list, we again look at each word. But, what do we do with that word? We call words.count(word). Since that method is built-in, we don't know exactly what it does, but we can make a pretty good guess. In order for words.count(word) to count all the instances of word in the list words, it has to walk down the list. So, while the third version does a very simple, conceptually a single, operation for each word, the first version has to walk the entire list for each word. The second version is sometimes better than the first, but if every word in the list is distinct, then it too walks the entire list for each word. Thus, the third version is more efficienc, as it does less computation per list element.

More mathematically, let n represent the unknown length of the list of words. The third version does roughly one operation per list element, or 1 × n = n. The first two versions can walk the entire list for each element, thus roughly n operations per list element, or n × n = n2. We say that the third version takes linear time, while the other two versions take quadratic time.

Empirical Observation (Exercises)

Empirically, how can we observe this? Timing it by just running and mentally estimating the time to run each is horribly inaccurate. Instead, look in the CodeSkulptor documentation at the method time.time() and how it can be used to time an operation.

Define some sample lists, and time how long the above functions take. For example,

…

my_list = …

time1 = time.time()
count_words(my_list)
time2 = time.time()
print time2 - time1

Note that this code doesn't print the resulting dictionary. After all, we don't want to know how long it takes to print the dictionary, just how long it takes to compute it.

You'll need long enough lists to see enough of a difference. But, lists that are too long will lead to CodeSkulptor timing out. For example, on my laptop, I used definitions such as my_list = range(3) * 2000 or my_list = range(2000). The former produces a list [0, 1, 2, 0, 1, 2, …], with 2000 repetitions of the three numbers.

File input

For our upcoming text analyses, we will use large text files to get our data. First, read the CodeSkulptor documentation for the urllib2 module.

The following reads one of the sample data files.

import urllib2
import codeskulptor

file = urllib2.urlopen(codeskulptor.file2url("comp200-Johnny_B_Goode.txt"))
print file.read()

We'll soon provide a list of all the available data files. We're still getting those ready.

The function file.read() returns a single string of all the text in the file. In order to use functions like those you wrote last in last class, we want to break that one long string into a list of word strings. A simple way to do that is to use string.split().

import urllib2
import codeskulptor

file = urllib2.urlopen(codeskulptor.file2url("comp200-Johnny_B_Goode.txt"))
file_string = file.read()
words = file_string.split()

print words