Practice for Exam 2: Comp130-specific topics
The following problems are typical of those that will be on the exam as Comp130-specific problems. For non-Comp130-specific problem examples, see the other practice problem set.
In addition to answering these questions, you should also review the course's online notes and the solutions to the two assignments. We have also made a study guide.
Problems
Explain why you might expect the PCA analysis of a word/tuple occurrences in a specific author's text corpus to form a cluster in a many-dimensional space.
Solution (No peeking!):
""" A cluster in the PCA analyzed data indicates that certain combinations of words/tuples are particularly common in the text corpus. Since there is naturally a statistical spread in the occurences of the word/tuple combinations, a cluster is formed as each multi-dimensional data point representing the word/tuple counts for a given text chunk tend to clump around a central point in a high dimensional space. These combinations may possibily involving many words/tuples of words, thus the cluster may exist in a the high dimensional space represented by the counts of the words/tuples involved. It should be noted that hte cluster may however exist in only a small subspace of the even larger dimensional space represented by all of the word/tuple counts used to measure the text corpus. """
Rewrite the countTextChunks() function from the PCA analysis code as a MapReduce operation.
Solution (no peeking!):
import text_analysis as ta import MapReduce as mr from collections import defaultdict def list2dict(aList): """ Convert the given list to a dictionary where each key is the index of the associated value in the given list. """ aDict = {} for i in range(len(aList)): aDict[i] = aList[i] return aDict def countTextChunks(chunks, order): """ Count the word occurences in each text chunk in chunks, a list of lists of words. Returns a list of defaultDict(int) dictionaries that relates word->count for each text chunk. """ chunkDict = list2dict(chunks) # convert list of text chunks into a dictionary of text chunks def mapCountChunk(k, v, outputFunc): """ Calls the given outputFunc(aKey, aValue) where aKey = k and aValue is the dictionary of tuple of length order counts in the given text chunk, v. """ outputFunc(k, ta.countSequences(v, order)) # notice how the order is curried in. def reduceCountChunk(k, v, outputFunc): """ Calls the outputFunc(aKey, aValue) where aKey = k and aValue = makeNormalizedDict(v)). """ outputFunc(k, makeNormalizedDict(v[0])) resultDict = mr.mapReduce(chunkDict, mapCountChunk, reduceCountChunk) return resultDict.values() """ Note: If one was really going to be using MapReduce, one would probably read the chunks directly into a dictionary rather than convert the list of chunks into a dictionary as done here. The choice to insert the conversion was to maintain a "drop-in" compatibility with the original countTextChunks() function. One could argue that the normalization process could have taken place in the map function instead of the reduce function. By that same token, one could argue that map could be a no-op, simply passing k and v onto its outputFunc and that all the counting work is done in the reduce function. """ # The following code is just support code to make a fully operational file. # MapReduce.py and text_analysis.py and text files are required as well. def readChunks(filename, chunkSize): """ Reads filename, assumed to be in the texts subfolder, and breaks the parsed texted into a list of chunks (lists of words) where each chunk has chunkSize elements. The list of chunks is returned. """ words = ta.inputText('texts/'+filename) return makeChunks(words, chunkSize) def makeChunks(aList, nElts): """ Break a given list into multiple smaller lists each of the given number of elements. Any leftover elements are ignored. A list of the resultant lists is returned. """ return [aList[i:i+nElts] for i in range(0, len(aList), nElts)] def getTotalValues(dict): """Returns the sum of all the values in a given dictionary """ return sum(zip(*dict.items())[1]) def makeNormalizedDict(dict): """Return a normalized defaultdict(int) where all its values from the given dictionary have be divided by the given total. """ total = getTotalValues(dict) result = defaultdict(int) for k,v in dict.items(): result[k] = float(v)/total return result chunks = readChunks("Shakespeare Complete Works.txt", 1000) countDicts = countTextChunks(chunks, 3)
Explain the following functions in terms of currying and closures:
def makeAddToFunc(x): def aFunc(y): return x+y return aFunc def makeMultByFunc(x): def aFunc(y): return x*y return aFunc def makePowByFunc(x): def aFunc(y): return pow(y,x) return aFunc def makeFuncCombo(f1, f2): def aFunc(x): return f1(f2(x)) return aFunc f1 = makeFuncCombo(makeMultByFunc(2), makeAddToFunc(5)) f2 = makeFuncCombo(makePowByFunc(2), makeAddToFunc(1)) print f1(3), f2(3)
Solution (no peeking!):
"""All 4 functions curry their input parameters into their returned functions. The returned functions only access the input parameter value that was used to create the returned function because that is the value in the closure of that particualr function. makeAddToFunc(x) returns a function that will return the sum of its input parameter the "x" value used when the returned function was made. makeMultByFunc(x) returns a function that will return the product of its input parameter the "x" value used when the returned function was made. makePowByFunc(x) returns a function that will return the its input parameter raised to the power of the "x" value used when the returned function was made. makeFuncCombo(f1, f2) returns a function that will return the result of the compound function formed by f1 and f2 applied to its input parameter, i.e. f1(f2(y)) """
Explain the behavior of the two returned functions of make2Funcs() in terms of currying and closures:
def make2Funcs(a0): a = [a0] # This has to be a list because of the way that Python creates closures def aFunc1(x): a[0] += x return a[0] def aFunc2(x): return a[0]*x return aFunc1, aFunc2 g1, h1 = make2Funcs(10) g2, h2 = make2Funcs(-10) print g1(2), h1(5), g2(2), h2(5) print g1(2), h1(5), g2(2), h2(5) print g1(2), h1(5), g2(2), h2(5) print g1(2), h1(5), g2(2), h2(5)
Solution (no peeking!):
""" make2Funcs() curries its input into the first element of the local list variable, a, which is part of the closures of both the aFunc1() and aFunc2() functions. The input parameter is thus the intial value for a[0], which differs from one call to the next of make2Funcs(). Since aFunc1 and aFunc2 share the same closure, they share the same list variable, a. Thus, when an aFunc1 function, e.g. g1() or g2(), changes the value of the first element of a, then the associated aFunc2 function, e.g. h1() or h2() respectively, will access that same variable and return a value based on it. g1() and h1() are linked together while g2() and h2() are separatedly linked because the closure shared by g1() and h1() is not the same closure as g2() and h2(). Technical note: Python does not allow a curried value (local variable) in a closure to be mutated, so the classic "hack" is to use a variable that refers to a list, not a primitive value such as an integer or float. This way, the variable itself, "a" here, is never mutated because it always refers to the same list. On the other hand, the *contents* of that list are mutated. Java shares this restriction with Python so the same hack works in Java. C# on the other hand, allows mutation of curried variables, so the hack is unnecessary in C#. """