COMP 200 & COMP 130

Preparing Text Data for PCA Analysis

This is a continuation of our discussion of Stylometry and PCA.

PCA analysis requires that we present each data point as an N-dimensional vector (list) of numbers representing the "measure" of that data point in terms of each "dimension" being measured. We thus need to prepare our raw data, which comes in the form of text files by creating these multi-dimensional measures and assembling our results into lists of numbers, i.e. a list of lists or a matrix.

The computational process being described here is typical of the sort of programs that a practicing scientists needs to write. Few scientists write the actual mathematical number-crunching algorithms to process the data. Instead, they use far more sophisticated programs than they could ever write from code libraries, either commercial or free, such as open-source or from government or academic laborartories. The scientist's job is thus reduced to translating the raw data from the form that it is obtained from experiments or other data gathering, into the standardized data formats that the generically written libraries expect. Likewise, the scientist would then often have to write additional code to translate the results back into a form that makes sense in the context of the research being performed.

We already know how to process a text file into lists of words and count word and word tuple frequencies, please review that material here and here.

Problem Solving Tip: Notice that in the following discussion that the focus will be on the expression of pseudo-code, not the actual code. This isn't to be mean to the students (even though it may feel like it!) but rather to keep focus on what we are trying to accomplish rather than get lost in the coding details. In general, it is highly recommended that all students (and professionals too!) should

Start by writing the function's documentation string -- this tells you overall what you are trying to accomplish with this function.
Write the algorithm in pseudo-code first -- this keeps you on track and helps give you a perspective on your algorithm without being clouded by programming language-specific details. Just comment out the pseudo-code lines as you implement them.

Measures of Texts

Probably the simplest measure of a text is the word frequencies. Here, each word represents a dimension and the count of that word in text in any given text is the measure along that dimension. One could definitely argue that the measure should in fact be the percentage count of that word relative to the total word count, i.e. the normalized count, to make comparing texts more consistent. The PCA analysis thus looks for correlations of clusters of words, that is, are there groups of words that are consistently used together in a set of texts?

Another possible measure, which is really just an extension of the simple word frequencies, is the frequencies of n-element word tuples. Once again the argument for normalization holds. And generalizing the single word count PCA analysis, we see that the PCA analysis of n-tuple word frequencies would detect correlations of ordered sets (tuples) of words being used together in a data set.

There are of course, many, many possible measures of texts that can be created, but we will focus on the above measures for now. Can you come up other measures?

Import statements

To write the code described below, you will need to include the following import statements at the top of your Python file:

from collections import defaultdict
import text_analysis as ta   # This is the word and tuple counting text analysis code from lecture.   Your filename may differ.

A Practical Problems of Identifying Dimensional Axes

For our purposes, there is an unfortunate and complicating mathematical convention in the representation of vectors. In mathematics, if one writes omething like, v = (42, 37, 99), it is convention that this refers to the "named" axes (x, y, z) or some impersonally numbered axes (x₁, x₂, x₃). Notice that the order of the numbers is what implicitly corresponds each value to axis. But what if these 3 numbers refer to the measures of ('hello', 'bye', 'howdy')? Remember that we are not dealing with 3 words, but thousands here! The word frequency counting process puts the counts in a dictionary which explicitly relates word → count but the dictionary does not guarantee any order in its list representation.

The PCA analysis functions in Python assume that the data is strictly in a list or matrix of numbers, where the relationship from any value to its axis is implicit in terms of its position in the list (column in the matrix). Somehow, we must convert from the explicit axis → value relationships defined by a dictionary to the implicit relationship used by mathematical convention and Python's PCA analysis routines.

What we need to do is to create a "standard" mapping of list index values to specific words (axes). That is, we need to be able to say that the first index of a list always refers to the meaure of 'hello' while the second index is the measure of 'bye', the third to 'howdy', etc., etc.. This standard mapping would thus be in the form of a tuple (for immutablility) of words, where each word is in the place in the tuple that will define the location (index) of the measure of that word in all lists (vectors) of measures.

Another complicating problem is that any given dictionary of word frequencies is not guaranteed to have the same set of keys, that is, any given text may not ever use certain specific words. That is, any given data set of words may not yield a complete set of axes.

To address the above issues, we must analyze the data points (sets of words) in aggregate, that is, we must combine them and look at all the words that are present across all the data points. Luckily, if defaultdict(int)'s were used to make the word frequency dictionaries, our computational life (sorry, can't help you with the rest of it!) gets much easier because if we ask for a non-existent key from a defaultdict(int), we will get the default value of zero as the corresponding count.

Class exercise: Write a function that will take a list of defaultdict(int)-based dictionaries and combines them into a single dictionary where all the counts for each word have been summed together.

# PSEUDO-CODE
def combine_dicts(dicts):
    """Combines list of defaultdict(int) dictionaries into a single dictionary which has all possible keys and 
    whose corresponding values are the sums of the corresponding values across all the dictionaries.  The single combined 
    dictionary is returned.
    """
    make the result as an empty defaultdict(int)
    Loop over the list of dictionaries:
        Loop over every key-value pair in a dictionary:
        	add the value to the corresponding key in the result
    return the result  

combinedDict = combine_dicts(aListOfDicts)

allWordsList = combineDict.keys()

If combinedDict = combine_dicts(aListOfDicts) then combinedDict.keys() will give us all the words every used in all the texts from which the dictionaries were derived.

Using only the most common words

So far so good, but the list of all words every used by an author is not only a huge number and thus a huge dimensionality in our already high-dimenisonality measurement space, but it also contains a lot of relatively little used words, such as character and location names that are specific to a certain stories. We can reduce the dimensionality and thus, complexity of our problem considerably by restricting ourselves to only the most commonly used words ("maxWords" below). This "pruning" of the problem is a very common technique used by scientists to reduce the problem to a more manageable size with a minimal impact on accuracy.

To do this, we must sort the key-value pairs (combinedDict.items()) by the size of the values and select only the keys to the top N values. The problem is that the aList.sort() technique that we used previously would sort such a list of tuples by the first element of the tuple, and we want to sort it by the second element of the tuples. There are many ways to accomplish this task -- we will utilize a generally applicable technique here that can be used in a wide variety of situations:

Python supplies a function called "sorted()" that takes 3 parameters:

aSortedList = sorted(anUnsortedList, key=getSortableValueFunctionName, reverse=isDescending)

where anUnsortedList is an unsorted list of key-value pairs, getSortableValueFunctionName is the name of a function of one parameter, the key-value pair, that returns a value that can be used to sort the list by, and isDescending is an optional boolean True or False value if you want the result to be in ascending or descending order. This makes short work of our sorting problem:

def getValueFromKV(kvPair):
    """ Returns the value from a key-value pair."""
    return kvPair[1]
    
def makeSortedKVs(aDict):
    """From the given dictionary, returns a list of (key, value) pairs where the 
    list is sorted in descending order of the values.
    """
    return sorted(aDict.items(), key=getValueFromKV, reverse=True)

We can now extract the maxWords most common words used by an author by using our old unzip/zip technique to first transpose the list of key-value pairs and then slice out the first maxWords from just the list of keys:

def getNKeys(kvPairList, n):
    """From the given list of key-value pairs, returns a list of 
    the first n keys.
    """
    return (zip(*kvPairList))[0][:n] 	
    
sortedKVs = makeSortedKVs(combinedDicts)

maxWords = 50  # some large, but not too large number. 
sortedKeys = tuple(getNKeys(sortedKVs, maxWords))    # the sorted tuple of the maxWords most common words.  Made into a tuple to insure immutability.

sortedKeys is now the definitive list of words that defines which word measures are in which indices on a list of word counts (measures).

We can now use this definition of our word count vectors to create the matrix of data points needed by Python's PCA analysis functions.

Normalizing the Word Counts

If we simply use the word or word tuple frequency counts directly, we will run into trouble later because in order to compare different texts we will have to make sure that we always count the same total number of words. This means we will always be restricted to analyzing part of a text if it has above our total word count setting and could be prevented from analyzing texts if they don't have enough words in them. We will always have to remember the total word count across every analysis we do.

But the analysis never really depends on how many words there are in a text. The key issue is really about how many times a word/tuple appears as a percentage of the total word count. Thus we want to "normalize" our data by dividing each and every count by the total word count to get a percentage count (Be careful about integer vs. floating point arithmetic!). This will make our data independent of the number of words used and we can then compare any texts we wish with no worries at all.

To accomplish this, we need two functions:

# PSEUDO-CODE
def getTotalValues(dict):
    """Returns the sum of all the values in a given dictionary
    """
    # Use our old zip trick to transpose dict.items() and extract the values, then sum them.   Only one line of code needed!
    
def makeNormalizedDict(dict):
    """Return a normalized defaultdict(int) where all its values from the given dictionary have be 
    divided by the given total.
    """
    #  get the total of the dictionary's values
    # initialize a resultant defaultdict(int)
    # loop through all the items in the dict, and make a corresponding entry in the result for the key and holding the normalized count value 
    # return the result dictionary

Making Many Data Points from a Single Data File

For efficiency's sake, a number of larger text collections are organized into single "Complete Works Of" files. Since these files are so large and represent multiple works by a single author, we need to break each file into multple "chunks" that we can individually process to get multiple data points from a single data file.

The problem of "chunkifying" a data file is similar to the sliding average and word tuples problems we encountered before except that the resultant tuples or lists are not overlapping. This means that we have to step through the list with a step size that is the size of the desired chunk of data. To do so, we can utilize that fact that Python's built-in range() function can take parameters to specify start, end and step size values. (Note: if you want to specify a step size with range(), you must also specify a start value.)

def makeChunks(aList, nElts):
    """ Break a given list into multiple smaller lists each of the given number of elements.
    Any leftover elements are ignored.
    A list of the resultant lists is returned.
    """
    # Write this as a single line of code using list comprehensions and range() with start, stop and step input parameters.  
    
chunkSize = 5000  # Use a large enough number to get decent word count statistics    
wordLists = makeChunks(wordList, chunkSize)

Each chunk of text can now be processed as usual to calculate word or tuples of word frequencies. Be sure to produce the normalized word counts!

Class Exercise: Write a function to count the word frequencies in all the chunks of text in a list of text chunks.

# PSEUDO-CODE
def countTextChunks(chunks):
    """ Count the word occurences in each text chunk in chunks, a list of lists of words.
    Returns a list of defaultDict(int) dictionaries that relates word->count for each text chunk.
    """
    Initialize the result list
    Loop over the text chunks
        Append the normalized dictionary of word counts for the text chunk to result list
    return the result list

Can you improve your countTextChunks() function by the following?

Writing it as a single line of code using a list comprehension?
Modify it to calculate the occurrences of a given (i.e. input parameter) number of words in a tuple? Or alternatively, to take an arbitrary function to process the text chunk?

Making Vectors (Lists) and Matrices of the Frequencies of the Most Common Words (or Tuples of Words)

At this point we have developed the technologies/code to count the occurences of every word or tuples of words in a long text taht was broken into smaller chunks and calculate an ordered list of the most common words/tuples, creating a standard vector definition of words/tuples. Now we want to combine these two capabilities to create vectors (lists) of words/tuple counts whose values correspond to the most common values in our defining standard word vector, sortedKeys above. That is, we need to convert our dictionaries of word counts, which relates each word to its count but in no particular order for the words, into simple lists of counts where the associated word is defined by the index of the count in the list, which is in turn, defined by our standard word vector, sortedKeys.

Problem solving tip: Sometimes you need to approach a problem backwards.

The way to think about a problem like this is as such, starting at the last thing that was said above:

sortedKeys defines the word count order in a data point vector.
Therefore, use sortedKeys to extract the word count values from the dictionary of word counts and place those values in order in a resultant list.

Class Exercise: Write a function called getTopValues that takes a defaultdict(int) dictionary of word/tuple frequencies and a list of keys and returns a list of values whose values correspond to the keys in the same positions in the given list of keys.

def getTopValues(aDict, keyList):
    """ Given a defaultdict dictionary, and a list of keys,
    returns a list of values associated with those keys, in the order of those keys.
    """
    # Only one line of code is required if you use a list comprehension!

Now generalize the above process to create a matrix of values from a list of word frequency dictionaries. Don't duplicate code!!

Class Exercise: Write a function called getTopValMatrix that takes a list of defaultdict(int) dictionaries of word/tuple frequencies and a list of keys and returns a list of lists of values (a matrix) whose column values correspond to the keys in the same positions in the given list of keys.

def getTopValMatrix(aDicts, keyList):
    """Given a list of defaultdict dictionaries and a list of keys, return a
    matrix of values where each row is a from a single dictionary and each column
    is the value from that matrix corresponding to the matching key in the 
    list of keys."""
    # Only one line of code is required if you use a list comprehension!

Question: Why aren't the functions here defined in terms of the sorted list of keys (words/tuples), sortedKeys, that we will actually be using as the input parameter value for the functions?

Putting It All Together

So far we have a lot of small to medium sized code pieces to all of the nitty-gritty processing of text files into a matrix of values that Python's PCA analysis routines want to see. All we have to do now is to assemble these pieces into larger pieces to automate the whole process from text file(s) to matrix of data points.

Let's assume we are working on one large "Complete Works of..." type file. We will need to following steps to process the file:

Read and parse the file into a list of words.
Chunkify the list of words into a list of lists of words
Create a list of dictionaries that are the word/tuple counts for each chunk
Calculate the defining ordered set of word/tuple keys of the most common words/tuples.
Create a matrix of data points (ordered list of count values) from the word frequency dictionaries and the defining ordered key list.

Class Exercise: Write a function called makePCADataFromFile, to perform the above steps that will take a filename and other relevant parameters, such as the size of the chunks, the number of top count values to use, how many words in a tuple, etc and return the matrix of data points, the normalized word frequency dictionaries and the defining key list.

Class Exercise: We will also need another function, called getNormalizedTopValsFromFile, later on that takes a text file and the defining ordered key list and returns a single normalized data point vector. This will be used to prepare "unknown" text data to be tested against the the matrix of known data points.

We are now ready to use Python's built-in PCA analysis capabilities and to process the results.