COMP 200 & COMP 130

matplotlib.mlab.PCA Documentation

The official Python documentation can be found here, but is unfortunately so sparse as to approach completely unusable.

To use PCA, be sure to include the following import statement:

from matplotlib.mlab import PCA

matplotlib.mlab.PCA is the name of a class in Python. Classes are like recipes used to create objects, which can be thought of intelligent entities that hold data and have built-in functions, called "methods", that perform various processes on the data the object holds.

PCA calculates principal component (PC) axes such that the origins of PC axes is at the mean of the distribution along each axis, that is, the origin is at the statistical "center" of the cluster. The PC axes are also scaled such that distances along each axis is in units of the standard deviation of the data point distribution along that axis. It has the advantage of creating a normalized distance notion but it also masks whether or not there is even a significant distribution along that axis.

Quick Links:

Constructor: PCA(dataMatrix)
Attributes: numrows, numcols
Method: project(x)
Method: center(x)
Attribute: mu
Attribute: sigma
Attribute: Y
Attribute: a
Attribute: fracs
Attribute: Wt

Constructor: PCA(dataMatrix) -- construct a new PCA object from a matrix of data points.

A constructor is very much like a function that returns a new object of a particular type, here a PCA class object.

import numpy as np

dataMatrix = np.array(aListOfLists)   # Convert a list-of-lists into a numpy array.  aListOfLists is the data points in a regular list-of-lists type matrix.
myPCA = PCA(dataMatrix)   # make a new PCA object from a numpy array object

Attributes: numrows, numcols -- the number of data points and the dimension of the measurement space.

numrows is the number of rows in the original data matrix, which corresponds to the number of data points, since each row is a data point.

numcols is the number of columns in the orginal data matrix, which corresponds to the number of axes in the measurement space, i.e. its dimensionality.

myPCA = PCA(dataMatrix)   
myPCA.numrows == len(dataMatrix)   #True. The number of rows in the data matrix 
myPCA.numcols == len(dataMatrix[i])   #True for all valid index i.  The number of columns in any row of the data matrix

Method: project(x) -- the projection of a vector x onto the principal component axes

The input parameter x is a vector (list) of measurements in terms of the original measurement axis, e.g. words for text analysis. The projection of a vector into a different coordinate system is express the same point the vector represents in terms of distances along the coordinate axes of that new system. project(x) returns a vector representing the same point in terms of the PC axes. It is important to remember that the origin of the PC axes is NOT the same as the origin of the original measurement axes (e.g. zero word counts). Instead, the origin is at the mean of the data points along each axis, i.e. at the "center" of the cluster. Also, the scale of the PC axes is not the same as the measurement axes. Distances along the PC axes are measured in terms of standard deviations of the distribution along that axis. Thus project(x) results in a normalized position with respect to the center of the cluster.

myPCA = PCA(dataMatrix)   
pcDataPoint = myPCA.project(aDataPoint)   # pcDataPoint is the same point as aDataPoint, but in terms of the PC axes.

Method: center(x) -- the translation and scaling of a vector x to the center of the cluster and scaled as per the standard deviations along the measurement axes

The center() muethod translates the given vector (list), x, by the mean of all the measurements, mu, and then scales the result by the standard deviation of the data along all the measurement axis. The translation puts the new origin at the center of the cluster.

myPCA = PCA(dataMatrix) 
myPCA.center(x) == (x -myPCA.mu)/myPCA.sigma  # True, note that subtraction and division are on an element by element basis
myPCA.center(myPCA.mu+myPCA.sigma) == [1, 1, 1, ...]   #True. one standard deviation away in all measurement directions.

Attribute: mu -- the vector that points to the origin of the PC axes in terms of the measurement axes.

mu is a vector (list) in the original measurement coordinates that points to the location of the mean values for all PC axes, which is the origin of the PC axes. The value of any element of mu is the average of the corresponding element in every data point in the measurement coordinates.

import numpy as np

myPCA = PCA(dataMatrix)   
myPCA.mu[i] == np.average([v[i] for v in dataMatrix])  # True for any valid index i
myPCA.project(myPCA.mu) == [0, 0, 0,...]   # True always because mu is the vector pointing at origin of the PC axes. 
myPCA.center(myPCA.mu) == [0, 0, 0,...]   # True always because mu is the vector pointing at the center of the cluster.

Attribute: sigma -- the vector that points to 1 standard deviation from the mean along the measurement axes

sigma is a vector (list) in the original measurement coordinates that, from the mean position (mu), points to the location 1 standard deviation along every measurement axis, in terms of the distribution of values along each axis. The value of any element of sigma is the standard deviation of the corresponding element in every data point in the measurement coordinates.

import numpy as np

myPCA = PCA(dataMatrix)   
myPCA.mu[i] == np.std([v[i] for v in dataMatrix])  # True for any valid index i

Attribute: Y -- the original data matrix in terms of the principal component axes

The attribute, Y, of a PCA object is the original data matrix in terms of the principal components basis vectors. That is, if Y is the matrix of data points you would get if you were to express the original data points in terms of the principal component axes instead of the original measurement components (e.g. words for text analysis). The i'th row of Y is equivalent to the projection of the i'th row of the original data matrix onto the principal component coordinates:

myPCA = PCA(dataMatrix)   
myPCA.Y[i] == myPCA.project(dataMatrix[i])   #True for all valid index i
myPCA.Y[i] != myPCA.a[i]	#True in general for all valid index i

Attribute: a -- the original data matrix centered on the cluster and scaled by the standard deviation in terms of the measurement axes.

The attribute, a, of a PCA object is the original data matrix in terms of the... The i'th row of a is equivalent to the projection of the i'th row of the original data matrix onto

myPCA = PCA(dataMatrix)   
myPCA.a[i] == myPCA.center(dataMatrix[i])   #True for all valid index i
myPca.a[i] == (dataMatrix[i] - p.mu)/p.sigma  # True, note that subtraction and division are on an element by element basis
[np.std(v) for v in np.transpose(myPCA.a)] == [1, 1, 1, ...]  # True, the standard deviation of the "centered" data points is 1 along any measurement axis.
myPCA.a[i] != myPCA.Y[i]	#True in general for all valid index i

Attribute: fracs -- the proportion of variance of each of the principal component axes.

fracs is a vecotr that measures how much variance is along each of the PC axes. This tells us how much the cluster is aligned along each PC axis. Since the PCA analysis orders the PC axes by descending importance in terms of describing the clustering, we see that fracs is a list of monotonically decreasing values. The fracs vector can be used to give weighting to each axis in terms of its importance in the clustering. Essentially, one wants to ignore the PC axes with low corresponding values in the fracs vector.

Attribute: Wt -- the PC axes in terms of the measurement axes scaled by the standard deviations

Wt is a matrix whose rows are the PC axes in terms of the measurement axes where the distances along each measurement axis has been scaled by the standard deviation of the data along that axis. In linear algebra terms, Wt is the rotation matrix around the center of the cluster that will transform the measurement axis-expressed vectors whose origins are at the center of the cluster into the PC axis-expressed vectors whose origins are still at the center of the cluster. For instance Wt can transform the data point vectors in a into the data point vectors in Y.

import numpy as np

myPCA = PCA(dataMatrix)   
myPCA.project(p.Wt[i]*p.sigma + p.mu) == [0,0,...,1,...0,0] #True for all valid i where the "1" in the i'th position in the result and rest of the elements are zero.  
[np.dot(myPCA.a[i], w) for w in myPCA.Wt] == p.Y[i]    # True for all valid index i.  This is just the rotation of a[i] into Y[i]