COMP 130 Elements of Algorithms and Computation
Spring 2012

Google PageRank and Quantum Mechanics?

In the old days, there was Yahoo, Lycos, Excite, Ask Jeeves ("Jeeves" was mercifully retired in 2006) and my personal favorite, AltaVista (now part of Yahoo).    But in 1998, two Stanford graduate students, Larry Page and Sergey Brin, developed a novel view of the World Wide Web as a directed graph of weighted nodes where the weight of each node, i.e. each web page, depended on the weights of all the web pages that linked to it.   Larry Page, in honor of himself, dubbed the notion "PageRank".    Together, the two enterprising students realized that this page ranking technique could be used to order search results from the web.   Naturally, they punted their Ph.D. programs at Stanford, set up some servers in a garage in Menlo Park, CA, dubbed themselves "Google" and the rest, as they say, is history. 

The basic premise of the Google search engine is that the web page that someone wants to see is the one that other people feel is important.   How do people indicate their belief in the importance of a web page?  They link to the web pages they believe are important when they create their own web pages.    The more links a web page has to it, the more important it must be.   In particular, the more links from important web pages, the better.   This idea revolutionized web searching and over a decade since Google first started popping up on browsers, some still say that nothing beats it.

The key thing here is less about learning the PageRank algorithm but rather about how an inter-related entities can be modeled so that information can be gleaned from them.   The original genesis of the PageRank algorithm was not for search engine ranking but rather as a technique to derive how many pages referenced a given page for purposes of understanding the level of citations in scientific papers.

 The Model  -- A Graph of Weighted Nodes and Edges

In a nutshell, here's the basics:

  1. Each web page (node) has a rank, R, and N out-links (outward-directed edges).
  2. Each out-link carries a weight of R/N.   Self-links are ignored.
  3. The rank, R,, is the sum of the weights of all the in-links (inward-directed edges).  

Consider the following miniature web that satisfies the above conditions:

linked web pages

(Edge weights left as unsimplified fractions to emphasize them as a fraction of the source page's rank)

Notice how the the values for the ranks and the edge weights are all self-consistent across the graph.   The ranks of pages J, K and L, which have no in-links, are given to be some value on the notion that there is always some probability of someone randomly surfing the web, to land on any given web page.   Here, their value is arbitrarily set to 2.

The Analysis -- The Graph as a Matrix Computation

The problem is that the Internet is very large, so how do we calculate the rankings of each web page?    At first glance, this seems like an all-but-impossible problem.    But let's look at the problem as a matrix multiplication problem:

Suppose we put all the page ranks in a vector (list):

V = (A, B, C, D, E, F, G, H, I, J, K, L) = (9, 24, 11, 3, 20, 1, 10, 6, 6, 2, 2, 2)

Let's arrange the edges in terms of a matrix where the rows and columns correspond to the pages, and on each row are arranged the contributions from the in-links from each of the other pages. The values in each cell are the fraction of  the column page's rank that contributes to the row page's rank.

(Click on the above diagram to open it a new window for easier comparison to the following matrix.)

M =
  A B C D E F G H I J K L
A   1/3       1            
B 1/3   1   1/2              
C   1/3   1/3             1  
D 1/3                      
E 1/3 1/3             1 1   1/2
F       1/3                
G         1/2              
H             1/2         1/2
I       1/3     1/2          
J                        
K                        
L  

(Note that the letters on the side are strictly
for reference and are not part of the matrix.)
The total relationship can thus be written as a matrix multiplication:

M*V = V

Notice how the page rank values, V, appears on both sides of the equation.    This means that the matrix, M, has no effect on V.    We call this a stationary state because V is the vectors for which M has no effect, i.e. V is stationary, i.e. unchanging, under the influence of M. 

If we consider the left side of the equation as one (1) times V, we see that the equation is what is called an eigenvalue equation (from the German, eigen or "one").   V here is thus called the eigenvector of M with the eigenvalue of 1.

M*V = 1*V

The great thing about this matrix equation is that M is entirely determined by the web pages themselves, specifically the number of out-links on the web page.    The problem thus reduces to finding the eigenvector for the matrix whose eiqenvalue is 1.

One thing to notice about the matrix, M, is that it is mostly zeros.   We call this a "sparse matrix".    There are specialized techniques to solve sparse matrices, which will be encountered in Comp182 most likely.  One technique, called the Power iteration Method.    The advantage of this technique is that one can start with artbitrary guess value for the eigenvector and the algorithm is guaranteed to converge to the proper eigenvector solution.   Even so, considering that Google indexes about 50 billion web pages (i.e. M is a 50,000,000,000 x 50,000,000, 000 element matrix), it takes many days to complete the job, so it's only done about once a month.   As far back as 2002, when the Internet was much smaller, Google's PageRank calculation was already being hailed as the world's largest eigenvector calculation.

Interesting note:   Consider the out-link weights as relative probabilities of a web surfer randomly clicking on one of the links and thus ending up at another page. From Markov chain theory, it can be shown that the page rank is thus the probability of landing on any given web page after a large number of clicks.

Mathematics caveat:   The above discussion leaves out a lot of technical details such as the reduction of the out-bound probabilities due to a non-zero probability that a random surfer will stop clicking ("damping factor"), using logarithmically-scaled weighting, convergence and existence of solution issues, anti-spoofing techniques, new improvements, etc. because they are beyond the scope of this course.

For a far more complete treatment of Google's page ranking algorithm, including the power iteration technique, please see Prof. Subramanian's Comp140 notes: Comp200 Students, Comp130 Students

Quantum Mechanics?

So how does this relate to quantum mechanics?   The basic formulation of quantum mechanics is the eigenvalue equation, arising from its roots in wave theory.   Quantum mechanic's famous time-independent Schroedinger wave equation is:

 H*ψ  = E*ψ

Where ψ  is the "wave function" that describes any physical entity, E is the quantum energy level and H is called the "Hamiltonian", which is derived from an energy analysis of the system.    The fundamental problem to solve in quantum mechanics is the eigenvalue problem for the wave equation.

Eigenvectors and eigenvalues are the resonant frequencies and standing waves in classical wave theory and when translated into quantum mechanics, we get discrete values for the eigenvectors and eigenvalues, which leads to phenomenon such as energy levels and electron orbits in atoms.

Who Knew That web page rankings and quantum mechanics were the same problem?