Rice University - Comp 212 - Intermediate Programming
Spring 2006
Lecture 31 - Hash Functions
Hash Functions
  - Recall from Lecture 28 that a hash function, h, maps the
set U of keys into the
slots of a hash table T[0..m-1], i.e., h:U
-> {0, 1, ..., m - 1}.
    - There are many functions that satisfy this requirement. 
Choosing a good function is critical to the performance of a hash table.
      - A good hash function distributes the keys over the range as
uniformly as possible, minimizing collisions.
- A bad hash function is exemplified by h(k)=0.
- Typically, a hash function is implemented in two steps:
    - Map the key to a natural number.  Often, the domain of
this mapping is larger than the range, for example, a function mapping
arbitrary length strings to a subset of the natural numbers, such as
[0, 2^31-1].
      - The Java Object class's method hashCode() is such a
mapping.  The key being the underlying object.
        - Any class can override the Object class's implementation of
this method with a more appropriate implementation.  For example,
we may want our class's hashCode() to use a single field, such as a
string, as the key.  (Thus, two objects that otherwise differ, but
contain the same string, would have the same hashCode() value.)
 
- Select a slot in the hash table based upon the natural number
computed in the previous step.  There are two common methods:
      - The division method:
        - h(k) =
k.hashCode() % m
          - The conventional wisdom says to beware of values of m
that are powers of two.  The rationale for this is based on better
performance, i.e., fewer
collisions, for a poorly chosen hashCode() mapping.
 
- The multiplicative method, which has three steps:
        - Multiply k.hashCode() by a constant A that is
0 < A < 1.  The golden section number 0.61803... is often 
		selected as the value of A. Here is a good URL discussing the golden 
		section:
		
		http://www.mcs.surrey.ac.uk/Personal/R.Knott/Fibonacci/fib.html. 
 The effect of using the golden section is to disperse keys that are 
		adjacent in the hash functions domain to widely separated values in the 
		range.
- Extract the fractional part of this product.
- Multiply the fractional part by m, keeping only the
integral part.
          - Although this method appears complex, there are very
efficient ways of implementing it if m is a power of two.
- Ultimately, you should test your hash function on representative
inputs.  For example, this 
	study compares numerous hashCode()
implementations using the division method.  (Note that m is a
power of two in their study.)
 
Review: Comparing the four IDictionary implementations
  
    
      | 
 | 
 | Expected-case Cost | Worst-case Cost | 
    
      | 
 | 
 | 
 | 
 | 
    
      | 
 | lookup() | O(n) | O(n) | 
    
      | DictLRS | insert() | O(n) | O(n) | 
    
      | 
 | remove() | O(n) | O(n) | 
    
      | 
 | 
 | 
 | 
 | 
    
      | 
 | lookup() | O(log n) | O(n) | 
    
      | DictBT | insert() | O(log n) | O(n) | 
    
      | 
 | remove() | O(log n) | O(n) | 
    
      | 
 | 
 | 
 | 
 | 
    
      | 
 | lookup() | O(log n) | O(log n) | 
    
      | DictArray | insert() | O(n) | O(n) | 
    
      | 
 | remove() | O(n) | O(n) | 
    
      | 
 | 
 | 
 | 
 | 
    
      | 
 | lookup() | O(1) | O(n) | 
    
      | DictHash | insert() | O(1)* | O(n) | 
    
      | 
 | remove() | O(1) | O(n) | 
  
*Calculated by amortizing the cost of doubling the hash table size over
the subsequent inserts.
  - DictHash is an excellent implementation of IDictionary.
    - Hash tables are, however, a poor choice for ranged operations,
such as removeRange().