Rice University - Comp 212 - Intermediate Programming

Spring 2005

Lecture 31 - Hash Functions; Interpolation Search

Hash Functions

Recall from Lecture 28 that a hash function, h, maps the set U of keys into the slots of a hash table T[0..m-1], i.e., h:U -> {0, 1, ..., m - 1}.

There are many functions that satisfy this requirement. Choosing a good function is critical to the performance of a hash table.

A good hash function distributes the keys over the range as uniformly as possible, minimizing collisions.
A bad hash function is exemplified by h(k)=0.

Typically, a hash function is implemented in two steps:

Map the key to a natural number. Often, the domain of this mapping is larger than the range, for example, a function mapping arbitrary length strings to a subset of the natural numbers, such as [0, 2^31-1].

The Java Object class's method hashCode() is such a mapping. The key being the underlying object.

Any class can override the Object class's implementation of this method with a more appropriate implementation. For example, we may want our class's hashCode() to use a single field, such as a string, as the key. (Thus, two objects that otherwise differ, but contain the same string, would have the same hashCode() value.)

Select a slot in the hash table based upon the natural number computed in the previous step. There are two common methods:

The division method:

h(k) = k.hashCode() % m

The conventional wisdom says to beware of values of m that are powers of two. The rationale for this is based on better performance, i.e., fewer collisions, for a poorly chosen hashCode() mapping.

The multiplicative method, which has three steps:

Multiply k.hashCode() by a constant A that is 0 < A < 1. The golden section number 0.61803... is often selected as the value of A. Here is a good URL discussing the golden section: http://www.mcs.surrey.ac.uk/Personal/R.Knott/Fibonacci/fib.html.
The effect of using the golden section is to disperse keys that are adjacent in the hash functions domain to widely separated values in the range.
Extract the fractional part of this product.
Multiply the fractional part by m, keeping only the integral part.

Although this method appears complex, there are very efficient ways of implementing it if m is a power of two.

Ultimately, you should test your hash function on representative inputs. For example, this study compares numerous hashCode() implementations using the division method. (Note that m is a power of two in their study.)

Review: Comparing the four IDictionary implementations

		Expected-case Cost	Worst-case Cost

	lookup()	O(n)	O(n)
DictLRS	insert()	O(n)	O(n)
	remove()	O(n)	O(n)

	lookup()	O(log n)	O(n)
DictBT	insert()	O(log n)	O(n)
	remove()	O(log n)	O(n)

	lookup()	O(log n)	O(log n)
DictArray	insert()	O(n)	O(n)
	remove()	O(n)	O(n)

	lookup()	O(1)	O(n)
DictHash	insert()	O(1)*	O(n)
	remove()	O(1)	O(n)

*Calculated by amortizing the cost of doubling the hash table size over the subsequent inserts.

DictHash is an excellent implementation of IDictionary.

Hash tables are, however, a poor choice for ranged operations, such as removeRange().

Can We Improve On Binary Search?

Consider an ordered, array-based IDictionary implementation, such as DictArray.
Suppose that keys are uniformly distributed.
How do you find a number in a phone book?

Specifically, if I asked you find ``Alan Cox'' in the phone book would you start in the middle?

Interpolation Search

Recall the findIndex() method, implementing binary search.
Rewrite mid = (lo + hi)/2 as mid = lo + (hi - lo)/2
Replace (hi - lo)/2 with an expression that places us closer to what we're looking for mid = lo + ((key - keys[lo + 1])*(hi - lo))/(keys[hi - 1] - keys[lo + 1])
Note: The Comparable interface is insufficient for interpolation search.

Example

Consider the following array of elements: 9, 21, 32, 38, 51, 59, 68, 80, 91, 97, 113, 119, 131, 142, 149
How many steps would binary search require in order to find 68?
How many steps would interpolation search require in order to find 68?

`findIndex()`

Suppose that we used an extended Comparable interface that includes the method int sub(Object key)

private int findIndex(Comparable key) {
    int lo = -1;
    int hi = _firstEmptyKeyValuePair;
    while (lo + 1 != hi) {
      ExtendedComparable loKey = _pairs[lo + 1].getKey();
      ExtendedComparable hiKey = _pairs[hi - 1].getKey();
      int mid = lo + key.sub(loKey)*(hi - lo)/hiKey.sub(loKey);
      int result = _pairs[mid].getKey().compareTo(key);
      if (result == 0)
        return (mid);
      else if (result > 0)
        hi = mid;
      else
        lo = mid;
      }
    }
    return lo;
}

The Computational Cost of Interpolation Search

If the keys are uniformly distributed, the number of steps in an interpolation search is O(log log n).
If, instead, the keys are not uniformly distributed, e.g., 1, 2, 3, 4, 5, 6, 7, 8, 9, 999, and we search for 9, performance is poor.

Can you guarantee O(log n) worst case?