Rice University - Comp 212 - Intermediate Programming

Spring 2007

Recall from Lecture 28 that a hash function, h, maps the set U of keys into the slots of a hash table T[0..m-1], i.e., h:U -> {0, 1, ..., m - 1}.

There are many functions that satisfy this requirement. Choosing a good function is critical to the performance of a hash table.

A good hash function distributes the keys over the range as uniformly as possible, minimizing collisions.
A bad hash function is exemplified by h(k)=0.

Map the key to a natural number. Often, the domain of this mapping is larger than the range, for example, a function mapping arbitrary length strings to a subset of the natural numbers, such as [0, 2^31-1].

The Java Object class's method hashCode() is such a mapping. The key being the underlying object.

Any class can override the Object class's implementation of this method with a more appropriate implementation. For example, we may want our class's hashCode() to use a single field, such as a string, as the key. (Thus, two objects that otherwise differ, but contain the same string, would have the same hashCode() value.)

Select a slot in the hash table based upon the natural number computed in the previous step. There are two common methods:

The conventional wisdom says to beware of values of m that are powers of two. The rationale for this is based on better performance, i.e., fewer collisions, for a poorly chosen hashCode() mapping.

Multiply k.hashCode() by a constant A that is 0 < A < 1. The golden section number 0.61803... is often selected as the value of A. Here is a good URL discussing the golden section: http://www.mcs.surrey.ac.uk/Personal/R.Knott/Fibonacci/fib.html.
The effect of using the golden section is to disperse keys that are adjacent in the hash functions domain to widely separated values in the range.
Extract the fractional part of this product.
Multiply the fractional part by m, keeping only the integral part.

Although this method appears complex, there are very efficient ways of implementing it if m is a power of two.

Ultimately, you should test your hash function on representative inputs. For example, this study compares numerous hashCode() implementations using the division method. (Note that m is a power of two in their study.)

*Calculated by amortizing the cost of doubling the hash table size over the subsequent inserts.

Hash tables are, however, a poor choice for ranged operations, such as removeRange().