Rice University - Comp 212 - Intermediate Programming
Spring 2007
Lecture 31 - Hash Functions
Hash Functions
- Recall from Lecture 28 that a hash function, h, maps the
set U of keys into the
slots of a hash table T[0..m-1], i.e., h:U
-> {0, 1, ..., m - 1}.
- There are many functions that satisfy this requirement.
Choosing a good function is critical to the performance of a hash table.
- A good hash function distributes the keys over the range as
uniformly as possible, minimizing collisions.
- A bad hash function is exemplified by h(k)=0.
- Typically, a hash function is implemented in two steps:
- Map the key to a natural number. Often, the domain of
this mapping is larger than the range, for example, a function mapping
arbitrary length strings to a subset of the natural numbers, such as
[0, 2^31-1].
- The Java Object class's method hashCode() is such a
mapping. The key being the underlying object.
- Any class can override the Object class's implementation of
this method with a more appropriate implementation. For example,
we may want our class's hashCode() to use a single field, such as a
string, as the key. (Thus, two objects that otherwise differ, but
contain the same string, would have the same hashCode() value.)
- Select a slot in the hash table based upon the natural number
computed in the previous step. There are two common methods:
- The division method:
- h(k) =
k.hashCode() % m
- The conventional wisdom says to beware of values of m
that are powers of two. The rationale for this is based on better
performance, i.e., fewer
collisions, for a poorly chosen hashCode() mapping.
- The multiplicative method, which has three steps:
- Multiply k.hashCode() by a constant A that is
0 < A < 1. The golden section number 0.61803... is often
selected as the value of A. Here is a good URL discussing the golden
section:
http://www.mcs.surrey.ac.uk/Personal/R.Knott/Fibonacci/fib.html.
The effect of using the golden section is to disperse keys that are
adjacent in the hash functions domain to widely separated values in the
range.
- Extract the fractional part of this product.
- Multiply the fractional part by m, keeping only the
integral part.
- Although this method appears complex, there are very
efficient ways of implementing it if m is a power of two.
- Ultimately, you should test your hash function on representative
inputs. For example, this
study compares numerous hashCode()
implementations using the division method. (Note that m is a
power of two in their study.)
Review: Comparing the four IDictionary implementations
|
|
Expected-case Cost |
Worst-case Cost |
|
|
|
|
|
lookup() |
O(n) |
O(n) |
DictLRS |
insert() |
O(n) |
O(n) |
|
remove() |
O(n) |
O(n) |
|
|
|
|
|
lookup() |
O(log n) |
O(n) |
DictBT |
insert() |
O(log n) |
O(n) |
|
remove() |
O(log n) |
O(n) |
|
|
|
|
|
lookup() |
O(log n) |
O(log n) |
DictArray |
insert() |
O(n) |
O(n) |
|
remove() |
O(n) |
O(n) |
|
|
|
|
|
lookup() |
O(1) |
O(n) |
DictHash |
insert() |
O(1)* |
O(n) |
|
remove() |
O(1) |
O(n) |
*Calculated by amortizing the cost of doubling the hash table size over
the subsequent inserts.
- DictHash is an excellent implementation of IDictionary.
- Hash tables are, however, a poor choice for ranged operations,
such as removeRange().