Rice University - Comp 212 - Intermediate Programming
Spring 2005
Lecture 31 - Hash Functions; Interpolation Search
Hash Functions
- Recall from Lecture 28 that a hash function, h, maps the
set U of keys into the
slots of a hash table T[0..m-1], i.e., h:U
-> {0, 1, ..., m - 1}.
- There are many functions that satisfy this requirement.
Choosing a good function is critical to the performance of a hash table.
- A good hash function distributes the keys over the range as
uniformly as possible, minimizing collisions.
- A bad hash function is exemplified by h(k)=0.
- Typically, a hash function is implemented in two steps:
- Map the key to a natural number. Often, the domain of
this mapping is larger than the range, for example, a function mapping
arbitrary length strings to a subset of the natural numbers, such as
[0, 2^31-1].
- The Java Object class's method hashCode() is such a
mapping. The key being the underlying object.
- Any class can override the Object class's implementation of
this method with a more appropriate implementation. For example,
we may want our class's hashCode() to use a single field, such as a
string, as the key. (Thus, two objects that otherwise differ, but
contain the same string, would have the same hashCode() value.)
- Select a slot in the hash table based upon the natural number
computed in the previous step. There are two common methods:
- The division method:
- h(k) =
k.hashCode() % m
- The conventional wisdom says to beware of values of m
that are powers of two. The rationale for this is based on better
performance, i.e., fewer
collisions, for a poorly chosen hashCode() mapping.
- The multiplicative method, which has three steps:
- Multiply k.hashCode() by a constant A that is
0 < A < 1. The golden section number 0.61803... is often
selected as the value of A. Here is a good URL discussing the golden
section:
http://www.mcs.surrey.ac.uk/Personal/R.Knott/Fibonacci/fib.html.
The effect of using the golden section is to disperse keys that are
adjacent in the hash functions domain to widely separated values in the
range.
- Extract the fractional part of this product.
- Multiply the fractional part by m, keeping only the
integral part.
- Although this method appears complex, there are very
efficient ways of implementing it if m is a power of two.
- Ultimately, you should test your hash function on representative
inputs. For example, this
study compares numerous hashCode()
implementations using the division method. (Note that m is a
power of two in their study.)
Review: Comparing the four IDictionary implementations
|
|
Expected-case Cost |
Worst-case Cost |
|
|
|
|
|
lookup() |
O(n) |
O(n) |
DictLRS |
insert() |
O(n) |
O(n) |
|
remove() |
O(n) |
O(n) |
|
|
|
|
|
lookup() |
O(log n) |
O(n) |
DictBT |
insert() |
O(log n) |
O(n) |
|
remove() |
O(log n) |
O(n) |
|
|
|
|
|
lookup() |
O(log n) |
O(log n) |
DictArray |
insert() |
O(n) |
O(n) |
|
remove() |
O(n) |
O(n) |
|
|
|
|
|
lookup() |
O(1) |
O(n) |
DictHash |
insert() |
O(1)* |
O(n) |
|
remove() |
O(1) |
O(n) |
*Calculated by amortizing the cost of doubling the hash table size over
the subsequent inserts.
- DictHash is an excellent implementation of IDictionary.
- Hash tables are, however, a poor choice for ranged operations,
such as removeRange().
Can We Improve On Binary Search?
- Consider an ordered, array-based IDictionary implementation, such
as
DictArray.
- Suppose that keys are uniformly distributed.
- How do you find a number in a phone book?
- Specifically, if I asked you find ``Alan Cox'' in the phone
book would
you start in the middle?
Interpolation Search
- Recall the findIndex() method, implementing binary search.
- Rewrite mid = (lo + hi)/2 as mid = lo + (hi - lo)/2
- Replace (hi - lo)/2 with an expression that places us closer to
what
we're
looking for mid = lo + ((key - keys[lo + 1])*(hi - lo))/(keys[hi -
1] - keys[lo + 1])
- Note: The Comparable interface is insufficient for
interpolation
search.
Example
- Consider the following array of elements: 9, 21, 32, 38, 51, 59,
68,
80,
91, 97, 113, 119, 131, 142, 149
- How many steps would binary search require in order to find 68?
- How many steps would interpolation search require in order to
find 68?
findIndex()
- Suppose that we used an extended Comparable interface
that
includes
the method int sub(Object key)
private int findIndex(Comparable key) {
int lo = -1;
int hi = _firstEmptyKeyValuePair;
while (lo + 1 != hi) {
ExtendedComparable loKey = _pairs[lo
+ 1].getKey();
ExtendedComparable hiKey = _pairs[hi
- 1].getKey();
int mid = lo + key.sub(loKey)*(hi
- lo)/hiKey.sub(loKey);
int result =
_pairs[mid].getKey().compareTo(key);
if (result == 0)
return (mid);
else if (result > 0)
hi = mid;
else
lo = mid;
}
}
return lo;
}
The Computational Cost of Interpolation Search
- If the keys are uniformly distributed, the number of
steps in
an
interpolation search is O(log log n).
- If, instead, the keys are not uniformly distributed, e.g., 1, 2,
3, 4,
5, 6, 7, 8, 9, 999, and we search for 9, performance is poor.
- Can you guarantee O(log n) worst case?