- Recall from Lecture 28 that a hash function,
*h*, maps the set U of keys into the slots of a**hash table**T[0..m-1],*i.e.*,*h*:U -> {0, 1, ..., m - 1}. - There are many functions that satisfy this requirement. Choosing a good function is critical to the performance of a hash table.
- A good hash function distributes the keys over the range as uniformly as possible, minimizing collisions.
- A bad hash function is exemplified by h(k)=0.
- Typically, a hash function is implemented in two steps:
- Map the key to a natural number. Often, the domain of this mapping is larger than the range, for example, a function mapping arbitrary length strings to a subset of the natural numbers, such as [0, 2^31-1].
- The Java Object class's method hashCode() is such a mapping. The key being the underlying object.
- Any class can override the Object class's implementation of
this method with a more appropriate implementation. For example,
we may want our class's hashCode() to use a single field, such as a
string, as the key. (Thus, two objects that otherwise differ, but
contain the same string, would have the same hashCode() value.)

- Select a slot in the hash table based upon the natural number computed in the previous step. There are two common methods:
- The division method:
- h(k) = k.hashCode() % m
- The conventional wisdom says to beware of values of m
that are powers of two. The rationale for this is based on better
performance, i.e., fewer
collisions, for a poorly chosen hashCode() mapping.

- The multiplicative method, which has three steps:
- Multiply k.hashCode() by a constant
**A**that is 0 <**A**< 1. The golden section number 0.61803... is often selected as the value of A. Here is a good URL discussing the golden section: http://www.mcs.surrey.ac.uk/Personal/R.Knott/Fibonacci/fib.html.

The effect of using the golden section is to disperse keys that are adjacent in the hash functions domain to widely separated values in the range. - Extract the fractional part of this product.
- Multiply the fractional part by m, keeping only the integral part.
- Although this method appears complex, there are very efficient ways of implementing it if m is a power of two.
- Ultimately, you should test your hash function on representative
inputs. For example, this
study compares numerous hashCode()
implementations using the division method. (Note that m is a
power of two in their study.)

Expected-case Cost | Worst-case Cost | ||

lookup() | O(n) | O(n) | |

DictLRS | insert() | O(n) | O(n) |

remove() | O(n) | O(n) | |

lookup() | O(log n) | O(n) | |

DictBT | insert() | O(log n) | O(n) |

remove() | O(log n) | O(n) | |

lookup() | O(log n) | O(log n) | |

DictArray | insert() | O(n) | O(n) |

remove() | O(n) | O(n) | |

lookup() | O(1) | O(n) | |

DictHash | insert() | O(1)* | O(n) |

remove() | O(1) | O(n) |

- DictHash is an excellent implementation of IDictionary.
- Hash tables are, however, a poor choice for ranged operations, such as removeRange().

- Consider an ordered, array-based IDictionary implementation, such as DictArray.
- Suppose that keys are
*uniformly*distributed. - How do you find a number in a phone book?
- Specifically, if I asked you find ``Alan Cox'' in the phone book would you start in the middle?

- Recall the findIndex() method, implementing binary search.
- Rewrite
`mid = (lo + hi)/2`as`mid = lo + (hi - lo)/2` - Replace (hi - lo)/2 with an expression that places us closer to
what
we're
looking for
`mid = lo + ((key - keys[lo + 1])*(hi - lo))/(keys[hi - 1] - keys[lo + 1])` - Note: The
`Comparable`interface is insufficient for interpolation search.

- Consider the following array of elements: 9, 21, 32, 38, 51, 59, 68, 80, 91, 97, 113, 119, 131, 142, 149
- How many steps would binary search require in order to find 68?
- How many steps would interpolation search require in order to find 68?

- Suppose that we used an extended
`Comparable`interface that includes the method`int sub(Object key)`

- If the keys are
*uniformly*distributed, the number of steps in an interpolation search is O(log log n). - If, instead, the keys are not uniformly distributed, e.g., 1, 2, 3, 4, 5, 6, 7, 8, 9, 999, and we search for 9, performance is poor.
- Can you guarantee O(log n) worst case?