A major theme in computing is the theme of storage/retrieval/removal: store
data somewhere so that it can later be retrieved and discarded if no longer
needed, all of this in the most efficient manner. The abstraction of these
computing activities is embodied in the notion of what is called a dictionary, expressed in Java
as an interface as follows.
IDictionay is an example what we call an unrestricted
access container (as opposed to restricted access container). A simple way
to implement IDictionary is to use an LRStruct.
The problem with such an
implementation is that each of the operations, insert, lookup and remove takes
O(N) time, where N is the total number of elements in the dictionary.
Using a self-balanced tree will guarantee O(logN) time.
Can we do better
than that? The answer is yes and no. With an data structure called
"hash table" coupled with an appropriate "hash function", we can achieve an
amortized performance of O(1), that is constant
time!
Hash Tables and Hash Functions
Hash Tables
A hash table is a generalization of an ordinary array.
When the number of keys actually stored is small relative to the total
number of possible keys, hash tables become an effective alternative to
directly addressing an array, since a hash table typically uses an array of
size proportional to the number of keys actually stored.
Instead of using the key as an array index directly, the array index is computed
from the key.
With hashing, an element with key k is stored in slot h(k); i.e., a
hash functionh is used to compute the slot from the key k.
h maps the set U of keys into the slots of a hash table
T[0..m-1]:
h:U -> {0, 1, ..., m - 1}
The Problem: Collisions
Two keys may hash to the same slot. This is called a collision.
Because |U|>m, collisions are unavoidable.
To avoid collisions, h should appear ``random'', i.e., adjacent
keys should not hash to adjacent slots.
To cope with collisions, the simplest method is chaining.
Chaining
In chaining, we put all the elements that hash to the same slot in a
linked list, i.e., slot j contains a reference to the head of the list of
all stored elements that hash to j; if there are no such elements, slot j
contains an empty list.
To insert an element, we simply put it at the front of the list. So,
the worst case running time is O(1).
To lookup an element, we search the list belonging to the slot for the
corresponding key. So, the worst case running time is proportional to
the length of the list.
Removal is identical to lookup.
Performance
Given a hash table with m slots that stores n elements, we define the load
factoralpha as n/m, i.e., the average number of elements in a
chain.
The worst case behavior of hashing with chaining is O(n): All n keys hash
to the same slot, creating a list of length n.
The expected case behavior depends on how well the hash function
distributes the set of keys to be stored among the m slots, on
average. We will assume that
any given element is equally likely to hash into any of the m slots and
the hash value can be computed in O(1) time. Then the expected
case search time is O(1+alpha).
If the number of hash table slots is at least proportional to the number
of elements in the table, we have n=O(m) and consequently, alpha = n/m = O(m)/m
= O(1). Thus, searching takes constant time on average.
Below is our own implementation of hash tables. One of the main
differences between the Sun's version and ours is the our elements() method
returns an IList, while the Sun's hash table returns an Enumeration.
By returning an IList, we can process our hash table using visitors, while
with an Enumeration, one will have to write loops to do the
processing. The choice is yours!