Spring 2007

Lecture #28 - Hash Tables and Hash Functions

Hash Tables

• A hash table is a generalization of an ordinary array.
• When the number of keys actually stored is small relative to the total number of possible keys, hash tables become an effective alternative to directly addressing an array, since a hash table typically uses an array of size proportional to the number of keys actually stored.
• Instead of using the key as an array index directly, the array index is computed from the key.
• With hashing, an element with key k is stored in slot h(k); i.e., a hash function h is used to compute the slot from the key k.
• h maps the set U of keys into the slots of a hash table T[0..m-1]:
• h:U -> {0, 1, ..., m - 1}

The Problem: Collisions

• Two keys may hash to the same slot.  This is called a collision.  Because |U|>m, collisions are unavoidable.
• To avoid collisions, h should appear ``random'', i.e., adjacent keys should not hash to adjacent slots.
• To cope with collisions, the simplest method is chaining.

Chaining

• In chaining, we put all the elements that hash to the same slot in a linked list, i.e., slot j contains a reference to the head of the list of all stored elements that hash to j; if there are no such elements, slot j contains an empty list.
• To insert an element, we simply put it at the front of the list.  So, the worst case running time is O(1).
• To lookup an element, we search the list belonging to the slot for the corresponding key.  So, the worst case running time is proportional to the length of the list.
• Removal is identical to lookup.

Performance

• Given a hash table with m slots that stores n elements, we define the load factor alpha as n/m, i.e., the average number of elements in a chain.
• The worst case behavior of hashing with chaining is O(n): All n keys hash to the same slot, creating a list of length n.
• The expected case behavior depends on how well the hash function distributes the set of keys to be stored among the m slots, on average.  We will assume that
1. any given element is equally likely to hash into any of the m slots and
2. the hash value can be computed in O(1) time.  Then the expected case search time is O(1+alpha).
• If the number of hash table slots is at least proportional to the number of elements in the table, we have n=O(m) and consequently, alpha = n/m = O(m)/m = O(1).  Thus, searching takes constant time on average.

Hash Table Implementations

•  The Java Collection Framework provided by Java SDK, includes two "industrial strength" implementations of the hash table data structure: HashTable and HashMap.
• Below is our own implementation of the hash table structure.  One of the main differences between the Sun's version of HashTable and ours is the our elements() method returns an IList, while the Sun's HashTable returns an Enumeration.  Sun's HashMap has a method called values() that returns a Collection of values.  By returning an IList, we can process our hash table using visitors, while with an Enumeration or a Collection, one will have to write loops to do the processing.  The choice is yours!

package genDict;

import genListFW.*;
/**
* An IDictionary implemented using a hash table. Collisions are handled
* using chaining. The chains are implemented using DictLRS.
*
* Uses the method hashCode() defined by class Object as the hash
* function. Any class may override this method with a new
* implementation.
*
* @author Alan L. Cox
* @author Dung X. Nguyen ("generifier")
* @since 03/23/05
*/

public class DictHash<K, V> implements IDictionary<K, V> {
/*
* Initialize _table to reference a single-element array of
* IDictionary, containing in its single element a reference to an
* empty DictLRS.
*/

private IDictionary<K, V>[] _table = new DictLRS[1];
private int _tableOccupancy = 0;

/*
* An IList factory used to linearalize each internal DictLRS and
* resize the _table array.
*/

private IListFactory< DictPair<K, V> > _lf;

/*
* An upper bound on the load factor.
*/

public DictHash(IListFactory< DictPair<K, V> > lf, double loadFactor) {
_table[0] = new DictLRS<K, V>();
_lf = lf;
}

/**
* Clears the contents of the dictionary leaving it empty.
*
* Implemented by replacing the existing LRStruct with a new,
* empty one.
*/

public void clear() {
_table = new DictLRS[1];
_table[0] = new DictLRS<K, V>();
_tableOccupancy = 0;
}

/**
* Returns true if the dictionary is empty and false otherwise.
*/
public boolean isEmpty() {
return _tableOccupancy == 0;
}

/**
* Returns false always.
*/
public boolean isFull() {
return false;
}

/**
* Returns an IList of DictPair<K, V>s corresponding to the entire
* contents of the dictionary.
*
* Note that the elements are not in order.
*/
public IList< DictPair<K, V> > elements(final IListFactory< DictPair<K, V> > lf) {
IList< DictPair<K, V> > l = lf.makeEmptyList();
for (int i = 0; i < _table.length; i++)
l = _table[i].elements(lf).execute(new IListAlgo<DictPair<K, V>, IList< DictPair<K, V> >, IList< DictPair<K, V> > >() {
public IList< DictPair<K, V> > emptyCase(IMTList<? extends DictPair<K, V> > host, IList< DictPair<K, V> >... input) {
return input[0];
}
public IList< DictPair<K, V> > nonEmptyCase(INEList<? extends DictPair<K, V> > host, IList< DictPair<K, V> >... input) {
return lf.makeNEList(host.getFirst(),(IList< DictPair<K, V> >)host.getRest().execute(this, input[0]));
}
}, l);
return l;
}

/**
* Returns the DictPair<K, V> with the given key. If there is not
* a DictPair<K, V> with the given key, returns null.
*
* Returns a DictPair<K, V> rather than the value alone so that
* the user can distinguish between not finding the key and
* finding the pair (key, null).
*
* This method is O(1) in the expected case and O(n) in the worst
* case.
*
* @param key the key to lookup
* @return the DictPair<K, V> found
*/

public DictPair<K, V> lookup(K key) {
int index = key.hashCode() % _table.length;
return _table[index].lookup(key);
}

/**
* Inserts the given key and value. If the given key is already
* in the dictionary, the given value replaces the key's old
* value.
*
* This method is O(1) in both the expected case and the worst
* case if we amortize the cost of doubling the hash table over
* subsequent insert()'s.
*
* @param key the key to insert
* @param value the value to insert
*/

public void insert(K key, V value) {
if (_tableOccupancy >= (_loadFactor * _table.length)) {
int i;
final IDictionary newTable[] = new IDictionary[2*_table.length];
for (i = 0; i < newTable.length; i++)
newTable[i] = new DictLRS<K, V>();
for (i = 0; i < _table.length; i++) {
_table[i].elements(_lf).execute(new IListAlgo<DictPair<K, V>, Object, Object>() {
public Object emptyCase(IMTList<? extends DictPair<K, V> > host, Object... nu) {
return null;
}
public Object nonEmptyCase(INEList<? extends DictPair<K, V> > host, Object... nu) {
DictPair<K, V> pair = (DictPair<K, V>) host.getFirst();
int index = pair.getKey().hashCode() % newTable.length;
newTable[index].insert(pair.getKey(), pair.getValue());
return host.getRest().execute(this);
}
});
}
_table = newTable;
}
int index = key.hashCode() % _table.length;
_tableOccupancy++;
_table[index].insert(key, value);
}

/**
* Removes the DictPair<K, V> with the given key and returns it.
* If there is not a DictPair<K, V> with the given key, returns
* null.
*
* This method is O(1) in the expected case and O(n) in the worst
* case.
*
* @param key the key to remove
* @return the DictPair<K, V> removed
*/

public DictPair<K, V> remove(K key) {
int index = key.hashCode() % _table.length;
DictPair<K, V> pair = _table[index].remove(key);
if (pair != null)
_tableOccupancy--;
return pair;
}

/**
* Returns a string representing the contents of the dictionary.
*/
public String toString() {
return elements(_lf).toString();
}
}