Lab 5 - Hash Tables

Lab goals:


An Introduction to Hash Tables

A hash table is an efficient data structure for maintaining a collection of key-to-value mappings. The operations supported by a hash table typically include: insert a key-to-value mapping into the collection, lookup a mapping with the specified key, and remove a mapping with the specified key. A hash table is also sometimes referred to as a hash map (as in Java) or a dictionary (as in Python).

A Strawman Approach (with Problems)

An ordinary array can be used to implement a collection of key-to-value mappings. A simple function can be used to compute an array index that is unique over all keys, and the associated value can simply be stored at that index within the array. Generally speaking, this approach is easily implemented, and the operations of insert, lookup, and remove would then run very quickly.

However, this approach has a major drawback. The size of the array has to be equal to the range of possible keys. In the general case, it is just not feasible to allocate such a huge array. Moreover, in general, even if it is possible to allocate an array of that size, the array would be very inefficiently used, as the number of keys that are actually in use is usually much smaller than the number of possible keys.

For example, suppose that an array is used to store some value associated with each of the words in a dictionary, using each word as the key. Then, the array size has to be big enough to be able to use as an index any possible combination of letters for some maximum word length. However, only a fraction of character combinations are valid words (and thus would be in the dictionary), and the rest of the entries in the array would be wasted.

A Better Solution: A Hash Table

Using a hash table helps eliminate or greatly reduce the wasted memory of this strawman approach without sacrificing much in terms of its simplicity or speed. As in the strawman solution above, using a hash table uses a function to compute the array index for a given key, but for a hash table, this array index need not be unique for every possible key.

The function used with a hash table is called a hash function and may produce the same array index for more than one possible input; and the resulting hash table allows more than one key-to-value mapping to use the same location (the same index) within the array. When this happens, the situation is called a hash collision, and there are many possible solutions for handling collisions in a hash table.

One common approach for handling such collisions is to create a linked list of all the mappings that collided at that same location in the array (i.e., for which the hash function resulted in the same array index being used). This approach to handling hash collisions is referred to as chaining:

Hash Table

Clearly, collisions are unavoidable when the number of mappings stored in a hash table is greater than the size of the hash table's array. In general, if the array size is never changed, as the number of stored mappings increases, so will the number of collisions as well as the average length of the collision chain linked lists. Adding a new mapping to the hash table is unaffected because it can be done just by computing the array index from the key using the hash function and adding the mapping at the front of the linked list. However, longer linked lists will result in lookup and remove operations taking longer. To solve this problem, a hash table can be made to dynamically grow (or shrink) the size of the array in proportion to the number of stored mappings. In particular, given a hash table of array size m that in total currently stores n mappings, the current load factor for this hash table is defined to be n/m. The hash table's load factor is thus the average length of the collision chains in that hash table. Dynamically growing (or shrinking) the hash table array size can, for example, be used to bound the hash table's load factor to not exceed some specified threshold value.

Hash Functions

Given a key k and a hash function hash(), the associated value for key k is stored at the index hash(k) within the hash table's array. A good hash function in general maps the keys uniformly across the range of array indices and thereby generally keeps collisions to a minimum (i.e., by spreading the mappings evenly across the hash table array). For some kinds of hash table keys, the set of keys that will be encountered may already naturally have an approximately uniform distribution (e.g., generally for keys that are small integers that are sequentially generated when each new item that will be added to the hash table is created). But in many other uses, this is not the case. For example, for the set of all words in English, many more words start with the letter 'a' than with the letter 'x', etc.

Below is the hash function that we will use in this lab (if you are curious, this is a version of the Fowler–Noll–Vo hash function).  It takes a string as its argument and returns an unsigned integer:

/*
 * Requires:
 *  Nothing.
 *
 * Effects:
 *  Computes the Fowler / Noll / Vo (FNV) hash of the string "key".
 */
static unsigned int
hash_value(const char *key)
{
        unsigned int c, hval = 33554467U;       /* A "magic" number. */

        /*
         * The following cast is meant to prevent the sign extension of
         * "*string"'s value before it is assigned to "c".  For example,
         * suppose that "*string" is 0xbc.  Then, without the cast, "c" would
         * be assigned 0xffffffbc instead of 0x000000bc.  The former value,
         * 0xffffffbc, is problematic because "hval ^= c" is only supposed
         * to change the least significant 8 bits of "hval".
         */
        while ((c = *(const unsigned char *)key++) != 0) {
                hval *= 0x01000193U;            /* A "magic" prime number. */
                hval ^= c;
        }
        return (hval);
}

This hash function goes through the given key, one character at a time, updating the value of hval for each new character.

As written, clearly this function can return values that are much larger than the array size of most hash tables. Specifically, here the value returned is not used directly as the index in the hash table. Rather, the hash table actually uses as the array index the remainder after dividing the resulting hash value by the array size. That is:

array index = hash value % array size

Function Pointers in C

The hash table interface described here uses a type of pointer, a function pointer, that we haven't used before in this class. For example, the second argument to the function hash_table_destroy() (described below) is a function pointer:

    void *(*destructor)(const char *key, void *value, void *cookie)

The C type in this declaration of destructor can be read according to the meaning of the different C operators that it uses and the precedence of those operators with respect to each other (just like the precedence of + vs. *, for example):

Basically, just read all of those parts together and you have what the type in this declaration of destructor means: pointer to function taking these arguments and returning a pointer to void.  Compare the C type in this declaration of destructor to the type in the following six example declarations:

    int a;
    int *b;
    int c(char *x);
    int *d(char *x);
    int (*e)(char *x);
    int *(*f)(char *x);

In summary, this expression declares that destructor is a pointer that can point to any function with the following signature (i.e., with these argument types and this return type):

    void *
    function(const char *key, void *value, void *cookie)
    {
            ...
    }

Also, in C, you can use the name of a function by itself in a way analogous to how you can use the name of an array by itself.  In other words, whenever you use the name of a function without calling it, that is, without putting (arguments) after the function name, you get the address of that function. So saying just  function  (again, with no parentheses or arguments after it) is the same as saying  &function,  although saying just  function  is preferred.  Here is an example from the main part of this week's lab that passes the address of the function free_key to the function hash_table_destroy:

void *
free_key(const char *key, void *value, void *cookie)
{
        /* A silly assignment to prevent a compilation warning. */
        value = value;
        /* A cast is needed here to remove the const attribute from "key". */
        free((char *)key);
        return (cookie);
}

...

int
main(int argc, char **argv)
{
        ...
        /* Ignore the return value, which is NULL. */
        (void)hash_table_destroy(ht, free_key, NULL);
        ...
}

Finally, calling a function through a pointer is easy. A call looks a lot like the declaration. Again, here is an example from this week's lab:

void *
hash_table_destroy(struct hash_table *ht,
    void *(*destructor)(const char *key, void *value, void *cookie),
    void *cookie)
{
        struct hash_table_mapping *elem, *next;
        ...
        cookie = (*destructor)(elem->key, elem->value, cookie);
        ...
}

In effect, if destructor is a pointer to the function, then *destructor is the function (not just the pointer to it), and the parentheses around this part of the expression cause it to be evaluated before the rest of the expression (i.e., before the actual calling of the function).


The Provided Hash Table Interface

Below is the interface for the hash table that you'll be working with in this lab. This interface is contained in the header file hash_table.h.

This interface definition also illustrates how the principle of information hiding (also known as data hiding) can be realized in C.  In particular:

This kind of information hiding was briefly discussed in the lectures on Structures and Unions, specifically, on slide 11, Abstraction in C. This technique in C is also similar (within the limits of the C language) to using private fields in Java.

This hash table interface also allows a user-defined cookie (i.e., some extra piece of information) to be passed among functions.  The cookie is a void pointer, allowing you to pass a pointer to any type of data you wish. This feature enables you to provide additional data when removing items from the hash table and/or when iterating over the mappings in the hash table. You won't directly use this feature in this week's lab, but you will see one example of how to use such user data in next week's lab.

/*
 * This file defines the interface for a hash table, which is an efficient
 * data structure for maintaining a collection of "key"-to-"value" mappings.
 * This hash table, in particular, maintains mappings from strings to opaque
 * objects.
 *
 * This file is placed in the public domain by its authors, Alan L. Cox and
 * Kaushik Kumar Ram.
 */

/*
 * Declare the hash table struct type without exposing its implementation
 * details, because the callers to the following functions don't need to know
 * those details in order to declare, assign, or pass a pointer to this struct
 * type.  However, without knowing those details, any attempt to dereference
 * such a pointer will result in a compilation error.  This is an example of
 * "data hiding", which is a good principle to follow in designing and
 * implementing data abstractions.
 */
struct hash_table;

/*
 * Requires:
 *  "load_factor" must be greater than zero.
 *
 * Effects:
 *  Creates a hash table with the upper bound "load_factor" on the average
 *  length of a collision chain.  Returns a pointer to the hash table if it
 *  was successfully created and NULL if it was not.
 */
struct hash_table *hash_table_create(double load_factor);

/*
 * Requires:
 *  Nothing.
 *
 * Effects:
 *  Destroys a hash table, but first if the function pointer "destructor" is
 *  not NULL calls "destructor" on each mapping in the hash table.  The
 *  mapping's key, its value, and the latest value of the pointer "cookie"
 *  are passed to "destructor" as arguments.  The return value is assigned to
 *  "cookie", becoming "cookie"'s latest value.  In other words,
 *      cookie = destructor(key, value, cookie);
 *  The order in which the mappings are passed to "destructor" is not
 *  defined.  Returns "cookie"'s latest value.
 */
void *hash_table_destroy(struct hash_table *ht,
    void *(*destructor)(const char *key, void *value, void *cookie),
    void *cookie);

/*
 * Requires:
 *  "key" is not already in "ht".
 *  "value" is not NULL.
 *
 * Effects:
 *  Creates an association or mapping from the string "key" to the pointer
 *  "value" in the hash table "ht".  Returns 0 if the mapping was successfully
 *  created and -1 if it was not.
 */
int hash_table_insert(struct hash_table *ht, const char *key, void *value);

/*
 * Requires:
 *  Nothing.
 *
 * Effects:
 *  Calls "function" on each mapping in the hash table, passing the mapping's
 *  key, its value, and the latest value of the pointer "cookie" as the
 *  arguments.  The return value is assigned to "cookie", becoming "cookie"'s
 *  latest value.  In other words,
 *      cookie = function(key, value, cookie);
 *  The order in which the mappings are passed to "function" is not defined.
 *  Returns "cookie"'s latest value.
 */
void *hash_table_iterate(struct hash_table *ht,
    void *(*function)(const char *key, void *value, void *cookie),
    void *cookie);

/*
 * Requires:
 *  Nothing.
 *
 * Effects:
 *  Searches the hash table "ht" for the string "key".  If "key" is found,
 *  returns its associated value.  Otherwise, returns NULL.
 */
void *hash_table_lookup(struct hash_table *ht, const char *key);

/*
 * Requires:
 *  Nothing.
 *
 * Effects:
 *  Searches the hash table "ht" for the string "key".  If "key" is found,
 *  removes its mapping from "ht".  If the function pointer "destructor" is
 *  not NULL, calls "destructor" with the mapping's key, its value, and the
 *  pointer "cookie" as the arguments.  The return value from "destructor" is
 *  assigned to "cookie".  In other words,
 *      cookie = destructor(key, value, cookie);
 *  Returns "cookie"'s value.
 */
void *hash_table_remove(struct hash_table *ht, const char *key,
    void *(*destructor)(const char *key, void *value, void *cookie),
    void *cookie);

The Provided Hash Table Implementation

Before you begin the exercises for this lab, we'll discuss the data structures and their initialization that you'll be working with.

Data Structures

The file hash_table.c contains the following data structure definitions:

/*
 * A hash table mapping:
 *
 *  Stores the association or mapping from a string "key" to its opaque
 *  "value".  Each mapping is an element of a singly-linked list.  This list
 *  is called a collision chain.
 */
struct hash_table_mapping {
        /*
         * The "key" is a pointer to a string.
         */
        const char *key;
        /*
         * The "value" is a pointer to an object of an unknown type, because
         * the hash table doesn't need to know the type in order to store and
         * return the pointer.
         */
        void *value;
        /*
         * The pointer to the next element in the same collision chain.
         */
        struct hash_table_mapping *next;
};

/*
 * A hash table:
 *
 *  Stores a collection of "key"-to-"value" mappings.  Each mapping is an
 *  element of a collision chain.  For efficiency, the number of collision
 *  chains is kept in proportion to the number of mappings so that the average
 *  length of a collision chain remains constant no matter how many mappings
 *  the hash table contains.
 */
struct hash_table {
        /*
         * The array of collision chains.  Really, this is a pointer to an
         * array of pointers.  Each element of that array is the head of a
         * collision chain.
         */
        struct hash_table_mapping **head;
        /*
         * The number of collision chains.
         */
        unsigned int size;
        /*
         * The number of mappings in the hash table.
         */
        unsigned int occupancy;
        /*
         * The upper bound on the average collision chain length that is
         * allowed before the number of collision chains is increased.
         */
        double load_factor;
};

Initialization

And the file hash_table.c contains the following function for initializing a hash table (note the use the function calloc() in this code for allocating and initializing memory):

/*
 * The initial number of collision chains in a hash table:
 */
#define INITIAL_SIZE    2

/*
 * Requires:
 *  "load_factor" must be greater than zero.
 *
 * Effects:
 *  Creates a hash table with the upper bound "load_factor" on the average
 *  length of a collision chain.  Returns a pointer to the hash table if it
 *  was successfully created and NULL if it was not.
 */
struct hash_table *
hash_table_create(double load_factor)
{
        struct hash_table *ht;

        assert(load_factor > 0.0);
        ht = malloc(sizeof(struct hash_table));
        if (ht == NULL)
                return (NULL);
        /*
         * Allocate and initialize the hash table's array of pointers.  In
         * contrast to malloc(), calloc() initializes each of the returned
         * memory locations to zero, effectively initializing the head of
         * every collision chain to NULL.
         */
        ht->head = calloc(INITIAL_SIZE, sizeof(struct hash_table_mapping *));
        if (ht->head == NULL) {
                free(ht);
                return (NULL);
        }
        ht->size = INITIAL_SIZE;
        ht->occupancy = 0;
        ht->load_factor = load_factor;
        return (ht);
}

GitHub Repository for This Lab

To obtain your private repo for this lab, please point your browser to the starter code for the lab at:

https://classroom.github.com/a/FgyBjLki

Follow the same steps as for previous labs and assignments to to create your repository on GitHub and to then clone it onto CLEAR. The directory for your repository for this lab will be

lab-5-hash-tables-name

where name is your GitHub userid.


Submission

As always, be sure to git push the appropriate C source files for this lab before 11:55 PM tonight, to get credit for this lab.