Lab 5 - Hash Tables

Lab goals:


GitHub Repository for This Lab

To obtain your private repo for this lab, please point your browser to this starting link.
Follow the previous labs and assignments protocol to get your repository link. The repository for this lab is
  lab-5-hash-tables-[YOUR GITHUB ID]

An Introduction to Hash Tables

A hash table is an efficient data structure for maintaining a collection of "key"-to-"value" mappings. The operations supported by a hash table typically include: insert a "key"-to-"value" mapping into the collection, lookup a mapping with the specified "key", and remove a mapping with the specified "key". A hash table is also sometimes referred to as a Hash Map (as in Java) or a Dictionary (as in Python).

A Strawman Approach

An ordinary array can be used to implement a collection of "key"-to-"value" mappings. A simple function can be used to compute a unique array index from a given key, and the associated value can be stored at that location within the array. Generally speaking, this approach is easily implemented and the operations, insert, lookup, and remove, run very quickly.

However, this approach has a major drawback. The size of the array has to be equal to the number of possible keys. In most cases, it is just not feasible to allocate such a huge array. Moreover, even if it is possible to allocate an array of that size, since the number of keys which are actually in use is usually much smaller, the array is very inefficiently used.

For example, suppose that an array is used to store something using each of the words in a dictionary as keys. Then, the array size has to be big enough to accommodate all possible combinations of characters for some maximum word length. However, only a fraction of character combinations are valid words, and the rest of the space in the array is wasted.

A Better Approach

A hash table eliminates the wasted memory of this strawman approach without sacrificing much in simplicity or speed. A hash table also uses a function, called a hash function, that computes an array index from a given key and stores the corresponding value at that location within the array. But, unlike the functions used in the case of an ordinary array, a hash function doesn't need to compute a unique array index for every possible key. Instead, a hash table allows more than one key to use the same location within the array. When this happens, it is called a collision. One approach to handling collisions is to create a linked list of all the mappings that collide. This is called chaining.

Clearly, collisions are unavoidable when the number of mappings stored in a hash table is greater than the size of the hash table's array. In general, if the array size is never changed, as the number of stored mappings increases, so will the number of collisions and the average length of the collision chain linked lists. Adding a new mapping to the hash table is unaffected because it just involves computing the array index from the key using the hash function and adding the mapping at the front of the linked list. However, longer linked lists will result in lookup and remove operations taking longer. Therefore, a hash table can be made to dynamically grow (or shrink) the size of the array in proportion to the number of stored mappings. This approach maintains the average length of the linked lists below a desired threshold called the load factor. Given a hash table array of length m that stores n mappings, the load factor is defined as n/m.

Hash Table

Hash Functions

Given a key k and a hash function h, the associated value for key k is stored in the index h(k) within the hash table's array. A good hash function generally maps the keys uniformly across the range of array indices and thereby generally keeps collisions to a minimum. For some uses, the set of keys that will be encountered may already naturally have an approximately uniform distribution, but in many uses, this is not the case. For example, for the set of words in English, many more words start with an 'a' than with an 'x', etc.

Below is the hash function used in this lab. It takes a string as its argument and returns an unsigned integer as the hash value for that key.

/*
 * Requires:
 *  Nothing.
 *
 * Effects:
 *  Computes the Fowler / Noll / Vo (FNV) hash of the string "key".
 */
static unsigned int
hash_value(const char *key)
{
        unsigned int c, hval = 33554467U;       /* A "magic" number. */

        /*
         * The following cast is meant to prevent the sign extension of
         * "*string"'s value before it is assigned to "c".  For example,
         * suppose that "*string" is 0xbc.  Then, without the cast, "c" would
         * be assigned 0xffffffbc instead of 0x000000bc.  The former value,
         * 0xffffffbc, is problematic because "hval ^= c" is only supposed
         * to change the least significant 8 bits of "hval".
         */ 
        while ((c = *(const unsigned char *)key++) != 0) {
                hval *= 0x01000193U;            /* A "magic" prime number. */
                hval ^= c;
        }
        return (hval);
}

Clearly, this function can return values that are much larger than the array size of most hash tables. Thus, the hash table actually uses as the array index the remainder after dividing the resulting hash value by the array size (i.e., array index = hash value modulo array size).


Function Pointers

The hash table interface described here uses a type of pointer, a function pointer, that we haven't used before in this class. For example, the second argument to the function hash_table_destroy() (described below) is a function pointer:

    void *(*destructor)(const char *key, void *value, void *cookie)

The type in this declaration of destructor can be read easily according to the meaning of the different C operators that it uses and the precedence of those operators:

Just read all of those parts together and you have what the type in this declaration of destructor means: "pointer to function taking these arguments and returning pointer to void". Compare the type in this declaration to the type in the following six declarations:
    int a;
    int *b;
    int c(char *x);
    int *d(char *x);
    int (*e)(char *x);
    int *(*f)(char *x);

Thus, this expression declares that destructor is a pointer that can point to any function with the following signature (i.e., with this return type and these argument types):

    void *
    function(const char *key, void *value, void *cookie)
    {
            ...
    }

A good resource for helping with this is https://cdecl.org/.

In C, you can use the name of a function in a way analogous to how you can use the name of an array. In other words, whenever you use the name of the function without calling it, that is, without the "(" ... arguments ... ")", you get the address of that function. Think of it as meaning &function.  Here is an example from the main part of today's programming exercise that passes the address of the function free_key to the function hash_table_destroy:

void *
free_key(const char *key, void *value, void *cookie)
{

        /* A silly assignment to prevent a compilation warning. */
        value = value;
        /* A cast is needed here to remove the const attribute from "key". */
        free((char *)key);
        return (cookie);
}

...

int
main(int argc, char **argv)
{
        ...
        /* Ignore the return value, which is NULL. */
        (void)hash_table_destroy(ht, free_key, NULL);
        ...
}

Finally, calling a function through a pointer is easy. A call looks a lot like the declaration. Again, here is an example from today's programming exercise:

void *
hash_table_destroy(struct hash_table *ht,
    void *(*destructor)(const char *key, void *value, void *cookie),
    void *cookie)
{
        struct hash_table_mapping *elem, *next;
        ...
        cookie = (*destructor)(elem->key, elem->value, cookie);
        ...
}

In effect, if destructor is a pointer to the function, then *destructor "is" the function (not just the pointer), and the parentheses around this part of the expression cause it to be evaluated before the rest of the expression (i.e., the actual calling of the function).


Hash Table Interface

Below is the interface for the hash table that you'll be working with in this lab. This interface is contained in the header file hash_table.h.

This interface definition also illustrates how the principle of "data hiding" (also known as "information hiding") can be realized in C. In particular, the implementation details of the hash table are hidden from its callers, and the hash table doesn't need to know how the memory was allocated for the keys and values that it stores.

The hash table also allows a user-defined cookie to be passed among functions. The cookie is a void pointer, allowing you to pass a pointer to any type of data you wish. This enables you to provide additional data when removing items from the hash table and/or iterating over the hash table. You will see one example of how to use such user data in next week's lab.

/*- -*- mode: c; c-basic-offset: 8; -*-
 *
 * This file defines the interface for a hash table, which is an efficient
 * data structure for maintaining a collection of "key"-to-"value" mappings.
 * This hash table, in particular, maintains mappings from strings to opaque
 * objects.
 *
 * This file is placed in the public domain by its authors, Alan L. Cox and
 * Kaushik Kumar Ram.
 */

/*
 * Declare the hash table struct type without exposing its implementation
 * details, because the callers to the following functions don't need to know
 * those details in order to declare, assign, or pass a pointer to this struct
 * type.  However, without knowing those details, any attempt to dereference
 * such a pointer will result in a compilation error.  This is an example of
 * "data hiding", which is a good principle to follow in designing and
 * implementing data abstractions.
 */
struct hash_table;

/*
 * Requires:
 *  "load_factor" must be greater than zero.
 *
 * Effects:
 *  Creates a hash table with the upper bound "load_factor" on the average
 *  length of a collision chain.  Returns a pointer to the hash table if it
 *  was successfully created and NULL if it was not.
 */
struct hash_table *hash_table_create(double load_factor);

/*
 * Requires:
 *  Nothing.
 *
 * Effects:
 *  Destroys a hash table, but first if the function pointer "destructor" is
 *  not NULL calls "destructor" on each mapping in the hash table.  The
 *  mapping's key, its value, and the latest value of the pointer "cookie"
 *  are passed to "destructor" as arguments.  The return value is assigned to
 *  "cookie", becoming "cookie"'s latest value.  In other words,
 *      cookie = destructor(key, value, cookie);
 *  The order in which the mappings are passed to "destructor" is not
 *  defined.  Returns "cookie"'s latest value.
 */
void *hash_table_destroy(struct hash_table *ht,
    void *(*destructor)(const char *key, void *value, void *cookie),
    void *cookie);

/*
 * Requires:
 *  "key" is not already in "ht".
 *  "value" is not NULL.
 *
 * Effects:
 *  Creates an association or mapping from the string "key" to the pointer
 *  "value" in the hash table "ht".  Returns 0 if the mapping was successfully
 *  created and -1 if it was not.
 */
int hash_table_insert(struct hash_table *ht, const char *key, void *value);

/*
 * Requires:
 *  Nothing.
 *
 * Effects:
 *  Calls "function" on each mapping in the hash table, passing the mapping's
 *  key, its value, and the latest value of the pointer "cookie" as the
 *  arguments.  The return value is assigned to "cookie", becoming "cookie"'s
 *  latest value.  In other words,
 *      cookie = function(key, value, cookie);
 *  The order in which the mappings are passed to "function" is not defined.
 *  Returns "cookie"'s latest value.
 */
void *hash_table_iterate(struct hash_table *ht,
    void *(*function)(const char *key, void *value, void *cookie),
    void *cookie);

/*
 * Requires:
 *  Nothing.
 *
 * Effects:
 *  Searches the hash table "ht" for the string "key".  If "key" is found,
 *  returns its associated value.  Otherwise, returns NULL.
 */
void *hash_table_lookup(struct hash_table *ht, const char *key);

/*
 * Requires:
 *  Nothing.
 *
 * Effects:
 *  Searches the hash table "ht" for the string "key".  If "key" is found,
 *  removes its mapping from "ht".  If the function pointer "destructor" is
 *  not NULL, calls "destructor" with the mapping's key, its value, and the
 *  pointer "cookie" as the arguments.  The return value from "destructor" is
 *  assigned to "cookie".  In other words,
 *      cookie = destructor(key, value, cookie);
 *  Returns "cookie"'s value.
 */
void *hash_table_remove(struct hash_table *ht, const char *key,
    void *(*destructor)(const char *key, void *value, void *cookie),
    void *cookie);

Hash Table Implementation

Before you begin the exercises for the lab this week, we'll discuss the data structures that you'll be working with.

Data Structures

The file hash_table.c contains the following data structure definitions.

/*
 * A hash table mapping:
 *
 *  Stores the association or mapping from a string "key" to its opaque
 *  "value".  Each mapping is an element of a singly-linked list.  This list
 *  is called a collision chain.
 */
struct hash_table_mapping {
        /*
         * The "key" is a pointer to a string.
         */
        const char *key;
        /* 
         * The "value" is a pointer to an object of an unknown type, because
         * the hash table doesn't need to know the type in order to store and
         * return the pointer.
         */
        void *value;
        /*
         * The pointer to the next element in the same collision chain.
         */
        struct hash_table_mapping *next; 
};

/*
 * A hash table:
 *
 *  Stores a collection of "key"-to-"value" mappings.  Each mapping is an
 *  element of a collision chain.  For efficiency, the number of collision
 *  chains is kept in proportion to the number of mappings so that the average
 *  length of a collision chain remains constant no matter how many mappings
 *  the hash table contains.
 */
struct hash_table {
        /* 
         * The array of collision chains.  Really, this is a pointer to an
         * array of pointers.  Each element of that array is the head of a
         * collision chain.
         */
        struct hash_table_mapping **head;
        /*
         * The number of collision chains.
         */
        unsigned int size;
        /*
         * The number of mappings in the hash table.
         */
        unsigned int occupancy;         
        /*
         * The upper bound on the average collision chain length that is
         * allowed before the number of collision chains is increased.
         */
        double load_factor;
};

Initialization

The file hash_table.c contains the following function for initializing a hash table. We are showing you this function as an example of the use of calloc() to allocate and initialize memory.

/*
 * The initial number of collision chains in a hash table:
 */
#define INITIAL_SIZE    2

/*
 * Requires:
 *  "load_factor" must be greater than zero.
 *
 * Effects:
 *  Creates a hash table with the upper bound "load_factor" on the average
 *  length of a collision chain.  Returns a pointer to the hash table if it
 *  was successfully created and NULL if it was not.
 */
struct hash_table * 
hash_table_create(double load_factor)
{
        struct hash_table *ht;

        assert(load_factor > 0.0);
        ht = malloc(sizeof(struct hash_table));
        if (ht == NULL)
                return (NULL);
        /* 
         * Allocate and initialize the hash table's array of pointers.  In
         * contrast to malloc(), calloc() initializes each of the returned
         * memory locations to zero, effectively initializing the head of
         * every collision chain to NULL.  
         */
        ht->head = calloc(INITIAL_SIZE, sizeof(struct hash_table_mapping *));
        if (ht->head == NULL) {
                free(ht);
                return (NULL);
        }
        ht->size = INITIAL_SIZE;
        ht->occupancy = 0;
        ht->load_factor = load_factor;
        return (ht);
}

Hash Table Exercises

Preliminaries

As a means to test your work on implementing the hash table, we provide you with a program that implements a rudimentary spelling checker that uses your hash table. You can find the source code files for this program in your repository as spell_check.c. To build the program, use the Unix command:

    make
Using make will first compile the file hash_table.c into hash_table.o and spell_check.c into spell_check.o. Then it will link the result of these two compilations (i.e., the two .o files) to create the final executable file spell_check. Here is how the file spell_check.c describes itself (as indicated using "..." below, you can use various combinations of the different options below on the command line at the same time):

/*- -*- mode: c; c-basic-offset: 8; -*-
 *
 * This file implements a rudimentary spelling checker using a hash table to
 * implement the dictionary.  The command-line options include:
 *
 *  spell_check ... -f  ...
 *
 * to use the specified dictionary,
 *
 *  spell_check ... -p ...
 *
 * to print all of the words in the dictionary,
 *
 *  spell_check ... -r  ...
 *
 * to remove the specified word from the in-memory copy of the dictionary, and
 *
 *  spell_check ... document1 [document2] ...
 *
 * to report the spelling errors in each of the documents.
 */

Insertion and Iterate

Exercise #1

Complete the implementation of the functions hash_table_insert() and hash_table_iterate(). You can test your work with the command:

./spell_check -f small_dictionary -p
Loaded 14 words from the dictionary "small_dictionary".
Neptune
Saturn
Jupiter
Mars
Venus
Mercury
rectangle
square
green
red
Uranus
triangle
circle
blue

This command tells spell_check to use the dictionary file small_dictionary (and thus to insert each word in this dictionary into your hash table)
and to then print all of the words in the dictionary (by iterating over the hash table).

Search

Exercise #2

Complete the implementations of the function hash_table_lookup(). You can test your work with the command:

./spell_check -f small_dictionary small_dictionary
Loaded 14 words from the dictionary "small_dictionary".
Processing the document "small_dictionary":
Every word is in the dictionary.
Processed 14 words from the document "small_dictionary".

This command tells spell_check to use the dictionary small_dictionary and to then check the spelling for all of the words in the file small_dictionary.

Deletion

Exercise #3

Complete the implementation of the function hash_table_remove(). You can test your work with the command:

./spell_check -f small_dictionary -r Uranus -p
Loaded 14 words from the dictionary "small_dictionary".
Neptune
Saturn
Jupiter
Mars
Venus
Mercury
rectangle
square
green
red
triangle
circle
blue

Resize Hash Table

Exercise #4

Complete the implementations of the function hash_table_resize(). You can test your work with the command:

./spell_check -f /usr/share/dict/words /usr/share/dict/words
Loaded 479828 words from the dictionary "/usr/share/dict/words".
Processing the document "/usr/share/dict/words":
Every word is in the dictionary.
Processed 479828 words from the document "/usr/share/dict/words".

Submission

Be sure to git push your lab before 11:55PM on Saturday at the end of this week.