Lab 6 - Advanced Linked Lists

Lab goals:


GitHub Repository for This Lab

To obtain your private repo for this lab, please point your browser to this starting link.
Follow the protocol from previous labs and assignments to get your repository link and clone the repository on CLEAR. The repository for this lab is
  lab-6-advanced-linked-lists-[YOUR GITHUB ID]

PLEASE NOTE: If, using make, you get an error such as

      *** missing separator (did you mean TAB instead of 8 spaces?).  Stop.

then please save this Makefile to your directory on CLEAR to replace the Makefile copy you got from GitHub (do not use copy-and-paste to transfer the contents of this file). There has been a problem with this file on GitHub this week, and this should work around that problem.


More on Resizing Hash Tables

As described in last week's lab, hash tables are often dynamically resized as as the occupancy of the hash table increases, to reduce the length of the collision chains. Smaller collision chains enable faster lookups and deletions.

For example, below we illustrate the impact of resizing hash tables in the spell checking program from last week's lab. The figures below show the hash table structure after all 479623 words from /usr/share/dict/words have been inserted into the hash table, with and without table resizing.

Without Resizing

When the hash table is not resized from its initial size of 2, the result is two huge collision chains with 239664 and 239959 elements, respectively. Clearly, traversing these linked lists to perform lookups and deletions is going to be very slow.

Hash Table

With Resizing

When hash table resizing is enabled, the hash table's size grows from 2 to 524288, as the words are inserted. After all 479623 words have been inserted into the hash table, the average size of the collision chains is only 1.53. This enables very fast hash table lookups and deletions.

Hash Table


Exercise #1

The file spell_check_lab5.c in the repository for this lab is a copy of the spell checking program from last week's lab, and the file hash_table.c is a completed implementation of the hash table from that lab.

Use the Unix command

	make spell_check_lab5

to make the program spell_check_lab5 by compiling and linking these two source files together.
Then run the following test to see how quickly it finishes:

	./spell_check_lab5 -f /usr/share/dict/words /usr/share/dict/words

Now, using the provided hash_table.c code, temporarily re-insert

	      /* ELIMINATE THIS: */
	      return (0); 
in hash_table_resize() to disable hash table resizing by ensuring that the hash_table_resize() operation has no effect.
The resulting code should look like:

static int 
hash_table_resize(struct hash_table *ht, unsigned int new_size)
{
        struct hash_table_mapping **new_head;
        struct hash_table_mapping *elem, *next;
        unsigned int index, new_index;

        assert(new_size > 0);
        /*
         * Does the hash table already have the desired number of collision
         * chains?  If so, do nothing.
         */
        if (ht->size == new_size)
                return (0);
        /* ELIMINATE THIS: */
        return (0);

	...

Again use the Unix command

	make spell_check_lab5

to remake the program spell_check_lab5, and then run the same test as above to observe its performance without hash table resizing:

	./spell_check_lab5 -f /usr/share/dict/words /usr/share/dict/words

Note that this test will run for a long time. When you get tired of running it, you can terminate it by typing a control-C character.

Now remove the lines from hash_table_resize() that you inserted above, to reenable hash table resizing.


PLEASE NOTE: The remaining exercises in this lab use spell_check.c not spell_check_lab5.c .  Only Exercise #1 uses spell_check_lab5.c

Also be sure you have removed the lines from hash_table.c that you inserted in Exercise #1 above.


Keeping Counts of Dictionary Word Usage

The goal here is to modify the spell checking program to also maintain the usage count of the words in the dictionary as it spell checks the input documents. Moreover, we want the program to order the dictionary words in this counting by their usage counts and to then print the dictionary words ordered by usage counts after all input documents have been checked.

To simplify the task, we will maintain the order only for usage count values of up to MAX_COUNT - 1 (i.e., only for words each used up to that many times). Usage count values greater than or equal to MAX_COUNT will not be ordered and will instead all be treated equally.

The implementation of word counting in spell_check.c uses an array of MAX_COUNT + 1 elements. Each element is the dummy head of a circular doubly-linked list. Each list corresponds to a usage count value that is equal to the array index value. That is, words with a usage count value of x are stored within the circular doubly-linked list whose head is at entry x in the array. However, words whose usage count is greater than or equal to MAX_COUNT are all stored together in the circular doubly-linked list whose head is at entry MAX_COUNT in the array. The benefit of using circular doubly-linked lists in maintining the usage count data is that when the usage count of a word increases, the word can be removed from the list corresponding to its old usage count in constant time, and can then be inserted (also in constant time) into the list corresponding to the word's new usage count.

The resulting usage counting data structure looks like:



/*
 * A doubly linked list node structure for maintaining word 
 * usage count statistics.
 */
struct word_data {
	const char *word;
	unsigned int count;
	struct word_data *next;
	struct word_data *prev;
};

/*
 * Requires:
 *  Nothing.
 *
 * Effects:
 *  Allocates and initializes an array of struct word_data,
 *  with each entry representing the dummy head of a circular 
 *  doubly linked list. Each list stores words with a particular usage 
 *  count value equals to the array index value. Words with usage count
 *  greater than or equal to MAX_COUNT are all stored in the list at the
 *  array index MAX_COUNT.
 */
struct word_data *
init_word_count_array(void)
{
	struct word_data *wda;
	int i;

	wda = malloc(sizeof(struct word_data) * (MAX_COUNT + 1));
	if (wda == NULL)
		return NULL;
	for (i = 0; i <= MAX_COUNT; i++) {
		wda[i].word = NULL;
		wda[i].count = i;
		wda[i].next = &wda[i];
		wda[i].prev = &wda[i];
	}
	return (wda);
}
Exercise #2

Complete the implementation of the function init_word_data(). Also, complete the function call to hash_table_insert() in main() such that copy and wd are stored as a (key, value) pair in the hash table. Use the Unix command make to make spell_check. You can test your work with the command:

   ./spell_check -f small_dictionary small_dictionary
   Loaded 14 words from the dictionary "small_dictionary".
   Processing the document "small_dictionary":
   Every word is in the dictionary.
   Processed 14 words from the document "small_dictionary".
   Printing words sorted by usage count

Circular Doubly-Linked List Insertion

The figures below illustrate the sequence of steps involved in the general case when inserting a new element at the beginning (after the head) of a circular doubly-linked list:






Inserting into a Empty List

The figures below illustrate the sequence of steps involved when inserting the first element into an empty circular doubly-linked list (this is not in any way actually a special case!):






Exercise #3

Complete the implementation of the function insert_word_data_after() to insert the new element wd into the circular doubly-linked list after the given element target. You can test your work with the following command (note that the code is still not functioning properly because the word_data structures are not yet actually being used for counting):

    ./spell_check -f small_dictionary small_dictionary
    Loaded 14 words from the dictionary "small_dictionary".
    Processing the document "small_dictionary":
    Every word is in the dictionary.
    Processed 14 words from the document "small_dictionary".
    Printing words sorted by usage count
    Neptune appeared 0 time(s)
    Uranus appeared 0 time(s)
    Saturn appeared 0 time(s)
    Jupiter appeared 0 time(s)
    Mars appeared 0 time(s)
    Venus appeared 0 time(s)
    Mercury appeared 0 time(s)
    rectangle appeared 0 time(s)
    square appeared 0 time(s)
    triangle appeared 0 time(s)
    circle appeared 0 time(s)
    blue appeared 0 time(s)
    green appeared 0 time(s)
    red appeared 0 time(s)

Circular Doubly-Linked List Deletion

The figures below illustrate the sequence of steps involved in the general case when deleting an element from a circular doubly-linked list:





Deleting The Last Element in the List

The figures below illustrate the sequence of steps involved when deleting the last element from a circular doubly-linked list (this is not in any way actually a special case!):





Exercise #4

Complete the implementation of the function remove_word_data(). Also, complete the missing code in main() to update the word's usage count and move its word_data structure to the doubly-linked list for its new usage count. You can test your work with the following command:

    ./spell_check -f small_dictionary small_dictionary
    Loaded 14 words from the dictionary "small_dictionary".
    Processing the document "small_dictionary":
    Every word is in the dictionary.
    Processed 14 words from the document "small_dictionary".
    Printing words sorted by usage count
    Neptune appeared 1 time(s)
    Uranus appeared 1 time(s)
    Saturn appeared 1 time(s)
    Jupiter appeared 1 time(s)
    Mars appeared 1 time(s)
    Venus appeared 1 time(s)
    Mercury appeared 1 time(s)
    rectangle appeared 1 time(s)
    square appeared 1 time(s)
    triangle appeared 1 time(s)
    circle appeared 1 time(s)
    blue appeared 1 time(s)
    green appeared 1 time(s)
    red appeared 1 time(s)

You may also want to try testing with a larger dictionary such as /usr/share/dict/words and spell checking a larger, more interesting text document such as the file test_file.txt that is in the repo for this lab:

    ./spell_check -f /usr/share/dict/words test_file.txt

There are 101 words (46 different words) in test_file.txt; the word that appears most frequently in it ("this") has a usage count of 7.

If you do not want words with usage count 0 to be printed, you can change the initialization in the final for loop in main() to use i = 1 instead of i = 0.

Writing All Dictionary Word Usage Count Statistics to an Output File

Note that we have added a new command line option -o <output_file> to spell_check. The purpose of this option is to write each word in the dictionary together with its usage count to the specified output file. This output is implemented by passing the address of the output_word_data() function to hash_table_iterate(), and passing as the cookie the FILE * for the opened, writable output_file.

PLEASE NOTE: Again, using hash_table_iterate() will not print the words in sorted order.

Exercise #5

Complete the implementation of the function output_word_data(). You can test your work with the following command:

   ./spell_check -o output -f small_dictionary small_dictionary

The file output should then contain:

Uranus 1
blue 1
Mercury 1
Venus 1
Saturn 1
red 1
green 1
Neptune 1
triangle 1
Mars 1
square 1
rectangle 1
Jupiter 1
circle 1

Now test the same thing, using a dictionary of /usr/share/dict/words and checking the spelling for test_file.txt:

   ./spell_check -o output2 -f /usr/share/dict/words test_file.txt

Check the output file output2.



Submission

Be sure to git push your lab before 11:55PM on Saturday at the end of this week.