Dictionaries and Hashtables

[an error occurred while processing this directive] [an error occurred while processing this directive] Dictionaries and Hashtables

COMP 200

Elements of Computer Science

Hash Tables

Dictionaries

A dictionary is a data structure used to represent the one-way relationship between entities in one set and entities in another set.

A dictionary is a "mapping" that enables captures the association of a key to a value. For every key, there is an associated value, so if one where given a particular key, one could find the associated value. Do note that a dictionary can only represent the one-way relationship of a key to a value and not the other way around.

key-value mapping

Conversely, for purposes of designing a computer program, if our problem contains entities that can be related by a one-way relationship then a dictionary could be very useful for representing that relationship in our program. Note that two-way relationships can sometimes be represented by two one-way relationships.

Hash Tables

There are many different ways in which one could implement a dictionary. In Scheme, a dictionary is implemented using a specific technique called a "hash table" which is very fast and memory efficient. Unfortunately, the internal algorithms of a hash table are beyond the scope of this course. For more information, you can see, for instance, the Wikipedia article on hash tables or simply ask the instructors, who will be happy to chew your ear off on everything you'd ever want to know about hash tables!

For the remainder of this discussion, since we will be limited to Scheme code, the terms "dictionary" and "hash table" will be synonymous, but bear in mind that a hash table is a dictionary, but not necessarily the other way around.

IMPORTANT CODING INFORMATION:

To use hash tables in DrScheme, you must force DrScheme into a special "Module" mode by including the following lines at the top of your definitions:

#lang Scheme
;; Language level = Module

Unfortunately, you will lose the niceties of the student language levels of DrScheme, such as testing with check-expect, but hey, this is the big leagues now!

Hash Table Operations in DrScheme

Creating an empty hash table:

(make-hash) -- returns an empty hash table

Adding a NEW key-value pair to a hash table:

(hash-set! ht key value) -- sets the given key and value in the given hash table, ht. If the key already exists in the hash table, the current value is replaced with the new value.

Retrieving the value for a given key in a hash table:

(hash-ref ht key) -- returns the value associated with the given key in the given hash table, ht. An exception is raised if the key does not exist in the hash table.

Removing a key-value pair:

(hash-remove! ht key) -- removes the key and its associated value from the given hash table, ht.

Update the value for a given key in a hash table:

(hash-update! ht key existing_key_fn nonexisting_key_fn) -- For the given hash table, ht,

If the given key already exists in the hash table: The new value for that key is the return value of the function existing_key_fn when given the input parameter of the currently associated value for that key. That is, new_value = (existing_key_fn old_value).
If the given key does not yet exist in the hash table: A new key-value pair is created where the value for that key is set to the return value of the function nonexisting_key_fn and then the value is updated as if it already existing (see above). That is, new_value = (existing_key_fn (nonexisting_key_fn)).

Either named functions or lambdas can be used for the existing_key_fn and nonexisting_key_fn functions, so long as they satisfy the following contracts:

existing_key_fn: any --> any
nonexisting_key_fn: --> any

Processing every element of a hash table:

(hash-map ht aFunc) --> (list (aFunc key1 value1) (aFunc key2 value2) (aFunc key3 value3) ...) -- returns a list of values resulting from executing the aFunc for every every key-value pair in the hash table, with the key and value as input parameters. IMPORTANT: The order of the key-value pairs processed by hash-map is NOT guaranteed!

The contract for aFunc is thus

aFunc: any any --> any

(hash-for-each ht aFunc) -- Does the same thing as hash-map, but does not create an output list of results. There is no return value for this hash-for-each, as this function is used exclusively for "side-effects" of aFunc, such as updating or setting the values in the hash table. In such, the contract for aFunc does not require that it return a value. IMPORTANT: The order of the key-value pairs processed by hash-map is NOT guaranteed!

Examples of using hash tables in DrScheme

> (define ht (make-hash)) ; make an empty hash table

> (hash-set! ht "a" 1)   ; add the key-value pair ("a" 1)

> (hash-set! ht "b" 10) ; add the key-value pair ("b" 10)

> (hash-ref ht "a")   ; retrieve the value associated with the key "a".
1

> (hash-ref ht "b") ; retrieve the value associated with the key "b".
10

> (hash-set! ht "d" 100) ; Creates a valid key, "d", associated with the value, 100.
> (hash-remove! ht "d") ; Remove the key, "d" and its associated value.
> (hash-ref ht "d") ; Double check to make sure key is gone.
. . hash-ref: no value found for key: "d"

> (hash-update! ht "a" (lambda (v) (add1 v)) (lambda () 42)) ; add one to the value associated with the key "a".

> (hash-ref ht "a") ; check the value for the key "a".
2

> (hash-update! ht "b" (lambda (v) (add1 v)) (lambda () 42)) ; add one to the value associated with the key "b".

> (hash-ref ht "b") ; check the value for the key "b".
11

> (hash-ref ht "a") ; verify that the value for the key "a" is unaffected.
2

> (hash-ref ht "c") ; attempt to get the value for the non-existent key "c".
. . hash-ref: no value found for key: "c"

> (hash-update! ht "c" (lambda (v) (add1 v)) (lambda () 42)) ; update the value non-existing key, "c", by first setting the value to 42 and then adding 1.

> (hash-ref ht "c") ; check the value for the key "c".
43

> (hash-map ht (lambda (k v) k)) ; get a list of all the keys in the hash table.
("b" "a" "c")

> (hash-map ht (lambda (k v) v)) ; get a list of all the values in a hash table.
(11 2 43)

> (hash-map ht (lambda (k v) (list k v))) ; get a list of all the key-value pairs (2 element lists) in the hash table.
(("b" 11) ("a" 2) ("c" 43))

> (hash-map ht (lambda (k v) ; replace every value less than 20 with 20
                  (cond
                       [(< v 20) (begin
                                     (hash-set! ht k 20)
                                     (list k 20))]
                       [else (list k v)])))
(("b" 20) ("a" 20) ("c" 43))

> (hash-map ht (lambda (k v) (list k v))) ; check the key-value pairs in the hash table.
(("b" 20) ("a" 20) ("c" 43))

> (hash-for-each ht (lambda (k v)
                        (hash-set! ht k (+ v 10))))   ; add 10 to every value in the hash table

>(hash-map ht (lambda (k v) (list k v))) ; check the key-value pairs in the hash table.
(("b" 30) ("a" 30) ("c" 53))

Useful Hash Table Utilities

Turn a hash table into a list of key-value pairs (we did this above--it's so useful, it's nice to have it as a function):

;; hash-items: hashtable(any-->any) --> list of (any, any)
;; Returns a list of all the key-value pairs in a hashtable.
(define (hash-items ht)
(hash-map ht (lambda (k v)
(list k v))))

Simple Substitution Cipher

In this part of the lab, we will develop some functions that will work as a simple "substitution cipher", which is a method of encrypting text messages by substituting each letter with another one, very much like the "secret decoder ring" in your favorite breakfast cereal. (e.g. http://www.exploratorium.edu/ronh/secret/secret.html and http://en.wikipedia.org/wiki/Secret_decoder_ring)

Here's an example of a simple substitution cipher: (for simplicity's sake, I've only shown the letters that I will need)

d --> a
e --> m
H --> o
l --> ?
o --> x
r --> e
w --> d
! --> H
[space] --> g

So we can use these relationships to encode a string by substituting each letter one at a time:

Hello world! --> om??xgdxe?aH

To decode the message, one needs the inverse of the above relationships (note what this says mathematically about the nature of substitution ciphers):

a --> d
d --> w
e --> r
g --> [space]
H --> !
m --> e
o --> H
x --> o
? --> l

With this relationship we can do the same substitution process as before:

om??xgdxe?aH --> Hello world!

Ta da!

The thing to notice here is that the substitution relations above are exactly the same type of one-way key-value relationships we discussed at the top of this lecture. That means that dictionaries are the perfect data structure to use to represent the substitution relationships.

Encoder Dictionary

The first thing we need to do is to create the dictionary that maps the letters and characters in a message. We will create a random mapping for maximum security.

One of the crucial issues however is that any mapping we create must be reversible, that is, each output character must correspond to one and only one input character. For instance, suppoose we have d --> a, then we cannot also have p --> a. Why do you think that this restriction is necessary? Is it also necessary to impose the restriction the other way, i.e. one output character for any input character?

Fundamentally, this is how we will accomplish this goal:

Make a list of all the possible characters that we will encode to, that is, the list of all possible output characters in the encoded string for any input string
Go through all the possible input characters we wish to encode and using a dictionary to match it to a randomly chosen an entry from the list of possible output characters.
If the chosen output character is removed from the list of possibilities at this point, then it will never be chosen again and our reversibility requirement will be automatically met.

Let's call our function makeEncoderDict which will take one parameter, a list of available characters, and return a dictionary where the keys are the original characters and the values are the encoded characters.

Here's what we need to do:

Make a list of all the available characters for encoding:

It's easier to type characters as part of a string, so we can type out a long string with all the desired characters and then use the built-in string->list function to convert the string into a list of individual characters:
(define all_chars (string->list"abcdefghijklmnopqrstuvwxyz .,!")) -- you'll probably want more, such as capital letters, more punctuation marks, etc.

For an even more complete list of characters, use the technique described below in the "Advanced encoding with ASCII characters" section.

Thinking ahead, we need a utility function to remove the n'th element from a list:

We write a simple recursive function to "count down" from n to zero as we recur through the list and return a list without that element by not cons'ing first onto the recursive result from rest. This gives us 3 distinct cases to contend with:

list is empty: this is an error condition because there are no elements to remove.
n is zero: The zero'th element of a list is its first, so the result of removing the zero'th element is simply the rest of the list.
n > 0: The n'th element is farther back in the list, so the result is the first cons'd onto the result of removing the n-1'th element from the rest of the list.

This generates the following code:

;; remove-nth: list-of-any integer --> list-of-any
;; (remove-nth loa n) returns a list where the n'th element (starting at zero) from
;; loa has been removed.
;; Throws an exception if the n'th element doesn't exist.
(define (remove-nth loa n)
    (cond
        [(empty? loa) (raise "remove-nth: Too few elements in list!")] ; no elements to remove here!
        [(zero? n) (rest loa)] ; found the n'th element, return the rest of the list without first
        [else (cons (first loa) (remove-nth (rest loa) (sub1 n)))])) ; not here. Count down into rest.

Make the dictionary to encode the original text string:

The process to randomly associate characters with another character goes something like this:

Start with a source list of characters, "srcChars" e.g. ('a 'b 'c 'd)
Create an empty hash table to hold the encoding data (mappings), dict.
Initially, the list of remaining characters, "remainingChars", is the same as the source list. Since the following is recursive, it is really a helper function that takes the list of remaining characters as its input parameter:
1. From the list of remaining characters, we can figure out how many remaining characters there are, "nAvail". Above, nAvail is 4.
2. If nAvailis zero, then it means that we have exhausted all remaining characters and thus we are done. Return the filled hash table, dict.
3. Otherwise, choose a random integer between 0, inclusive, and nAvail, exclusive. In Scheme, this is simply (random nAvail). Call this "valueIdx". For instance, valueIdx could be 2.
  1. Pick the nAvail-1'th character from the source characters. At first, this will be 'd. This will be the key.
  2. Since valueIdx is a valid element of remainingChars, take the valueIdx element from remainingChars and use it as the value. This would be 'c on the first round.
  3. Add the key-value pair to the output hash table, dict. Here, that key-value pair is ('d, 'c).
  4. Recur, to step 3.a above having set remainingChars to remainingChars with the valueIdx element removed. This means that remainingChars would become ('a 'b 'd).

The process might look like something like this, for a srcChars list of ('a 'b 'c 'd):

Step #	remainingChars	nAvail	valueIdx	key -value pair
1	('a 'b 'c 'd)	4	2	'd --> 'c
2	('a 'b 'd)	3	0	'c --> 'a
3	('b 'd)	2	1	'b --> 'd
4	('b)	1	0	'a --> 'b
5	()	0	Done!	Done!

The above process can be captured in a function called makeEncoderDict by utilizing some local definitions and a helper function:

;; makeEncoderDict: list-of-chars --> hashtable(char-->char)
;; (makeEncoderDict srcChars) returns a hash table who's keys are the characters
;; in srcChars where the associated value is another, randomly chosen character
;; from the same set.
(define (makeEncoderDict srcChars)
    (local
        [(define dict (make-hash)) ; The dictionary that will be returned.
         ;; fillDict: list-of-chars --> hashtable(char-->char)
         ;; (fillDict srcChars) is a recursive helper function that returns the above dict hash table
         ;; who's keys are the characters in srcChars where the associated value is another,
         ;; randomly chosen character from the same set.
         (define (fillDict remainingChars)
             (local
                 [(define nAvail (length remainingChars))] ; this value is needed several times below
                 (cond
                     [(zero? nAvail) dict] ; done! return the filled dictionary
                     [else (local
                               [(define valueIdx (random nAvail))] ; pick a random number 0<= valueIdx < n
                               (begin
                                   ; associate the n-1'th element of the srcChars list with
                                   ; the i'th member of the remaining characters
                                   (hash-set! dict (list-ref srcChars (sub1 nAvail)) (list-ref remainingChars valueIdx))
                                   ; recurse onto the remaining characters with the i'th char removed.
                                   (fillDict (remove-nth remainingChars valueIdx))))])))]
        (fillDict srcChars))) ; Start the process off with the full list of characters

To use makeEncoderDict, simply call it with a list with has all the available characters for encoding, such as all_chars, e.g.

(define myEncoderDict (makeEncoderDict all_chars))

Encoder Function

Encoding is a trivial matter of simply walking through every element of a string and building up a new string by using the encoding dictionary to retrieve the corresponding encoded character. The "walking through the string" can be easily accomplished by mapping a lambda over a list of characters formed from the string. This will produce a list of encoded characters, which can be easily transformed back into a string.

The encoder function, called encode, will take a string and an encoder dictionary (hash table) and return an encoded string.

Convert the string to list of characters using the built-in function "string-->list".
Use map to apply a lambda on every element of the list of characters. The lambda should simply use the given character as the key to lookup the corresponding value in the given encoder dictionary:
(lambda (c)
(hash-ref dict c))
The result of the map is a list of encoded characters, the outputs from each application of the lambda.
Convert the list of encoded characters to a string using the built-in function, "list-->string".

All three steps above can be accomplished in a single line of code. Take a shot at it:

;; encode: string hashtable(char-->char) --> string
;; (encode str dict) returns a string where every character
;; in str has been replaced by the corresponding value in dict.
(define (encode str dict)
...)

You should now check your encoder function:

Print out your encoder dictionary.
Choose a simple, short word or phrase and manually encode it using the data from your encoder dictionary.
Compare your hand-encoded phrase with the same phrase put through your encoder function. Your call to your encoder function should look something like this:
(define myEncodedStr (encode "Yo Yo Ho!" myEncoderDict))

Decoder Dictionary

In order to decode the encoded string, we need to represent the opposite relationship as the encoder dictionary. To do this, we simply have to walk through the encoder dictionary and in the process, fill a new dictionary with key-value pairs that are exactly the opposite of what came from the encoder dictionary. That is, the key becomes the value and the value becomes the key.

Make a function called makeDecoderDict that will take an encoder dictionary and return a decoder dictionary made from it.

There are two similar ways of doing this:

Convert the encoder dictionary into a list of key-value pairs and then map a lambda across it to fill the decoder dictionary.
Use the special built-in hash table map function, hash-map (see above), to map a lambda of two input parameters, the key and value, across the entire encoder dictionary. See the hash-items function above to see how to use hash-map.

In a nutshell, the process is thus:

Use local to create a definition for an empty decoder hash table, which when filled, will be returned as makeDecoderDict's result.
Use begin to
1. Map a lambda across the given encoder dictionary (hash table) that will enter switch the keys and values and add them to the decoder dictionary. Use either technique described above for the mapping process.
2. Return the now filled decoder dictionary.

Give it a shot:

;; makeDecoderDict: hashtable(any-->any) --> hashtable(any-->any)
;; (makeDecoderDict inputDict) inverts the input hashtable, inputDict,
;; returning a new hashtable where the keys and values from inputDict been
;; swapped. For instance if in inputDict, key x --> value y, then in the
;; returned hashtable, key y --> value x.
(define (makeDecoderDict inputDict)
...)

A call to makeDecoder should look something like this:

(define myDecoderDict (makeDecoder myEncoderDict))

Decoder Function

Do we need to write a special decoding function, like we had for the encoding? Why or why not?

Testing Your System

Test your decoder by comparing a manual decoding with a the output of your decoding function, just as you did with the encoder function.
You should also now be able to test a full encode-decode process by taking a phrase, encoding it to an encrypted string, and then decode that string back to its original form.
By making a second encoder and decoder dictionary pair, prove that if you use the wrong encoder or decoder, you cannot decrypt an encoded message.

Download the Code:

After you have obtained a solution for the above exercises, you may download the full code here to compare your algorithms with those from the instructors.

A Challenge!

One of the problems with a single substitution cipher as we just created is the fact that they are relatively easily cracked. Even though each letter has been replaced by another, since each letter is represented by a single letter in the encrypted message, the relative frequency that each letter occurs is unchanged. Since the frequency of each letter in general texts is well known (see, for example http://en.wikipedia.org/wiki/Letter_frequency), then one can easily tell which letter each character in the encrypted message really represents.

One possible way to make a substitution cipher much more difficult to crack is to have multiple possibilities for each letter. For instance,

d --> a or "
e --> m or z or 1
H --> o or :
l --> ? or 8 or ]
o --> x or t or \
r --> e or %
w --> d or q
! --> H or >
[space] --> g or =

Note that the number of possibilities for each letter does not have to be the same. The restriction however is that the number of output characters is much greater than the number of input characters. Using this sort of mapping, randomly choosing what letter to substitute from the list of choices, we get the following sort of encoding. Note how repeated letters such as the "l", do not appear to be repeating any more.

Hello world! --> oz?8tgdxe]a>

The decoding relationship is essentially the same as before, but just with more entries, where some of the entries map back to the same value:

a --> d
d --> w
e --> r
g --> [space]
H --> !
m --> e
o --> H
q --> w
t --> o
x --> o
z --> e
? --> l
" --> d
1 --> e
8 --> l
] --> l
: --> H
\ --> o
% --> r
> --> h
= --> [space]

With this decoding mapping, the encoded message is easily decoded:

oz?8tgdxe]a> --> Hello world!

So your challenge is, can you modify your simple, single substitution code to be able to perform a multiple substitution cipher as described above?

Advanced encoding with ASCII characters

What if you wanted to be able to encode some non-printing characters such as linefeed, carriage return, backspace and tab? These are very difficult to add to your string of available characters.

One way is to represent the characters not as regular printable entities, but as integers. This is done by the use of ASCII characters. Please take a moment to read about them in the addendum below.

By using the escape sequences for the non-printing characters, you could painstakingly add them all to your string of available characters, but there's an easier way: simply define a function that will generate the string of available characters.

Define a function called makeCharList that takes two integers as inputs, a beginning ASCII integer value and an ending ASCII integer value.

The function should simply recurse through all the ASCII integer values, starting at the given start value up to but not including the given end value.
At each iteration of the loop, by using the integer->char function, simply append the character equivalent of the ASCII integer value to a result string (initialized before the loop).
After the loop is done, simply return the result string.

Define two integer variables called ASCII_START and ASCII_END that define the range of ACII characters you are willing to encode. For basic type-able characters, starting at 32 (space character) and ending at 127 (126 is the is the tilde, ~, character) is sufficient. This will insure that your encoded message is easily printed. On the other hand, if you want to encode anything, you could use something like 0 through 128.

(define SPACE_CHAR 32) ;; ASCII code for space character
(define DEL_CHAR 127) ;; ASCII code for del character

;; makeCharList: int int --> list-of-chars
;; (makeCharList startCharInt stopCharInt) Returns a list of characters with consecutive ASCII numerical values,
;; starting with the integer value, startCharInt, inclusive up to stopCharInt, exclusive.
(define (makeCharList startCharInt stopCharInt)
    (cond
        [(>= startCharInt stopCharInt) empty]
        [else (cons (integer->char startCharInt) (makeCharList (add1 startCharInt) stopCharInt))]))

A call to makeCharList should look something like the following, which will create a list of all the printable ASCII characters, except for those that control multiline text, such as newline and carriage return:

(define all_chars (makeCharList ASCII_START, ASCII_END))

Now you can easily make a list of available characters, including non-printing ones, without having to type them all out!

Addendum: ASCII Characters

Before we start working on our substitution cipher, we need to understand a little more about how characters are represented in a computer. The standard binary representation of characters in computers is called the ASCII (American Standard Code for Information Interchange) character set. The ASCII character set associates every character with a specific integer number between zero and 127 (inclusive). The characters represented included upper and lower case letters, numbers, various symbols such as exclamation points and question marks, as well as non-printing characters used to control the display device, such as tabs, line feeds and carriage returns. With 128 total characters, a 7 bit binary number can be used to represent them all.

With the advent of the IBM PC in 1981, with its natural ability to handle multiples of 8-bits, the ASCII character set was extended to 8-bits or 256 possible values. These new "extended ASCII codes" included some Greek letters, letters with diacriticals and various lines and corners used to create "ASCII art" borders and drawings on primitive text screens and printouts. There is not a true standard for an 8-bit ASCII code however, so it may vary from program to program.

Here are some links to ASCII tables on the web:

http://www.asciitable.com/ ((Note that the extended ASCII codes here differ from those in the next link)
http://www.idevelopment.info/data/Programming/ascii_table/PROGRAMMING_ascii_table.shtml (I believe that this is what Python uses)
http://ascii-table.com/
http://en.wikipedia.org/wiki/ASCII

Python provides convenient functions to convert a character to and from its respective ASCII numerical values:

(integer->char anInteger) --> the character represented by the given integer.

(char->integer aCharacter) --> the integer that the ASCII code says represents the given character.

Non-printing characters

Consider the phenomena when trying to display a 3 element list:

> (define s1 (list->string (list (integer->char 65) (integer->char 66) (integer->char 67))))
> s1
"ABC"
> (display s1)
ABC
> (define s2 (list->string (list (integer->char 65) (integer->char 10) (integer->char 67))))
> s2
"A\nC"
> (display s2)
A
B

Note that print(hr(i)) may show what appears to be a different output than simply typing chr(i) at the interactive Python prompt. In particular, this issue will show up for characters whose integer values are less than 32 or greater than 127. For numbers less than 128, this is due to the fact that the characters represented by those integer values do not have nice screen representations. Originally these were used control the primitive teletypes that were in use when the ASCII code was developed. Originally, there were no character codes above 128, but later they were used for international symbols and "ASCII art" symbols for making nicer looking layouts on text-only screens. These control codes do not have representations that are easily rendered on the screen.

MMost of the ASCII control codes are useless these days, except for a few notable exceptions:

backspace, (integer->char 8) -- moves the cursor back one space without deleting what is on the screen
tab, (integer->char 9) -- move the cursor over on tab unit
linefeed, (integer->char 10) -- move the cursor directly down one line. Some systems, e.g. Linux and Python, will also move the cursor to the beginning of the line.
carriage return, (integer->char 13) -- move the cursor to the beginning of the line
delete, (integer->char127) -- moves the cursor back on space, deleting the character that was there.

So how does one specify a character that doesn't print?

Scheme, as many languages do, uses a special "escape sequence" of characters to specify non-printing characters. The escape sequence always starts with a backslash, "\", and is followed by a single letter or numbers. The single letter is a shorthand for common characters and the numbers are the ASCII integer value in either hexadecimal (base 16, preceded by an "x") or octal (base 8). For instance:

linefeed (LF) = span class="style3">\n or \x0a or \012
carriage return (CR) = \r or \x0c or \015
tab = \t or \x09 or \011
backslash = \\ or \x5C or \134

More information on Python escape codes can be found here: http://python.org/doc/2.5.2/ref/strings.html

Unicode Characters:

In our modern world with non-English letters, there was a need for a much richer character system than the old ASCII. Unicode is a multi-byte character set that can represent most of the characters and symbols used in the world today. Any program that expects to be used in an international setting should use Unicode. For more information, see

Last revised 11/23/2009 10:28 AM

Maintained by the professor; see contact information on the course home page