Lecture 34 - Hashing


As usual, create a directory to hold today's activities:

$ mkdir ~/cs170/labs/lab34
$ cd ~/cs170/labs/lab34

Program Outlines

Turn in your program outlines at the beginning of class, as usual. I will have these graded, commented, and uploaded to inquire by tomorrow morning.


Quiz on Wednesday

Your final quiz of the semester will be this Wednesday. It will be on paper, and will be based on today's lecture. Make sure you pay attention to the process of Linear Probing, as you will be asked to execute this process on Wednesday.


Hashing

You were introduced to dictionaries in CS120, as a mechanism for storing data similar to lists. The interesting thing about dictionaries are that they use arbitrary data as your keys, as opposed to integers for lists. We've seen how we could sort of do that, using data structures like linked lists, trees, etc. However, dictionaries are much more efficient than any of those structures.

Dictionaries use a mechanism known as hashing to increase their performance. The underlying structure they use is actually a list. They convert any key you give them into some integer index in their list, which grants them the speed of lists. However, this method is not exactly free. We have to make sure we handle some special cases in the algorithms, to make this effective.


Lab Activity 1
Hash Codes

Step one of hashing requires generating integer based indicies based off arbitrary data types. For today, we are going to focus solely on Strings. However, note that a similar process can be used for arbitrary data types.

The easiest mechanism for generating a hash code for a string is to simply add the ASCII values of all of the characters of the strings. This will generate some number for us, but these values will likely exist outside of the indicies of our table. We need to make sure the value we get is in the bounds of the table.

Details

Create a file called hash_table.py, which will contain a class called HashTable. This class needs two main attributes: hash_table and size. Your constructor should take the size of the table as a parameter, and create the hash_tableas a list that initially only contains None values.

Add a method called generate_hash_code(self, a_string), which takes as a parameter a_string. It should return an integer in the range [0, len(self.size)), which is computed as the sum of the ASCII values of the string's characters.

Example

>>> my_hash_table = HashTable(10)
>>> print(my_hash_table.generate_hash_code("Hello"))
0
>>> print(my_hash_table.generate_hash_code("World!"))
3
>>> print(my_hash_table.generate_hash_code("p@$$w0rD"))
7
>>> print(my_hash_table.generate_hash_code("123456789"))
7
>>> my_hash_table = HashTable(13)
>>> print(my_hash_table.generate_hash_code("Hello"))
6
>>> print(my_hash_table.generate_hash_code("World!"))
7
>>> print(my_hash_table.generate_hash_code("p@$$w0rD"))
12
>>> print(my_hash_table.generate_hash_code("123456789"))
9
>>> my_hash_table = HashTable(23)
>>> print(my_hash_table.generate_hash_code("Hello"))
17
>>> print(my_hash_table.generate_hash_code("World!"))
1
>>> print(my_hash_table.generate_hash_code("p@$$w0rD"))
22
>>> print(my_hash_table.generate_hash_code("123456789"))
17

Hint

  • Recall that the ord function takes a character, and returns the ASCII value for that character.

  • You are going to be using the accumulator pattern that we discussed way back in CS120 for this one. You simply need a for loop that iterates over all of the characters of the string.

  • Your sums will get arbitrarily large, given any non-negligible length string. You will want to use the mod operator (%) to make sure you get an integer in the range specified.

 

Challenge

As we will see in today's activity, the more unique our hash codes are for strings, the better our hash table is going to perform. However, simply adding the ASCII values of the characters is not going to cut it. We need some other function in order to get more unique hash codes.

Imagine we treat the ASCII string as a base-256 number. That means each character can be assigned some place value, based on its position in the string. Given that, compute a new hash code based off this technique. Are previous conflicts now fixed? Does this cause new conflicts?


Lab Assignment 34
Linear Probing

Once you have the hash code specified for some data, all you need to do is look up the index from the hash code to determine where to find the data, right? Well, we still have the problem of collisions. What happens if the two different string map to the same location in the hash table?

There are multiple ways collisions can be handled. The easiest way to accomplish this is through a process known as linear probing. If your hash code maps to a location that already has data in it, you just start iterating through the rest of the list, until you find an open slot. You have to perform this on both insertion and look up.

Details

You will need to write two new methods for the HashTable class. The first method is insert(self, key, value), which takes a string key and some other data value as parameters. This method should insert the tuple (key, value) into the hash table, using the linear probing technique as necessary. For now, we will assume that there is always room in the hash table for this key, value pair.

The other method you need to write is lookup(self, key), which takes a string key and returns the value associated with that key. For now, we will assume that all look ups will succeed: the key is always a valid key in the hash table. This method should return the value associated with that key as well as the index that value was found at.

Notice that you are required to insert the (key, value) pair into the hash table. In an ideal world, you wouldn't have to do this. However, there is a substantial probability of collisions in our data set. Your lookup not only needs to compute the hash code of the key, but then it has to make sure they key stored in the table is the one being looked up.

Example

>>> my_hash_table = HashTable(4)
>>> my_hash_table.insert("Hello", "Hello")
>>> my_hash_table.insert("World!", "World!")
>>> my_hash_table.insert("p@$$w0rD", "p@$$w0rD")
>>> my_hash_table.insert("123456789", "123456789")
>>> print(my_hash_table.lookup("Hello"))
('Hello', 0)
>>> print(my_hash_table.lookup("World!"))
('World!', 1)
>>> print(my_hash_table.lookup("p@$$w0rD"))
('p@$$w0rD', 2)
>>> print(my_hash_table.lookup("123456789"))
('123456789', 3)


>>> my_hash_table = HashTable(23)
>>> my_hash_table.insert("Hello", "Hello")
>>> my_hash_table.insert("World!", "World!")
>>> my_hash_table.insert("p@$$w0rD", "p@$$w0rD")
>>> my_hash_table.insert("123456789", "123456789")
>>> print(my_hash_table.lookup("Hello"))
('Hello', 17)
>>> print(my_hash_table.lookup("World!"))
('World!', 1)
>>> print(my_hash_table.lookup("p@$$w0rD"))
('p@$$w0rD', 22)
>>> print(my_hash_table.lookup("123456789"))
('123456789', 18)


>>> my_hash_table = HashTable(29)
>>> my_hash_table.insert("Hello", "Hello")
>>> my_hash_table.insert("World!", "World!")
>>> my_hash_table.insert("p@$$w0rD", "p@$$w0rD")
>>> my_hash_table.insert("123456789", "123456789")
>>> print(my_hash_table.lookup("Hello"))
('Hello', 7)
>>> print(my_hash_table.lookup("World!"))
('World!', 2)
>>> print(my_hash_table.lookup("p@$$w0rD"))
('p@$$w0rD', 17)
>>> print(my_hash_table.lookup("123456789"))
('123456789', 13)

Hint

  • There are really two cases for both insertion and look ups. If the hash code computed for the key produced the actual location the data was stored in, you are done. If it didn't, however, you have to perform probing to find the correct location.

    Notice that probing requires you to wrap around to the beginning of the list. Your immediate intuition for this problem was probably a for loop. However, a for loop might not make a lot of sense with the wrap around mechanisms. A while loop might be better.

  • Your mod operator is going to get a work-out today. Remember, you can mod any number by the length of the list, and it will give you a value in the specified range of the list. This is invaluable to wrap back around to the beginning of the list.

 

Challenge

The problem with this mechanism of insertion is that it can cause cascading issues. If one (key, value) pair gets placed in an "incorrect" location based off the key, it could cause a collision with unrelated data that gets inserted later.

The better solution for this is to make each location of the hash table a list of (key, value) pairs. This way collisions only affect their own key locations. Modify your hash table code, to follow this new paradigm.


Submission

When you have finished, create a tar file of your lab34 directory. To create a tar file, execute the following commands:

cd ~/cs170/labs
tar czvf lab34.tgz lab34/

To submit your activity, go to cseval.roanoke.edu. You should see an available assignment called Lab Assignment 34. Make sure you include a header listing the authors of the file.


Last modified: Mon Apr 14 14:46:00 EDT 2014