Lecture 35 - Hashing


As usual, create a directory to hold today's activities:

$ mkdir ~/cs170/labs/lab35
$ cd ~/cs170/labs/lab35

Quiz

It's time for your final quiz of the semester! Please lock your computers, as this is a closed computer (as well as closed notes and book) quiz. You may move on to today's lab activities when you finish, but keep in mind that I do have a brief lecture as soon as everyone has finished the quiz.


Hashing

On Monday we wrote hash tables using the linear probing technique. However, you hopefully noticed that the linear probing has an issue with clustering, which causes collisions in unrelated table cells. Today we will go over two additional probing techniques, with the goal of reducing the clustering.

A common question from Monday was also how to handle running out of space in the hash table. We assumed there would always be a location that we could put the key/value pair in. However, this might not always be the case. We will cover table resizing today as well.


Lab Activity 1
Resizing

Determining when to resize the hash table is a bit of a tricky one. We could wait until the table is completely full, but this would cause relatively long lookup times for almost everything in the hash table. Instead, what is commonly done is resizing the table once some load factor is exceeded.

This load factor is a constant defined value, which is a percentage of the hash table locations that are filled in. This is typically 75% of the size of the table, but can vary depending on implementation.

Details

Copy your hash_table.py file from lab 34 into your current directory.

$ cd ~/cpsc170/labs/lab35
$ cp ../lab34/hash_table.py .

Add a method called resize(self), which resizes the hash table. By default, we will always double the size of the hash table. You need to create a new hash table with this new size, and re-insert all of the key/value pairs into their new locations in the new table.

Modify your insert method, so that the table is resized when the load threshold is exceeded. You may need to add other method or attributes to make determining this easier.

Example

>>> my_hash_table = HashTable(5)
>>> my_hash_table.insert("Professor", "Scotty")
>>> print(my_hash_table.hash_table)
[None, None, None, ('Professor', 'Scotty'), None]
>>> my_hash_table.insert("Class", "CPSC 170")
>>> print(my_hash_table.hash_table)
[None, None, ('Class', 'CPSC 170'), ('Professor', 'Scotty'), None]
>>> my_hash_table.insert("Room", "Trexler 363")
>>> print(my_hash_table.hash_table)
[None, None, ('Class', 'CPSC 170'), ('Professor', 'Scotty'), ('Room', 'Trexler 363')]
>>> my_hash_table.insert("Wet", "Ceiling")
>>> print(my_hash_table.hash_table)
[None, None, ('Class', 'CPSC 170'), ('Professor', 'Scotty'), ('Room', 'Trexler 363'), ('Wet', 'Ceiling'), None, None, None, None]
>>> my_hash_table.insert("Slow", "Computers")
>>> print(my_hash_table.hash_table)
[None, ('Slow', 'Computers'), ('Class', 'CPSC 170'), ('Professor', 'Scotty'), ('Room', 'Trexler 363'), ('Wet', 'Ceiling'), None, None, None, None]

Hint

 

Challenge

As we will see in the next section, having the table size be a prime number can be quite advantageous. Our current resizing technique does not guarantee this. As a matter of fact, it will always generate a composite number. Moving to the next prime is also not necessarily good, as primes are sometimes too close to one another.

Rewrite the resize structure, so that it finds the first prime larger than 2 * self.size. Your table should use that as the size.


Lab Assignment 35
Removing Clustering

We have used linear probing to determine where in the hash table a key/value pair should appear. However, you have likely noticed a clustering trend. There tends to be large, contiguous locations of key/value pairs in the hash table. This is a problem, because it will cause future collisions in unrelated locations to the problem keys.

There are two additional probing techniques that typically get used. Quadratic probing defines a quadratic equation as the probing sequence. This reduces "primary clustering," but has an issue with "secondary clustering." These are clusters around the secondary locations of the probes. The better technique is known as Double Hashing. In double hashing, a secondary hash function is used to modify the probe sequence.

Details

You will need to write at least two more methods to your HashTable class. The first should be quadratic_probe(self, a_key, probe_iteration). a_key is a string, while probe_iteration is an integer. This method should return an integer in the range [0, self.size). It will compute that value using the following equation:

quadratic_probe(key, i) = h(key) + c1 * i + c2 * i2

Where h(key) is your hash function, i is your probe iteration, and c1 and c2 are prime constants. For our purposes, use the values 5 and 7 respectively. This method will tell you where to put the key, based on the ith time you have had to generate a probe key.

Write another method called double_hash(self, a_key, probe_iteration). This method will behave much like quadratic probe, except it will encode the following equation:

double_hash(key, i) = h1(key) + h2(key) * i

Where h1(key) is your original hash function, and h2(key) is a secondary hash function. The value of this second hash function needs to be relatively prime to the size of the hash table. We can fudge this, by using the following equation for h2(key):

h2(key) = (h1(key) + 1) % (size - 1)

Modify your insertion and lookup methods, so you can easily test each of these methods.

Example

Quadratic
>>> my_hash_table = HashTable(5)
>>> my_hash_table.insert("Professor", "Scotty")
>>> print(my_hash_table.hash_table)
[None, None, None, ('Professor', 'Scotty'), None]
>>> my_hash_table.insert("Class", "CPSC 170")
>>> print(my_hash_table.hash_table)
[None, None, ('Class', 'CPSC 170'), ('Professor', 'Scotty'), None]
>>> my_hash_table.insert("Room", "Trexler 363")
>>> print(my_hash_table.hash_table)
[('Room', 'Trexler 363'), None, ('Class', 'CPSC 170'), ('Professor', 'Scotty'), None]
>>> my_hash_table.insert("Wet", "Ceiling")
>>> print(my_hash_table.hash_table)
[None, None, ('Class', 'CPSC 170'), ('Room', 'Trexler 363'), ('Wet', 'Ceiling'), ('Professor', 'Scotty'), None, None, None, None]
>>> my_hash_table.insert("Slow", "Computers")
>>> print(my_hash_table.hash_table)
[None, ('Slow', 'Computers'), ('Class', 'CPSC 170'), ('Room', 'Trexler 363'), ('Wet', 'Ceiling'), ('Professor', 'Scotty'), None, None, None, None]
Double Hashing
>>> my_hash_table = HashTable(5)
>>> my_hash_table.insert("Professor", "Scotty")
>>> print(my_hash_table.hash_table)
[None, None, None, ('Professor', 'Scotty'), None]
>>> my_hash_table.insert("Class", "CPSC 170")
>>> print(my_hash_table.hash_table)
[None, None, ('Class', 'CPSC 170'), ('Professor', 'Scotty'), None]
>>> my_hash_table.insert("Room", "Trexler 363")
>>> print(my_hash_table.hash_table)
[('Room', 'Trexler 363'), None, ('Class', 'CPSC 170'), ('Professor', 'Scotty'), None]
>>> my_hash_table.insert("Wet", "Ceiling")
>>> print(my_hash_table.hash_table)
[('Wet', 'Ceiling'), None, ('Class', 'CPSC 170'), ('Room', 'Trexler 363'), ('Professor', 'Scotty'), None, None, None, None, None]
>>> my_hash_table.insert("Slow", "Computers")
>>> print(my_hash_table.hash_table)
[('Wet', 'Ceiling'), ('Slow', 'Computers'), ('Class', 'CPSC 170'), ('Room', 'Trexler 363'), ('Professor', 'Scotty'), None, None, None, None, None]

Hint

 

Challenge

This challenge is a bit unrelated to the probings, but can really increase performance of the hash table. One of the ways we can improve lookup speed is by reordering the elements in the hash table based on their "popularity." As lookups are performed on the elements of the hash table, we can move them up the probing chain.

Whenever you perform a lookup that had to probe to find the correct value, move the key/value pair one step up the probing chain by swapping it with the value immediately proceeding it.


Submission

When you have finished, create a tar file of your lab35 directory. To create a tar file, execute the following commands:

cd ~/cs170/labs
tar czvf lab35.tgz lab35/

To submit your activity, go to cseval.roanoke.edu. You should see an available assignment called Lab Assignment 35. Make sure you include a header listing the authors of the file.


Last modified: Wed Apr 16 13:26:09 EDT 2014