Lecture 35 - Hashing


As usual, create a directory to hold today's activities:

$ mkdir ~/cs170/labs/lab35
$ cd ~/cs170/labs/lab35

Quiz

It's time for your final quiz of the semester! Please lock your computers, as this is a closed computer (as well as closed notes and book) quiz. You may move on to today's lab activities when you finish, but keep in mind that I do have a brief lecture as soon as everyone has finished the quiz.


Hashing

On Monday we wrote hash tables using the linear probing technique. However, you hopefully noticed that the linear probing has an issue with clustering, which causes collisions in unrelated table cells. Today we will go over two additional probing techniques, with the goal of reducing the clustering.

A common question from Monday was also how to handle running out of space in the hash table. We assumed there would always be a location that we could put the key/value pair in. However, this might not always be the case. We will cover table resizing today as well.


Lab Activity 1
Resizing

Determining when to resize the hash table is a bit of a tricky one. We could wait until the table is completely full, but this would cause relatively long lookup times for almost everything in the hash table. Instead, what is commonly done is resizing the table once some load factor is exceeded.

This load factor is a constant defined value, which is a percentage of the hash table locations that are filled in. This is typically 75% of the size of the table, but can vary depending on implementation.

Details

Copy your hash_table.py file from lab 34 into your current directory.

$ cd ~/cpsc170/labs/lab35
$ cp ../lab34/hash_table.py .

Add a method called resize(self), which resizes the hash table. By default, we will always double the size of the hash table. You need to create a new hash table with this new size, and re-insert all of the key/value pairs into their new locations in the new table.

Modify your insert method, so that the table is resized when the load threshold is exceeded. You may need to add other method or attributes to make determining this easier.

Example

>>> my_hash_table = HashTable(5)
>>> my_hash_table.insert("Professor", "Scotty")
>>> print(my_hash_table.hash_table)
[None, None, None, ('Professor', 'Scotty'), None]
>>> my_hash_table.insert("Class", "CPSC 170")
>>> print(my_hash_table.hash_table)
[None, None, ('Class', 'CPSC 170'), ('Professor', 'Scotty'), None]
>>> my_hash_table.insert("Room", "Trexler 363")
>>> print(my_hash_table.hash_table)
[None, None, ('Class', 'CPSC 170'), ('Professor', 'Scotty'), ('Room', 'Trexler 363')]
>>> my_hash_table.insert("Wet", "Ceiling")
>>> print(my_hash_table.hash_table)
[None, None, ('Class', 'CPSC 170'), ('Professor', 'Scotty'), ('Room', 'Trexler 363'), ('Wet', 'Ceiling'), None, None, None, None]
>>> my_hash_table.insert("Slow", "Computers")
>>> print(my_hash_table.hash_table)
[None, ('Slow', 'Computers'), ('Class', 'CPSC 170'), ('Professor', 'Scotty'), ('Room', 'Trexler 363'), ('Wet', 'Ceiling'), None, None, None, None]

Hint

  • Add an attribute called load to the class. This attribute should store the number of key/value pairs currently held by the table. You can use this value to determine when you need to resize your table.

  • If you create your new table in the hash_table attribute, you can simply call insert for all of the key value pairs in the table. However, you will need to keep a local copy of the old hash table, so you can find all of the original key/value pairs.

 

Challenge

As we will see in the next section, having the table size be a prime number can be quite advantageous. Our current resizing technique does not guarantee this. As a matter of fact, it will always generate a composite number. Moving to the next prime is also not necessarily good, as primes are sometimes too close to one another.

Rewrite the resize structure, so that it finds the first prime larger than 2 * self.size. Your table should use that as the size.


Lab Assignment 35
Removing Clustering

We have used linear probing to determine where in the hash table a key/value pair should appear. However, you have likely noticed a clustering trend. There tends to be large, contiguous locations of key/value pairs in the hash table. This is a problem, because it will cause future collisions in unrelated locations to the problem keys.

There are two additional probing techniques that typically get used. Quadratic probing defines a quadratic equation as the probing sequence. This reduces "primary clustering," but has an issue with "secondary clustering." These are clusters around the secondary locations of the probes. The better technique is known as Double Hashing. In double hashing, a secondary hash function is used to modify the probe sequence.

Details

You will need to write at least two more methods to your HashTable class. The first should be quadratic_probe(self, a_key, probe_iteration). a_key is a string, while probe_iteration is an integer. This method should return an integer in the range [0, self.size). It will compute that value using the following equation:

quadratic_probe(key, i) = h(key) + c1 * i + c2 * i2

Where h(key) is your hash function, i is your probe iteration, and c1 and c2 are prime constants. For our purposes, use the values 5 and 7 respectively. This method will tell you where to put the key, based on the ith time you have had to generate a probe key.

Write another method called double_hash(self, a_key, probe_iteration). This method will behave much like quadratic probe, except it will encode the following equation:

double_hash(key, i) = h1(key) + h2(key) * i

Where h1(key) is your original hash function, and h2(key) is a secondary hash function. The value of this second hash function needs to be relatively prime to the size of the hash table. We can fudge this, by using the following equation for h2(key):

h2(key) = (h1(key) + 1) % (size - 1)

Modify your insertion and lookup methods, so you can easily test each of these methods.

Example

Quadratic
>>> my_hash_table = HashTable(5)
>>> my_hash_table.insert("Professor", "Scotty")
>>> print(my_hash_table.hash_table)
[None, None, None, ('Professor', 'Scotty'), None]
>>> my_hash_table.insert("Class", "CPSC 170")
>>> print(my_hash_table.hash_table)
[None, None, ('Class', 'CPSC 170'), ('Professor', 'Scotty'), None]
>>> my_hash_table.insert("Room", "Trexler 363")
>>> print(my_hash_table.hash_table)
[('Room', 'Trexler 363'), None, ('Class', 'CPSC 170'), ('Professor', 'Scotty'), None]
>>> my_hash_table.insert("Wet", "Ceiling")
>>> print(my_hash_table.hash_table)
[None, None, ('Class', 'CPSC 170'), ('Room', 'Trexler 363'), ('Wet', 'Ceiling'), ('Professor', 'Scotty'), None, None, None, None]
>>> my_hash_table.insert("Slow", "Computers")
>>> print(my_hash_table.hash_table)
[None, ('Slow', 'Computers'), ('Class', 'CPSC 170'), ('Room', 'Trexler 363'), ('Wet', 'Ceiling'), ('Professor', 'Scotty'), None, None, None, None]
Double Hashing
>>> my_hash_table = HashTable(5)
>>> my_hash_table.insert("Professor", "Scotty")
>>> print(my_hash_table.hash_table)
[None, None, None, ('Professor', 'Scotty'), None]
>>> my_hash_table.insert("Class", "CPSC 170")
>>> print(my_hash_table.hash_table)
[None, None, ('Class', 'CPSC 170'), ('Professor', 'Scotty'), None]
>>> my_hash_table.insert("Room", "Trexler 363")
>>> print(my_hash_table.hash_table)
[('Room', 'Trexler 363'), None, ('Class', 'CPSC 170'), ('Professor', 'Scotty'), None]
>>> my_hash_table.insert("Wet", "Ceiling")
>>> print(my_hash_table.hash_table)
[('Wet', 'Ceiling'), None, ('Class', 'CPSC 170'), ('Room', 'Trexler 363'), ('Professor', 'Scotty'), None, None, None, None, None]
>>> my_hash_table.insert("Slow", "Computers")
>>> print(my_hash_table.hash_table)
[('Wet', 'Ceiling'), ('Slow', 'Computers'), ('Class', 'CPSC 170'), ('Room', 'Trexler 363'), ('Professor', 'Scotty'), None, None, None, None, None]

Hint

  • Partially as a result of the Chinese remainder theorem, we don't have to worry about changing our current return value for generate_hash_code. We can leave it exactly as it was.

  • Performing linear probing was pretty easy. However, if we complicate that method a little bit, it might make the insertion method a bit easier to modify. Imagine instead that the linear probing was it's own method:

          def linear_probe(self, a_key, probe_iteration):
              return (generate_hash_code(a_key) + probe_iteration) % self.size
          
    Then, instead of performing the probe code in insertion, we simply call linear_probe in the probing loop. You will need to do something similar with quadratic and double hashing.

  • You could even take the above hint one step farther. Write a method called probe(self, a_key, probe_iteration), which calls one of the other probe methods. Then, all you need to do is call probe in insert and lookup! This is following a design pattern known as Adapter, which can really make editing your code a lot easier in the long run.

 

Challenge

This challenge is a bit unrelated to the probings, but can really increase performance of the hash table. One of the ways we can improve lookup speed is by reordering the elements in the hash table based on their "popularity." As lookups are performed on the elements of the hash table, we can move them up the probing chain.

Whenever you perform a lookup that had to probe to find the correct value, move the key/value pair one step up the probing chain by swapping it with the value immediately proceeding it.


Submission

When you have finished, create a tar file of your lab35 directory. To create a tar file, execute the following commands:

cd ~/cs170/labs
tar czvf lab35.tgz lab35/

To submit your activity, go to cseval.roanoke.edu. You should see an available assignment called Lab Assignment 35. Make sure you include a header listing the authors of the file.


Last modified: Wed Apr 16 13:26:09 EDT 2014