How hashtables work

Today we’re talking about how hashtables really work–partly because it’s useful to know, and partly because it involves some very interesting ideas from computer science.

A more accurate model for memory

In order to talk about how hashtables work, we’ll have to start by refining our model of program execution one final time. Imagine we have a list representing the current Billboard Top 100. We’d write it out like this:

Dictionary:

name value
top100 loc 1

Memory:

location data
loc 1 [“Truth Hurts”, “Yellow Submarine”, …]

Here’s a more accurate picture.

Memory:

location data
loc 1 <list of length 100>
loc 2 list member: “Truth Hurts”
loc 3 list member: “Yellow Submarine”
loc 4

So, what does Python do when we run something like:

> top100[1]

Rather than searching the list for the right value, Python can go directly to it–since top100 is at loc 1, top100[0] is at loc 2, top100[1] is at loc 3, etc.

Notice that this is a very different model than Pyret’s lists. You might see Pyret-style lists (defined using link and empty) called linked lists, while Python-style lists are sometimes called arrays.

Hashtables in memory

Now that we’ve seen how lists work in memory, we can see how hashtables work in memory. If our keys are numbers, we could imagine using exactly the same scheme as lists. So if we had a hashtable like

{1: "ones",
 10: "tens",
 100: "hundreds"}

We could store it in memory:

location data
loc 1 <hashtable of length 100>
loc 2 <empty>
loc 3 ht member: “one”
loc 4 <empty>
loc 5 <empty>
loc 12 ht member: “ten”

Does this seem like a good idea? This gets us into another aspect of program performance: memory usage. In this case, a hashtable with only three elements is using one hundred memory locations! If we added an entry to the table for the trillions place, what would happen?

Hashtables don’t really work like this. Instead, we run the keys through a hash function: a function that takes a key and returns a number in the range defined by the size of the hashtable.

So memory might be something like

location data
loc 1 <hashtable of length 3>
loc 2 ht member: “hundred”
loc 3 ht member: “one”
loc 4 ht member: “ten”

So the hash function might look like:

1......              ......0
        .............
10...       ...............1
     ..............
100....            ........2

The details of how this function works are beyond this class’s scope, but we’ve already seen the last step. Python can get numbers in the range 0-2 using the modulus function: x mod 3 is always in the range 0-2 regardless of what x is.

The same approach works for strings and other datatypes–with a bit of a wrinkle.

Hashing datatypes

What if we want to keep track of the locations of objects we’ve put on a map? We might want to use a hashtable where the keys are Posn’s (as we’ve defined before):

@dataclass
class Posn:
  x: int
  y: int

So let’s say we have a hashtable like this:

p = Posn(1, 2)
objects = {
  p: "tree",
  Posn(4, 3): "rock"
}

We might think that Python would store this hashtable in memory like this:

location data
loc 1 <hashtable of length 2>
loc 2 ht member: “rock”
loc 3 ht member: “tree”

What happens, though, if we change one of the keys (p.x = 3)? Python might look in the wrong location for the value. Hashtables aren’t going to work if our keys can be changed! Because of this, the objects hashtable above will actually cause Python to throw an error, something like unhashable type: Posn.

We can get around this by modifying our Posn class:

@dataclass(frozen=True)
class Posn:
  x: int
  y: int

This tells Python that Posn’s should be immutable (or “frozen”). Now we can use them as keys in a hashtable, but we can’t modify instances:

> p = Posn(1, 2)
> p.x = 3
ERROR