How hashtables work
Today we’re talking about how hashtables really work–partly because it’s useful to know, and partly because it involves some very interesting ideas from computer science.
A more accurate model for memory
In order to talk about how hashtables work, we’ll have to start by refining our model of program execution one final time. Imagine we have a list representing the current Billboard Top 100. We’d write it out like this:
Dictionary:
name | value |
---|---|
top100 | loc 1 |
Memory:
location | data |
---|---|
loc 1 | [“Truth Hurts”, “Yellow Submarine”, …] |
Here’s a more accurate picture.
Memory:
location | data |
---|---|
loc 1 | <list of length 100> |
loc 2 | list member: “Truth Hurts” |
loc 3 | list member: “Yellow Submarine” |
loc 4 | … |
So, what does Python do when we run something like:
> top100[1]
Rather than searching the list for the right value, Python can go directly to
it–since top100
is at loc 1
, top100[0]
is at loc 2
, top100[1]
is at
loc 3
, etc.
Notice that this is a very different model than Pyret’s lists. You might see
Pyret-style lists (defined using link
and empty
) called linked lists,
while Python-style lists are sometimes called arrays.
Hashtables in memory
Now that we’ve seen how lists work in memory, we can see how hashtables work in memory. If our keys are numbers, we could imagine using exactly the same scheme as lists. So if we had a hashtable like
{1: "ones", 10: "tens", 100: "hundreds"}
We could store it in memory:
location | data |
---|---|
loc 1 | <hashtable of length 100> |
loc 2 | <empty> |
loc 3 | ht member: “one” |
loc 4 | <empty> |
loc 5 | <empty> |
… | … |
loc 12 | ht member: “ten” |
Does this seem like a good idea? This gets us into another aspect of program performance: memory usage. In this case, a hashtable with only three elements is using one hundred memory locations! If we added an entry to the table for the trillions place, what would happen?
Hashtables don’t really work like this. Instead, we run the keys through a hash function: a function that takes a key and returns a number in the range defined by the size of the hashtable.
So memory might be something like
location | data |
---|---|
loc 1 | <hashtable of length 3> |
loc 2 | ht member: “hundred” |
loc 3 | ht member: “one” |
loc 4 | ht member: “ten” |
So the hash function might look like:
1...... ......0 ............. 10... ...............1 .............. 100.... ........2
The details of how this function works are beyond this class’s scope, but we’ve
already seen the last step. Python can get numbers in the range 0-2
using the
modulus function: x mod 3
is always in the range 0-2
regardless of what
x
is.
The same approach works for strings and other datatypes–with a bit of a wrinkle.
Hashing datatypes
What if we want to keep track of the locations of objects we’ve put on a map? We
might want to use a hashtable where the keys are Posn
’s (as we’ve defined
before):
@dataclass class Posn: x: int y: int
So let’s say we have a hashtable like this:
p = Posn(1, 2) objects = { p: "tree", Posn(4, 3): "rock" }
We might think that Python would store this hashtable in memory like this:
location | data |
---|---|
loc 1 | <hashtable of length 2> |
loc 2 | ht member: “rock” |
loc 3 | ht member: “tree” |
What happens, though, if we change one of the keys (p.x = 3
)? Python might look in the
wrong location for the value. Hashtables aren’t going to work if our keys can be
changed! Because of this, the objects
hashtable above will actually cause
Python to throw an error, something like unhashable type: Posn.
We can get around this by modifying our Posn
class:
@dataclass(frozen=True) class Posn: x: int y: int
This tells Python that Posn
’s should be immutable (or “frozen”). Now we can use
them as keys in a hashtable, but we can’t modify instances:
> p = Posn(1, 2) > p.x = 3 ERROR