Hash View as a PDF

Theme Song: instagram

Sometimes, range queries and sorted order don't matter too much to you, and all you care about is fast lookup. A hash table is a very popular data structure for performing quick lookups used in everything from coding interviews to industry standard indexes. In this assignment, you will implement an extendible hash table.

Background Knowledge

Extendible Hashing

As you know from lecture, extendible hashing is a method of hashing that adapts as you insert more and more data. For direct lookups, it can be a very efficient and lightweight indexing solution. However, it does not support range queries or maintain any sorted order, making it less than ideal as a general purpose indexing scheme.

An extendible hash table has two moving parts: the table and the buckets. The table is essentially an array of pointers to buckets. Each bucket has a local depth, denoted \(d_i\) , which is the number of trailing bits that every element in that bucket has in common. The table has a global depth, denoted \(d\) , that is the maximum of the local depths.

A hash table also has an associated hash function. Ideally, this hash function is fast and uniform across the output space, meaning that keys will be uniformly distributed across the buckets. In this implementation, we use xxhash since it is the state of the art, but really, many other hash functions would work just fine. You won't need to call this hash function directly; just use the Hasher function in the hash_subr.go file.

Basic Operations

When a hash table performs lookup, the href=" key is hashed, and then the table is used to determine the bucket that we should look at. The last \(d\) bits of the href=" key's hash are used as the key to the table, and then we do a linear scan over the bucket it points to.

When a hash table performs insertion, the insertion key is hashed, and then the table is used to determine the bucket that our new tuple should be in. The last \(d\) bits of the hash is used as the key to the hash table, after which we add the new tuple to the end of the bucket.

Splitting

When a value is inserted into a hash table, one of two things can happen: either the bucket doesn't overflow, or the bucket does overflow. If it overflows, then we need to split our bucket. To split a bucket, we create a new bucket with the same href=" key as the bucket that overflowed, but with a "1" prepended to it. We then prepend a "0" to the original bucket's href=" key. As an example, if a bucket with a local depth of 3 and href=" key "101" splits, we would be left with two buckets, each of local depth 4 and with href=" keys "0101" and "1101". Then, we transfer the values from the old bucket to the new bucket, and we are done.

It's worth noting that while we talk about "prepending" as if we are dealing with strings, in actuality, this action is done entirely through the bit-level representation of integers. Think about what you would have to add to a href=" key to effectively prepend its bit representation with a 1 or a 0.

It is possible that no values belong in either the old or newly created bucket, immediately triggering a second split. This may be a consequence of a bad hash function, but is a totally possible, albeit unlikely, scenario. Don't worry too much about these cases; trust in the hash {-: .

However, if a bucket overflows and ends up with a local depth greater than the global depth, this is no good. We should always maintain a global depth equal to the maximum of the buckets' local depths. To remedy this, we copy the hash table and append it to itself, doubling its size and increasing global depth by 1. Then, we can iterate through and make sure that the buckets are all being pointed to by the correct slots in the hash table.

Deletion and Fragmentation

When a hash table performs deletion, depending on your scheme, the other values may have to be moved over to fill the space it left. A good indexing scheme would use slotted pages to avoid this, but our setup doesn't readily accomodate this. Alternatively, we could populate a free list. There are many ways to handle deletion, since we don't have to maintain any sort order on the bucket level. In this assignment, you should shift over the remaining entries on the right one place left, so that there is no internal fragmenting in the bucket.

Assignment Spec

Overview

The following are the files you will need to alter:

cmd/bumble/main.go
pkg/db/db.go
pkg/hash/table.go
pkg/hash/bucket.go

And the following are the files you'll need to read:

pkg/db/db.go
pkg/hash/hash.go
pkg/hash/hash_subr.go
pkg/hash/entry.go

Subroutine Code

We've provided subroutine code so that you don't have to worry about the fact that all of this data is actually being stored on pages. We handle the marshalling and unmarshalling of data, you handle the actual logic of the hash table.

You will need to read the hash_subr.go and entry.go files to implement this assignment. The hash.go file is just a wrapper for the db.Index interface. You don't need to understand exactly how every function works, just understand what they do so that you know when you need to call them. If you ever find yourself manipulating bytes yourself, you have done something wrong and need to leverage the subroutine code more. You shouldn't change the subroutine code unless a TA has instructed you to do so. Note that there are a number of functions that are tagged with future assignment names - you won't need them now.

A common pattern you will see is how we get buckets; we use defer to ensure that the page is returned to the pager eventually.

// Get page and put later:
newBucket, err := NewHashBucket(table.pager, bucket.depth)
if err != nil {
    return err
}
defer newBucket.page.Put()

Implementation-specific Details

Your hash table should allow inserts of duplicate values! Don't worry, non-duplicity through the repl is enforced through the database layer, so there won't be any undefined behaviour. However, we do need to allow duplicate entries for later in the course. One known issue is that is the bucket size is \(x\) , and we insert \(x+1\) elements with the same key, this scheme will lead to infinite bucket splits; ignore this for now, because there is no good solution to this problem without involving overflow chains.

Part 0: Instrumenting Existing Code

Relevant files:

cmd/bumble/main.go
pkg/db/db.go

We have provided you an updated db package in your stencil. Once you download the hash-stencil.zip folder, you'll see a db subfolder and a hash subfolder; both of these should be placed as is in the pkg folder, replacing the old version of the db package. Make sure that your linter doesn't throw any new errors in the main.go folder. When finished, your filestructure should look like this:

./
    - cmd/
        - bumble/
    - pkg/
        - btree/
        - config/
        - db/
        - utils/
        - hash/
        - list/
        - pager/
        - repl/

Next, you will need to uncomment all of the code blocks prefixed with [HASH] in your main.go (there may be none present; this is a failsafe in case the stencil ever changes in the future). If you're not sure if your main.go is working, we've provided a copy that should work below.

Part 1: Bucket Functions

Implement the following functions:

func (bucket *HashBucket) Find(key int64) (entry.Entry, bool)
func (bucket *HashBucket) Insert(key int64, value int64) (bool, error) // ALLOW DUPLICATE ENTRIES
func (bucket *HashBucket) Update(key int64, value int64) error
func (bucket *HashBucket) Delete(key int64) error
func (bucket *HashBucket) Select() ([]entry.Entry, error)

Some hints:

These functions operate on a single bucket; you can safely assume that the key-value pair belongs in this bucket if these functions are called.
Use modifyCell, updateValueAt, updateKeyAt, and updateNumKeys.

Part 2: Hash Table Functions

Implement the following functions:

func (table *HashTable) Find(key int64) (entry.Entry, error)
func (table *HashTable) Split(bucket *HashBucket, hash int64) error
func (table *HashTable) Insert(key int64, value int64) error // ALLOW DUPLICATE ENTRIES
func (table *HashTable) Update(key int64, value int64) error
func (table *HashTable) Delete(key int64) error
func (table *HashTable) Select() ([]entry.Entry, error)

Some hints:

To hash a value, use Hasher(key, depth).
To get a bucket by its hash, use table.GetBucket(hash). To get it by its page number, use table.GetBucketByPN(pn)
Split takes the hash of the bucket that should be duplicated; this makes it easier to be called from Insert.
The maximum size of a bucket is BUCKETSIZE.

Debugging

When using your REPL to debug, we recommend temporarily setting the bucket size to a smaller number. WARNING: REMEMBER TO REVERT THESE CHANGES BEFORE YOU SUBMIT, OTHERWISE THE AUTOGRADER WILL TIME OUT!:

// var BUCKETSIZE int64 = (PAGESIZE - BUCKET_HEADER_SIZE) / ENTRYSIZE // num entries
var BUCKETSIZE int64 = 8

Moreover, if you'd like to use the mod operator as the hasher instead of the actual hasher we use, feel free to replace Hasher with the following function:

func Hasher(key int64, depth int64) int64 {
	return int64(key %powInt(2, depth))
}

Testing

Run our unit tests on Gradescope. We've provided a subset of the tests here.

Getting started

To get started, download the following stencil packages: db and hash. We've provided an updated main.go file here. Part 0 on how to integrate this with the rest of your code.

Feedback

As this is a new course, we appreciate any feedback you have for us! If you enjoyed this assignment, hated this assignment, or have other thoughts to share, please do so here!