Recovery View as a PDF

Theme Song: end of THE WORLD

What happens when our database crashes? What if we lose power for a moment or run out of memory? In dramatic terms, how can our database be prepared for the end of THE WORLD? The answer is recovery! In this assignment, you'll make your database crash-tolerant using write-ahead logging and recovery.


Background Knowledge

Write-Ahead Logging

When our database crashes, we lose everything that was stored in volatile memory. Write-ahead logging aims to solve this problem by storing enough information to recreate the state of the database after a crash.

In write-ahead logging, we write a log specifying the action we undertook before we actually do it. For example, if we inserted (10, 10) into the database, we would write a log that might look like "< Tx1, INSERT, 10, 10, tablename >", indicating that transaction 1 inserted the value (10, 10) into table "tablename". If we do this for every write action a database undertakes, then we will be able to recover just by replaying the log!

You might be tempted to remark that write-ahead logging seems redundant and slow; why don't we just save the state of the database each time? The nature of write-ahead logging makes writing a log much less expensive than flushing a page: in short, appending a string to a file is much cheaper than mutating a data structure and flushing the corresponding pages.

Checkpoints

Let's say we've crashed and decide to recover our database by replaying history. You may have noticed that it is rather inefficient to replay a database's entire history. Especially since our database already has data stored on disk, we want to do the minimum amount of work required to restore the database.

To achieve this, we use checkpoints. A checkpoint is a point in the logs where we can be sure that all data in memory was flushed to disk. In other words, there is no need to replay information up to a checkpoint, since it already exists on disk. A checkpoint log also contains information specifying which transactions are currently active; as you will see soon, this is imperative for recovery, since it gives us information about what instructions should actually be undone.

When we call checkpoint on the REPL, we first flush all pages to disk, then write a checkpoint log to disk. This order is important; if we wrote the log first, but crashed during the flush itself, our recovery algorithm would not work!

Log Types

Now that we understand checkpoints, we should introduce the other kinds of logs. There are five kinds of logs in our database: TABLE logs, EDIT logs, START logs, COMMIT logs, and CHECKPOINT logs.

TABLE logs signify the creation of a table. The structure of a TABLE log is < create tblType table tblName >, where tblType is either hash or btree, and tblName is the name of the table.

EDIT logs signify actions that have modified the state of the database; for example, before a database insertion, deletion, or update, an EDIT log is written to disk. The structure of an EDIT log is < Tx, table, INSERT|DELETE|UPDATE, key, oldval, newval >, where Tx is the transaction that undertook this edit, table is the affected table, key is the affected key, and oldval and newval after the before and after results. Note that oldval xor newval can be null - think about why!

START logs signify the beginning of a new transaction. The structure of a START log is < Tx start >.

COMMIT logs signify the end of a running transaction. The structure of a COMMIT log is < Tx commit >.

CHECKPOINT logs list the currently running transactions and guarantee that all memory has been flushed to disk. The structure of a CHECKPOINT log is < Tx1, Tx2... checkpoint >.

Using these logs, we can define a recovery algorithm that can recovery from any database crash. We've handled the serialize and deserialize functions for you in the code; this will be useful, though, if you decide to debug by looking at your WAL.

The Recovery Algorithm

Now that we have all of the groundwork we need, let's formalize the recovery algorithm. It goes as follows:

  1. Seek backwards through the logs to the most recent checkpoint and note which transactions are currently active.
  2. Replay all actions from the most recent checkpoint to the end of the log, keeping track of which transactions are active.
  3. Undo all transactions that have failed to commit.

We encourage you to think about the following cases and why this algorithm properly restores the database:


Assignment Spec

Overview

In this project, you'll implement crash recovery! After this final piece, you will have fully completed a disk-oriented, multi-indexed, fully-concurrent and fully-crash tolerant database! Congratulations!

The following are the files you'll need to alter:

And the following are the files you'll need to read:

New REPL Commands

Our REPL now supports the following commands:

Note that you will need transaction management from the Concurrency assignment; however, we will not be testing to ensure that your database can recover from a multi-threaded workload.

Part 0: Instrumenting Existing Code

Relevant files:

Once you download recovery.zip, you'll see a recovery subfolder; this should be placed as is in the pkg folder. You'll also want to replace the db package with the one linked below. Next, you'll want to replace your main.go with the one provided below. Make sure that your linter doesn't throw any new errors in the main.go folder. When finished, your filestructure should look like this:

./
    - cmd/
        - bumble/
        - bumble_client/
        - bumble_stress/
    - pkg/
        - btree/
        - config/
        - db/
        - utils/
        - hash/
        - list/
        - pager/
        - repl/
        - query/
        - concurrency/
        - recovery/

Next, add/change the following line in your config/default.go file:

// Name of log file.
const LogFileName = "./db.log"

You may have to insert config.LogFileName in bumble_stress to compile.

Next, add the following code to your page.go file:

// [RECOVERY] Grab the update lock.
func (page *Page) LockUpdates() {
	page.updateLock.Lock()
}

// [RECOVERY] Release the update lock.
func (page *Page) UnlockUpdates() {
	page.updateLock.Unlock()
}

Next, add/replace the following code to your pager.go file:

// TODO: Replace the following function
// GetFileName returns the file name.
func (pager *Pager) GetFileName() string {
	return filepath.Base(pager.file.Name())
}

// [RECOVERY] Block all updates.
func (pager *Pager) LockAllUpdates() {
	pager.ptMtx.Lock()
	for _, page := range pager.pageTable {
		page.GetKey().(*Page).LockUpdates()
	}
}

// [RECOVERY] Enable updates.
func (pager *Pager) UnlockAllUpdates() {
	for _, page := range pager.pageTable {
		page.GetKey().(*Page).UnlockUpdates()
	}
	pager.ptMtx.Unlock()
}

And finally, run go get github.com/otiai10/copy. Now you should be ready to code!

Part 1: Logging

Implement the following functions:

func (rm *RecoveryManager) Table(tblType string, tblName string)
func (rm *RecoveryManager) Edit(clientId uuid.UUID, table db.Index, action Action, key int64, oldval int64, newval int64)
func (rm *RecoveryManager) Start(clientId uuid.UUID)
func (rm *RecoveryManager) Commit(clientId uuid.UUID)
func (rm *RecoveryManager) Checkpoint()

Some notes:

Part 2: Recovery

Implement the following functions:

func (rm *RecoveryManager) Recover() error
func (rm *RecoveryManager) Rollback(clientId uuid.UUID) error

Some notes:

Part 3: Optimization (Optional)

Since this is the last assignment, we'd like to challenge you to take our database to the next level somehow! Given the length of Databases are complex creatures that often support many features. Sometimes they span millions of lines of code. From this rich field of possibilities, we have some suggestions for potential optimizations/improvements you could undertake; if any of them seem interesting to you, come talk with us about it. We'd love to hear about what you end up building! If you do any optimization, please write about it in your README.

Testing

Run our unit tests on Gradescope.

Getting started

To get started, download the following stencil packages: recovery, db. We've provided an updated main.go file here. See Part 0 on how to integrate this with the rest of your code.


Feedback

As this is a new course, we appreciate any feedback you have for us! If you enjoyed this assignment, hated this assignment, or have other thoughts to share, please do so here!