A great way to understand data-intensive systems deeply—including their challenges, the space of available solutions, and their trade-offs—is to build them. As part of this course students will build a version of a real distributed system — a small but realistic version of Google. The course follows a new approach to teaching such systems termed distribution-as-a-library: as part of this approach, students use a conventional (non-distributed) programming language to build a library of distributed services.

This is a hands-on implementation-heavy cours — students will incrementally build components of a large-scale distributed system. They will start with a high-performance web server implementing HTTP, then expand it towards an actor-style message-passing substrate, then build a scalable queryable key-value store, then build a MapReduce framework, and then use that to implement scalable web crawling, indexing, and retrieval algorithms. Combined, these components will result in the implementation of Google-style search engine — including distributed, scalable crawling; indexing with ranking; stream processing; and even PageRank on the students’ own MapReduce-style platform!

Project Milestones #

All project milestones except the last one are individual. The pace is about one project milestone per week—due on Wednesday—except for the last milestone, which is due in two weeks after the release of the milestone handout. Milestones build on each other, eventually culminating in a large-scale distributed system deployed on the cloud.

M0. Setup & Centralized Computing: The goal of this milestone is to set up the development environment, familiarize students with the steps required for each assignment (e.g., GitHub commits, Gradescope), and confirm everyone’s background meets the requirements of the course. It will also serve as a refresher on JavaScript/Typescript and Shell programming.

M1. Serialization / Deserialization: To implement any form of distributed computation, two or more nodes need to communicate — and to do that, they first need to be able to exchange messages. Serialization often comes built into the programming language or runtime system — but here you will be implementing the core serialization and deserialization functionality yourselves. The goal of this milestone is to build the necessary infrastructure for exchanging messages between nodes — including converting any value from an in-memory structure to an on-wire message and correctly back to an in-memory structure.

M2. Actors & RPCs: The goal of this framework is to build a simple actor framework with associated runtime support. Actors can be thought of as event-driven processes that, in response to incoming messages, take a set of actions — e.g.,compute something locally, respond to a message, send messages to other actors, or interact with persistent local storage.

M3. Node Groups & Gossip: This milestone focuses on abstractions and systems support for addressing a set of nodes as a single system. Additional support includes scalable gossip protocols—a first version of a — for checking the health and status of all nodes in a node group, as well as for relaying messaging to node groups.

M4. Distributed Storage: The goal of this milestone is to implement a distributed and scalable storage subsystem over a set of nodes. The core of this milestone is a distributed key-value store centered around two classic techniques — consistent hashing and rendezvous hashing — to store, retrieve, update, and delete objects.

M5. Distributed Processing: This milestone focuses on implementing scalable processing abstractions modeled after the higher-order map and reduce functions. The goal is to build a distributed processing system that can process large amounts of data in parallel, across many nodes, and that can scale out to handle large amounts of data.

Final Project #

M6. Cloud Deployment: The final milestone is about deploying the entire system on real distributed infrastructure such as Amazon Web Services (AWS). Students here will work in groups, combining best-of-breed milestone implementations, considering additional trade-offs, and tuning several subsystems appropriately to offer a final product that is deployed and runs on the cloud.

M6 and the project poster/report is submitted in groups of, ideally, 4 students. These deliverables will describe the end-to-end system, and will have the form of a short research poster/paper: discussion of the problem statements and techniques applied, some examples, a detailed description of the design and implementation of the system-including the core challenges and how they were solved, and acknowledgement of the related work. It will also include an evaluation and characterization of each system along several dimensions — including correctness, performance, scalability, and fault tolerance — and also address the design and implementation decisions of different members in the team.

As part of the final project, students are expected to collaborate effectively — including comparing and contrasting their approaches to different milestones, thinking about the strengths and weaknesses of their respective approaches, and reflecting on their learning.

Poster Session #

Beyond some basics about the project, students have significant creative freedom in terms of the content and presentation of the poster. Students will present their posters during a poster session during the last week of class. Attendees during the poster session will include an evaluation committee of distributed-systems experts from industry and academia, who will ask each team questions about the implementation of the project, its performance, and other characteristics of interest.