CSCI 0300/1310: Fundamentals of Computer Systems

⚠️ This is not the current iteration of the course! Head here for the current offering.

Lecture 21: RPC, Sharding, and Replication

» Lecture code
» Post-Lecture Quiz (due 11:59pm Monday, April 12)

S1: Remote Procedure Call (RPC)

Using network connections and sockets directly involves programming against a byte stream abstraction. Both the server and the client code must encode application messages into bytes and manage delimiting messages (e.g., by prepending them with a length, or by using terminator characters). This requires a lot of "boilerplate" code – for example, about 80% of the lines in WeensyDB relate to request parsing and encoding.

Remote Procedure Call (RPC) is a more convenient abstraction for programmers. An RPC is like an ordinary function call in the client, and like a callback on the server. In particular, RPCs hide the details of the encoding into bytes sent across the network (this is called the "wire format") from the application programmer, allowing application code to be separated from protocol details, and affording the flexibility to change the underlying protocol without having to change the application.

RPCs are often implemented via infrastructure that includes automatically-generated stubs, which take care of encoding operations into the wire format, managing network connections, and handling (some) errors. An example of a widely-used RPC stub compiler is gRPC, an RPC library from Google (others, such as Apache Thrift also exist), which works with the Protocol Buffers library (a library that provides APIs for encoding into different wire formats, such as efficient binary representations and JSON) to generate efficient and easy-to-use RPC stubs. You will use both gRPC and Protobufs in Lab 8 and the Distributed Store project.

As an example, consider how WeensyDB with RPCs works (see the picture above): rather than encoding GET and SET requests as strings with newlines and spaces as delimiters, an RPC-based WeensyDB would use generated stubs to expose a get(key)/set(key, value) API on the client. When the client application calls these functions, it calls into generated stub code, which encodes the request in whichever wire format makes sense, and sends the data to the server via a network connection (which the generated stub also initiates via the socket syscalls and whose file descriptors (FDs) it manages). On the server side, the receiving stub code calls developer-provided implementations of the get() and set() functions, and takes care to send any return values back to the client via the network connection.

S2: Sharding

WeensyDB is a distributed system with many clients and a single server. But in some settings, a single server is not sufficient to handle all the requests that users generate. Such settings include popular web services like Google, Facebook, Twitter, or Airbnb, which all receive millions of user requests per second &nash; clearly more than a single server can handle. This requires scaling the service.

Vertical vs. Horizontal Scalability

When we talk about scalability, we differentiate between two different ways of scaling a system.

Vertical scalability involves adding more resources to scale the system on a single computer. This might, for example, be achieved by buying a computer with more processors or RAM (something that is pretty easy – if expensive " on cloud infrastructure like AWS or Google Cloud). Vertical scalability is nice because it does not require us to change our application implementation too much: simply adding more threads to a program is sufficient to handle more requests (assuming the computation was parallelizable to begin with).

But the downside of vertical scalability is that it faces fundamental limits: due to the physics of energy conservation, heat dissipation, and the speed of light, a computer cannot have more than a certain number of processors (in the hundreds with current technology) before it runs too hot or would have to slow down processor speed significantly. This puts a practical limit to how far we can scale a system vertically. Another limit (but sometimes also a benefit) is that a vertically-scaled system is a single fault domain: if it loses power, the entire system turns off. This can be a problem (a website run from this computer no longer works), but – as we will see when we discuss the alternatives – also avoids a lot of complexity associated with more resilient distributed systems.

The alternative is horizontal scaling, which works by adding more computers to the system (i.e., making the server itself a distributed system). This is easy to do in principle: public cloud platforms allow anyone with a credit card to rent hundreds of virtual machines (or more) with a few clicks. This provides practically unlimited scalability, as long as we can figure out a way to split our application in such a way that it can harness many computers. (It turns out that this split, and issues related to fault tolerance, really add a lot of complexity to the system, however.)

Sharding: splitting a service for scalability

To use multiple computers to scale a service, we need a way to split the service's work between many computers. Sharding is the term for such a split: think about throwing our system on the floor and seeing it break into many shards, which are independent but together make up the whole of the system.

To shard a system, we split its workload along some dimension. Possible dimensions include splitting by client, or splitting by the resource that a client seeks to access (e.g., in an RPC). When you talk to a website like google.com, you and other users access the same domain name, but actually talk to different computers, both depending on your geographic location and based on a random selection implemented by the Domain Name System (DNS) server for google.com. This "load balancing" is a form of sharding by client: different front-end servers in Google's data centers receive network connection requests from different clients, based on the IP address that google.com resolved into for each specific client.

But sharding by client requires that every server that a client might talk to be equally able to handle its requests. This is pretty difficult to ensure in a practical distributed system. Consider WeensyDB: if we were to shard it by client, every server would need to be able to serve requests for every single key stored in the WeensyDB. A better alternative for this kind of stateful service is to shard by resource (i.e., by key in the case of WeensyDB).

In practice, this sharding might be realized by splitting the key space of WeensyDB's keys (which are strings) into different regions ("shards") assigned to different servers. In the picture above, server S₀ handle keys starting with letters A-H, while S₁ handles those starting with I-S, and S₂ handles T-Z.

To make this sharding work, the client must know which server is responsible for which range of keys. This assignment (the "sharding function") is either hard-coded into the client, or part of a configuration it dynamically obtains from a coordinator system (in the Distributed Store project, this coordinator is called "shardmaster").

A properly sharded service scales very well, since we can simply add more servers and split the key ranges assigned to a shard in order to add more capacity to the system. But there are some edge cases: for example, many social media services have highly skewed key popularities, leading to a few popular keys (e.g., the timeline for a celebrity user) to receive a disproportionaly larger number of requests than others. This means that keys and the load they induce are no longer equal, and the sharding must take this into account.

S3: Replication

A sharded distributed system achieves its high scalability and simplicity by having a single server take responsibility for an individual resource (a key in WeensyDB). This arrangement has the rather nice property that this single server is always the source of ground truth for the state of this resource (e.g., the value for a key). But it also has a serious downside: what if that server fails?

When a server fails, all resources that it controls in a sharded system become unavailable. This may mean that a web service goes down for certain users, or even entirely. Consider a WeensyDB that has lost all keys starting with I-S: it wouldn't be much use. Recovering from an on-disk or offsite tape backup would take far too long (hours!) to be practical for today's web services.

To fix this problem, we can add some redundancy into the system. Instead of storing each shard (i.e., range of keys) only on one server, we can replicate these keys on multiple servers.

In the example above, each shard is replicated on two servers: keys I-S on S₁ and S₂. If a server (e.g., S₁) fails, the keys and their values are still available on another server (here, S₂). A request to GET k sent to S₁ would now time out and the client would retry it on S₂. Hence, this two-way-replicated design can tolerate the failure of any one server. More generally, a N-way replicated system can tolerate N - 1 failed servers.

But there is a challenge! Consider this sequence of events:

The client issues a SET request to set k to 2, which completes fine on S₁.
Server S₁ fails immedately after handling this request and responding to the client, but before the new value for k is replicated to S₂.
The client now issues a GET requeste for k, which times out on failed S₁.
The client retries on S₂, which responds with the latest value of k that it knows about, which is 7.
The client thus reads a stale value for k.

This issue illustrates that replication requires paying attention to the consistency of state in the distributed system. We will discuss this more next time.

Summary

Today, we learned about three key ideas in distributed systems: RPCs, sharding, and replication.

RPCs help making it easier for programmers to write distributed systems code, since auto-generated RPC stubs prevent the application programmer from having to implement byte stream-level protocol code by hand. Instead, an operation on a remote server feels just like a function call from the application perspective.

We then talked about how to scale a system using horizontal and vertical scalability, and discussed a key horizontal scalability technique, sharding. A sharded system partiions its state along some dimension, allowing different servers to maintain independent slices ("shards") of state.

Finally, we talked about the possiblity of failures in distributed systems. Since a distributed system uses more than one computer, it must be prepared to handle partial failures, where some of the computers in the distributed system are still operational, but one or more others are not. With a sharded system, the loss of a single computer would make entire parts of the state of the system disappear, with potentially disastrous consequences. We learned that replicating state on multiple computers helps avoid this problem, albeit at the cost of introducing the complexity of maintaining consistent state across different computers.