CSCI 0300/1310: Fundamentals of Computer Systems

Lecture 27: Replication, Consistency, Summary and Outlook

» Lecture video (Brown ID required)
» Post-Lecture Quiz (optional)

Remote Procedure Calls

Notes coming soon!

Replication

A sharded distributed system achieves its high scalability and simplicity by having a single server take responsibility for an individual resource (a key in KVStore). This arrangement has the rather nice property that this single server is always the source of ground truth for the state of this resource (e.g., the value for a key). But it also has a serious downside: what if that server fails?

When a server fails, all resources that it controls in a sharded system become unavailable. This may mean that a web service goes down for certain users, or even entirely. Consider a KVStore that has lost all keys starting with I-S: it wouldn't be much use. Recovering from an on-disk or offsite tape backup would take far too long (hours!) to be practical for today's web services.

To fix this problem, we can add some redundancy into the system. Instead of storing each shard (i.e., range of keys) only on one server, we can replicate these keys on multiple servers.

In the example above, each shard is replicated on two servers: keys I-S on S₁ and S₂. If a server (e.g., S₁) fails, the keys and their values are still available on another server (here, S₂). A request to GET k sent to S₁ would now time out and the client would retry it on S₂. Hence, this two-way-replicated design can tolerate the failure of any one server. More generally, a N-way replicated system can tolerate N - 1 failed servers.

But there is a challenge! Consider this sequence of events:

The client issues a SET request to set k to 2, which completes fine on S₁.
Server S₁ fails immedately after handling this request and responding to the client, but before the new value for k is replicated to S₂.
The client now issues a GET requeste for k, which times out on failed S₁.
The client retries on S₂, which responds with the latest value of k that it knows about, which is 7.
The client thus reads a stale value for k.

This issue illustrates that replication requires paying attention to the consistency of state in the distributed system.

Consistent Replication

Coming soon!

Real-world Distributed Systems

Strong vs. Weak Consistency

Maintaining strong consistency guarantees correct results and is convenient for application developers. However, it is not a helpful strategy for high performance or scalability. Strong consistency, comes at the cost of reducing effective concurrency in the system (think about all the communication and locks required!). Consequently, many systems relax consistency properties in exchange for performance. These systems are called weak consistency systems; they tend to scale better, but you don't want to use them to handle crucial information like monetary balances or user account creation.

Many companies therefore run a mix of strongly consistent and weakly consistent systems. Here are some examples:

Strong Consistency	Weak Consistency
MySQL (Facebook)	memcached (Facebook, many others)
	TAO (Facebook)
Spanner (Google)	BigTable (Google)
	Dynamo (Amazon)
NFS (CS department)
	Blockchains

The links in the above table point to research papers about these systems and how the companies use them. If you're curious to learn more, take a look!

Infrastructure at Scale

Modern web services can have millions of users, and the companies that operate them run serious distributed systems infrastructure to support these services. Below picture shows a simplified view of the way such infrastructure is typically structured.

End-users contact one of several datacenters, typically the one geographically closest to them. Inside that datacenter, their requests are initially terminated at a load-balancer (LB). This is a simple server that forwards requests onto different frontend servers (FE) that run an HTTPS server (Apache, nginx, etc.) and the application logic (e.g., code to generate a Twitter timeline, or a Facebook profile page).

The front-end servers are stateless, and they contact backend servers for information required to dynamically generate web page data to return to the end-users. Depending on the consistency requirements for this data, the front-end server may either talk directly to a strongly-consistent database, or first check for the data on servers in a cache tier, which store refined copies of the database contents in an in-memory key-value store to speed up access to them. If the data is in the cache, the front-end server reads it from there and continues; if it is not in the cache, the front-end server queries the database.

Note that the database which is usually itself sharded and which acts as the source of ground-truth, is replicated across servers, often with a backup replica in another datacenter to protect against datacenter outages.

Finally, the preceeding infrastructure serves end-user requests directly and must produce responses quickly. This is called a service or interactive workload. Other computations in the datacenter are less time-critical, but may process data from many users. Such batch processing workloads include data science and analytics, training of machine learning models, backups, and other special-purpose tasks that run over large amounts of data. The systems executing these jobs typically split the input data into shards and have different servers work on distinct partitions of the input data in parallel. If the computation can be structured in such a way that minimal communication between shards is required, this approach scales very well.

Bonus material: Transactions

⚠️ We did not cover transactions this year. Following material is for your education only; we won't test you on it. Feel free to skip it.

In a distributed system, handling a client request sometimes requires operations to take place on multiple servers. As an example, consider posting to Facebook: you would rather like your new post to be replicated to multiple servers, so that it doesn't disappear if one of Facebook's servers goes offline for maintenance. Or consider a money transfer at your bank or on an application like Venmo, which needs to take money out of your account and deposit it into your friend's account – even if your and your friend's account balances are stored on different shards, and therefore on different servers.

Where is the client?

Classically, the client of a distributed system was the end-user device and the server was a single remote computer. But in today's applications, we have clients on end-user devices (e.g., a smartphone or laptop) and complex distributed systems infrastructure in a company's datacenters. In these settings, it is often the case that some "front-end" server in the datacenter acts on behalf of the end-user client to avoid sending many messages across the wide-area internet (which can take hundreds of milliseconds per message – an long time in computer terms!). This "proxy" client handles requests from the client device, and decomposes them into operations on different servers in the "back-end" of the web service. Towards these servers, the front-end server acts as a client.

Since the operations happen independently in the distributed system, it may be necessary to abort and undo (or "roll back") earlier operations if a later one fails. For example, consider a SET command for key k replicated over three servers:

The client sends the SET command and new value for k to the first server.
The first server applies the SET, changes the stored value for k, and acknowledges the success to the client.
The client sends the SET command and new value for k to the second server.
The second server fails and does not respond, or tells the client that it cannot apply the operation.
The client detects this failure, and knows that it won't be able to update all servers.
It is crucial that the SET on the first server gets undone at this point; otherwise, we would leave the system in an inconsistent state (namely, the replicas for k no longer agree).

(An alternative is for the client to keep retrying on the second server, hoping that it succeeds at some point. But in real distributed systems, we rarely have the time to keep retrying forever.)

The idea of a transaction (TX) captures that a distributed system either (a) processes a client request in its entirety without interference from other requests, or (b) the request fails and the system returns to the state prior to the request's arrival. A transaction wraps a set of separate operations to execute them in unison or not at all.

A transaction always has a defined beginning and end:

BEGIN TX {
   ---
    |
    | Operations (requests/RPCs to servers) contained in the TX
    |
   ---
} COMMIT/ABORT TX.

For example, a transaction that transfers $100 from A's account to X's account, whose balances are stored in a key-value store, may be written as follows:

BEGIN TX {
  a = GET balance_A
  x = GET balance_X
  if (a > 100 and x != ERROR) {
    SET balance_A = a - 100
    SET balance_X = x + 100
  } else {
    ABORT TX
  }
} COMMIT TX

If the transaction succeeds and all operations get completed, we say that the transaction "commits"; if it fails to complete one or more operations and undoes the others, we say the transaction "aborts" (fails).

In the money transfer example, this means that while the above transaction executes, it shouldn't be possible for other requests that modify account balances for A or X to succeed. Two examples are highlighted in red in the picture below; these are operations that another client may try to execute concurrently and which would mess up the correctness of our transaction above. (Consider what would happen if the SET A = A - 20 completed before our SET balance_A = a - 100 operation; or if DELETE X completed before the debit to X.)

If we turn these competing requests into their own transactions, we get:

// T2
BEGIN TX {
  a = GET balance_A
  if (a > 20) {
    SET balance_A = a - 20
  } else {
    ABORT TX
  }
} COMMIT TX

// T3
BEGIN TX {
  DELETE X
} COMMIT TX

When run correctly, the execution of these transactions is isolated. One way to achieve this isolation is for each transaction to take locks on all the objects accessed in the transaction, as highlighted in yellow in the next picture:

Since the locks ensure that transactions can only execute one after another, the order of execution now determines which transactions succeed and which fail. The picture shows several possible orders in green at the right-hand side: for example, an order of T1, T2, T3 results in T1 and T3 succeeding, but T2 fails because A has insufficient funds. Likewise, if T2 runs before T1, T1 fails for the same reason.

The ACID properties

We can now define a more formal set of properties that transactions ensure.

Atomicity, which means that the transaction behaves atomically: it either completes in its entirety (all operations contained in the transaction succeed), or not at all (none of the operations succeed).
Consistency, which means that the transaction starts with the system in a consistent state and ends with the system in a (potentially different) consistent state. In other words, if a transaction modifies a value, all servers replicating it move from all holding the old value to all holding the new value.
Isolation, which describes that the transaction executes as if it were running on a single-threaded, concurrency-free system, and that it cannot tell if there are any other concurrently active transactions.
Durability, which refers to the notion that the transaction's success survives restarts of participating servers, which in practice means that the transaction's effects have to be stored durably on persistent storage (SSD or harddisk).

Transactions and the ACID properties are concepts. By just stating them, they don't become true in your system! Someone needs to implement mechanisms that ensure that the system maintains the ACID properties, as well as APIs for clients to start transactions and to attempt to commit them. Such mechanisms often include locking (to help with the ACI properties) and strategies for efficient writes to disk (to help with the D property).

Summary

Today, we further explored the complexities introduced by distributed systems. We covered sharding and the use of replication as a fault-tolerance mechanism. In particular, we looked at the common situation where a logical client request is actually split into operations across multiple servers – either because updates need to be replicated across servers, or because the request requires operations on multiple shards (e.g., a money transfer).

We also briefly talked about some examples of real-world distributed systems that exist at different ends of the strong vs. weak consistency spectrum, and we saw how both types of system are typically required in a web company.

EOF

This is the end of CS 300! If you're thinking of courses to take next, here are some courses that dive deeper into the concepts we learned about in this course.

CSCI 1260 considers the internals of a compiler like GCC or Clang, and teaches you how to build one.
CSCI 1270 looks at databases, which are an important kind of structured storage system. You'll learn more about transactions and concurrency control too.
CSCI 1380 dives deeper into distributed systems and how to design systems that can survive even complex failures.
CSCI 1600 is about embedded systems and software-hardware interactions.
CSCI 1650 looks the security issues that exist in low-level systems (such as the buffer overflow from Lab 3), and at how malicious agents can hack into computer systems.
CSCI 1660 also covers applied computer security, but looks at slightly higher level threats as well as some policy questions about secure system design.
CSCI 1670 (and its lab variant, CSCI 1690) are all about OS kernel programming, with a more advanced and complete OS than WeensyOS.
CSCI 1680 covers computer networking, and looks in much more detail at how networks transmit data between computers, as well as how the global internet really works.
CSCI 1760 is about multiprocessor synchronization, which involves both the theory and low-level details behind synchronization objects like mutexes and condition variables.
CSCI 2390 is a research seminar that looks at how we can build systems (particularly distributed systems) that better preserve users' data privacy and data ownership rights.