CSCI 0300/1310: Fundamentals of Computer Systems

⚠️ This is not the current iteration of the course! Head here for the current offering.

Lecture 21: Synchronization: Atomics, Mutexes, and Condition Variables

» Lecture video (Brown ID required)
» Lecture code
» Post-Lecture Quiz (due 11:59pm Sunday, April 24)

Race conditions

Recall that in the last lecture we showed that updating shared memory variables in concurrent threads requires synchronization, otherwise some update may get lost because of the interleaving of operations in different threads.

Here's an example, in which two threads are trying to both increment an integer x in memory that is originally set to the value 7. Recall that each thread needs to (i) load the value from memory; (ii) increment it; and (iii) write the incremented value back to memory. (In the actual processor, these steps may happen in microcode instructions, but we're showing them as x86-64 assembly instructions operating on a register %tmp here.)

This interleaving is bad! It loses one of the increments, and ends up with x = 8 even though x has been incremented twice. The reason for this problem is that T2's movl (&x), %tmp instruction runs before T1's increment (addl), so the processor running T2 stores (in its cache and registers) an outdated value of x and the increments that value. Both threads write back a value of x = 8 to memory.

To see that this is indeed a race condition, where the outcome only depends on how the executions of T1 and T2 happen relative to each other, consider the below schedule, which happens to run the operations without interleaving between the threads:

In this execution, we get the correct value of x = 9 (7 incremented twice) because the T1's increment completes and writes x = 8 back to memory before T2 reads the value of x.

Having the outcome of our program depend on how its threads' execution randomly interleaves is terrible, so we better synchronize the threads to restrict executions to "good" interleavings.

This gives rise to a basic rule of synchronization: if two or more threads can concurrently access an object, and at least one of the accesses is a write, a race condition can occur and synchronization is required.

Atomics

Processors have special, atomic instructions that always execute without racing with any other processor's accesses. These instructions are "indivisible" (which is what the notion of "atomic" refers to in physics!), i.e., the processor does not internally decompose atomic instructions into smaller micro-code instructions. Atomic instructions are generally slower and more expensive (in terms of energy) to execute than normal instructions.

incr-atomic.cc implements synchronized shared-memory access using C++ atomics, which the compiler will translate into atomic instructions. The relevant code in threadfunc() is shown below:

void threadfunc(std::atomic* x) {
    for (int i = 0; i != 10000000; ++i) {
        x->fetch_add(1);
        // `*x += 1` and `(*x)++` also work!
    }
}

C++'s atomics library implements atomic additions using an x86-64's atomic instruction. When we use objdump to inspect the assembly of threadfunc(), we see an lock addl ... instruction instead of just addl .... The lock prefix of the addl instruction asks the processor to hold on to the cache line with the shared variable (or in Intel terms, "lock the memory bus") until the entire addl instruction has completed.

C++ atomics and lock-prefixed instructions only work with word-sized variables that cannot span multiple cache lines. (Recall that the alignment unit discussed that one reason for alignment is to avoid primitive types spanning cache blocks; here this notion surfaces again!)

S2: Synchronization Objects

Synchronized updates for primitive types (like integers) can be achieved by using the C++ std::atomic template library, which automatically translates the addl instruction used to perform the update into a lock-prefixed instruction that behaves atomically.

C++'s std::atomic library is powerful (and also great progress in standardization of atomic operations), but it works only on integers that are word-sized or smaller. To synchronize more complex objects in memory or perform more complex synchronization tasks, we need abstractions called synchronization objects.

Synchronization objects are types whose methods can be used to achieve synchronization and atomicity on normal (non-std::atomic-wrapped) objects. Synchronization objects provide various abstract properties that simplify programming with multi-threaded access to shared data. The most common synchronization object is called a "mutex", which provides the mutual exclusion property.

Mutexes

Mutual exclusion means that at most one thread accesses the shared data at a time.

In our multi-threaded incr-basic.cc example from the last lecture, the code does not work because more than one thread can access the shared variable at a time. The code would behave correctly if the mutual exclusion policy is enforced. We can use a mutex object to enforce mutual exclusion (incr-mutex.cc). In this example, it has the same effect as wrapping *x in a std::atomic template:

std::mutex mutex;

void threadfunc(unsigned* x) {
    for (int i = 0; i != 10000000; ++i) {
        mutex.lock();
        *x += 1;
        mutex.unlock();
    }
}

The mutex (a kind of lock data structure) has an internal state (denoted by state), which can be either locked or unlocked. The semantics of a mutex object is as follows:

Upon initialization, state = unlocked.
mutex::lock() method: waits until state becomes unlocked, and then atomically sets state = locked. Note the two steps shall complete in one atomic operation.
mutex::unlock() method: asserts that state == locked, then sets state = unlocked.

The mutual exclusion policy is enforced in the code region between the lock() and unlock() invocations. We call this region the critical section.

A mutex ensures that only one thread can be active in a critical section at a time. In other words, the mutex enforces a thread interleaving like the one shown in the picture below, where T1 waits on the mutex m while T2 runs the critical section, and then T1 runs the critical section after it acquires the mutex once T2 releases it via m.unlock().

How do atomic operations come into the mutex? Consider the implementation of m.lock(): there must be an operation that allows the thread calling lock() to set the lock's state to locked without any chance of another thread grabbing the lock in between. The way we ensure that is with an atomic operation: in particular, a "compare-and-swap" operation (CMPXCHG in x86-64) atomically reads the lock state and, if it's unlocked, sets it to locked.

Implementing a mutex with a single bit and cmpxchg

Extra material: not examinable in 2021.

Internally, a mutex can be implemented using an atomic counter, or indeed using a single atomic bit! Using a single bit requires some special atomic machine instructions.

A busy-waiting mutex (also called a spin lock) can be implemented as follows:
struct mutex {
    std::atomic spinlock;

    void lock() {
        while (spinlock.swap(1) == 1) {}
    }

    void unlock() {
        spinlock.store(0);
    }
};
The spinlock.swap() method is an atomic swap method, which in one atomic step stores the specified value to the atomic spinlock variable and returns the old value of the variable.

It works because lock() will not return unless spinlock previously contains value 0 (which means unlocked). In that case it will atomically stores value 1 (which means locked) to spinlock and prevents other lock() calls from returning, hence ensuring mutual exclusion. While it spin-waits, it simply swaps spinlock's old value 1 with 1, effectly leaving the lock untouched. Please take a moment to appreciate how this simple construct correctly implements mutual exclusion.

x86-64 provides this atomic swap functionality via the lock xchg assembly instruction. We have shown that it is all we need to implement a mutex with just one bit. x86-64 provides a more powerful atomic instruction that further opens the possibility of what can be done atomically in modern processors. The instruction is called a compare-exchange, or lock cmpxchg. It is powerful enough to implement atomic swap, add, subtract, multiplication, square root, and many other things you can think of.

The behavior of the instruction is defined as follows:
// Everything in one atomic step
int compare_exchange(int* object, int expected, int desired) {
    if (*object == expected) {
        *object = desired;
        return expected;
    } else {
        return *object;
    }
}
This instruction is also accessible as the this->compare_exchange_strong() member method for C++ std::atomic type objects. Instead of returning an old value, it returns a boolean indicating whether the exchange was successful.

Condition Variables

A condition variable is another synchronization object. It is useful when you have a situation in a program where a thread holds a lock, but cannot proceed because some condition needs to become true before it can make progress: for example, in Vunmo you may need to wait for the work queue to become non-empty (i.e., the worker must wait for the client-handling thread to insert work) or non-full (i.e., the client-handling thread must wait until workers have removed some items from the queue). Importantly, it is necessary for the waiting thread to give up its lock so that other threads can change the condition. This is what a condition variable is for. It supports the following operations:

wait(std::unique_lock& lock): In one atomic step, it unlocks the lock, and blocks until another thread calls notify_all(). It also relocks the lock before returning (waking up).
notify_all(): Wakes up all threads blocked by calling wait().

We will talk more about condition variables next time.

Summary

Today, we discussed how to avoid race conditions in our programs. To do so, we add synchronization, which orders acccesses by different threads such that they cannot interleave in undefined ways. Synchronization relies, ultimately, on atomic processor instructions, and we implement it using abstractions built on these atomic instructions. std::atomic (for primitive types) and mutexes (for synchronizing access to more complex datastructures) are examples of synchronization objects that a C++ program may use to synchronize access by multiple threads.