Lecture 23: Synchronization, Deadlock, Atomics

🎥 Lecture video (Brown ID required)
💻 Lecture code
❓ Post-Lecture Quiz (due 11:59pm, Wednesday, Apr 23).

Locking Granularity

Our incr-basic program just increments a single integer. When we need to synchronize access to larger amounts of data, we also need to consider how the data will be protected, which has implications for performance.

Let's look at the program incr-array.cc, which synchronizes access to a large array updated by several threads. Each array element is of type struct item, which contains a counter for how many times it has been updated, and the ID of the last thread that updated it:


struct item {
    int count;
    int last_update_by; // Thread that last increased count                                              
};

struct item all_items[ARRAY_SIZE];

void threadfunc(int tid) {
    unsigned int seed = tid;

    for (int i = 0; i != 10000000; ++i) {
        int index = rand_r(&seed) % ARRAY_SIZE; // Get random int [0, ARRAY_SIZE)                        

        all_items[index].count += 1;
        all_items[index].last_update_by = tid;
    }
}

How should we protect access to this data? One option is to use a single mutex, like this:


struct item all_items[ARRAY_SIZE];
std::mutex m;

void threadfunc(int tid) {
    // . . .

    for (int i = 0; i != 10000000; ++i) {
    	// . . .
	m.lock();
        all_items[index].count += 1;
        all_items[index].last_update_by = tid;
	m.unlock();
    }
}

The single mutex means that no two threads can access any part of the array at the same time. This is will produce correct results, in that there is no race condition, but it has a serious problem: this program will not take advantage of parallelism, since each thread is forced to perform operations on the array sequentially! Looking at the data in the array, this is not required: each item item stuct is an independent piece of data, so locking down the whole array in order to modify just one item is too coarse. We call this kind of locking strategy coarse-grained locking, in that we lock significantly more data than is necessary to synchronize the shared data.

For another example, imagine our array held data for a huge number of users in some database. We would like our database to be able to operate on data for multiple users at the same time, but with only one mutex for the whole dataset, this would not be possible. Instead, we need a solution with a finer locking granularity such that we only lock a smaller subset of the data.

To see one way can improve this, let's consider another locking strategy where we add one mutex per item, like this:


struct item {
    std::mutex m;
    int count;
    int last_update_by; // Thread that last increased count                                              
};

struct item all_items[ARRAY_SIZE];

void threadfunc(int tid) {
    // . . .

    for (int i = 0; i != 10000000; ++i) {
    	// . . .
        all_items[index].m.lock();
        all_items[index].count += 1;
        all_items[index].last_update_by = tid;
        all_items[index].m.unlock();
    }
}

We call this strategy fine-grained locking in that the smallest unit of data that we can synchronize (in this case, one item struct) has its own mutex. This means that multiple threads can operate on the array in parallel, so long as they're operating on different items! This will have much-improved performance vs. coarse-grained locking, but at a cost: each item struct now has its own mutex, which takes up space in the struct! While the size of an individual mutex is small (one integer for the simplest mutexes, ~20 bytes for a C++ std::mutex), for large amounts of data (e.g., a database with many thousands of users), this extra cost can be very significant.

Which locking strategy is ideal? Making this decision involves a tradeoff between the amount of parallelism we want to support, and the storage cost for the synchronization objects (e.g., mutexes). For a certain application, the best solution may be somewhere in the middle of the two examples we have seen so far. For example, consider a scenario where we have one mutex for every 100 elements in the array (eg. one mutex for indices 0..99, another for indices 100..199 and so on). This would allow one thread per 100 array elements to run at once (which is better than coarse-grained locking) and consume less storage than adding one mutex per element with fine-grained locking.

In practice, application developers and performance engineers would need to make this decision based on the requirements of the application (how many users, how many operations should happen concurrently, how much storage overhead can be tolerated, etc.).

Deadlock

Locking is difficult not only because we need to make sure that we include all accesses to shared state in our critical sections, but also because it's possible to get it wrong in ways that cause our programs to get stuck indefinitely.

To explain this, let's consider a program with two mutexes, M1 and M2. An operation in the program (such as transferring money in Vunmo!) requires a thread to lock both mutexes M1 and M2 successfully to proceed.

If the program contains two threads, T1 and T2, which both lock the mutexes in different order, a situation like the following can arise:

  1. T1 tries to lock M1, and succeeds.
  2. T2 concurrently tries to lock M2, and succeeds.
  3. T1 now wants to lock M2, but cannot, since T2 has M2 locked. So, T1 suspends itself ("blocks") until M2 becomes available.
  4. T2 meanwhile wants to lock M1, but cannot, since T1 has M1 locked. So, T2 also suspends itself until M1 becomes available.
  5. Neither thread will be able to proceed, and thus neither thread will ever unlock the mutex it already has locked.

Therefore, no thread can ever make progress again, and the program becomes deadlocked.

The way to avoid deadlock situations (which can occur whenever an operation requires multiple mutexes to be locked) is to take locks in a consistent order across threads. For example, you might have a rule that threads will always lock the lower-numbered mutex first. This ensures that the situation described above cannot occur, as T2 would have attempted (and failed) to lock M1 before trying to lock M2.

Paying attention to deadlocks is important; it is easier than you think to accidentally write deadlocking code!

Atomics

Processors have special, atomic instructions that always execute without racing with any other processor's accesses. These instructions are "indivisible" (which is what the notion of "atomic" refers to in physics!), i.e., the processor does not internally decompose atomic instructions into smaller micro-code instructions. Atomic instructions are generally slower and more expensive (in terms of energy) to execute than normal instructions.

incr-atomic.cc implements synchronized shared-memory access using C++ atomics, which the compiler will translate into atomic instructions. The relevant code in threadfunc() is shown below:

void threadfunc(std::atomic* x) {
    for (int i = 0; i != 10000000; ++i) {
        x->fetch_add(1);
        // `*x += 1` and `(*x)++` also work!
    }
}

C++'s atomics library implements atomic additions using an x86-64's atomic instruction. When we use objdump to inspect the assembly of threadfunc(), we see an lock addl ... instruction instead of just addl .... The lock prefix of the addl instruction asks the processor to hold on to the cache line with the shared variable (or in Intel terms, "lock the memory bus") until the entire addl instruction has completed.

C++ atomics and lock-prefixed instructions only work with word-sized variables that cannot span multiple cache lines. (Recall that the alignment unit discussed that one reason for alignment is to avoid primitive types spanning cache blocks; here this notion surfaces again!)

Synchronization Objects

Synchronized updates for primitive types (like integers) can be achieved by using the C++ std::atomic template library, which automatically translates the addl instruction used to perform the update into a lock-prefixed instruction that behaves atomically.

C++'s std::atomic library is powerful (and also great progress in standardization of atomic operations), but it works only on integers that are word-sized or smaller. To synchronize more complex objects in memory or perform more complex synchronization tasks, we need abstractions called synchronization objects.

Synchronization objects are types whose methods can be used to achieve synchronization and atomicity on normal (non-std::atomic-wrapped) objects. Synchronization objects provide various abstract properties that simplify programming with multi-threaded access to shared data. The most common synchronization object is the "mutex", which provides the mutual exclusion property.

Implementing a mutex with a single bit and cmpxchg

Extra material: not examinable in 2024.

Internally, a mutex can be implemented using an atomic counter, or indeed using a single atomic bit! Using a single bit requires some special atomic machine instructions.

A busy-waiting mutex (also called a spin lock) can be implemented as follows:

struct mutex {
    std::atomic spinlock;

    void lock() {
        while (spinlock.swap(1) == 1) {}
    }

    void unlock() {
        spinlock.store(0);
    }
};

The spinlock.swap() method is an atomic swap method, which in one atomic step stores the specified value to the atomic spinlock variable and returns the old value of the variable.

It works because lock() will not return unless spinlock previously contains value 0 (which means unlocked). In that case it will atomically stores value 1 (which means locked) to spinlock and prevents other lock() calls from returning, hence ensuring mutual exclusion. While it spin-waits, it simply swaps spinlock's old value 1 with 1, effectly leaving the lock untouched. Please take a moment to appreciate how this simple construct correctly implements mutual exclusion.

x86-64 provides this atomic swap functionality via the lock xchg assembly instruction. We have shown that it is all we need to implement a mutex with just one bit. x86-64 provides a more powerful atomic instruction that further opens the possibility of what can be done atomically in modern processors. The instruction is called a compare-exchange, or lock cmpxchg. It is powerful enough to implement atomic swap, add, subtract, multiplication, square root, and many other things you can think of.

The behavior of the instruction is defined as follows:

// Everything in one atomic step
int compare_exchange(int* object, int expected, int desired) {
    if (*object == expected) {
        *object = desired;
        return expected;
    } else {
        return *object;
    }
}

This instruction is also accessible as the this->compare_exchange_strong() member method for C++ std::atomic type objects. Instead of returning an old value, it returns a boolean indicating whether the exchange was successful.

Summary

For performance, we prefer to make our critical sections as small as possible: the more code in a critical section, the less parallelism we can get as only one thread can run in a critical section. But we must make the critical section large enough to make the program correct: it needs to contain all accesses to shared state within the operation we're trying to synchronize. Moreover, if we make synchronization granularity too fine, the number of synchronization objects required is huge, which imposes high memory overhead.

We also saw an example of deadlock, which occurs when a thread blocks on acquiring a lock that is already held and never given up. Deadlock can occur with two threads that try to mutually take locks on resources that they have already locked. With C++ standard library mutexes, deadlock can also happen in other situations, such as when a thread is trying to take a lock it already holds.

Synchronization relies, ultimately, on atomic processor instructions, and we implement it using abstractions built on these atomic instructions. std::atomic (for primitive types) and mutexes (for synchronizing access to more complex datastructures) are examples of synchronization objects that a C++ program may use to synchronize access by multiple threads.