Lecture 23: Synchronization, Deadlock, Atomics
🎥 Lecture video (Brown ID required)
💻 Lecture code
❓ Post-Lecture Quiz (due 11:59pm, Wednesday, Apr 23).
Locking Granularity
Ourincr-basic
program just increments a single integer.
When we need to synchronize access to larger amounts of data, we also
need to consider how the data will be protected, which has
implications for performance.
Let's look at the program incr-array.cc
, which
synchronizes access to a large array updated by several threads. Each
array element is of type struct item
, which contains a
counter for how many times it has been updated, and the ID of the last
thread that updated it:
struct item {
int count;
int last_update_by; // Thread that last increased count
};
struct item all_items[ARRAY_SIZE];
void threadfunc(int tid) {
unsigned int seed = tid;
for (int i = 0; i != 10000000; ++i) {
int index = rand_r(&seed) % ARRAY_SIZE; // Get random int [0, ARRAY_SIZE)
all_items[index].count += 1;
all_items[index].last_update_by = tid;
}
}
How should we protect access to this data? One option is to use a single mutex, like this:
struct item all_items[ARRAY_SIZE];
std::mutex m;
void threadfunc(int tid) {
// . . .
for (int i = 0; i != 10000000; ++i) {
// . . .
m.lock();
all_items[index].count += 1;
all_items[index].last_update_by = tid;
m.unlock();
}
}
The single mutex means that no two threads can access any part of
the array at the same time. This is will produce correct results, in
that there is no race condition, but it has a serious problem: this
program will not take advantage of parallelism, since each thread is
forced to perform operations on the array sequentially! Looking at
the data in the array, this is not required: each item
item
stuct is an independent piece of data, so locking
down the whole array in order to modify just one item
is
too coarse. We call this kind of locking strategy
coarse-grained locking, in that we lock significantly more
data than is necessary to synchronize the shared data.
For another example, imagine our array held data for a huge number of users in some database. We would like our database to be able to operate on data for multiple users at the same time, but with only one mutex for the whole dataset, this would not be possible. Instead, we need a solution with a finer locking granularity such that we only lock a smaller subset of the data.
To see one way can improve this, let's consider another locking
strategy where we add one mutex per item
, like
this:
struct item {
std::mutex m;
int count;
int last_update_by; // Thread that last increased count
};
struct item all_items[ARRAY_SIZE];
void threadfunc(int tid) {
// . . .
for (int i = 0; i != 10000000; ++i) {
// . . .
all_items[index].m.lock();
all_items[index].count += 1;
all_items[index].last_update_by = tid;
all_items[index].m.unlock();
}
}
We call this strategy fine-grained locking in that the
smallest unit of data that we can synchronize (in this case, one
item
struct) has its own mutex. This means that
multiple threads can operate on the array in parallel, so long as
they're operating on different items! This will have much-improved
performance vs. coarse-grained locking, but at a cost: each item
struct now has its own mutex, which takes up space in the struct!
While the size of an individual mutex is small (one integer for the
simplest mutexes, ~20 bytes for a C++ std::mutex
), for
large amounts of data (e.g., a database with many thousands of
users), this extra cost can be very significant.
Which locking strategy is ideal? Making this decision involves a tradeoff between the amount of parallelism we want to support, and the storage cost for the synchronization objects (e.g., mutexes). For a certain application, the best solution may be somewhere in the middle of the two examples we have seen so far. For example, consider a scenario where we have one mutex for every 100 elements in the array (eg. one mutex for indices 0..99, another for indices 100..199 and so on). This would allow one thread per 100 array elements to run at once (which is better than coarse-grained locking) and consume less storage than adding one mutex per element with fine-grained locking.
In practice, application developers and performance engineers would need to make this decision based on the requirements of the application (how many users, how many operations should happen concurrently, how much storage overhead can be tolerated, etc.).
Deadlock
Locking is difficult not only because we need to make sure that we include all accesses to shared state in our critical sections, but also because it's possible to get it wrong in ways that cause our programs to get stuck indefinitely.
To explain this, let's consider a program with two mutexes, M1 and M2. An operation in the program (such as transferring money in Vunmo!) requires a thread to lock both mutexes M1 and M2 successfully to proceed.
If the program contains two threads, T1 and T2, which both lock the mutexes in different order, a situation like the following can arise:
- T1 tries to lock M1, and succeeds.
- T2 concurrently tries to lock M2, and succeeds.
- T1 now wants to lock M2, but cannot, since T2 has M2 locked. So, T1 suspends itself ("blocks") until M2 becomes available.
- T2 meanwhile wants to lock M1, but cannot, since T1 has M1 locked. So, T2 also suspends itself until M1 becomes available.
- Neither thread will be able to proceed, and thus neither thread will ever unlock the mutex it already has locked.
Therefore, no thread can ever make progress again, and the program becomes deadlocked.
The way to avoid deadlock situations (which can occur whenever an operation requires multiple mutexes to be locked) is to take locks in a consistent order across threads. For example, you might have a rule that threads will always lock the lower-numbered mutex first. This ensures that the situation described above cannot occur, as T2 would have attempted (and failed) to lock M1 before trying to lock M2.
Paying attention to deadlocks is important; it is easier than you think to accidentally write deadlocking code!
Atomics
Processors have special, atomic instructions that always execute without racing with any other processor's accesses. These instructions are "indivisible" (which is what the notion of "atomic" refers to in physics!), i.e., the processor does not internally decompose atomic instructions into smaller micro-code instructions. Atomic instructions are generally slower and more expensive (in terms of energy) to execute than normal instructions.
incr-atomic.cc
implements synchronized shared-memory access using C++
atomics, which the compiler will translate into atomic instructions. The relevant code in
threadfunc()
is shown below:
void threadfunc(std::atomic* x) {
for (int i = 0; i != 10000000; ++i) {
x->fetch_add(1);
// `*x += 1` and `(*x)++` also work!
}
}
C++'s atomics library implements atomic additions using an x86-64's atomic
instruction. When we use objdump
to inspect the assembly of threadfunc()
,
we see an lock addl ...
instruction instead of just addl ...
.
The lock
prefix of the addl
instruction asks the processor to
hold on to the cache line with the shared variable (or in Intel terms, "lock the memory
bus") until the entire addl
instruction has completed.
C++ atomics and lock
-prefixed instructions only work with word-sized
variables that cannot span multiple cache lines. (Recall that the alignment unit
discussed that one reason for alignment is to avoid primitive types spanning cache
blocks; here this notion surfaces again!)
Synchronization Objects
Synchronized updates for primitive types (like integers) can be achieved by using the C++
std::atomic
template library, which automatically translates the addl
instruction used to perform the update into a lock
-prefixed instruction that behaves
atomically.
C++'s std::atomic
library is powerful (and also great progress in standardization
of atomic operations), but it works only on integers that are word-sized or smaller. To synchronize
more complex objects in memory or perform more complex synchronization tasks, we need abstractions
called synchronization objects.
Synchronization objects are types whose methods can be used to achieve synchronization and
atomicity on normal (non-std::atomic
-wrapped) objects. Synchronization objects provide
various abstract properties that simplify programming with multi-threaded access to shared data.
The most common synchronization object is the "mutex", which provides the mutual
exclusion property.
Implementing a mutex with a single bit and
cmpxchg
Extra material: not examinable in 2024.
Internally, a mutex can be implemented using an atomic counter, or indeed using a single atomic bit! Using a single bit requires some special atomic machine instructions.
A busy-waiting mutex (also called a spin lock) can be implemented as follows:
struct mutex { std::atomic
spinlock; void lock() { while (spinlock.swap(1) == 1) {} } void unlock() { spinlock.store(0); } }; The
spinlock.swap()
method is an atomic swap method, which in one atomic step stores the specified value to the atomicspinlock
variable and returns the old value of the variable.It works because
lock()
will not return unlessspinlock
previously contains value 0 (which means unlocked). In that case it will atomically stores value 1 (which means locked) tospinlock
and prevents otherlock()
calls from returning, hence ensuring mutual exclusion. While it spin-waits, it simply swapsspinlock
's old value 1 with 1, effectly leaving the lock untouched. Please take a moment to appreciate how this simple construct correctly implements mutual exclusion.x86-64 provides this atomic swap functionality via the
lock xchg
assembly instruction. We have shown that it is all we need to implement a mutex with just one bit. x86-64 provides a more powerful atomic instruction that further opens the possibility of what can be done atomically in modern processors. The instruction is called a compare-exchange, orlock cmpxchg
. It is powerful enough to implement atomic swap, add, subtract, multiplication, square root, and many other things you can think of.The behavior of the instruction is defined as follows:
// Everything in one atomic step int compare_exchange(int* object, int expected, int desired) { if (*object == expected) { *object = desired; return expected; } else { return *object; } }
This instruction is also accessible as the
this->compare_exchange_strong()
member method for C++std::atomic
type objects. Instead of returning an old value, it returns a boolean indicating whether the exchange was successful.
Summary
For performance, we prefer to make our critical sections as small as possible: the more code in a critical section, the less parallelism we can get as only one thread can run in a critical section. But we must make the critical section large enough to make the program correct: it needs to contain all accesses to shared state within the operation we're trying to synchronize. Moreover, if we make synchronization granularity too fine, the number of synchronization objects required is huge, which imposes high memory overhead.
We also saw an example of deadlock, which occurs when a thread blocks on acquiring a lock that is already held and never given up. Deadlock can occur with two threads that try to mutually take locks on resources that they have already locked. With C++ standard library mutexes, deadlock can also happen in other situations, such as when a thread is trying to take a lock it already holds.
Synchronization relies, ultimately, on atomic
processor instructions, and we implement it using abstractions built on these atomic
instructions. std::atomic
(for primitive types) and mutexes (for synchronizing
access to more complex datastructures) are examples of synchronization objects that a C++
program may use to synchronize access by multiple threads.