Lecture 22: Race Conditions, Synchronization, and Mutexes
🎥 Lecture video (Brown ID required)
💻 Lecture code
❓ Post-Lecture Quiz (due 11:59pm, Monday, April 22).
Threads
Here is a comparison between what a parent and child process shared, and what two threads within the same process share:
Resource | Processes (parent/child) | Threads (within same process) |
---|---|---|
Code | shared (read-only) | shared |
Global variables | copied to separate physical memory in child | shared |
Heap | copied to separate physical memory in child | shared |
Stack | copied to separate physical memory in child | not shared, each thread has separate stack |
File descriptors, etc. | shared | shared |
A process can contain multiple threads. All threads within the same process share the same virtual address space and file descriptor table. However, each thread must have its own set of registers and stack. (Think about all the things that would go horribly wrong if two threads used the same stack and ran in parallel on differnt processors!). The processes we have looked at so far all have a single thread running.
Threads allow for concurrency within a single process: even if one thread is blocked waiting for some event (such as I/O from the network or from harddisk), other threads can continue executing. On computers with multiple processors, threads also allow multiple streams of instructions to execute in parallel.
All threads within the same process share the same virtual address space and file descriptor table, but each thread has its own set of registers and stack. The processes we have looked at so far have all been "single-threaded", meaning that they only had one thread. Each process has, at minimum, a single "main" thread – hence, our processes had one and not zero threads. For multi-threaded processes, the kernel stores a set of registers for each thread, rather than for each process. This is necessary because the kernel needs to be able to independently run each thread on a processor and suspend it again to let other threads run.
Race Conditions
Let's look at how our example program in incr-basic.cc
uses threads:
void threadfunc(unsigned* x) {
// This is a correct way to increment a shared variable!
// ... OR IS IT?!?!?!?!?!?!??!?!
for (int i = 0; i != 10000000; ++i) {
*x += 1;
}
}
int main() {
std::thread th[4];
unsigned n = 0;
for (int i = 0; i != 4; ++i) {
th[i] = std::thread(threadfunc, &n);
}
for (int i = 0; i != 4; ++i) {
th[i].join();
}
printf("%u\n", n);
}
In this code, we run the function threadfunc()
concurrently in four
threads. The std::thread::join()
function makes the main thread block until
the thread upon which the join()
is called finishes execution. Consequently,
the final value of n
will be printed by the main thread after all four threads
finish.
In each thread, threadfunc
increments a shared variable 10 million times.
There are four threads incrementing in total and the variable starts at zero. What should
the final value of the variable be after all incrementing threads finish?
40 million seems like a reasonable answer, but by running the program
(./incr-basic.noopt
) we observe that it prints all sorts of values such as
15285711, and that the value is different every time. What's going on?
There is a race condition in the addl
instruction itself!
(The term "race" refers to the idea that two threads are "racing"
each other to access, modify, and update a value in a single memory location.)
Up until this point in the course, we've been thinking x86 instructions as being
indivisible and atomic. In fact, they are not, and their lack of atomicity
shows up in a multi-processor environment.
Inside the processor hardware, the addl $1, (%rdi)
is actually
implemented as three separate "micro-op" instructions:
movl (%rdi), %temp
(load)addl $1, %temp
(add)movl %temp, (%rdi)
(store)
Imagine two threads executing this addl
instruction at the same time
(concurrently). Each thread loads the same value of (%rdi)
from memory, then
adds one to it in their own separate temporary registers, and then write the same value
back to (%rdi)
in memory. The last write to memory by each thread will
overwrite each other with the same value, and one increment will essentially be lost.
What about the optimized version of
incr-basic
?With compiler optimizations turned on, the compiler can optimize away the loop in
threadfunc()
. It recognizes that the loop is simply incrementing the shared variable 10 million times, so it transforms the loop into a singleaddl
instruction with immediate value 10 million.The compiled program appears to run and produce the correct output, but if we run it in a loop (
./incr-basic | uniq
), it will eventually print a value like 30000000 or even 20000000. This happens because even with the single addition of ten million, there is still a race condition between the four additions that different threads execute.
This is the behavior of running two increments on the same variable in x86 assembly. In C/C++, accessing shared memory from different threads without proper synchronization is undefined behavior, unless all accesses are reads.
Race conditions
Recall that in the last lecture we showed that updating shared memory variables in concurrent threads requires synchronization, otherwise some update may get lost because of the interleaving of operations in different threads.
Here's an example, in which two threads are trying to both increment an integer x
in memory that is originally set to the value 7. Recall that each thread needs to (i)
load the value from memory; (ii) increment it; and (iii) write the incremented
value back to memory. (In the actual processor, these steps may happen in microcode instructions,
but we're showing them as x86-64 assembly instructions operating on a register %tmp
here.)
This interleaving is bad! It loses one of the increments, and ends up with x = 8
even though x
has been incremented twice. The reason for this problem is that T2's
movl (&x), %tmp
instruction runs before T1's increment (addl
), so
the processor running T2 stores (in its cache and registers) an outdated value of x
and the increments that value. Both threads write back a value of x = 8
to memory.
To see that this is indeed a race condition, where the outcome only depends on how the executions of T1 and T2 happen relative to each other, consider the below schedule, which happens to run the operations without interleaving between the threads:
In this execution, we get the correct value of x = 9
(7 incremented twice)
because the T1's increment completes and writes x = 8
back to memory before T2
reads the value of x
.
Having the outcome of our program depend on how its threads' execution randomly interleaves is terrible, so we better synchronize the threads to restrict executions to "good" interleavings.
This gives rise to a basic rule of synchronization: if two or more threads can concurrently access an object, and at least one of the accesses is a write, a race condition can occur and synchronization is required.
Mutexes
Mutual exclusion means that at most one thread accesses the shared data at a time.
In our multi-threaded incr-basic.cc
example from the last lecture, the code does
not work because more than one thread can access the shared variable at a time. The code would
behave correctly if the mutual exclusion policy is enforced. We can use a mutex
object to enforce mutual exclusion (incr-mutex.cc
). In this example, it has the same
effect as wrapping *x
in a std::atomic
template:
std::mutex mutex;
void threadfunc(unsigned* x) {
for (int i = 0; i != 10000000; ++i) {
mutex.lock();
*x += 1;
mutex.unlock();
}
}
The mutex (a kind of lock data structure) has an internal state (denoted by state
),
which can be either locked or unlocked. The semantics of a mutex object is as follows:
- Upon initialization,
state = unlocked
. mutex::lock()
method: waits untilstate
becomesunlocked
, and then atomically setsstate = locked
. Note the two steps shall complete in one atomic operation.mutex::unlock()
method: asserts thatstate == locked
, then setsstate = unlocked
.
The mutual exclusion policy is enforced in the code region between the
lock()
and unlock()
invocations. We call this region the critical
section.
A mutex ensures that only one thread can be active in a critical section at a time. In other
words, the mutex enforces a thread interleaving like the one shown in the picture below, where
T1 waits on the mutex m
while T2 runs the critical section, and then T1 runs the
critical section after it acquires the mutex once T2 releases it via m.unlock()
.
How do atomic operations come into the mutex? Consider the implementation of
m.lock()
: there must be an operation that allows the thread calling
lock()
to set the lock's state to locked
without any chance of another
thread grabbing the lock in between. The way we ensure that is with an atomic operation: in
particular, a "compare-and-swap" operation (CMPXCHG in x86-64) atomically reads the
lock state and, if it's unlocked
, sets it to locked
.
Summary
Today, we started to explore the complexities that concurrency within a single process (using threads) can introduce. In particular, we learned about race conditions, which occur when multiple threads access the same memory location and at least one of these accesses is a write.
Race conditions are bad: they cause undefined behavior in the program, and often lead to hard-to-find bugs and seemingly bizarre data corruption. To avoid race conditions in our programs, we add synchronization, which orders acccesses by different threads such that they cannot interleave in undefined ways.
We then discussed how to avoid race conditions in our programs. To do so, we add synchronization, which orders acccesses by different threads such that they cannot interleave in undefined ways. We learned about one synchronization object, an "mutex". We'll talk more about this next time.