CSCI 0300/1310: Fundamentals of Computer Systems

Lecture 22: Race Conditions, Synchronization, and Mutexes

🎥 Lecture video (Brown ID required)
💻 Lecture code
❓ Post-Lecture Quiz (due 11:59pm, Monday, April 22).

Threads

Here is a comparison between what a parent and child process shared, and what two threads within the same process share:

Resource	Processes (parent/child)	Threads (within same process)
Code	shared (read-only)	shared
Global variables	copied to separate physical memory in child	shared
Heap	copied to separate physical memory in child	shared
Stack	copied to separate physical memory in child	not shared, each thread has separate stack
File descriptors, etc.	shared	shared

A process can contain multiple threads. All threads within the same process share the same virtual address space and file descriptor table. However, each thread must have its own set of registers and stack. (Think about all the things that would go horribly wrong if two threads used the same stack and ran in parallel on differnt processors!). The processes we have looked at so far all have a single thread running.

Threads allow for concurrency within a single process: even if one thread is blocked waiting for some event (such as I/O from the network or from harddisk), other threads can continue executing. On computers with multiple processors, threads also allow multiple streams of instructions to execute in parallel.

All threads within the same process share the same virtual address space and file descriptor table, but each thread has its own set of registers and stack. The processes we have looked at so far have all been "single-threaded", meaning that they only had one thread. Each process has, at minimum, a single "main" thread – hence, our processes had one and not zero threads. For multi-threaded processes, the kernel stores a set of registers for each thread, rather than for each process. This is necessary because the kernel needs to be able to independently run each thread on a processor and suspend it again to let other threads run.

Race Conditions

Let's look at how our example program in incr-basic.cc uses threads:

void threadfunc(unsigned* x) {
    // This is a correct way to increment a shared variable!
    // ... OR IS IT?!?!?!?!?!?!??!?!
    for (int i = 0; i != 10000000; ++i) {
        *x += 1;
    }
}

int main() {
    std::thread th[4];
    unsigned n = 0;
    for (int i = 0; i != 4; ++i) {
        th[i] = std::thread(threadfunc, &n);
    }
    for (int i = 0; i != 4; ++i) {
        th[i].join();
    }
    printf("%u\n", n);
}

In this code, we run the function threadfunc() concurrently in four threads. The std::thread::join() function makes the main thread block until the thread upon which the join() is called finishes execution. Consequently, the final value of n will be printed by the main thread after all four threads finish.

In each thread, threadfunc increments a shared variable 10 million times. There are four threads incrementing in total and the variable starts at zero. What should the final value of the variable be after all incrementing threads finish?

40 million seems like a reasonable answer, but by running the program (./incr-basic.noopt) we observe that it prints all sorts of values such as 15285711, and that the value is different every time. What's going on?

There is a race condition in the addl instruction itself! (The term "race" refers to the idea that two threads are "racing" each other to access, modify, and update a value in a single memory location.) Up until this point in the course, we've been thinking x86 instructions as being indivisible and atomic. In fact, they are not, and their lack of atomicity shows up in a multi-processor environment.

Inside the processor hardware, the addl $1, (%rdi) is actually implemented as three separate "micro-op" instructions:

movl (%rdi), %temp (load)
addl $1, %temp (add)
movl %temp, (%rdi) (store)

Imagine two threads executing this addl instruction at the same time (concurrently). Each thread loads the same value of (%rdi) from memory, then adds one to it in their own separate temporary registers, and then write the same value back to (%rdi) in memory. The last write to memory by each thread will overwrite each other with the same value, and one increment will essentially be lost.

What about the optimized version of incr-basic?

With compiler optimizations turned on, the compiler can optimize away the loop in threadfunc(). It recognizes that the loop is simply incrementing the shared variable 10 million times, so it transforms the loop into a single addl instruction with immediate value 10 million.

The compiled program appears to run and produce the correct output, but if we run it in a loop (./incr-basic | uniq), it will eventually print a value like 30000000 or even 20000000. This happens because even with the single addition of ten million, there is still a race condition between the four additions that different threads execute.

This is the behavior of running two increments on the same variable in x86 assembly. In C/C++, accessing shared memory from different threads without proper synchronization is undefined behavior, unless all accesses are reads.

Race conditions

Recall that in the last lecture we showed that updating shared memory variables in concurrent threads requires synchronization, otherwise some update may get lost because of the interleaving of operations in different threads.

Here's an example, in which two threads are trying to both increment an integer x in memory that is originally set to the value 7. Recall that each thread needs to (i) load the value from memory; (ii) increment it; and (iii) write the incremented value back to memory. (In the actual processor, these steps may happen in microcode instructions, but we're showing them as x86-64 assembly instructions operating on a register %tmp here.)

This interleaving is bad! It loses one of the increments, and ends up with x = 8 even though x has been incremented twice. The reason for this problem is that T2's movl (&x), %tmp instruction runs before T1's increment (addl), so the processor running T2 stores (in its cache and registers) an outdated value of x and the increments that value. Both threads write back a value of x = 8 to memory.

To see that this is indeed a race condition, where the outcome only depends on how the executions of T1 and T2 happen relative to each other, consider the below schedule, which happens to run the operations without interleaving between the threads:

In this execution, we get the correct value of x = 9 (7 incremented twice) because the T1's increment completes and writes x = 8 back to memory before T2 reads the value of x.

Having the outcome of our program depend on how its threads' execution randomly interleaves is terrible, so we better synchronize the threads to restrict executions to "good" interleavings.

This gives rise to a basic rule of synchronization: if two or more threads can concurrently access an object, and at least one of the accesses is a write, a race condition can occur and synchronization is required.

Mutexes

Mutual exclusion means that at most one thread accesses the shared data at a time.

In our multi-threaded incr-basic.cc example from the last lecture, the code does not work because more than one thread can access the shared variable at a time. The code would behave correctly if the mutual exclusion policy is enforced. We can use a mutex object to enforce mutual exclusion (incr-mutex.cc). In this example, it has the same effect as wrapping *x in a std::atomic template:

std::mutex mutex;

void threadfunc(unsigned* x) {
    for (int i = 0; i != 10000000; ++i) {
        mutex.lock();
        *x += 1;
        mutex.unlock();
    }
}

The mutex (a kind of lock data structure) has an internal state (denoted by state), which can be either locked or unlocked. The semantics of a mutex object is as follows:

Upon initialization, state = unlocked.
mutex::lock() method: waits until state becomes unlocked, and then atomically sets state = locked. Note the two steps shall complete in one atomic operation.
mutex::unlock() method: asserts that state == locked, then sets state = unlocked.

The mutual exclusion policy is enforced in the code region between the lock() and unlock() invocations. We call this region the critical section.

A mutex ensures that only one thread can be active in a critical section at a time. In other words, the mutex enforces a thread interleaving like the one shown in the picture below, where T1 waits on the mutex m while T2 runs the critical section, and then T1 runs the critical section after it acquires the mutex once T2 releases it via m.unlock().

How do atomic operations come into the mutex? Consider the implementation of m.lock(): there must be an operation that allows the thread calling lock() to set the lock's state to locked without any chance of another thread grabbing the lock in between. The way we ensure that is with an atomic operation: in particular, a "compare-and-swap" operation (CMPXCHG in x86-64) atomically reads the lock state and, if it's unlocked, sets it to locked.

Summary

Today, we started to explore the complexities that concurrency within a single process (using threads) can introduce. In particular, we learned about race conditions, which occur when multiple threads access the same memory location and at least one of these accesses is a write.

Race conditions are bad: they cause undefined behavior in the program, and often lead to hard-to-find bugs and seemingly bizarre data corruption. To avoid race conditions in our programs, we add synchronization, which orders acccesses by different threads such that they cannot interleave in undefined ways.

We then discussed how to avoid race conditions in our programs. To do so, we add synchronization, which orders acccesses by different threads such that they cannot interleave in undefined ways. We learned about one synchronization object, an "mutex". We'll talk more about this next time.