CS 131/CSCI 1310: Fundamentals of Computer Systems

Lecture 17: Threads and Race Conditions

» Lecture video (Brown ID required)
» Lecture code
» Post-Lecture Quiz (due 11:59pm Monday, April 13)

Threads

A process can contain multiple threads. Threads allow for concurrency within a single process: even if one thread is blocked waiting for some event (such as I/O from the network or from harddisk), other threads can continue executing. On computers with multiple processors, threads also allow multiple streams of instructions to execute in parallel.

All threads within the same process share the same virtual address space and file descriptor table, but each thread has its own set of registers and stack. The processes we have looked at so far have all been "single-threaded", meaning that they only had one thread. Each process has, at minimum, a single "main" thread – hence, our processes had one and not zero threads. For multi-threaded processes, the kernel stores a set of registers for each thread, rather than for each process. This is necessary because the kernel needs to be able to independently run each thread on a processor and suspend it again to let other threads run.

Let's look at how our example program in incr-basic.cc uses threads:

void threadfunc(unsigned* x) {
    // This is a correct way to increment a shared variable!
    // ... OR IS IT?!?!?!?!?!?!??!?!
    for (int i = 0; i != 10000000; ++i) {
        *x += 1;
    }
}

int main() {
    std::thread th[4];
    unsigned n = 0;
    for (int i = 0; i != 4; ++i) {
        th[i] = std::thread(threadfunc, &n);
    }
    for (int i = 0; i != 4; ++i) {
        th[i].join();
    }
    printf("%u\n", n);
}

In this code, we run the function threadfunc() concurrently in four threads. The std::thread::join() function makes the main thread block until the thread upon which the join() is called finishes execution. Consequently, the final value of n will be printed by the main thread after all four threads finish.

In each thread, threadfunc increments a shared variable 10 million times. There are four threads incrementing in total and the variable starts at zero. What should the final value of the variable be after all incrementing threads finish?

40 million seems like a reasonable answer, but by running the program (./incr-basic.noopt) we observe that it prints all sorts of values such as 15285711, and that the value is different every time. What's going on?

There is a race condition in the addl instruction itself! (The term "race" refers to the idea that two threads are "racing" each other to access, modify, and update a value in a single memory location.) Up until this point in the course, we've been thinking x86 instructions as being indivisible and atomic. In fact, they are not, and their lack of atomicity shows up in a multi-processor environment.

Inside the processor hardware, the addl $1, (%rdi) is actually implemented as three separate "micro-op" instructions:

movl (%rdi), %temp (load)
addl $1, %temp (add)
movl %temp, (%rdi) (store)

Imagine two threads executing this addl instruction at the same time (concurrently). Each thread loads the same value of (%rdi) from memory, then adds one to it in their own separate temporary registers, and then write the same value back to (%rdi) in memory. The last write to memory by each thread will overwrite each other with the same value, and one increment will essentially be lost.

What about the optimized version of incr-basic?

With compiler optimizations turned on, the compiler can optimize away the loop in threadfunc(). It recognizes that the loop is simply incrementing the shared variable 10 million times, so it transforms the loop into a single addl instruction with immediate value 10 million.

The compiled program appears to run and produce the correct output, but if we run it in a loop (./incr-basic | uniq), it will eventually print a value like 30000000 or even 20000000. This happens because even with the single addition of ten million, there is still a race condition between the four additions that different threads execute.

This is the behavior of running two increments on the same variable in x86 assembly. In C/C++, accessing shared memory from different threads without proper synchronization is undefined behavior, unless all accesses are reads.

This gives rise to a basic rule of synchronization: if two or more threads can concurrently access an object, and at least one of the accesses is a write, a race condition can occur and synchronization is required. We will look at different synchronization mechanisms next.

There are two ways to synchronize shared memory accesses in C++. We will describe a low-level approach, using C++'s std::atomic library, first, before introducing a higher level and more general way of performing synchronization.

Atomics

Processors have special, atomic instructions that always execute without racing with any other processor's accesses. These instructions are "indivisible" (which is what the notion of "atomic" refers to in physics!), i.e., the processor does not internally decompose atomic instructions into smaller micro-code instructions. Atomic instructions are generally slower and more expensive (in terms of energy) to execute than normal instructions.

incr-atomic.cc implements synchronized shared-memory access using C++ atomics, which the compiler will translate into atomic instructions. The relevant code in threadfunc() is shown below:

void threadfunc(std::atomic* x) {
    for (int i = 0; i != 10000000; ++i) {
        x->fetch_add(1);
        // `*x += 1` and `(*x)++` also work!
    }
}

C++'s atomics library implements atomic additions using an x86-64's atomic instruction. When we use objdump to inspect the assembly of threadfunc(), we see an lock addl ... instruction instead of just addl .... The lock prefix of the addl instruction asks the processor to hold on to the cache line with the shared variable (or in Intel terms, "lock the memory bus") until the entire addl instruction has completed.

C++ atomics and lock-prefixed instructions only work with word-sized variables that cannot span multiple cache lines. (Recall that the alignment unit discussed that one reason for alignment is to avoid primitive types spanning cache blocks; here this notion surfaces again!)

Mutexes

But what if you do need to synchronize access to a larger object, such as an array or a struct? A more general way to perform synchronized access to arbitrary data in memory is called a mutex (short for "mutual exclusion"), and it's an example of a synchronization object.

We will talk about mutexes and synchronization objects in more detail next time.

Summary

Today, we started to explore the complexities that concurrency within a single process (using threads) can introduce. In particular, we learned about race conditions, which occur when multiple threads access the same memory location and at least one of these accesses is a write.

Race conditions are bad: they cause undefined behavior in the program, and often lead to hard-to-find bugs and seemingly bizarre data corruption. To avoid race conditions in our programs, we add synchronization, which orders acccesses by different threads such that they cannot interleave in undefined ways. Synchronization relies, ultimately, on atomic processor instructions, and we implement it using abstractions built on these atomic instructions. std::atomic (for primitive types) and mutexes (for synchronizing access to more complex datastructures) are examples of synchronization objects that a C++ program may use to synchronize access by multiple threads.