CSCI 0300/1310: Fundamentals of Computer Systems

⚠️ This is not the current iteration of the course! Head here for the current offering.

Lecture 21: Threads and Race Conditions

🎥 Lecture video (Brown ID required)
💻 Lecture code
❓ Post-Lecture Quiz (due 11:59pm, Monday, April 24).

Threads

Here is a comparison between what a parent and child process shared, and what two threads within the same process share:

Resource	Processes (parent/child)	Threads (within same process)
Code	shared (read-only)	shared
Global variables	copied to separate physical memory in child	shared
Heap	copied to separate physical memory in child	shared
Stack	copied to separate physical memory in child	not shared, each thread has separate stack
File descriptors, etc.	shared	shared

A process can contain multiple threads. All threads within the same process share the same virtual address space and file descriptor table. However, each thread must have its own set of registers and stack. (Think about all the things that would go horribly wrong if two threads used the same stack and ran in parallel on differnt processors!). The processes we have looked at so far all have a single thread running.

Threads allow for concurrency within a single process: even if one thread is blocked waiting for some event (such as I/O from the network or from harddisk), other threads can continue executing. On computers with multiple processors, threads also allow multiple streams of instructions to execute in parallel.

All threads within the same process share the same virtual address space and file descriptor table, but each thread has its own set of registers and stack. The processes we have looked at so far have all been "single-threaded", meaning that they only had one thread. Each process has, at minimum, a single "main" thread – hence, our processes had one and not zero threads. For multi-threaded processes, the kernel stores a set of registers for each thread, rather than for each process. This is necessary because the kernel needs to be able to independently run each thread on a processor and suspend it again to let other threads run.

Race Conditions

Let's look at how our example program in incr-basic.cc uses threads:

void threadfunc(unsigned* x) {
    // This is a correct way to increment a shared variable!
    // ... OR IS IT?!?!?!?!?!?!??!?!
    for (int i = 0; i != 10000000; ++i) {
        *x += 1;
    }
}

int main() {
    std::thread th[4];
    unsigned n = 0;
    for (int i = 0; i != 4; ++i) {
        th[i] = std::thread(threadfunc, &n);
    }
    for (int i = 0; i != 4; ++i) {
        th[i].join();
    }
    printf("%u\n", n);
}

In this code, we run the function threadfunc() concurrently in four threads. The std::thread::join() function makes the main thread block until the thread upon which the join() is called finishes execution. Consequently, the final value of n will be printed by the main thread after all four threads finish.

In each thread, threadfunc increments a shared variable 10 million times. There are four threads incrementing in total and the variable starts at zero. What should the final value of the variable be after all incrementing threads finish?

40 million seems like a reasonable answer, but by running the program (./incr-basic.noopt) we observe that it prints all sorts of values such as 15285711, and that the value is different every time. What's going on?

There is a race condition in the addl instruction itself! (The term "race" refers to the idea that two threads are "racing" each other to access, modify, and update a value in a single memory location.) Up until this point in the course, we've been thinking x86 instructions as being indivisible and atomic. In fact, they are not, and their lack of atomicity shows up in a multi-processor environment.

Inside the processor hardware, the addl $1, (%rdi) is actually implemented as three separate "micro-op" instructions:

movl (%rdi), %temp (load)
addl $1, %temp (add)
movl %temp, (%rdi) (store)

Imagine two threads executing this addl instruction at the same time (concurrently). Each thread loads the same value of (%rdi) from memory, then adds one to it in their own separate temporary registers, and then write the same value back to (%rdi) in memory. The last write to memory by each thread will overwrite each other with the same value, and one increment will essentially be lost.

What about the optimized version of incr-basic?

With compiler optimizations turned on, the compiler can optimize away the loop in threadfunc(). It recognizes that the loop is simply incrementing the shared variable 10 million times, so it transforms the loop into a single addl instruction with immediate value 10 million.

The compiled program appears to run and produce the correct output, but if we run it in a loop (./incr-basic | uniq), it will eventually print a value like 30000000 or even 20000000. This happens because even with the single addition of ten million, there is still a race condition between the four additions that different threads execute.

This is the behavior of running two increments on the same variable in x86 assembly. In C/C++, accessing shared memory from different threads without proper synchronization is undefined behavior, unless all accesses are reads.

Summary

Today, we started to explore the complexities that concurrency within a single process (using threads) can introduce. In particular, we learned about race conditions, which occur when multiple threads access the same memory location and at least one of these accesses is a write.

Race conditions are bad: they cause undefined behavior in the program, and often lead to hard-to-find bugs and seemingly bizarre data corruption. To avoid race conditions in our programs, we add synchronization, which orders acccesses by different threads such that they cannot interleave in undefined ways. We'll talk more about sychronization next time!