Lab 7: Threads

Due March 30, 2021, at 8:00PM

Introduction

In Lab 6, you implemented a concurrent program using multiple processes. With multiple processes, each process gets its own distinct address space. By default, no data is shared, and in Lab 6, you dealt with this by using pipes to send data between processes. (Another approach is to designate a shared region of memory with the mmap system call.) However, sometimes you want concurrency within a process, with the different parallel executions truly sharing memory.

In this lab, you will gain experience creating concurrent programs using multiple threads. Using multiple threads rather than processes is beneficial as threads of the same process share an address space. The only private region of memory for threads is their stack, so any data contained in the heap or global sections are shared across all threads of the same processes, and their changes are visible to each other.

It is important to note the difference between a process and a thread. A process is the general container responsible for the execution of a program. Threads are the individual entities within a process which actually do the execution. A single process can contain multiple threads all working in its address space. Every process contains at least one thread, the “main” thread, whose execution starts in main!

Assignment

Assignment installation

First, ensure that your repository has a handout remote. Type:

$ git remote show handout

If this reports an error, run:

$ git remote add handout 
https://github.com/csci0300/cs300-s21-labs.git

Then run:

$ git pull
$ git pull handout master

This will merge our Lab 7 stencil code with your previous work. If you have any “conflicts” from Lab 6, resolve them before continuing further. Run git push to save your work back to your personal repository.

Assignment Specifications

In this lab, you will again implement a parallel reverse index search; however, this time you will do so by taking advantage of multiple threads rather than multiple processes. You will work with the same basic reverse index program as in Lab 6. The program takes in a search term and produces a list of all occurrences of that term in a corpus of documents.

We provide you with the following files:

wordindex.h: This file contains all shared functions and data structures. This includes the wordindex class for storing the results of a search, all functions involving reading files and directories, and functions to print out the results. You should not modify anything in this file, but you will definitely want to look at the functions and data structures it contains since you will have to use them.
revindex_sequential.cpp: This file contains a single-threaded, sequential implementation of a reverse index search. You should use this as a baseline for the performance of your reverse index, and you may want to look at the code to get an idea of how the reverse index should work. However, you should not modify anything in this file.
revindex_threads.cpp: This file contains a multi-threaded implementation of a reverse index search. This program should take advantage of concurrency by creating multiple threads such that the reverse index of a search term in multiple files can be calculated at once. Each file should be searched by its own thread. You only need to modify this file to complete the lab.

Note: this lab does not require you to use synchronization objects or atomics.

Why?

If you implement the lab correctly, each thread works on a different file and shares no state with other threads.

We also give you some input files that are helpful for testing.

The books directory containing a series of .txt files of classic novels
input.txt: This file contains a series of search terms separated by a newline. It can be used for testing purposes to execute multiple searches by running ./revindex_threads books < input.txt
WORD.txt: all files called WORD.txt (besides input.txt) contain the expected results of a reverse index of the term WORD in the books directory. These files are meant for testing/debugging purposes.

Task:
Implement the run_workers function in revindex_threads.cpp to spawn multiple threads to execute find_word. More details of the implementation of run_workers can be found in the stencil comments.

Hints

It is important to not start the joining threads until after every thread has been created. If you were to create one thread and then wait to join with it before creating another thread, you will only ever be running one thread at a time. This is not actually concurrent programming!
Additionally, it is important that the wordindex pointer used as an argument to find_word is a pointer to a wordindex contained in the files vector. In c++, elements contained in a vector are allocated in the heap. It is necessary that your wordindex objects are contained in the heap so that they are visible across threads.

You may find it helpful to view the Relevant Library Functions portion of this handout when implementing your run_workers function.

Running and Testing:

Compile:

The whole project with make all
Just the provided sequential implementation with make revindex_sequential
Just the parallel implementation with make revindex_threads

Run:

The books directory is provided for you, but feel free to create your own directories containing fewer or shorter files for preliminary testing purposes.

The sequential implementation with ./revindex_sequential <directory>
The parallel implementation with ./revindex_threads <directory>

Test:

We have provided you with the expected output for some search terms over the books directory (all files titled WORD.txt), so that you can compare your results. You can also compare your implementation with the sequential implementation via make check. The results should be the same.

Functionality:

$ make check
./revindex_threads books < input.txt > output_threads.txt
./revindex_sequential books < input.txt > output_sequential.txt
diff output_parallel.txt output_sequential.txt
... You should receive no output/differences ...

Runtime: Additionally, once you have a working reverse index using parallel processes, you should find a performance improvement over the sequential implementation. To verify this, run:

$ time ./revindex_parallel books < input.txt
$ time ./revindex_sequential books < input.txt
...  the sequential implementation should be slower than the parallel ...

You can also compare the performance of your multithreaded reverse index search with the multi-process implementation from Lab 6. Which one do you expect to be faster?

Relevant Library Functions

Below you will find a list of functions you may find helpful in completing this lab. Feel free to also reference the man pages and C++ references for more details.

std::thread

std::thread(func, args...)
- The std::thread constructor spawns a new thread that beings running the function, func, with the specificed arguments
- Arguments:
  - A function to be executed by the new thread.
  - Any number of arguments as needed by the specified function
- Return value: a thread handle that you can use to join the thread (see join()).
- Below is an example of standard std::thread usage:
```
void some_func(int x, int y){
    …
}

void create_threads(){
    // create and run the thread
    std::thread t = thread(some_func, 1, 2);
    
    // wait for it to complete
    t.join();
}
```

std::thread::join

join()
- The std::thread join function waits for the execution of a thread to complete. Calling t.join() will halt the calling thread until thread t, has completed. (join will return immediately if the thread has already completed.)

Debugging Resources/ GDB

Switching between threads:

When your program is stopped in GDB, you can gain information on the current state of your threads running the command info threads (you can stop your program in GDB either through a breakpoint or by typing Ctrl-Z). The info threads command will provide you with a numbered list of every currently running thread. From there, you can use the command thread [number] (where [number] is the number listed by the thread) to inspect the current state of that thread and follow its execution.

Executing commands across multiple threads:

In GDB, you can use the command thread apply all <command> to apply a given command to all threads. For example, you can use thread apply all bt to get a the stack trace of every running thread.

Handin instructions

Turn in your code by pushing your git repository to github.com/csci300/cs300-s20-labs-YOURUSERNAME.git.

Then, head to the grading server. On the “Labs” page, use the “Lab 7 checkoff” button to check off your lab.