Lab 5: Processes

Due Thursday, April 18th at 8:00 PM EST

Assignment

Caution: This lab (and particularly part II) has historically been the most time-consuming lab for students, therefore we highly recommend starting and seeking help early!

Assignment Installation

First, ensure that your repository has a handout remote. Type:

$ git remote show handout

If this reports an error, run:

$ git remote add handout https://github.com/csci0300/cs300-s24-labs.git

Then run:

$ git pull
$ git pull handout main

This will merge our Lab 5 stencil code with your previous work. If you have any "conflicts" from Lab 4, resolve them before continuing further. Run git push to save your work back to your personal repository.

C++ Tips

In both parts of this assignment, you will be using some C++ that might differ slightly from what has been previously used in C.

Arrays: To allocate memory for C++ arrays on the heap, you should use the new keyword. To free that memory, you would use the delete keyword. This is shown in the next section, which describes the function of the new and delete keyword.

New/Delete: C++ lets you allocate heap memory via the new operator, which takes the name of a class (or struct) to instantiate an object in heap memory. For example, to make a heap-allocated array of ints, you would write:

dataType* heapArray = new dataType[size];
delete[] heapArray;

The new keyword is essentially the C++ equivalent of mallocing memory and instantiating a object.

When you no longer need memory created with new, you use delete to destroy it. In this specific case, delete[] is used to delete dynamically allocated arrays.

Vectors: The vector data type from the standard C++ library stores elements of similar data types in an array-like format. However, unlike the array, the vector data type can grow dynamically. You can declare an empty vector variable and add elements as shown:

vector<int> heap_vec;
heap_vec.push_back(1);
heap_vec.push_back(2);
heap_vec.push_back(3);

Assignment Part I: Simple Shell

In the first part of this lab, you will write the process launching code for your own shell, called 300sh. 300sh is similar to the shell you type your commands into (a program called bash) while running your Docker container, but it's a lot simpler.

Introduction

Your shell needs to be able to start new processes via system calls into the OS kernel. For this purpose, you will use the fork and execv system calls discussed in lectures.

What's the deal with execv? Thus far, you have primarily written programs which execute a series of instructions, often organized into functions. However, sometimes you'd like to write programs that have the ability to execute other programs. This can be done on Linux with the exec system call. When one program executes another using execv, the current process image is completely replaced with that of the new program. The process image comprises of the virtual address space of the current process, including the stack, the heap, the global variables (.data) and the code (.text) sections. All of these are replaced by the corresponding sections in the new program. If execv is successful, the original program will never be returned to as that process has been completely overwritten!

Here’s an example usage of execv.

Try running this program in your Docker container with the file name example.cpp.

















#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

int main() {
    char program_name[10] = "/bin/echo"; 
    const char* arg1 = "hello"; 
    const char* arg2 = "world";
    const char* newargv[] = { program_name, arg1 , arg2, NULL };
    // this will replace the process image with the image of
    // executable `/bin/echo`and arguments newargv
    execv("/bin/echo", (char* const*)newargv);
    
    // should only get past here if execv() returns an error
    perror("execv failed");
    exit(EXIT_FAILURE);
}

Running this program will produce this output:

$ gcc -o example example.cpp
$ ./example
hello world

Try changing the first argument of execv on line 12 to an invalid program name (ex: /bin/hello) and see what happens.

What's the deal with fork? What if you want to execute a new program while still being able to return to the original, rather than overwriting it with the new program? This is where the ability to create multiple processes becomes useful. The fork system call will create a new process (the "child process") whose address space is a copy of its parent process’s address space. However, while they map the same virtual addresses, these two processes run in separate address spaces, so changes to one do not affect the other!

Here’s an example usage of fork.

Try running this program in your Docker container or repl.it with the file name example.cpp.

















#include <stdio.h>
#include <unistd.h>

int main() {
  printf("I am in the parent process.\n");
  int x = 5; 
  pid_t child; 
  // fork returns 0 to the child process 
  // and the child's process id (pid_t) to the parent process
  if ((child = fork()) == 0) {
    // only the child process can reach within this if statement //
    x = x + 1; 
    printf("I am in the child process: x is %d.\n", x);
  }
  printf("I am outside the if statement.\n");
  printf("Child is %d and x is %d.\n", child, x);
}

Running this program will produce an output similar to this:

$ gcc -o example example.cpp
$ ./example
I am in the parent process.
I am outside the if statement.
Child is 1753 and x is 5.
I am in the child process: x is 6.
I am outside the if statement.
Child is 0 and x is 6.

Note: Try running the program in your Docker container. (In online environments like repl.it, you may not see all the output.)
Since the child process is a copy of the parent's address space, it will perform the same instructions as its parent starting on line 10, and have a copy of its parent's stack, heap, and data (e.g., it has access to a copy of variable x). fork will return a different value depending on whether it's in the child or parent process – allowing us to execute different instructions in the child and parent process (e.g., the parent process can't execute instructions within the if statement, but the child can). Currently, the child can also execute instructions after the if statement. Let's change that!

Optional Task 1: Immediately after line 13, add the line exit(0);, indicating to exit from the child process with return code 0. Also, include #include <stdlib.h> as a header.

Optional Task 2: Immediately after line 14, declare a local integer called status and then add the line waitpid(child, &status, 0);, telling the parent to wait on the child to terminate before it continues executing instructions. Also, include #include <sys/wait.h> as a header.

Assignment Specifications

For this part of the lab, you will be completing the provided implementation of a simple shell by adding functionality for executing programs. You are provided with the file sh.cpp and its corresponding makefile. Currently, sh.cpp contains a read-evaluate-print loop (REPL) which reads user input, and parses it into individual tokens. If you run ./300sh after make, you'll see that the shell currently doesn't do anything, except for the builtin commands exit and cd.

$ ./300sh 
300sh> ls
300sh> yo
300sh> hello
300sh> /bin/ls
300sh> cd /bin
Current directory is now /bin
300sh> exit

Your task is to extend this functionality to be able to execute any arbitrary program (or fail if the program does not exist). This should all be done within the do_fork function in the stencil.

When a process, in this case the shell, executes another program, the current process is completely replaced by the program it executes. In order for your shell to continue execution after running another program, it needs to create a new child process in which the new program can run as to not replace the shell’s execution. As such, if you create a new process (using fork) in which to execute a different program (using execv), the original program will remain preserved in the parent process and its execution can continue rather than be replaced.

Task:
Implement the do_fork(char *args[]) in sh.cpp. This function should:

Fork into a new child process.
The child process should execute the desired command (the first element in args), with the desired arguments (the following elements in args).
The parent process should wait on the child process to terminate before executing the next command.

The argument of do_fork, args, is such that:

the first argument is the full path to the command to be executed
the following elements are any arguments to that command
the last argument is NULL or nullptr (meaning that args is null terminated)

For example, if args were the array: {"/bin/echo", "hello", NULL}, do_fork would create a child process to execute the program /bin/echo with argument hello.

Understanding the arguments of execv.

The execv system call takes as arguments:

the full path to the desired command
an array of strings in which the first element is the program name (this could just be the full path) and the following elements are arguments to the command.

(Make sure to handle the case where execv fails!)

Check out the example usages of fork and execv above for more help.

When implementing this part of the lab you may find the fork, execv, and wait or waitpid system calls useful. Please read the Relevant System Calls section for more details.

Running and Testing

Compile and Run:

Compile with the make
Run with ./300sh

When your shell is working, it should exhibit the same basic functionality as a regular bash terminal when running processes in the foreground. Just note that when executing programs in your shell, you will need to use the full pathname. For example, instead of typing ls as you usually would, you will have to use /bin/ls. (This is the case because your shell does not automatically expand paths for binaries in standard directories like /bin.)

If you do not know the full pathname of a program you wish to execute in your shell, you can use the which <program> command in a standard bash terminal. For example, try the following:

$ which cat
/bin/cat
$ ./300sh
300sh> /bin/ls
300sh  Makefile  builtin.h  builtin.o  sh.cpp  sh.o
300sh> /bin/cat builtin.h
... prints the contents of builtin.h ...

To get a sense of whether or not your shell is working properly, you can compare the output of your shell to a standard bash terminal for some simple commands such as the following:

cat <file> (or /bin/cat <file>)
echo <args> (or /bin/echo <args>)
cal (or /usr/bin/cal)
ls (or /bin/ls)

Note that 300sh supports passing arguments to programs, but does not support output redirection (e.g., /bin/ls > /tmp/outfile).

Now head to the grading server and run the checkoff for Lab 5.

You should see the "shell" test pass. Don't worry about others failing or partial credit; if you pass the "shell" test, you've done what you need to do for Part I of the lab.

Assignment Part II: Reverse Index

Introduction

One common paradigm of programming, perhaps the one with which you are most familiar, is sequential programming. Sequential programs execute instructions one at a time in a specific order. With the aid of multiple processes, however, we can also have concurrent programs. Concurrent programs execute in parallel, either on multiple processors in your computer or in a time-sliced way on a single processor.

You have experience writing concurrent programs using fork already! fork creates a new process in which a distinct task can be executed in parallel with existing processes.

One benefit to concurrent programming is that it can improve program performance by splitting one big task into smaller subtasks which can be executed at the same time rather than sequentially (assuming you're running on a multiprocessor computer). However, since multiple processes exist in distinct virtual address spaces, the data in one process’s address space is not visible to any other process. This becomes a problem when using multiple processes to speed up a task, as you will eventally want to pool all child process results together.

In this part of the lab, you will learn how to share data between processes via pipes. A pipe is a one-way channel in which data can be sent and received between processes. On Linux, it is represented as an array of two file descriptors with the first (element 0) representing the file descriptor for the read end of the pipe, and the second (element 1) representing the file descriptor for the write end of the pipe. Data can be read from and written to pipes using the standard read and write system calls.

Assignment Specifications

For this lab, you will use multiple processes to improve the performance of a reverse index program, which takes in a word and produces a list of the occurrences of the word in a corpus of documents.

This is the essence of how a search engine like Google works! When you type a keyword into Google, the Google servers look up your keyword in a giant reverse index, finding the web pages that mention this keyword. It then ranks the results according to a highly secret algorithm.

Google's reverse index is split over many servers, and speeds up lookups using concurrency. For this assignment, you will build a more modest, single-machine reverse index, but concurrency will still help speed it up!

In the multiProcs directory of the Lab 6 stencil, we give you the following files:

wordindex.h: This file contains all shared functions and data structures. This includes the wordindex class for storing the results of a search, all functions involving reading files and directories, and functions to print out the results. You should not modify anything in this file, but you will definitely want to look at the functions and data structures it contains since you will have to use them.
revindex_sequential.cpp: This file contains a single-threaded, sequential implementation of a reverse index search. You should use this as a baseline for the performance of your reverse index, and you may want to look at the code to get an idea of how the reverse index should work. However, you should not modify anything in this file.
revindex_procs.cpp: This file contains a multiple process implementation of a reverse index search. This program should take advantage of parallel execution by creating multiple processes such that the reverse index of a term can be calculated for multiple files at once. Each file should be searched by its own process. You only need to modify this file to complete the lab.

In addition, we also give you some input files that are helpful for testing.

The books directory containing a series of .txt files of classic novels
input.txt: This file contains a series of search terms separated by a newline. It can be used for testing purposes to execute multiple searches by running ./revindex_parallel books < input.txt
WORD.txt: all files called WORD.txt (besides input.txt) contain the expected results of a reverse index of the term WORD in the books directory. These files are meant for testing/debugging purposes.

Trying running make revindex_sequential, and typing in a few terms to get a feel for how the revindex_sequential works.

Task:
Implement the following functions in revindex_procs.cpp. Below is a brief overview of what each function is resposible for. The comments in the stencil provide a more detailed explanation of each function.

process_input:

This function is responsible for setting up the pipes for each child process, creating the child processes, pooling the data from each child process, and waiting for them to complete.

process_file:

This function should be run by the child processes to search for occurrences of a given term in the child process’ delegated file (see `wordindex.h` for functions to use for this). This function is also responsible for writing the search results for the file a child process searched back to the parent process through a pipe.

read_process_results:

This function should be run by the parent process to read the results from a child process through a pipe.

When implenting this part of the lab, you may find details in the pipe section of the handout useful.

Hints

If you're getting a timeout error, its not because multiprocessing is slow! Think about where in your code reading from a pipe could be blocked. How might the amount of data written effect this? Consider the case of trying to write a huge amount of data.
Be careful of how you manage your processes. You'll want to know that all of your processes execute successfully, but keep in mind the purpose of this assignment: concurrency!
Each process you fork will be responsible for processing one file, but each iteration of the loop in `process_input` starts several (up to workers) processes. You will need to ensure that you're passing the right filename from the filenames vector to each process.
If you are confused about how the pipes between the parent and child processes work, it may help to sketch a picture of which process runs what code and what data flows from one process to another.
Recall that different processes live in different address spaces. When a worker process sends back results through a pipe, it cannot send the wordindex data structure directly (Why not? Think about where the memory backing the vectors it contains is allocated). You will need to serialize the data before writing it to the pipe; check out the helper functions we provide!
The serialize/deserialize helpers we provide only serialize (encode) some members of the wordindex object. You will need to appropriately set the others after deserializing!
Since the parent process does not know how many results a child worker process found, you'll face the question of how many bytes to read from the pipe using the read syscall. Check out the examples we provide with the explanation of pipe for some ideas on how to determine this quantity.
If you're debugging an issue, it may help to add logging to your code that prints the process ID (available via the getpid() system call) and what that process is trying to read or write to/from a pipe.

Running and Testing:

Compile:

The whole project with make all
Just the provided sequential implementation with make revindex_sequential
Just the parallel implementation with make revindex_parallel

Run:

The books directory is provided for you, but feel free to create your own directories containing fewer or shorter files for preliminary testing purposes.

The sequential implementation with ./revindex_sequential <directory>
The parallel implementation with ./revindex_parallel <directory>

Test:

We have provided you with the expected output for some search terms over the books directory (all files titled WORD.txt), so that you can compare your results. You can also compare your implementation with the sequential implementation via make check. The results should be the same.

Functionality:

$ make check
./revindex_parallel books < input.txt > output_parallel.txt
./revindex_sequential books < input.txt > output_sequential.txt
diff output_parallel.txt output_sequential.txt
... You should receive no output/differences ...

Runtime: Additionally, once you have a working reverse index using parallel processes, you should find a performance improvement over the sequential implementation. To verify this, run:

$ time ./revindex_parallel books < input.txt
$ time ./revindex_sequential books < input.txt
...  the sequential implementation should be slower than the parallel ...

Relevant System Calls

Below you will find a list of functions that you may find useful for this lab. Feel free to also view the man pages if you would like more details:

Fork:

fork()

The fork system call is used to create a new process. It takes no arguments and returns a pid_t (an integer value)
From the perspective of the child process, fork() returns 0
From the perspective of the parent process, fork() returns the process id of the newly created child
Recall that the address space of new child process is an exact copy of the parent until execv is called. Among other things, this means that child process has its own copy of all variables set by the parent prior to the call to fork()!

An example of fork’s usage is as follows:

   int pid;
   if((pid = fork()) == 0){
       /* this is the child */
   }else{
       /* this is the parent */
   }

Execv:

execv(const char* path, const char* args[])

The execv system call is used to execute a program. In doing so, it completely replaces the image of the current process with that of the new program
- Note: this means that, when successful, execv does not return since the program that called it has been replaced and is no longer running
Arguments:
- The first argument to execv should be the path to the desired program
- The second argument to execv is an array beginning with the name of the program, followed by any arguments to that program, and terminated by NULL. An example arguments array is as follows:
```
    char* args[] = {"cat", "file.txt", NULL};
```

An example of execv’s usage is as follows:

    char* path = ""/bin/cat";
    char* args[] = {"cat", "file.txt", NULL};
    execv(path, args);
    
    /* this segment will only be reached if execv 
    failed, so you know there must be some kind of 
    error to report */
    perror("execv");
    exit(1);

Wait and Waitpid:

It is often necessary for a process to wait for the child process to terminate before continuing execution. To accomplish this, you can use either wait or waitpid.

wait(int* status)

The wait system call will return once any child of the calling process terminates
wait has one argument: a pointer to an integer, which will contain the status of the terminated child once wait returns (this argument can be NULL if you do not care about the child's status)
wait returns the process id of the terminated child process on success and -1 on failure

waitpid(pid_t pid, int* status, int options)

The waitpid system call returns once the child process specified by the pid argument has changed state as dictated by the options argument.
By default (when options=0) waitpid will return when the child process terminates, but other options can be specified.
The status argument will hold the exit status of the terminated child, which can be NULL if this value is not needed
waitpid return the process id of the terminated child process of success and -1 upon failure

Pipe:

pipe(int fd[2])

The pipe system call creates a one way channel through which processes can share data.
Upon return, the fd argument will contain 2 file descriptors (integers).
- fd[0] refers to the side of the pipe from which data can be read.
- fd[1] refers to the side of the pipe to which data can be written.
Data that is written to fd[1] is buffered by the kernel until a read from fd[0]

Below are two examples of using the pipe() system call:

    /*** Example 1 ***/
    int fd[2];
    /* set up the pipe */
    if(pipe(fd) == -1){
        perror(“pipe”);
        exit(1);
    }

    if(!fork()){
        /*child process*/
        char *message = “Hello World!”;
        /* the message is written to the
         * write end of the pipe */
        write(fd[1], data, strlen(message));
        
        /* close both ends of the pipe in
         * the child process */
        close(fd[0]);
        close(fd[1]);
        exit(0);
    }else{
        /*parent process*/
        char buf[256];
        /* read the message written to 
         * the pipe into the buffer */
        int loc = 0;
        while(read(fd[0], &buf[loc], 1) > 0){
            loc++;
        }
        fprintf(stdout, “%s\n”, buf);
        
        /* wait for the child to terminate */
        wait(NULL);
        
        /* close both ends of the pipe in
         *  the parent process */
        close(fd[0]);
        close(fd[1]);
    }

    /*** Example 2 ***/
    int fd[2];
    /* set up the pipe */
    if(pipe(fd) == -1){
        perror(“pipe”);
        exit(1);
    }

    if(!fork()){
        /*child process*/
        char *message = “Hello World!”;
        int len = strlen(data);
        /* write the size of the message
         * that will be written to the pipe */
        write(fd[1], &len, sizeof(int));
        
        /* the message itself is written to 
         * the write end of the pipe */
         write(fd[1], data, len);
         
        /* close both ends of the pipe in
         * the child process */
        close(fd[0]);
        close(fd[1]);
        exit(0);
    }else{
        /*parent process*/
        int size;
        /* read from the pipe, the first item 
         * read will be the size of the message 
         * since data is read in the same order 
         * as it is written */
        read(fd[0], &size, sizeof(int));
        char *buf = new char[size];
        
        /* read the message written to the
         * pipe into the buffer */
        read(fd[0], buf, size);
        fprintf(stdout, “%s\n”, buf);
        
        /* wait for the child to terminate */
        wait(NULL);
        
        /* close both ends of the pipe in 
         * the parent process */
        close(fd[0]);
        close(fd[1]);
        delete[] buf;
    }

Note: The second example first writes the size of the data and then the data itself. Since you know the order in which data is written to the pipe, you know that if you first read sizeof(int) bytes, you will read the length of the remaining portion of the data (the message). You then know by how much to read so that you can read the entirety of the message in just one more call to read.

In the first example, data is read from the pipe one byte at a time until read returns 0 (there is no more data left to read). In this example, passing data through the pipe results in 14 system calls, 1 write to the pipe, 12 reads to get the whole message and 1 more read to hit EOF. In contrast, the second example passing data through the pipe only uses 4 system calls: 2 writes and 2 reads. System calls are costly since they involve handing over control to the OS. As such, you will generally find your programs to be more efficient if you cut down on system calls.

Debugging Resources/ GDB

Debugging Multi-process Programs:

If you use GDB to help you debug this lab (as is always recommended!), it is important to note that the default behavior of GDB is to not follow the execution of child processes after they are started with fork. As such, if you set a breakpoint on a line that executes within a child process, GDB won’t break on that line.

However, you can change this behavior by running the GDB command set follow-fork-mode child. This tells GDB to follow the execution of the child process after a fork, and leave the parent process to run freely (but not followed). After setting this, GDB will hit breakpoints you set within a forked process but not any breakpoints set in the parent process beyond the call to fork. If needed, run help set follow-fork-mode within GDB for more details.

Handin instructions

Turn in your code by pushing your git repository to github.com/csci0300/cs300-s24-labs-YOURUSERNAME.git.

Then, head to the grading server. On the "Labs" page, use the "Lab 5 checkoff" button to check off your lab.