Lab 1: Writing and Debugging C Programs

Due February 11, 2020, at 8:00PM

Before attempting this lab, please make sure that you have:

1. Completed Lab 0 – This will ensure that your VM and grading server account are set up properly.
2. Completed the Diversity Survey – Your grades for Lab 0 and Lab 1 will depend on whether you’ve submitted this (though all questions are optional).

Introduction

The purpose of this lab is to give you some experience with the syntax and basic features of the C programming language, as well as introduce you to a C debugging tool called gdb (GNU Debugger). Learning C will help you understand a lot of the underlying architecture of the operating system, and as a whole demystify how programs run.

If you take away anything from this course, hopefully, it’s that Computer Systems are not magic and that much of it actually makes a lot of sense. Don’t be afraid to look up questions on Stack Overflow and Linux Man Pages (which provide great documentation on C library functions), and if that doesn’t help, ask on Piazza!

Why C?

Check out this article for more on why C programming is awesome! Here are some of the article’s highlights: C is a procedural programming language that was mainly developed as a systems programming language to write operating systems. The main features of the C language include low-level access to memory, a simple set of keywords, and clean style, these features make C language suitable for system programming like operating system or compiler development.

If you are looking for a detailed tutorial on C, check out the links on our C primer.

Assignment

Assignment installation

Start with the cs131-s20-labs-YOURNAME repository you used for Lab 0.

First, ensure that your repository has a handout remote. Type:

$ git remote show handout

If this reports an error, run:

$ git remote add handout https://github.com/csci1310/cs131-s20-labs.git

Then run:

$ git pull
$ git pull handout master

This will merge our Lab 1 stencil code with your previous work. If you have any “conflicts” from Lab 0 (very unlikely!), resolve them before continuing further. Run git push to save your work back to your personal repository.

Exercise 1: Running and Debugging

Here’s how to run a C program

To run a C program, you first need to compile the source code into a binary. There are several widely-used C compilers, but for this lab and CS 131, you will mostly use gcc (the GNU C Compiler).

In the next lab, we’ll go over more information on the compilation process.

# compile your c-program into an executable binary (ones and zeros)
$ gcc name_of_program.c -o name_of_executable
# run the executable
$ ./name_of_executable

# Smile at the exciting output of your program.

However, sometimes things don’t go as planned, and instead of smiling, you’re pulling up your sleeves to solve a bug!

Like with other programming languages, C programmers frequently make use of print statements to look at the state of their program (in C, you use the printf function for this). This so-called “printf debugging” is an important approach that can get you quite far, and you’ll probably use it a lot.

Often, however, you may wish that you could stop your program in its tracks (e.g., just before you hit a bug) and interactively inspect its state. This is what debugger tools like gdb are for.

Here’s how to debug a C program using the GDB Debugger

# compile your C program using the `-g` flag to compile with debugging info
$ gcc name_of_program.c -g -o name_of_executable
# run the executable in gdb
$ gdb name_of_executable 
# set a breakpoint at a function
(gdb) b name_of_a_function
# run the program optionally with arguments ARGS (if necessary)
(gdb) r ARGS
# display the source code as you debug
(gdb) layout src
# print a variable VAR
(gdb) p VAR

# Run other gdb commands
# Track down your bug 

# quit out of gdb
(gdb) q

As explained on the gnu website, GDB can do four main things (plus other things) to help you catch bugs in the act:

Start your program, specifying anything that might affect its behavior.
Make your program stop on specified conditions.
Examine what has happened, when your program has stopped.
Change things in your program, so you can experiment with correcting the effects of one bug and go on to learn about another.

Here’s a cheatsheet of common gdb commands. Throughout this lab we’ll use a few.

Task:

Take a look at math_prog.c. There are two bugs in this program – don’t fix them quite yet.
Try compiling and running the program. (You’ll notice an unpleasant surprise.)
Try running the program in gdb.
- Set a breakpoint at the function called add_arr, run the program, open the source code, and then print out the variable a.

Example

# compile your c-program using the `-g` flag to compile with degugging info
$ gcc math_prog.c -g -o math_prog 
# run the executable
$ ./math_prog
# run the executable in gdb
$ gdb math_prog
# set a breakpoint at a function
(gdb) b add_arr
# run the program optionally with arguments(if necessary)
(gdb) r
# display the source code as you debug
(gdb) layout src
# print the variable a
(gdb) p a
# Quit gdb
(gdb) q

Finding the Bugs using GDB

Note: For the remainder of the this lab, try to refrain from using print statements to debug. The following gdb commands can be very helpful in debugging C programs (particuarly the bt command), and the sooner you get familiar with working with gdb, the easier your life will be.

Once you’re stopped at a breakpoint at add_arr, run the following commands:

(gdb) c # continues the program to the next breakpoint or to termination
# ...You should notice a SEGFAULT

# this should show you exactly when the fault occured
(gdb) layout src 
# this call is accessing invalid memory
(gdb) p *(c + i)
Cannot access memory at address 0xf0b5ff
# ... Hmm where was the variable `c` initialized? 

# Prints a backtrace of the program
# The 'bt' command is incredibly useful anytime you encounter a SEGFAULT. 
(gdb) bt

The bt command shows you the function calls that led up to where you currently are in the program (in our case, the segfault). Each function call comes with a stack frame, which contains information specific to that call (such as arguments and local variables). We will hear more about stack frames later in the course. In gdb, we can check out different frames (i.e. check out different function calls), like so:

# The 'f' command allows you to switch frames
# the below command switches to frame #1, which corresponds to the main function
(gdb) f 1 
(gdb) p c
# ... Oh `c` was declared in `main`, but never intialized

Hopefully you noticed that the pointer c is initially pointing at uninitialized memory! We can fix this in two ways:

Stack allocate enough space for the whole array – and then pass in a pointer to that array to add_arr.
Heap allocate enough space for the array – and then pass in a pointer to that array to add_arr.

(In this case, because we’re only using the arr for a short period of time, the stack allocation makes sense.)

First, try it yourself, but here are some tips if you need help.

To stack allocate the array change the declaration to: int c[6];
To heap allocate the array:

int *c = malloc(sizeof(int) * 6);
# ... use the pointer and when you're done ...
free(c)

Once you fix the bug and re-compile your program, you should notice that the program no longer segfaults, but it’s still not working as expected.

Task: Use gdb to find (and then fix) the second bug.

Hint!

Typically when C programmers pass arrays as arguments to functions, they also include the length of the array as another argument to the function. Think about why they might do this.

Exercise 2: Let’s get programming!

Take a look at simple_repl.c. This program reads in input from the terminal and breaks up a single line of text by either a space or comma! Fun fact: “REPL” stands for “read-eval-print” loop, and one place where you may have encountered a REPL before is the Python interpreter: you type a line, it evaluates it, and it prints some result.

As you’re reading through the code, here are some functions and variables you might want to look into:

fgets
printf
stdin and stdout
strtok(This is a wacky function that we’ll use later, so pay special attention to it.)

Task:

Compile and run simple_repl.c. Enter a few lines of text to get a feel for how it works.
- You can also redirect standard input using the < symbol, so that instead of reading in commands from the terminal, it reads them from a file.
  Try:
  $ ./simple_repl < files/three-star.csv or
  $ ./simple_repl < files/A_Christmas_Carol_in_Prose.txt
  Similarly, you can write:
  $ echo "hello world" | ./simple_repl, piping the output from echo into your REPL.
- Typing Ctrl-D will send an End-Of-File (EOF) signal to the program, causing fgets to return NULL and exiting the program.
Run the program in gdb, and perform the following commands:
- Break at main.
- Use the next (n) command until the call to fgets.
- Use the print (p) and examine (x) commands to examine the contents of buf before and after the call to fgets.
Where is the char array buf allocated (the stack or the heap)?

Help

# set a break point at main
(gdb) b main
# show source code, and then run the program
(gdb) layout src
(gdb) r
# use the n command to execute the next line of code 
(gdb) n
# keep using the n command until you're about to execute the `fgets`
(gdb) n 
#...
# print out the buffer before executing fgets and after
(gdb) p buf
# the program will hang
# (it's waiting for input from stdin for the fgets function)
hello there # type a line of text

# print the buffer
(gdb) p buf # you should see the text you inputted
(gdb) x/10c buf # examines (x) 10 characters (/10c) starting at buf

`strtok`

In this section, you will be writing your own version of strtok. It might sound daunting, but we’ll walk you through it. Take a look at the link above if you need clarification on what exactly strtok does.

Note: You may have noticed that strtok maintains state internally from iteration to iteration. It does this by declaring a static local variable. Essentially, the function creates the variable in a region of memory that will persist until the end of the program (almost like a global variable), but the variable is only accessible within the function. This part has been written for you.

Task: Take a look at my_strtok.c. You’ll be implementing your own version of strtok.

In simple_repl.c:
- At the top, #include "my_strtok.h".
- Change the calls to strtok to use my_strtok.
Fill in the my_strtok.c according to the TODOs in the comments.
- Exclusively use pointer operations rather than array notation (brackets []).
- Here are some function you may want to look into:
  - strtok
  - strspn
  - strlen
  - strcspn
- Note: For the above functions, if you ever want to check out their behavior on edge cases (e.g., what would happen if you pass in an empty string, or a null string?), we highly recommend using repl.it for testing!
You can test your code using simple_repl.c and some test cases in test_runner.c. Compiling and running test_runner.c will run the test cases in the function test_strtok.
- In order to compile with your own implementation of strtok you will need to add my_strtok.c to the source list. For instance to compile the repl with my_strtok() the command would be:
  - gcc simple_repl.c my_strtok.c -g -o simple_repl

Note: Don’t worry about the interplay between my_strtok.c and my_strtok.h for now. If you are curious, a comment in my_strtok.h explains what it’s about, but we will go over compilation more in Lab 2!

`getline`

This REPL is really good at tokenizing based on commas and spaces now, but you may have realized that the program as a whole might struggle with parsing long sentences.

Task:

Try running:
$ ./simple_repl < files/A_Christmas_Carol_excerpt.txt. This file contains the first two paragraphs of the Christmas Carol text file, and places each sentence on its own line. If you look at the output, you’ll see some weird-looking lines. This is because our program can’t parse more than 99 characters at a time.
- Why can’t our program parse more than 99 characters?

One solution to this problem is to increase our BUFFER_SIZE to something like 1,000,000 (roughly 1 MB), but in the cases where we’re reading smaller lines, this will waste a lot of space on our stack. Plus, what if someone had a really, really long line with more than a million characters? We really need to be able to dynamically adjust the size of our buffer (hint… the heap ).

getline is a great function for this! It uses malloc and realloc to dynamically allocate memory as it’s reading in more characters from a file.

Task:

Change your simple_repl.c to use getline!
- Test it on files/A_Christmas_Carol_excerpt.txt.
- Remember to free the character array before the program exits.

Hint

Your char array no longer needs to be allocated on the stack. If you declare a NULL char pointer, getline will intialize it correctly.
However, because getline will modify the contents of the char pointer itself (i.e getline isn’t changing the contents of what the pointer is pointing at, it’s changing the address that the pointer points at), it needs the address of a char pointer that’s stack allocated.

[Optional] Lecture Review: How are C Programs Laid Out?

Before you start coding, let’s use the debugger to examine how our C-program is laid out in memory.

Variables in C never overlap; each variable occupies distinct storage. Additionally, each variable in C has a lifetime, which is called storage duration by the standard. There are three different kinds of lifetime.

static lifetime: The variable lasts as long as the program runs.
automatic lifetime: The compiler allocates and destroys the memory automatically as the program runs, based on the variable’s scope (the region of the program in which it is meaningful).
dynamic lifetime: The programmer allocates and destroys the object explicitly.

The compiler and operating system work together to put variables at different addresses. A program’s address space (which is the range of addresses accessible to a program) divides into regions called segments. Objects with different lifetimes are placed into different segments. The most important segments are:

Segment	Lifetime	Contains
Code (text, read-only data)	static, unmodifiable	program instructions and constant global variables
Data (data, bss)	static, modifiable	initialized and uninitialized non-constant global variables
Stack	automatic, modifiable	temporary local variables for each function call
Heap	dynamic, modifiable	memory that is explicitly allocated and deallocated

An executable is normally at least as big as the static-lifetime data (the code and data segments together). Since all that data must be in memory for the entire lifetime of the program, it’s written to disk and then when a program runs, the operating system loads the segments into memory. The stack and heap segments, by contrast, grow on demand.

Note on disks

A harddisk (HDD, for hard disk drive, or SSD, for solid-state drive) is a persistent form of storage for data. The data on disk is maintained after your computer shuts down or the power fails, but data in memory is not!

Let’s take a look at this in action! We’ll be looking at hello_world.c and the binary compiled from it.

Note: Modern compilers employ many optimizations to make it difficult for users to examine memory, because malicious users can perform some serious attacks on unprotected programs. We’re using the -fno-pic and -no-pie flags to turn off these optimizations for the purposes of this exercise.

# compile your program with the following flags
$ gcc hello_world.c -no-pie -fno-pic -g -o hello_world
$ gdb hello_world
# before setting any breakpoints, do the following in gdb:
(gdb) info files
# don't quit yet ...

Quick Interjection:

info files will print out the static segments that have been loaded into memory. The segments are formatted as:
[segment-start-address] - [segment-end-address] is [name-of-segment]
We want to pay attention to the .text (the C’s program instructions, i.e., its code), .rodata (read-only data), .data (initialized data), and .bss (uninitialized data). These are static segments of our program that have already been placed into memory.
The Entry point: 0x400590 will refer to an address in the .text region of memory corresponding to the first instruction the program will run.

# ... back to the terminal
(gdb) p GLOBAL_VAR            # print the contents of GLOBAL_VAR
200
(gdb) p &GLOBAL_VAR           # print the address of GLOBAL_VAR
(int *) 0x601058 <GLOBAL_VAR> # the address may vary on your machine

# examine (x) the contents at the address of GLOBAL_VAR as an integer (/d)
(gdb) x/d &GLOBAL_VAR
0x601058 <GLOBAL_VAR>:	200

Notice that the address of the global GLOBAL_VAR variable is in the .data segment – the region where intialized global memory lives.

Task:

Examine the addresses of const_variable, uninitialized_variable, and main in gdb, and identify the segment of memory each has been loaded into.
Now print out the first 33 strings in the .rodata section. Notice that any static strings used in the course of the hello_world program are stored in this section.

Hint:

use x/d to examine as a decimal
use x/s to examine as a string
use x/c to examine as a character
use x/a to examine as an address
use x/i to examine as an instruction
use x/3i to examine next 3 instructions that begin at an address
Similarly x/3s will examine the first 3 strings beginning at an address

Now, let’s continue our program in gdb. Set a breakpoint in main and run.

(gdb) b main
(gdb) r
#Now in main:
(gdb) info proc mappings
# Again, don't quit yet ...

Here, the command info proc mappings shows the address ranges currently accessible to the program and their corresponding regions. Note that the mappings for this process currently include a stack (labeled [stack]), but not a heap.

Task:

Step through to line 18 of hello_world.c (past the declaration of local_var) and then examine the address of local_var. What section is local_var contained in?
Additionally, print local_var. Since it is a char pointer, this should show the address local_var points to and the value (string) at that address. In what section is the address contained in local_var? (Hint: you examined this section in the previous task)

Tip:

You can use the command layout src to see where you are in the code while it is running in gdb
To skip to line 18 of hello_world.c, you can use the next or n command in gdb so that you can step over any function calls
Additionally, you can set a breakpoint on line 41 with b 18 and then use continue or c to continue straight to that line

Now, let’s continue stepping through main until line 22 (past the initialization of heap_allocated).

Task:

Once again, print out the addresses accessible to the program using info proc mappings. Do you notice any differences?
Additionally, examine the address of heap_allocated, and identify the section it is contained in.

Hint:

The first time you examined the addresses accessible to the process right at the start of main, the program had not yet allocated any data in the heap. Hence, the heap was not listed as an accessible section.

Handin Instructions

You will turn in your code by pushing your git repository to github.com/csci1310/cs131-s20-labs-YOURNAME.git.

As a quick recap, you do this by running git commit; either use git commit -a to commit all changes; or use git add -p to interactively choose which changes to “stage” for commit, and then commit them using git commit. Finally, push your changes to your git repository via git push.

Then, head to the grading server. On the “Labs” page, use the “Lab 1 checkoff” button to check off your lab.

Note: Your lab grades are associated with the commit that you used as your lab checkoff, so when you check off your Lab 1, the grade for Lab 0 will no longer be shown. But rest assured: if you switch to the commit you used for the Lab 0 checkoff, you’ll hopefully see a 2/2 next to Lab 0