Lecture 2: Systems Programming

🎥 Lecture video (Brown ID required)
💻 Lecture code
❓ Post-Lecture Quiz (due 11:59pm, Tuesday, February 1).

Context: The CS300 Journey

After the first lecture, you may have wondered why understanding the details of how your computer works is so crucial. How does this understanding affect your goals, such as becoming a software engineer in industry or an academic computer science researcher? The answer is, beyond a natural thirst for knowledge, that this kind of understanding will make you a much better, more versatile, and more valuable computer scientist and engineer.

Some reasons why systems programming and understanding the machine matters:

Interpreting Bytes In Memory

Why are we covering this?
The only place where a computer can store information is memory. It's up to the systems software and the programs that the computer runs to decide what these bytes actually mean. They could be program code, data (integers, strings, images, etc.), or "meta-data" used to build more complex data structures from simple memory boxes. Understanding how complex programs boil down to bytes will help you debug your program, and will make you appreciate why they behave the way they do.

A common question at this point in the course is how the computer knows which bytes in memory belong together, and what sort of data (e.g., an int, long, or string/image) a given byte is part of. The answer is that the computer does not know, and that the same byte can mean different things depending on how a program chooses to interpret its numeric value. At the end of this section, you'll understand this notion of using the same bytes in different ways..

We will build this up from first principles. Start with add.c, which is a C program that serves a simple purpose: it reads numbers from the command line and adds them. Let's disect the code, and you'll get immersed in the basic structure of a C program, as well as seeing the crucial add() function that we'll use to explore how programs are just bytes in memory.

A Simple C Program
C Programming Resources
We'll go through the basics of C (and later C++) in lectures, but in an "immersive" way: we'll come across language features are we are trying to understand fundamental concepts of systems. Check out our C/C++ Primers if you're looking for step-by-step language tutorials or links to detailed language references.
#include <stdio.h>   // <== import standard I/O (fprintf, printf)
#include <stdlib.h>   // <== import standard library

int main(int argc, char* argv[]) {  // <== starting point of our program
    if (argc <= 2) {
        fprintf(stderr, "Usage: add A B\n\
    Prints A + B.\n");   // <== print error message if arguments are missing. "\n" is a newline character!
        exit(1);
    }

    int a = strtol(argv[1], 0, 0);  // <== covert first argument (string) to integer
    int b = strtol(argv[2], 0, 0);  // <== same for second argument
    printf("%d + %d = %d\n", a, b, add(a, b));  // <== invoke add() function, print result to console
}

What's going on here? Every C program's execution starts with the main() function. This is one of the things that the C language standard, a long, technical document, prescribes. Our program checks if the user provided enough arguments and prints an error if not; otherwise it converts the first two arguments from strings to integers using strtol() (a standard library function), calls add() on them and returns the results.

How do the argc and argv arguments to main() get set, and how is main() called?
Your computer's operating system (OS) is responsible for starting up the program, and does some prep. The command line program you're using (this is called a "shell") makes sure to put the argument count (argc) and argument values in boxes at well-known memory addresses before the OS starts your program.

Let's try to compile this program.

$ gcc -o add add.c

There's an error, because we haven't actually provided an add() function. Let's write one.

The program now works, and it adds numbers. Yay! But we can also define our add function in another file – something that often happens in larger programs. Let's use the add function in addf.c instead. Since the compiler looks at each source file in isolation, we now need to tell it that there is an add function in some other file, and what arguments it takes. Let's add a line to add.c that specifies the name and arguments of add(), but does not provide an implementation. This is called a declaration: we're telling the compiler "there will be a function called add(), and you'll find out about its implementation later". All functions and variables in C have to be declared when you first use them, but they do not have to be defined. We'll understand the exact difference shortly.

Let's try compiling this version of add.c. A different error! Why? Because we haven't told the compiler to also look at the addf.c file, which actually has our implementation of add(). To do that, let's pass two files to the compiler.

$ gcc -o add add.c addf.c

It works! Great. The compiler first compiles add.c into a file called add.o, and then compiles addf.c into a file called addf.o. These files don't contain human-readable text, but binary bytes that the computer's CPU (central processing unit) understands to execute.

Code as data: adding numbers

Why are we covering this?
The only place where a computer can store information is memory. This example illustrates that even the program itself ultimately consists of bytes stored in memory boxes; and that the meaning of a given set of boxes depends on how we tell the computer to interpret them (much like with different-sized numbers). Things that don't seem like programs can actually act as programs!

We talked about storing data – like integers t– in memory, but where does the actual code for our programs live? It turns out it is also just bytes in memory. The CPU needs to know what calculations to run, and we tell it by putting bytes in memory that the CPU interprets as machine code, even though in other situations they may represent data like numbers or an image.

And with the right sequence of magic bytes in the right place, we can make almost any piece of data in memory run as code. Consider, for example the add program in the datarep folder of the lecture code. Its add function is defined in addf.c, and when you compile this code by running make, you get a file called addf.o that actually contains the byte representation of the add function. To look at it, you can run objdump -S addf.o in your course VM, and you'll see that the add function is encoded as four bytes (0x8d 0x04 0x37 0xc3 in hexadecimal notation).

Programs are just bytes!

We can look at the contents of addf.o using a tool called objdump. objdump -d addf.o prints two things below the <add> line: on the left, the bytes in the file in hexadecimal notation (8d 04 37 c3), and on the right, a human-readable version of what these bytes mean in computer machine language (specifically, in a language called "x86-64 assembly", which is the language my laptop's Intel processor understands).

addf.o:     file format elf64-x86-64

Disassembly of section .text:

0000000000000000 :
   0:	8d 04 37             	lea    (%rdi,%rsi,1),%eax
   3:	c3                   	retq
        ^                       ^
        | bytes in file         | their human-readable meaning in x86-64 machine language
        | in hexadecimal        | (not stored in the file; objdump generated this)
        | notation
What does the machine language mean?

We don't know machine language yet, and though we will touch on it briefly later in the course, you'll never need to memorize it. But to give you an intution, lea means to add integers, and retq tells the processor to return to the calling function.

Let's focus on the bytes. When the program runs, these bytes are stored somewhere in memory. The processor, which on its own is just a dumb piece of silicon, then reads these bytes and interprets them as instructions as to what to do next.

Now let's change our implementation in addf.c and just store the same bytes directly:

const unsigned char add[] = { 0x8d, 0x04, 0x37, 0xc3 };

We're no longer writing a function in the C programming language, we're just defining an array of bytes called add. Do you think our add program will still work?

It turns out it does work! Why? Because we are manually storing the exact same bytes in memory that the compiler generates when compiling our add function into machine instructions. The processor doesn't care that we were storing an array of data there – if we tell it to go an execute these bytes, the dumb silicon goes and does as it's told!

Now we can figure out how we could add numbers using the course logo: our crucial bytes, 8d 04 37 c3 occur inside the JPEG file of the course logo. If we just tell the processor to look in the right place, it can execute these bytes. To do that, I use the addin.c program, which asks the operating system to load the file specified in its first argument into memory, and then tells the processor to look for bytes to execute at the offset specified as the second argument. If we put the right offset (10302 decimal), the processor executes 8d 04 37 c3 and adds numbers! The image decoder, meanwhile, just interprets these bytes (which I changed manually) as data and turns them into slightly discoloured pixels.

What about the party emoji code? That secret was revealed in the lecture :-)

Pointers!

We previously discussed that memory boxes can store addresses of other memory boxes, and how an address occupies 8 bytes. Where would you store such an address? Well, if you want to hold on to it, you'll want to store it in a variable itself. Where does that variable live? It better be in memory, too! In other words, the 8 bytes corresponding to the address will have a memory location of their own. We refer to such memory locations that hold addresses as pointers, because you can think of them as arrows pointing to other memory boxes.

Summary

Today, we've seen more of a computer's memory bytes can be interpreted to represent many different kinds of data. For example, the same bytes can be interpreted as program code or as an image's pixels, and a sequence of bytes can represent characters of a string or a large number. We've also seen why addresses are incredibly important: C and C++ locate data and functions in memory by their address, and we learned some basic C syntax. We will talk more about memory representation, strings, and sequences of objects in memory next time!