Lecture 2: Systems Programming

» Lecture video (Brown ID required)
» Lecture code
» Post-Lecture Quiz (due 6pm Wednesday, January 29).

Context: The CS131 Journey

After the first lecture, you may have wondered why understanding the details of how your computer works is so crucial. How does this understanding affect your goals, such as becoming a software engineer in industry or an academic computer science researcher? The answer is, beyond a natural thirst for knowledge, that this kind of understanding will make you a much better, more versatile, and more valuable computer scientist and engineer.

Some reasons why systems programming and understanding the machine matters:

So, what will you learn throughout this course? Here's an overview.

Interpreting Bytes In Memory

Why are we covering this?
The only place where a computer can store information is memory. It's up to the systems software and the programs that the computer runs to decide what these bytes actually mean. They could be program code, data (integers, strings, images, etc.), or "meta-data" used to build more complex data structures from simple memory boxes. Understanding how complex programs boil down to bytes will help you debug your program, and will make you appreciate why they behave the way they do.

Let's recap the final example from last lecture. We used bytes from the course logo image file to add numbers, and other bytes to print messages to the terminal. How is that possible? At the end of this section, you'll understand how.

We will build this up from first principles. Start with add.c, which is a C program that serves a simple purpose: it reads numbers from the command line and adds them. Let's disect the code, and you'll get immersed in the basic structure of a C program, as well as seeing the crucial add() function that we'll use to explore how programs are just bytes in memory.

A Simple C Program
C Programming Resources
We'll go through the basics of C (and later C++) in lectures, but in an "immersive" way: we'll come across language features are we are trying to understand fundamental concepts of systems. Check out our C/C++ Primers if you're looking for step-by-step language tutorials or links to detailed language references.
#include <stdio.h>   // <== import standard I/O (fprintf, printf)
#include <stdlib.h>   // <== import standard library

int main(int argc, char* argv[]) {  // <== starting point of our program
    if (argc <= 2) {
        fprintf(stderr, "Usage: add A B\n\
    Prints A + B.\n");   // <== print error message if arguments are missing. "\n" is a newline character!
        exit(1);
    }

    int a = strtol(argv[1], 0, 0);  // <== covert first argument (string) to integer
    int b = strtol(argv[2], 0, 0);  // <== same for second argument
    printf("%d + %d = %d\n", a, b, add(a, b));  // <== invoke add() function, print result to console
}

What's going on here? Every C program's execution starts with the main() function. This is one of the things that the C language standard, a long, technical document, prescribes. Our program checks if the user provided enough arguments and prints an error if not; otherwise it converts the first two arguments from strings to integers using strtol() (a standard library function), calls add() on them and returns the results.

How do the argc and argv arguments to main() get set, and how is main() called?
Your computer's operating system (OS) is responsible for starting up the program, and does some prep. The command line program you're using (this is called a "shell") makes sure to put the argument count (argc) and argument values in boxes at well-known memory addresses before the OS starts your program.

Let's try to compile this program.

$ gcc -o add add.c

There's an error, because we haven't actually provided an add() function. Let's write one.

The program now works, and it adds numbers. Yay! But we can also define our add function in another file – something that often happens in larger programs. Let's use the add function in addf.c instead. Since the compiler looks at each source file in isolation, we now need to tell it that there is an add function in some other file, and what arguments it takes. Let's add a line to add.c that specifies the name and arguments of add(), but does not provide an implementation. This is called a declaration: we're telling the compiler "there will be a function called add(), and you'll find out about its implementation later". All functions and variables in C have to be declared when you first use them, but they do not have to be defined. We'll understand the exact difference shortly.

Let's try compiling this version of add.c. A different error! Why? Because we haven't told the compiler to also look at the addf.c file, which actually has our implementation of add(). To do that, let's pass two files to the compiler.

$ gcc -o add add.c addf.c

It works! Great. The compiler first compiles add.c into a file called add.o, and then compiles addf.c into a file called addf.o. These files don't contain human-readable text, but binary bytes that the computer's CPU (central processing unit) understands to execute.

Programs are just bytes!

We can look at the contents of addf.o using a tool called objdump. objdump -d addf.o prints two things below the <add> line: on the left, the bytes in the file in hexadecimal notation (8d 04 37 c3), and on the right, a human-readable version of what these bytes mean in computer machine language (specifically, in a language called "x86-64 assembly", which is the language my laptop's Intel processor understands).

addf.o:     file format elf64-x86-64

Disassembly of section .text:

0000000000000000 :
   0:	8d 04 37             	lea    (%rdi,%rsi,1),%eax
   3:	c3                   	retq
        ^                       ^
        | bytes in file         | their human-readable meaning in x86-64 machine language
        | in hexadecimal        | (not stored in the file; objdump generated this)
        | notation
What does the machine language mean?

We don't know machine language yet, and though we will touch on it briefly later in the course, you'll never need to memorize it. But to give you an intution, lea means to add integers, and retq tells the processor to return to the calling function.

Let's focus on the bytes. When the program runs, these bytes are stored somewhere in memory. The processor, which on its own is just a dumb piece of silicon, then reads these bytes and interprets them as instructions as to what to do next.

Now let's change our implementation in addf.c and just store the same bytes directly:

const unsigned char add[] = { 0x8d, 0x04, 0x37, 0xc3 };

We're no longer writing a function in the C programming language, we're just defining an array of bytes called add. Do you think our add program will still work?

It turns out it does work! Why? Because we are manually storing the exact same bytes in memory that the compiler generates when compiling our add function into machine instructions. The processor doesn't care that we were storing an array of data there – if we tell it to go an execute these bytes, the dumb silicon goes and does as it's told!

Now we can figure out how we could add numbers using the course logo: our crucial bytes, 8d 04 37 c3 occur inside the JPEG file of the course logo. If we just tell the processor to look in the right place, it can execute these bytes. To do that, I use the addin.c program, which asks the operating system to load the file specified in its first argument into memory, and then tells the processor to look for bytes to execute at the offset specified as the second argument. If we put the right offset (10302 decimal), the processor executes 8d 04 37 c3 and adds numbers! The image decoder, meanwhile, just interprets these bytes (which I changed manually) as data and turns them into slightly discoloured pixels.

What about the party emoji code? That secret was revealed in the lecture :-)

Exploring Data Representation In Memory

Now that we understand how programs are just bytes in memory, let's look in more detail at how data is represented in memory.

Why are we covering this?
We're now building an understanding of where the different parts of a C/C++ program (and, in fact, programs in other languages too!) are stored in memory. This will help you understand how a program obtains and manages memory, something that some programming languages (e.g., Java, OCaml, Pyret) do automatically behind your back, while others, and particularly systems programming languages like C and C++, force you as the programmer do some of this memory management. This gives you a great degree of control, and allows avoiding expensive hidden memory allocation and copying costs.

For, we will use the program mexplore.c. This program declares and defines (recall the difference!) three variables, all of type char. char is the name for a byte type in the C language; it refers to the fact that a byte is exactly sufficient to store one character according to the ASCII standard, a way of translating numbers into characters and vice versa. Computers can only store numbers, so all characters in a computer are actually "encoded" as numbers. For example, the uppercase letter "A" in ASCII corresponds to the number 65 (see man ascii for the translation table).

What's ASCII, and do we still use it today?

In the early days of computers, every computer had its own way of encoding letters as numbers. ASCII, the American Standard Code for Information Interchange, was defined in the 1960s to find a common way of representing text. Original ASCII uses 7 bits, and can therefore represent 128 distinct characters – enough for the alphabet, numbers, and some funky special characters (e.g., newlines (\n), the NUL character (\0), and "bell" character that made typewriter bells go off).
But even 256 characters aren't sufficient to support languages that use non-Latin alphabets, and certainly not for advanced emoji. So, while all of today's computers still support ASCII, we've mostly moved on to a new standard, Unicode, which supports 1.1 million characters using one to four bytes per character. Fun fact: to be backwards compatible, Unicode is defined such that the original ASCII character encodings remain valid Unicode letters.

What's the difference between these three char variables? Let's take a look. The first one, global_ch is defined in what we call global scope: it's at the top level of the file, not inside a function or inside the curly braces ({ }) that C and C++ use to delineate scopes in the program. This variable can be referred to from anywhere in the entire program.

The second variable, const_global_ch is also a global variable, but the const keyword indicates that it is constant and the compiler and OS should not allow modifications to it.

Finally, our third variable is inside function f(). It's called local_ch and is a local variable. It's valid only within the scope of f() and other parts of the program (such as main) cannot refer to it.

The hexdump() function that f() calls is defined in hexdump.c and imported via hexdump.h, a "header file". (In a future lecture, we'll talk about why header files exist.) hexdump(ADDR, N) has the effect of printing the contents of N bytes of memory at address ADDR. We're passing our character variables to it, but prefix the variable with an ampersand character, &. So: hexdump(&global_ch, 1) means "print 1 byte from a box located at the address of global_ch".

This is an important concept of the C language: you can always get the memory address at which an object is located. The term "object" here means something different from what it means in an object-oriented language like Java: rather than an instance of a class, an "object" according to the C standard is a set of bytes that contain a value. This can be code (a function) or data (a variable).

In other words, local and ptr in the snippet below refer to the same object, i.e., to the same bytes of memory:

void f() {
    int local = 1;
    int* ptr = &local;
}
&local means "the address of local", and the value stored in the memory location where ptr is located is the 8 bytes that make up the address of local. The type of ptr is int*, which signifies that it is the address of an integer in memory (a short* would be the address of a short, a char* the address of a char, etc.).

You can invert the ampersand operator using the asterisk (*) operator: *ptr dereferences the pointer and turns the address back into a value. In other words, *&local is the same as plain, old local. Thee concepts are very, very important – you'll use them all the time!

Back to our mexplore.c program though. Let's look at what it prints when we run it (note that the specific addresses will be different on your computer).

$ ./mexplore
00601038  41                                                |A|
004009a4  42                                                |B|
7ffd4977e80f  43                                                |C|

On the left, we see the addresses of our three variables, printed in hexadecimal notation. Next, just to the right of that, we see the hexadecimal value of the data stored in the byte at each of these addresses. For example, hexadecimal 41 (often written 0x41 for clarity) is equal to ... 16*4 + 1 = 65! Not surprisingly, this equals to the ASCII character "A", which we see on the right.

But let me draw your attention to the addresses on the left. They vary a fair amount! The exact locations of variables in memory are decided by the compiler and OS together, but the general region where an object lives is determined by its lifetime. Think about how long each of our variables needs to stick around before the memory can be reused:

  1. The global variables, global_ch and const_global_ch, need to be around for the entire runtime of the program, as the program could reference them anywhere in the code. This is called a static lifetime.
  2. The local variable, local_ch, needs to stick around until it goes out of scope, which happens when the execution reaches the closing curly brace of f(). After it's out of scope, no code in the program can refer to the variable, so it is fine to reuse its memory. This is called an automatic lifetime. Local variables and function arguments have automatic lifetimes.

But what if a function needs to create an object whose size is not known at the start of the program (so it can't be global) and which also needs to survive beyond the end of the function? In this common situation, neither a static lifetime nor an automatic lifetime are appropriate.

Dynamically Allocated Objects

The C language allows for a third kind of object lifetime: a dynamic lifetime. For an object with a dynamic lifetime, you as the programmer have to explicitly create and destroy the object – that is, you must set aside memory for the object and make sure it is released again when the object is no longer needed.

To set aside memory, you use the malloc() standard library function (memory allocate). malloc() takes only one argument, which is the number of bytes you're asking for, and it returns a pointer to the newly allocated memory (i.e., the address where the OS has set aside memory boxes for this object). For example,

char* allocated_ch = (char*)malloc(1);
reserves 1 byte of memory and stores the address of that memory in the variable allocated_ch. The char* in brackets is a cast of the pointer returned to the type we expect (a pointer to a char); this is needed because malloc() itself does not have any idea what kind of object you're allocating, so we need to tell the compiler.

Similarities and differences with Java

Calls to malloc() may look clunky, but they effectively do the same thing as the new keyword in Java: setting aside memory for a new object. Indeed, C++ actually provides a new keyword that, under the hood, invokes malloc(). One big difference compared to Java, however, is that you're responsible for cleaning up and returning that memory. Java figures out automatically when an object with a dynamic lifetime is no longer needed, and frees the memory then (a process called "garbage collection"). C and C++ don't do so, but leave it to the programmer to decide when the time is right to return the memory.

The big upside of dynamic-lifetime objects is that we can decide at runtime how big they need to be, and that they can outlive the function that creates them. Consider a string that takes the characters a user typed into the program – a quantity that's hard to predict correctly, and data that we certainly want to outlive the function that reads the input! The big downside of dynamic-lifetime objects is that it's the programmer's responsibility to free the memory allocated. You to this by calling the free() function with the address of the allocated boxes as an argument. For example, free(allocated_ch); will free the memory we asked malloc() to set aside for allocated_ch.

Incorrect use of dynamic lifetimes is an immensely common source of problems, bugs, and security holes in C/C++ programs: serious problems like memory leaks, double free, use-after-free, etc. all arise from this language feature.

Memory Segments

Objects with different lifetimes are grouped into different regions in memory. The program code, global variables, and constant global variables are all stored in static segments, as these all have static lifetimes and known sizes at compile time.

Other objects are come and go, and therefore the memory regions that contain them grow and shrink. For example, as functions call each other, they create more and more local variables with automatic lifetimes; and as the program calls malloc() to reserve memory for objects with dynamic lifetimes, more memory is needed for these objects.

If we look at the addresses printed by mexplore, we see that the global variables stored in are at relatively low memory addresses (around 0x60'0000 hexadecimal and 0x40'0000 hexadecimal), while the local variable is stored at a high address, close to 0x7fff'ffff'ffff (about 247), and the dynamically allocated character is stored in between (albeit closer to the static segments).

There is a reason for this placement: it allows both types of segments to grow without risk of getting in each other's way. In particular, the automatic-lifetime segment grows downwards as more local variables come into scope, while the dynamic-lifetime segment grows upwards. If they ever meet, your computer runs out of memory.

Finally, these segments have names:

The stack and heap terms are important, and you will keep seeing them!

Java Similarities Note

In Java, any object created with the new keyword is allocated with dynamic lifetime and lives on the heap. (Java puts only "primitive" types (int, double etc.) on the stack.) Indeed, Java under the hood uses malloc()! Yet, it seems like Java has automatic lifetime for all objects, as you never need to destroy them explicitly! This works because your program runs inside the Java virtual machine (JVM), which "magically" injects code that tracks whether each object is still reachable via an in-scope variable; if this reference count goes to zero, the JVM automatically calls free() to delete the object. But all this magic is not free in performance terms, as there is code to run to keep track of objects' reference counts. C and C++ instead opt to do nothing and give maximum control to the programmer, for better or worse.

Summary

Today, we've seen more of a computer's memory bytes can be interpreted to represent many different kinds of data. For example, the same bytes can be interpreted as program code or as an image's pixels, and a sequence of bytes can represent characters of a string or a large number. We've also seen why addresses are incredibly important: C and C++ locate data and functions in memory by their address. We learned some basic C syntax, and built an understanding of how a program's memory is split into different segments that contain objects with different lifetimes. We will talk more about dynamic lifetimes, strings, and sequences of objects in memory next time!