Lecture 2: Systems Programming
🎥 Lecture video (Brown ID required)
💻 Lecture code
❓ Post-Lecture Quiz (due 11:59pm, Tuesday, February 1).
Context: The CS300 Journey
After the first lecture, you may have wondered why understanding the details of how your computer works is so crucial. How does this understanding affect your goals, such as becoming a software engineer in industry or an academic computer science researcher? The answer is, beyond a natural thirst for knowledge, that this kind of understanding will make you a much better, more versatile, and more valuable computer scientist and engineer.
Some reasons why systems programming and understanding the machine matters:
- There are billions of lines of C and C++ code in the world. If you work as a software engineer, you will sooner or later have to deal with these languages. For example, operating systems, web browsers, machine learning toolkits like TensorFlow and PyTorch, and high-performance infrastructure used by companies like Facebook and Google are written in C++. Many companies yearn to hire engineers who know these languages!
- The concepts we're learning here are fundamental and ultimately impact how other languages work. Even if you're programming in Go or Java, you will understand these languages a lot better if you know the underlying infrastructure.
- When you work with concurrent and distributed systems (which most moderately complex applications we use today, including nearly all applications on your phone, are!), mysterious bugs and seemingly impossible behavior will make a lot more sense (and be easier to debug!) if you know how the computer actually executes these programs.
Interpreting Bytes In Memory
Why are we covering this?
The only place where a computer can store information is memory. It's up to the systems software and the programs that the computer runs to decide what these bytes actually mean. They could be program code, data (integers, strings, images, etc.), or "meta-data" used to build more complex data structures from simple memory boxes. Understanding how complex programs boil down to bytes will help you debug your program, and will make you appreciate why they behave the way they do.
A common question at this point in the course is how the computer knows which bytes in memory belong together, and
what sort of data (e.g., an int
, long
, or string/image) a given byte is part of. The answer is
that the computer does not know, and that the same byte can mean different things depending on how a program chooses to
interpret its numeric value. At the end of this section, you'll understand this notion of using the same bytes in different
ways..
We will build this up from first principles. Start with add.c
, which is a C program that serves a simple
purpose: it reads numbers from the command line and adds them. Let's disect the code, and you'll get immersed in the basic
structure of a C program, as well as seeing the crucial add()
function that we'll use to explore how programs
are just bytes in memory.
A Simple C Program
C Programming Resources
We'll go through the basics of C (and later C++) in lectures, but in an "immersive" way: we'll come across language features are we are trying to understand fundamental concepts of systems. Check out our C/C++ Primers if you're looking for step-by-step language tutorials or links to detailed language references.
#include <stdio.h> // <== import standard I/O (fprintf, printf)
#include <stdlib.h> // <== import standard library
int main(int argc, char* argv[]) { // <== starting point of our program
if (argc <= 2) {
fprintf(stderr, "Usage: add A B\n\
Prints A + B.\n"); // <== print error message if arguments are missing. "\n" is a newline character!
exit(1);
}
int a = strtol(argv[1], 0, 0); // <== covert first argument (string) to integer
int b = strtol(argv[2], 0, 0); // <== same for second argument
printf("%d + %d = %d\n", a, b, add(a, b)); // <== invoke add() function, print result to console
}
What's going on here? Every C program's execution starts with the main()
function. This is one of the
things that the C language standard, a long, technical document, prescribes. Our program checks if the user provided
enough arguments and prints an error if not; otherwise it converts the first two arguments from strings to integers
using strtol()
(a standard library function), calls add()
on them and returns the results.
How do theargc
andargv
arguments tomain()
get set, and how ismain()
called?
Your computer's operating system (OS) is responsible for starting up the program, and does some prep. The command line program you're using (this is called a "shell") makes sure to put the argument count (argc
) and argument values in boxes at well-known memory addresses before the OS starts your program.
Let's try to compile this program.
$ gcc -o add add.c
There's an error, because we haven't actually provided an add()
function. Let's write one.
The program now works, and it adds numbers. Yay! But we can also define our add function in another file –
something that often happens in larger programs. Let's use the add function in addf.c
instead.
Since the compiler looks at each source file in isolation, we now need to tell it that there is an add function
in some other file, and what arguments it takes. Let's add a line to add.c
that specifies the name
and arguments of add()
, but does not provide an implementation. This is called a declaration:
we're telling the compiler "there will be a function called add()
, and you'll find out about its
implementation later". All functions and variables in C have to be declared when you first use them, but they
do not have to be defined. We'll understand the exact difference shortly.
Let's try compiling this version of add.c
. A different error! Why? Because we haven't told the
compiler to also look at the addf.c
file, which actually has our implementation of add()
.
To do that, let's pass two files to the compiler.
$ gcc -o add add.c addf.c
It works! Great. The compiler first compiles add.c
into a file called add.o
, and then
compiles addf.c
into a file called addf.o
. These files don't contain human-readable text,
but binary bytes that the computer's CPU (central processing unit) understands to execute.
Code as data: adding numbers
Why are we covering this?
The only place where a computer can store information is memory. This example illustrates that even the program itself ultimately consists of bytes stored in memory boxes; and that the meaning of a given set of boxes depends on how we tell the computer to interpret them (much like with different-sized numbers). Things that don't seem like programs can actually act as programs!
We talked about storing data – like integers t– in memory, but where does the actual code for our programs live? It turns out it is also just bytes in memory. The CPU needs to know what calculations to run, and we tell it by putting bytes in memory that the CPU interprets as machine code, even though in other situations they may represent data like numbers or an image.
And with the right sequence of magic bytes in the right place, we can make almost any piece of data in memory run as
code. Consider, for example the add
program in the datarep
folder of
the lecture code.
Its add
function is defined in addf.c
, and when you compile this code by running make
,
you get a file called addf.o
that actually contains the byte representation of the add
function.
To look at it, you can run objdump -S addf.o
in your course VM, and you'll see that the add function is
encoded as four bytes (0x8d 0x04 0x37 0xc3
in hexadecimal notation).
Programs are just bytes!
We can look at the contents of addf.o
using a tool called objdump
. objdump -d
addf.o
prints two things below the <add>
line: on the left, the bytes in the file in
hexadecimal notation (8d 04 37 c3
), and on the right, a human-readable version of what these bytes
mean in computer machine language (specifically, in a language called "x86-64 assembly", which is the
language my laptop's Intel processor understands).
addf.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000000 :
0: 8d 04 37 lea (%rdi,%rsi,1),%eax
3: c3 retq
^ ^
| bytes in file | their human-readable meaning in x86-64 machine language
| in hexadecimal | (not stored in the file; objdump generated this)
| notation
What does the machine language mean?
We don't know machine language yet, and though we will touch on it briefly later in the course, you'll never need to memorize it. But to give you an intution,
lea
means to add integers, andretq
tells the processor to return to the calling function.
Let's focus on the bytes. When the program runs, these bytes are stored somewhere in memory. The processor, which on its own is just a dumb piece of silicon, then reads these bytes and interprets them as instructions as to what to do next.
Now let's change our implementation in addf.c
and just store the same bytes directly:
const unsigned char add[] = { 0x8d, 0x04, 0x37, 0xc3 };
We're no longer writing a function in the C programming language, we're just defining an array of bytes
called add
. Do you think our add
program will still work?
It turns out it does work! Why? Because we are manually storing the exact same bytes in memory that
the compiler generates when compiling our add
function into machine instructions. The processor doesn't
care that we were storing an array of data there – if we tell it to go an execute these bytes, the dumb silicon
goes and does as it's told!
Now we can figure out how we could add numbers using the course logo: our crucial bytes, 8d 04 37 c3
occur inside the JPEG file of the course logo. If we just tell the processor to look in the right
place, it can execute these bytes. To do that, I use the addin.c
program, which asks the operating
system to load the file specified in its first argument into memory, and then tells the processor to look for
bytes to execute at the offset specified as the second argument. If we put the right offset (10302
decimal), the processor executes 8d 04 37 c3
and adds numbers! The image decoder, meanwhile, just
interprets these bytes (which I changed manually) as data and turns them into slightly discoloured
pixels.
What about the party emoji code? That secret was revealed in the lecture :-)
Pointers!
We previously discussed that memory boxes can store addresses of other memory boxes, and how an address occupies 8 bytes. Where would you store such an address? Well, if you want to hold on to it, you'll want to store it in a variable itself. Where does that variable live? It better be in memory, too! In other words, the 8 bytes corresponding to the address will have a memory location of their own. We refer to such memory locations that hold addresses as pointers, because you can think of them as arrows pointing to other memory boxes.
Summary
Today, we've seen more of a computer's memory bytes can be interpreted to represent many different kinds of data. For example, the same bytes can be interpreted as program code or as an image's pixels, and a sequence of bytes can represent characters of a string or a large number. We've also seen why addresses are incredibly important: C and C++ locate data and functions in memory by their address, and we learned some basic C syntax. We will talk more about memory representation, strings, and sequences of objects in memory next time!