Lecture 2: Systems Programming
🎥 Lecture video (Brown ID required)
💻 Lecture code
❓ Post-Lecture Quiz (due 11:59pm, Wednesday, January 31).
Context: The CS300 Journey
After the first lecture, you may have wondered why understanding the details of how your computer works is so crucial. How does this understanding affect your goals, such as becoming a software engineer in industry or an academic computer science researcher? The answer is, beyond a natural thirst for knowledge, that this kind of understanding will make you a much better, more versatile, and more valuable computer scientist and engineer.
Some reasons why systems programming and understanding the machine matters:
- There are billions of lines of C and C++ code in the world. If you work as a software engineer, you will sooner or later have to deal with these languages. For example, operating systems, web browsers, machine learning toolkits like TensorFlow and PyTorch, and high-performance infrastructure used by companies like Facebook and Google are written in C++. Many companies yearn to hire engineers who know these languages!
- The concepts we're learning here are fundamental and ultimately impact how other languages work. Even if you're programming in Go or Java, you will understand these languages a lot better if you know the underlying infrastructure.
- When you work with concurrent and distributed systems (which most moderately complex applications we use today, including nearly all applications on your phone, are!), mysterious bugs and seemingly impossible behavior will make a lot more sense (and be easier to debug!) if you know how the computer actually executes these programs.
Interpreting Bytes In Memory
Why are we covering this?
The only place where a computer can store information is memory. It's up to the systems software and the programs that the computer runs to decide what these bytes actually mean. They could be program code, data (integers, strings, images, etc.), or "meta-data" used to build more complex data structures from simple memory boxes. Understanding how complex programs boil down to bytes will help you debug your program, and will make you appreciate why they behave the way they do.
You might observe that our boxes can only hold numbers up to 255, but we need to store numbers larger than that. To achive this,
the computer interprets multiple adjacent boxes as a single number, using the concept of
positional notation, albeit in the binary system. Two boxes
together consist of a "low" value (1 byte, 0 to 28-1) and a "high" value (1 byte, 0 to
(28 × 28) - 1 = 216 - 1). This way, we can represent numbers between 0 and 216
- 1, which is 65,535. The data type of two such boxes is called a short
; 4 adjacent bytes are an
int
and represent an integer between 0 and 232 - 1 (about 4 billion).
So, we can represent integers up to about 4 billion using 4 bytes each. But what if we needed to store the address of another post-office box (i.e., memory location) – for example, because we're constructing a data structure like a linked list? On my laptop, there are about 8 billion boxes, so we must be able to represent and store addresses up to 8,589,934,592 in boxes adjacent to those four storing the number. To achieve this, we need more than 4 bytes of address.
A common question at this point in the course is how the computer knows which bytes in memory belong together, and
what sort of data (e.g., an int
, long
, or string/image) a given byte is part of. The answer is
that the computer does not know, and that the same byte can mean different things depending on how a program chooses to
interpret its numeric value. At the end of this section, you'll understand this notion of using the same bytes in different
ways.
Pointers!
A "post-office box" in computer memory is identified by an address. On a computer with M bytes of memory, there are M such boxes, each having as its address a number between 0 and M−1. My laptop has 8 GB (gibibytes) of memory, so M = 8×230 = 233 = 8,589,934,592 boxes (and possible memory addresses)!
Memory boxes can store addresses of other memory boxes! How many bytes might be needed to store an address? This, of course, depends on the amount of memory the computer has: 4 bytes (which can represent numbers up to 4 billion) are insufficient for a computer with 8 billion bytes of memory (approximately 8 GB of RAM). Indeed, this came to bite computing in the mid-2000s, when computers started having 4 GB of memory, an amount that was thought to be ludicrously large in the 1970 and 1980s. Since hardware designers find it convenient to increase the size of types by powers of two, modern computers use 8 bytes to store an address. 8 bytes can represent numbers up to several quintillion, so they should be enough for all existing and future computers for the foreseeable future!
Where would you store an address? Well, if you want to hold on to it, you'll want to store it in a variable itself. Where does that variable live? It better be in memory, too! In other words, the 8 bytes corresponding to the address will have a memory location of their own. We refer to such memory locations that hold stored addresses as pointers, because you can think of them as arrows pointing to other memory boxes.
A Simple C Program
C Programming Resources
We'll go through the basics of C (and later C++) in lectures, but in an "immersive" way: we'll come across language features are we are trying to understand fundamental concepts of systems. Check out our C/C++ Primers if you're looking for step-by-step language tutorials or links to detailed language references.
We will now build an initial understanding of the C programming language. Start with add.c
, which is a C
program that serves a simple purpose: it reads numbers from the command line and adds them. Let's disect the code, and
you'll get immersed in the basic structure of a C program, as well as seeing the crucial add()
function
that we already used to explore how programs are just bytes in memory.
#include <stdio.h> // <== import standard I/O (fprintf, printf)
#include <stdlib.h> // <== import standard library
int main(int argc, char* argv[]) { // <== starting point of our program
if (argc <= 2) {
fprintf(stderr, "Usage: add A B\n\
Prints A + B.\n"); // <== print error message if arguments are missing. "\n" is a newline character!
exit(1);
}
int a = strtol(argv[1], 0, 0); // <== covert first argument (string) to integer
int b = strtol(argv[2], 0, 0); // <== same for second argument
printf("%d + %d = %d\n", a, b, add(a, b)); // <== invoke add() function, print result to console
}
What's going on here? Every C program's execution starts with the main()
function. This is one of the
things that the C language standard, a long, technical document, prescribes. Our program checks if the user provided
enough arguments and prints an error if not; otherwise it converts the first two arguments from strings to integers
using strtol()
(a standard library function), calls add()
on them and returns the results.
How do theargc
andargv
arguments tomain()
get set, and how ismain()
called?
Your computer's operating system (OS) is responsible for starting up the program, and does some prep. The command line program you're using (this is called a "shell") makes sure to put the argument count (argc
) and argument values in boxes at well-known memory addresses before the OS starts your program.
Let's try to compile this program.
$ gcc -o add add.c
There's an error, because we haven't actually provided an add()
function. Let's write one.
The program now works, and it adds numbers. Yay! But we can also define our add function in another file –
something that often happens in larger programs. Let's use the add function in addf.c
instead.
Since the compiler looks at each source file in isolation, we now need to tell it that there is an add function
in some other file, and what arguments it takes. Let's add a line to add.c
that specifies the name
and arguments of add()
, but does not provide an implementation. This is called a declaration:
we're telling the compiler "there will be a function called add()
, and you'll find out about its
implementation later". All functions and variables in C have to be declared when you first use them, but they
do not have to be defined. We'll understand the exact difference shortly.
Let's try compiling this version of add.c
. A different error! Why? Because we haven't told the
compiler to also look at the addf.c
file, which actually has our implementation of add()
.
To do that, let's pass two files to the compiler.
$ gcc -o add add.c addf.c
It works! Great. The compiler first compiles add.c
into a file called add.o
, and then
compiles addf.c
into a file called addf.o
. These files don't contain human-readable text,
but binary bytes that the computer's CPU (central processing unit) understands to execute.
Addon: Programs are still just bytes in memory!
Now let's have some fun and change our implementation in addf.c
and just directly store the four
bytes that we know make the processor (on an Intel computer) add numbers together:
const unsigned char add[] = { 0x8d, 0x04, 0x37, 0xc3 };
We're no longer writing a function in the C programming language, we're just defining an array of bytes
called add
. Do you think our add
program will still work?
It turns out it does work! Why? Because we are manually storing the exact same bytes in memory that
the compiler generates when compiling our add
function into machine instructions. The processor doesn't
care that we were storing an array of data there – if we tell it to go an execute these bytes, the dumb silicon
goes and does as it's told!
Summary
Today, we've seen more of a computer's memory bytes can be interpreted to represent many different kinds of data. For example, the same bytes can be interpreted as program code or as an image's pixels, and a sequence of bytes can represent characters of a string or a large number. We've also seen why addresses are incredibly important: C and C++ locate data and functions in memory by their address, and we learned some basic C syntax. We will talk more about memory representation, strings, and sequences of objects in memory next time!