Lecture 1: Overview and machine organization

» Lecture video (Brown ID required)
» Lecture code
» Post-Lecture Quiz (due 6pm Monday, January 27; optional this time only).

Course overview

Welcome to CS 131 / CSCI 1310: Fundamentals of Computer Systems!

This is a new introductory systems course, developed for this year by these folks.

You can find all details on policies in the missive, and the plan for the semester, including assignment due dates, on the the schedule page. The first labs (Lab 0 and 1) are out today, and Project 1 will be released later today.

We're also trying a new idea: post-lecture quizzes, which replace clicker questions, and allow you to review the lecture material. The first quiz for this lecture is here. These quizzes are graded for completion and are due at 6pm the day before the next lecture.

We will look at topics arranged in four main blocks during the course, building up your understanding across the whole systems stack:

  1. Computer Systems Basics: programming in systems programming languages (C and C++); memory; how code executes; and why machine and memory organization mean that some code runs much faster than other code that implements the same algorithm.
    Project: you will build your own versions of standard library functions and a vector data structure (strvec), and you will develop your own debugging memory allocator, helping you understand common memory bugs in systems programming languages (DMalloc).

  2. Fundamentals of Operating Systems: how your computer runs even buggy or incorrect programs without crashing; how it keeps your secrets in one program safe from others; and how an operating system like Windows, Mac OS X, or Linux works.
    Project: you will implement memory isolation via virtual memory in your own toy operating system (WeensyOS)!

  3. Concurrency and Parallel Programming: how your computer can run multiple programs at the same time; how you can share memory even if running programs on multiple processors; why programming with concurrency is very difficult; and how concurrency is behind almost all major web services today.
    Project: you will implement the server-side component of a peer payment application called "Vunmo", and make it handle requests from many users.

  4. Distributed Systems: what happens when you use more than one computer and have computers communicate over networks like the Internet; how hundreds of computer can work together to solve problems far too big for one computer; and how programs running distributedly can survive the failure of participating computers.
    Project: you will implement a sharded key-value storage service that can (in principle) scale over hundreds or thousands of machines.

The first block, Computer Systems Basics, begins today. It's about how we represent data and code in terms the computer can understand. But today's lecture is also a teaser lecture, so we'll see the material raise questions that you cannot yet answer (but you will, hopefully, at the end of the course!).

Machine organization

Why are we covering this?
In real-world computing, processor time and memory cost money. Companies like Facebook, Google, or Airbnb spend billions of dollars a year on their computing infrastructure, and constantly buy more servers. Making good use of them demands more than just fast algorithms – a good organization of data in memory can make an algorithm run much faster than a poor one, and save substantial money! To understand why certain data structures are better suited to a task than others, we need to look into how the computer and, in particulary, its memory is organized.

Your computer, in terms of physics, is just some materials: metal, plastic, glass, and silicon. These materials on their own are incredibly dumb, and it's up to us to orchestrate them to achieve a task. Usually, that task is to run a program that implements some abstract idea – for example, the idea of a game, or of an algorithm.

There is an incredible amount of systems infrastructure in our computers to make sure that when you write program code to realize your idea, the right pixels appear on the screen, the right noises come from the speakers, and the right calculations are done. Part of the theme of this course is to better understand how that "magic" infrastructure works.

Why try to understand it, you might wonder? Good question. One answer is that the best programmers are able to operate across all levels of "the stack", seamlessly diving into the details under the hood to understand why a program fails, why it performs as it does, or why a bug occurs. Another reason is that many of the concepts we encounter once we pull back on how systems work, it turns out that even the details of specific systems (e.g., Windows, OS X, Linux) vary, a surprisingly small number of fundamental concepts explain why computers are as powerful and useful as they are today.

Let's start with the key components of a computer:

These components need to work together to achieve things, and we need to understand them all in order to understand why code and systems behave the way they do.

Today, we will particularly focus on memory. A computer's memory is like a vast set of mailboxes like you might find them at a post office. Each box can hold one of 256 numbers: 0 through 255. This is called a byte (a byte is a number between 0 and 255, and corresponds to 8 bits, which each are a 0 or a 1).

A "post-office box" in computer memory is identified by an address. On a computer with M bytes of memory, there are M such boxes, each having as its address a number between 0 and M−1. My laptop has 8 GB (gibibytes) of memory, so M = 8×230 = 233 = 8,589,934,592 boxes (and possible memory addresses)!

       0     1     2     3     4                         2^33 - 1    <- addresses
    +-----+-----+-----+-----+-----+--     --+-----+-----+-----+
    |     |     |     |     |     |   ...   |     |     |     |      <- values
    +-----+-----+-----+-----+-----+--     --+-----+-----+-----+

Powers of ten and powers of two. Computers are organized all around the number two and powers of two. The electronics of digital computers are based on the bit, the smallest unit of storage, which a base-two digit: either 0 or 1. More complicated objects are represented by collections of bits: for example, a byte is 8 bits. Using binary bits has many advantages: for example, error correction is much easier if presence or absence of electric current just represents "on" or "off". This choice influences many layers of hardware and system design. Memory chips, for example, have capacities based on large powers of two, such as 230 bytes. Since 210 = 1024 is pretty close to 1,000, 220 = 1,048,576 is pretty close to a million, and 230 = 1,073,741,824 is pretty close to a billion, it’s common to refer to 230 bytes of memory as "a gigabyte," even though that actually means 109 = 1,000,000,000 bytes (SI units are base 10). But when trying to be precise, it's better to use terms that explicitly signal the use of powers of two, such as gibibyte: the "-bi-" component means “binary.”

All variables and data structures we use in our programs, and indeed all code that runs on the computer, need to be stored in these byte-sized memory boxes. How we lay them out can have major consequences for safety and performance!

An example: Quicksort

We will illustrate this with an example of one algorithm running over data laid out in memory in different ways.

Most if not all of you will be familiar with the idea of a sorting algorithm, which takes a sequence of integers and returns another sequence containing the same integers, in a defined order (e.g., ascending). There are many sorting algorithms, and on a theoretical level we are interested in their computational complexity.

The QuickSort algorithm is an elegant algorithm that applies the "Divide and Conquer" approach to achieve good average runtime complexity (O(N log N)). It takes the list, picks a "pivot" element, then builds two lists of elements that are smaller than the pivot (the left list) or greater than or equal to the pivot (the right list). QuickSort then recusively calls itself on the left and right list, recursing until it hits an empty list as its base case.

There are elegant implementations of this algorithm in the lecture code in Pyret (qs.arr), OCaml (qs.ml), and Java (qs.java). We'll compare the performance and memory use of these implementations to several implementations in C++, one of the systems programming languages we'll use in this course.

The point here is that the same algorithm, implemented in the same way in different languages, can take vastly different time to run. These differences get abstracted away as "constant factors" when we talk about Big-O complexity, but they can matter a lot to the real-world cost!

Memory Layout of a Linked List
Why are we covering this?
A linked list is a data structure that helps explain memory well: it requires storing elements and addressing the next element. Linked lists are very common in functional programming languages (e.g., OCaml's and Pyret's lists are linked lists) and have nice properties (e.g., their size is dynamic, so you can add and remove elements), but their practical performance is very much affected by the memory layout.

All the previous non-C++ implementations, and the first C++ one we're looking at (testqs0.cc) use some form of linked list data structure. Conceptually, in a linked list each node consists of the element (an integer in our case) and an arrow to the next element in the list.

But how would we lay out such a list in memory? One idea is to put the numbers in the nodes into our little byte-sized boxes, and to store the address of the box containing the next number in the list in the adjacent box. So, for list [1, 80000, 43, 9, 67, 3, 3, 287, 1], we might store 1 in the box at address 6 and the next number in a box at address 40.

You might observe that our boxes can only hold 255 bytes, but we need to store numbers larger than that. To achive this, the computer interprets multiple adjacent boxes as a single number, using the concept of positional notation, albeit in the binary system. Two boxes together consist of a "low" value (1 byte, 0 to 28-1) and a "high" value (1 byte, 0 to (28 × 28) - 1 = 216 - 1). This way, we can represent numbers between 0 and 216 - 1, which is 65,535. The data type of two such boxes is called a short; 4 adjacent bytes are an int and represent an integer between 0 and 232 - 1 (about 4 billion).

So, we can represent the integers in our list using 4 bytes each. But what about the arrows pointing to the next element? These are stored as addresses of the next number's box. On my laptop, there are about 8 billion boxes, so we must be able to represent and store addresses up to 8,589,934,592 in boxes adjacent to those four storing the number. To achieve this, we need more than 4 bytes of address. Practically all computers today use 8 byte (64-bit) addresses, and so we'll use 8 bytes to store the address.

How much space does our linked list occupy in memory? Each node consists of 4 + 8 bytes (number + address of next number), and it turns out that the C++ linked list also has back pointers (another 8 bytes) and that alignment (a concept we'll learn more about) requires the 4-byte number to be padded to 8 bytes. This brings us to 24 bytes per element, and since the OS also needs to keep track of the memory we allocate for each list node, the real amount of memory used for each list node is about 32-40 bytes. That means that a list of one million integers takes 32-40 million bytes to store in memory with this representation, even though the raw 1 million integers only make up 4 million bytes of that.

In the lecture code folder, type make. Compare the runtime of ./testqs0 1000000 and ./testqs2 1000000. The latter uses a different data structure, a vector, which lays out the list nodes next to each other in adjacent boxes in memory. If previously the first node was at address 6 and the next at 40, they would now be adjacent at 6 and 10 (as each integer is 4 bytes wide, and no arrows are required). This version runs 5x faster than that based on the linked list!

The reason for this is that even though both implement the same algorithm, the vector-based quicksort uses substantially less memory, and therefore runs faster. In other words, the memory layout of your datastructure matters to performance!

The Cost of Copies
Why are we covering this?
One important story of systems programming is that moving data around in memory is not free. Higher-level languages like OCaml, Pyret, or Java often hide from you as the programmer the fact that a program copies data. The experiments in this section will show to you how much it matters whether you copy data or not; later in the course, we'll learn how to avoid those copies.

In addition to the memory efficiency of your data structure, how much data moves around in memory also matters for performance. The implementations of QuickSort that we looked at so far all create new lists and copy elements over when they partition the elements into the left and right lists. For large lists, creating new temporary lists by copying values is quite expensive, as each copy requires a few CPU cycles.

Consider testqs1.cc. This implementation still uses a linked list, but instead of copying elements around, it splices them from one list into another by rewriting just the arrows around it. Remember that each arrow is just an 8-byte address, so splicing a node from one list another another requires rewriting three arrows (the node's old list's incoming arrow, the new list's incoming arrow, and the node's own outgoing arrow).

Comparing the runtime of ./testqs0 1000000 and ./testqs1 1000000, we find that the list-splicing approach is about 3x faster than the basic linked-list implementation. And this "destructive" approach (called so because it destroys the original list in the process) can also be applied to the vector-based implementation: see testqs4.cc, which runs about another 4x faster, combining the memory savings of using a vector and the copy savings of destroying the original list while the sort runs.

Now we know that the way in which a datastructure allocates, copies, and frees memory can greatly affect its performance.

As a final step, compare the runtime of ./testqs4 1000000 to that of ./testqs5 1000000. The latter uses the C++ standard library implementation of QuickSort (std::sort), and is highly optimized. But it seems like our vector-based, destructive (copy-free) version in testqs4.cc actually comes close in terms of performance: both the standard library version and ours are about 20x faster than the basic implementation in testqs0.cc, which was already faster than Java, OCaml, and Pyret.

Code as data: adding numbers

Why are we covering this?
The only place where a computer can store information is memory. This example illustrates that even the program itself ultimately consists of bytes stored in memory boxes; and that the meaning of a given set of boxes depends on how we tell the computer to interpret them (much like with different-sized numbers). Things that don't seem like programs can actually act as programs!

We talked about storing data – like a list of integers to be sorted – in memory, but where does the actual code for QuickSort or other programs live? It turns out it is also just bytes in memory. The CPU, when it decides what instructions to run, interprets these bytes as machine code, even though in other situations they may represent data like a linked list or our course logo.

And with the right sequence of magic bytes in the right place, we can make almost any piece of data in memory run as code. For example, the course logo (logo.jpg) contains the bytes for an add function (0x8d 0x04 0x37 0xc3 in hexadecimal notation; run objdump -S addf.o to see those bytes in an actual compiled add function), and if we run ./addin logo.jpg 10302 NUM1 NUM2, we can use those bytes to add numbers!

We will look into these concepts of code as data in more detail next time.