⚠️ This is not the current iteration of the course! Head here for the current offering.

Lecture 13: Caching

🎥 Lecture video (Brown ID required)
💻 Lecture code
❓ Post-Lecture Quiz (due 11:59pm, Wednesday, March 15).

Measuring Actual Cache Performance

We can see the CPU's cache in action by running our ./arrayaccess program under a tool that reads information from special "performance counter" registers in the processor. The perf.sh script invokes this tool and sets it to measure the last-level cache (LLC) accesses ("loads") and misses. In our example, the L3 cache is the last-level cache.

When we run ./arrayaccess -u -i 1 10000000, we should expect a hit rate smaller than 100%, since the first access to each block triggers a cache miss. Sometimes, but not always, prefetching manages to ensure that the next block is already in the cache (which increases the hit rate); the precise numbers depend on a variety of factors, including what else is running on the computer at the same time.

For example, we may get the following output:

$ ./perf.sh ./arrayaccess -u -i 1 10000000
accessing 10000000 integers 1 times in sequential order:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, ...]
OK in 0.007934 sec!

 Performance counter stats for './arrayaccess -u -i 1 10000000':

           307,936      LLC-misses                #   25.67% of all LL-cache hits
         1,199,483      LLC-loads

       0.048342578 seconds time elapsed
This indicates that we experienced a roughly 74% cache hit rate (100% minus the 26% miss rate) – a rather decent result.

If we instead run the arrayaccess program with a random access pattern, the hit rate is much lower and there are more misses:

$ ./perf.sh ./arrayaccess -r -i 1 10000000
accessing 10000000 integers 1 times in random order:
[4281095, 3661082, 3488908, 9060979, 7747793, 8711155, 427716, 9760492, 9886661, 9641421, 9118652, 490027, 3368690, 3890299, 4340420, 7513926, 3770178, 5924221, 4089172, 3455736, ...]
OK in 0.152167 sec!

 Performance counter stats for './arrayaccess -r -i 1 10000000':

        19,854,838      LLC-misses                #   78.03% of all LL-cache hits
        25,443,796      LLC-loads

       0.754197032 seconds time elapsed
Here, the hit rate is only 22%, and 78% of cache accesses result in misses. As a result, the program runs 16x slower than when it accessed memory sequentially and benefited from locality of reference and prefetching.

By varying the size of the array and observing the miss rate measured, we can work out the size of the L3 cache on the computer we're running on: once the array is smaller than the cache size, the hit rate will be nearly 100%, since no cache blocks every get evicted.

Caching and File I/O

Caching exists at many layers of computer systems!

The programs diskio-slow and diskio-fast in the lecture code illustrate the huge difference caching can make to performance. Both programs write bytes to a file they create (the file is simply called data; you can see it in the lecture code directory after running these programs).

diskio-slow is a program that writes data to the computer's disk (SSD or harddisk) one byte at a time, and ensures that the byte is written to disk immediately and before the operation returns (the O_SYNC flag passed to open ensures this). It can write a few hundred bytes per second – hardly an impressive speed, as writing a single picture (e.g., from your smartphone camera) would take several minutes if done this way!

diskio-fast, on the other hand, writes to disk via series of caches. It easily achieves write throughputs of hundreds of megabytes per second: in fact, it writes 50 MB in about a tenth of a second on my laptop! This happens because these writes don't actually go to the computer's disk immediately. Instead, the program just writes to memory and relies on the operating system to "flush" the data out to stable storage over time in a way that it deems efficient. This improves performance, but it does come with a snag: if my computer loses power before the operating system gets around to putting my data on disk, it may get lost, even though my program was under the impression that the write to the file succeeded.

I/O System Calls

When programs want the OS to do I/O on their behalf, their mechanism of choice is a system call. System calls are like function calls, but they invoke OS functionality (which we'll discuss in more detail shortly). read() and write() are examples of system calls.

System calls are not cheap. They require the processor to do significant extra work compared to normal function calls. A system call also means that the program probably loses some locality of reference, and thus may have more processor cache misses after the system call returns. In practice, a system call takes 1-2µs to handle. This may seem small, but compared to a DRAM access (60ns), it's quite expensive – more than 20x the cost of a memory access. Frequent system calls are therefore one major source of poor performance in programs. In Project 3, you implement a set of tricks to avoid having to make frequent system calls!

Disk I/O

Input and output (I/O) on a computer must generally happen through the operating system, so that it can mediate and ensure that only one process at a time uses the physical resources affected by the I/O (e.g., a harddisk, or your WiFi). This avoids chaos and helps with fair sharing of the computer's hardware. (There are some exceptions to this rule, notably memory-mapped I/O and recent fast datacenter networking, but most classic I/O goes through the operating system.)

File Descriptors

When a user-space process makes I/O system calls like read() or write(), it needs to tell the kernel what file it wants to do I/O on. This requires the kernel and the user-space process to have a shared way of referring to a file. On UNIX-like operating systems (such as macOS and Linux), this is done using file descriptors.

File descriptors are identifiers that the kernel uses to keep track of open resources (such as files) used by user-space processes. User-space processes refer to these resources using integer file descriptor (FD) numbers; in the kernel, the FD numbers index into a FD table maintained for each process, which may contain extra information like the filename, the offset into the file for the next I/O operation, or the amount of data read/written. For example, a user-space process may use the number 3 to refer to a file descriptor that the kernel knows corresponds to /home/malte/cats.txt.

To get a file descriptor number allocated, a process calls the open() syscall. open() causes the OS kernel to do permission checks, and if they pass, to allocate an FD number from the set of unused numbers for this process. The kernel sets up its metadata, and then returns the FD number to user-space. The FD number for the first file you open is usually 3, the next one 4, etc.

Why is the first file descriptor number usually 3?

On UNIX-like operating systems such as macOS and Linux, there are some standard file descriptor numbers. FD 0 normally refers to stdin (input from the terminal), 1 refers to stdout (output to the terminal), and 2 refers to stderr (output to the terminal, for errors). You can close these standard FDs; if you then open other files, they will reuse FD numbers 0 through 2, but your program will no longer be able to interact with the terminal.

Now that user-space has the FD number, it uses this number as a handle to pass into read() and write(). The full API for the read system call is: int read(int fd, void* buf, size_t count). The first argument indicates the FD to work with, the second is a pointer to the buffer (memory region) that the kernel is supposed to put the data read into, and the third is the number of bytes to read. read() returns the number of bytes actually read (or 0 if there are no more bytes in the file; or -1 if there was an error). write() has an analogous API, except the kernel reads from the buffer pointed to and copies the data out.

One important aspect that is not part of the API of read() or write() is the current I/O offset into the file (sometimes referred to as the "read-write head" in man pages). In other words, when a user-space process calls read(), it fetches data from whatever offset the kernel currently knows for this FD. If the offset is 24, and read() wants to read 10 bytes, the kernel copies bytes 24-33 into the user-space buffer provided as an argument to the system call, and then sets the kernel offset for the FD to 34.

A user-space process can influence the kernel's offset via the lseek() system call, but is generally expected to remember on its own where in the file the kernel is at. In Project 3, you'll have to maintain such metadata for your caching in user-space memory. In particular, when reading data into the cache or writing cached data into a file, you'll need to be mindful of the current offset that the I/O will happen at.

Summary

Today, we started by looking at caching in action and measured the hit and miss rates of real-world processor caches. We also talked about how the concept of caching shows up again in between applications and files on disk, where I/O libraries cache the contents of files on disk in RAM. There is a lot more to say about caches and as you complete Project 3, you will build your own cache!