Lecture 13: Caching
🎥 Lecture video (Brown ID required)
💻 Lecture code
❓ Post-Lecture Quiz (due 11:59pm, Wednesday, March 15).
Measuring Actual Cache Performance
We can see the CPU's cache in action by running our ./arrayaccess
program under a tool that reads
information from special "performance counter" registers in the processor. The perf.sh
script invokes this tool and sets it to measure the last-level cache (LLC) accesses ("loads") and
misses. In our example, the L3 cache is the last-level cache.
When we run ./arrayaccess -u -i 1 10000000
, we should expect a hit rate smaller than 100%,
since the first access to each block triggers a cache miss. Sometimes, but not always, prefetching manages to
ensure that the next block is already in the cache (which increases the hit rate); the precise numbers depend
on a variety of factors, including what else is running on the computer at the same time.
For example, we may get the following output:
$ ./perf.sh ./arrayaccess -u -i 1 10000000 accessing 10000000 integers 1 times in sequential order: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, ...] OK in 0.007934 sec! Performance counter stats for './arrayaccess -u -i 1 10000000': 307,936 LLC-misses # 25.67% of all LL-cache hits 1,199,483 LLC-loads 0.048342578 seconds time elapsedThis indicates that we experienced a roughly 74% cache hit rate (100% minus the 26% miss rate) – a rather decent result.
If we instead run the arrayaccess
program with a random access pattern, the hit rate is much
lower and there are more misses:
$ ./perf.sh ./arrayaccess -r -i 1 10000000 accessing 10000000 integers 1 times in random order: [4281095, 3661082, 3488908, 9060979, 7747793, 8711155, 427716, 9760492, 9886661, 9641421, 9118652, 490027, 3368690, 3890299, 4340420, 7513926, 3770178, 5924221, 4089172, 3455736, ...] OK in 0.152167 sec! Performance counter stats for './arrayaccess -r -i 1 10000000': 19,854,838 LLC-misses # 78.03% of all LL-cache hits 25,443,796 LLC-loads 0.754197032 seconds time elapsedHere, the hit rate is only 22%, and 78% of cache accesses result in misses. As a result, the program runs 16x slower than when it accessed memory sequentially and benefited from locality of reference and prefetching.
By varying the size of the array and observing the miss rate measured, we can work out the size of the L3 cache on the computer we're running on: once the array is smaller than the cache size, the hit rate will be nearly 100%, since no cache blocks every get evicted.
Caching and File I/O
Caching exists at many layers of computer systems!
The programs diskio-slow
and diskio-fast
in the lecture code illustrate the huge difference
caching can make to performance. Both programs write bytes to a file they create (the file is simply called
data
; you can see it in the lecture code directory after running these programs).
diskio-slow
is a program that writes data to the computer's disk (SSD or
harddisk) one byte at a time, and ensures that the byte is written to disk immediately and before the operation
returns (the O_SYNC
flag passed to open
ensures this). It can write a few hundred bytes per
second – hardly an impressive speed, as writing a single picture (e.g., from your smartphone camera) would take
several minutes if done this way!
diskio-fast
, on the other hand, writes to disk via series of caches. It easily achieves write throughputs
of hundreds of megabytes per second: in fact, it writes 50 MB in about a tenth of a second on my laptop! This happens
because these writes don't actually go to the computer's disk immediately. Instead, the program just writes to memory
and relies on the operating system to "flush" the data out to stable storage over time in a way that it deems
efficient. This improves performance, but it does come with a snag: if my computer loses power before the operating
system gets around to putting my data on disk, it may get lost, even though my program was under the impression that the
write to the file succeeded.
I/O System Calls
When programs want the OS to do I/O on their behalf, their mechanism of choice is a system call. System
calls are like function calls, but they invoke OS functionality (which we'll discuss in more detail shortly).
read()
and write()
are examples of system calls.
System calls are not cheap. They require the processor to do significant extra work compared to normal function calls. A system call also means that the program probably loses some locality of reference, and thus may have more processor cache misses after the system call returns. In practice, a system call takes 1-2µs to handle. This may seem small, but compared to a DRAM access (60ns), it's quite expensive – more than 20x the cost of a memory access. Frequent system calls are therefore one major source of poor performance in programs. In Project 3, you implement a set of tricks to avoid having to make frequent system calls!
Disk I/O
Input and output (I/O) on a computer must generally happen through the operating system, so that it can mediate and ensure that only one process at a time uses the physical resources affected by the I/O (e.g., a harddisk, or your WiFi). This avoids chaos and helps with fair sharing of the computer's hardware. (There are some exceptions to this rule, notably memory-mapped I/O and recent fast datacenter networking, but most classic I/O goes through the operating system.)
File Descriptors
When a user-space process makes I/O system calls like read()
or write()
, it needs to
tell the kernel what file it wants to do I/O on. This requires the kernel and the user-space process to have a
shared way of referring to a file. On UNIX-like operating systems (such as macOS and Linux), this is done using
file descriptors.
File descriptors are identifiers that the kernel uses to keep track of open resources (such as files) used
by user-space processes. User-space processes refer to these resources using integer file descriptor (FD)
numbers; in the kernel, the FD numbers index into a FD table maintained for each process, which may contain
extra information like the filename, the offset into the file for the next I/O operation, or the amount of data
read/written. For example, a user-space process may use the number 3
to refer to a file descriptor
that the kernel knows corresponds to /home/malte/cats.txt
.
To get a file descriptor number allocated, a process calls the open()
syscall. open()
causes the OS kernel to do permission checks, and if they pass, to allocate an FD number from the set of unused
numbers for this process. The kernel sets up its metadata, and then returns the FD number to user-space. The
FD number for the first file you open is usually 3
, the next one 4
, etc.
Why is the first file descriptor number usually3
?
On UNIX-like operating systems such as macOS and Linux, there are some standard file descriptor numbers. FD
0
normally refers tostdin
(input from the terminal),1
refers tostdout
(output to the terminal), and2
refers tostderr
(output to the terminal, for errors). You can close these standard FDs; if you then open other files, they will reuse FD numbers 0 through 2, but your program will no longer be able to interact with the terminal.
Now that user-space has the FD number, it uses this number as a handle to pass into read()
and write()
. The full API for the read
system call is: int read(int fd, void*
buf, size_t count)
. The first argument indicates the FD to work with, the second is a pointer to the buffer
(memory region) that the kernel is supposed to put the data read into, and the third is the number of bytes
to read. read()
returns the number of bytes actually read (or 0 if there are no more bytes
in the file; or -1 if there was an error). write()
has an analogous API, except the kernel reads
from the buffer pointed to and copies the data out.
One important aspect that is not part of the API of read()
or write()
is
the current I/O offset into the file (sometimes referred to as the "read-write head" in man pages).
In other words, when a user-space process calls read()
, it fetches data from whatever offset the
kernel currently knows for this FD. If the offset is 24, and read()
wants to read 10 bytes, the
kernel copies bytes 24-33 into the user-space buffer provided as an argument to the system call, and then
sets the kernel offset for the FD to 34.
A user-space process can influence the kernel's offset via the lseek()
system call, but is
generally expected to remember on its own where in the file the kernel is at. In Project 3, you'll have to
maintain such metadata for your caching in user-space memory. In particular, when reading data into the
cache or writing cached data into a file, you'll need to be mindful of the current offset that the I/O will
happen at.
Summary
Today, we started by looking at caching in action and measured the hit and miss rates of real-world processor caches. We also talked about how the concept of caching shows up again in between applications and files on disk, where I/O libraries cache the contents of files on disk in RAM. There is a lot more to say about caches and as you complete Project 3, you will build your own cache!