Lecture 12: Caching (continued)
» Lecture video (Brown ID required)
» Lecture code
» Post-Lecture Quiz (due 11:59pm Sunday, March 13).
Measuring Actual Cache Performance
We can see the cache in action by running our ./arrayaccess
program under a tool that reads
information from special "performance counter" registers in the processor. The perf.sh
script invokes this tool and sets it to measure the last-level cache (LLC) accesses ("loads") and
misses. In our example, the L3 cache is the last-level cache.
When we run ./arrayaccess -u -i 1 10000000
, we should expect a hit rate smaller than 100%,
since the first access to each block triggers a cache miss. Sometimes, but not always, prefetching manages to
ensure that the next block is already in the cache (which increases the hit rate); the precise numbers depend
on a variety of factors, including what else is running on the computer at the same time.
For example, we may get the following output:
$ ./perf.sh ./arrayaccess -u -i 1 10000000 accessing 10000000 integers 1 times in sequential order: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, ...] OK in 0.007934 sec! Performance counter stats for './arrayaccess -u -i 1 10000000': 307,936 LLC-misses # 25.67% of all LL-cache hits 1,199,483 LLC-loads 0.048342578 seconds time elapsedThis indicates that we experienced a roughly 74% cache hit rate (100% minus the 26% miss rate) – a rather decent result.
If we instead run the arrayaccess
program with a random access pattern, the hit rate is much
lower and there are more misses:
$ ./perf.sh ./arrayaccess -r -i 1 10000000 accessing 10000000 integers 1 times in random order: [4281095, 3661082, 3488908, 9060979, 7747793, 8711155, 427716, 9760492, 9886661, 9641421, 9118652, 490027, 3368690, 3890299, 4340420, 7513926, 3770178, 5924221, 4089172, 3455736, ...] OK in 0.152167 sec! Performance counter stats for './arrayaccess -r -i 1 10000000': 19,854,838 LLC-misses # 78.03% of all LL-cache hits 25,443,796 LLC-loads 0.754197032 seconds time elapsedHere, the hit rate is only 22%, and 78% of cache accesses result in misses. As a result, the program runs 16x slower than when it accessed memory sequentially and benefited from locality of reference and prefetching.
By varying the size of the array and observing the miss rate measured, we can work out the size of the L3 cache on the computer we're running on: once the array is smaller than the cache size, the hit rate will be nearly 100%, since no cache blocks every get evicted.
Caching and File I/O
Caching exists at many layers of computer systems!
The programs diskio-slow
and diskio-fast
in the lecture code illustrate the huge difference
caching can make to performance. Both programs write bytes to a file they create (the file is simply called
data
; you can see it in the lecture code directory after running these programs).
diskio-slow
is a program that writes data to the computer's disk (SSD or
harddisk) one byte at a time, and ensures that the byte is written to disk immediately and before the operation
returns (the O_SYNC
flag passed to open
ensures this). It can write a few hundred bytes per
second – hardly an impressive speed, as writing a single picture (e.g., from your smartphone camera) would take
several minutes if done this way!
diskio-fast
, on the other hand, writes to disk via series of caches. It easily achieves write throughputs
of hundreds of megabytes per second: in fact, it writes 50 MB in about a tenth of a second on my laptop! This happens
because these writes don't actually go to the computer's disk immediately. Instead, the program just writes to memory
and relies on the operating system to "flush" the data out to stable storage over time in a way that it deems
efficient. This improves performance, but it does come with a snag: if my computer loses power before the operating
system gets around to putting my data on disk, it may get lost, even though my program was under the impression that the
write to the file succeeded.
I/O System Calls
When programs want the OS to do I/O on their behalf, their mechanism of choice is a system call. System
calls are like function calls, but they invoke OS functionality (which we'll discuss in more detail shortly).
read()
and write()
are examples of system calls.
System calls are not cheap. They require the processor to do significant extra work compared to normal function calls. A system call also means that the program probably loses some locality of reference, and thus may have more processor cache misses after the system call returns. In practice, a system call takes 1-2µs to handle. This may seem small, but compared to a DRAM access (60ns), it's quite expensive – more than 20x the cost of a memory access. Frequent system calls are therefore one major source of poor performance in programs. In Project 3, you implement a set of tricks to avoid having to make frequent system calls!
Summary
Today, we started by looking at caching in action and measured the hit and miss rates of real-world processor caches. There is a lot more to say about caches, and you will learn some more in Lab 4 and Project 3.