⚠️ This is not the current iteration of the course! Head here for the current offering.

Lecture 12: Caching (continued)

» Lecture video (Brown ID required)
» Lecture code
» Post-Lecture Quiz (due 11:59pm Sunday, March 13).

Measuring Actual Cache Performance

We can see the cache in action by running our ./arrayaccess program under a tool that reads information from special "performance counter" registers in the processor. The perf.sh script invokes this tool and sets it to measure the last-level cache (LLC) accesses ("loads") and misses. In our example, the L3 cache is the last-level cache.

When we run ./arrayaccess -u -i 1 10000000, we should expect a hit rate smaller than 100%, since the first access to each block triggers a cache miss. Sometimes, but not always, prefetching manages to ensure that the next block is already in the cache (which increases the hit rate); the precise numbers depend on a variety of factors, including what else is running on the computer at the same time.

For example, we may get the following output:

$ ./perf.sh ./arrayaccess -u -i 1 10000000
accessing 10000000 integers 1 times in sequential order:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, ...]
OK in 0.007934 sec!

 Performance counter stats for './arrayaccess -u -i 1 10000000':

           307,936      LLC-misses                #   25.67% of all LL-cache hits
         1,199,483      LLC-loads

       0.048342578 seconds time elapsed
This indicates that we experienced a roughly 74% cache hit rate (100% minus the 26% miss rate) – a rather decent result.

If we instead run the arrayaccess program with a random access pattern, the hit rate is much lower and there are more misses:

$ ./perf.sh ./arrayaccess -r -i 1 10000000
accessing 10000000 integers 1 times in random order:
[4281095, 3661082, 3488908, 9060979, 7747793, 8711155, 427716, 9760492, 9886661, 9641421, 9118652, 490027, 3368690, 3890299, 4340420, 7513926, 3770178, 5924221, 4089172, 3455736, ...]
OK in 0.152167 sec!

 Performance counter stats for './arrayaccess -r -i 1 10000000':

        19,854,838      LLC-misses                #   78.03% of all LL-cache hits
        25,443,796      LLC-loads

       0.754197032 seconds time elapsed
Here, the hit rate is only 22%, and 78% of cache accesses result in misses. As a result, the program runs 16x slower than when it accessed memory sequentially and benefited from locality of reference and prefetching.

By varying the size of the array and observing the miss rate measured, we can work out the size of the L3 cache on the computer we're running on: once the array is smaller than the cache size, the hit rate will be nearly 100%, since no cache blocks every get evicted.

Caching and File I/O

Caching exists at many layers of computer systems!

The programs diskio-slow and diskio-fast in the lecture code illustrate the huge difference caching can make to performance. Both programs write bytes to a file they create (the file is simply called data; you can see it in the lecture code directory after running these programs).

diskio-slow is a program that writes data to the computer's disk (SSD or harddisk) one byte at a time, and ensures that the byte is written to disk immediately and before the operation returns (the O_SYNC flag passed to open ensures this). It can write a few hundred bytes per second – hardly an impressive speed, as writing a single picture (e.g., from your smartphone camera) would take several minutes if done this way!

diskio-fast, on the other hand, writes to disk via series of caches. It easily achieves write throughputs of hundreds of megabytes per second: in fact, it writes 50 MB in about a tenth of a second on my laptop! This happens because these writes don't actually go to the computer's disk immediately. Instead, the program just writes to memory and relies on the operating system to "flush" the data out to stable storage over time in a way that it deems efficient. This improves performance, but it does come with a snag: if my computer loses power before the operating system gets around to putting my data on disk, it may get lost, even though my program was under the impression that the write to the file succeeded.

I/O System Calls

When programs want the OS to do I/O on their behalf, their mechanism of choice is a system call. System calls are like function calls, but they invoke OS functionality (which we'll discuss in more detail shortly). read() and write() are examples of system calls.

System calls are not cheap. They require the processor to do significant extra work compared to normal function calls. A system call also means that the program probably loses some locality of reference, and thus may have more processor cache misses after the system call returns. In practice, a system call takes 1-2µs to handle. This may seem small, but compared to a DRAM access (60ns), it's quite expensive – more than 20x the cost of a memory access. Frequent system calls are therefore one major source of poor performance in programs. In Project 3, you implement a set of tricks to avoid having to make frequent system calls!

Summary

Today, we started by looking at caching in action and measured the hit and miss rates of real-world processor caches. There is a lot more to say about caches, and you will learn some more in Lab 4 and Project 3.