CSCI 0300/1310: Fundamentals of Computer Systems

⚠️ This is not the current iteration of the course! Head here for the current offering.

Lecture 9: C++ and Caching

» Lecture code (C++) – Lecture code (Caching)
» Post-Lecture Quiz (due 6pm Wednesday, February 24).

Standard Library Data Structures

One big advantage of C++ over C is that C++ comes with a large standard library with many common data structures implemented. We will use some of these data structures in the rest of the course.

The data structure part of the C++ library is called the Standard Template Library (STL), and it contains various "container" structures represent different kinds of collections. For example:

std::vector is a vector (dynamically-sized array) similar to the vector you implemented in Project 1.
std::map provides an ordered key-value map, with an API somewhat similar to a Python dictionary, though with much stricter rules (fixed key and value types, no nesting, and others). The ordered map is typically implemented as a heap (the data structure, not the memory segment) or tree, so many operations are O(log N) complexity for a map of size N>.
std::unordered_map provides an unordered key-value map, implemented as a hashtable with most operations having O(1) amortized complexity. Again, the API is somewhat similar to a Python dictionary, but with all the constraints of a map and the added constraint that the key type must be hashable (true of the primitive C++ types, but requires additional implementation for more complex types).
std::set and std::unordered_set provide ordered and unordered set abstractions, with APIs that support addition, removal, membership checking and other set operations.

The difference between the ordered and unordered variants of these data structures matters when iterating over them: an ordered collection always guarantees the same, specific iteration order, while and unordered collection makes no such guarantee.

STL collections are generic, meaning that they can hold elements of any type. This is extremely handy, because it means that we don't need separate implementations for, say, a vector of integers and a vector of strings. Recall that in your C vector implementation for Project 1, you had to use void* pointers and explicit element size arguments to make the vector generic; fortunately, generic C++ data structures require no such things. To tell the data structure what specific types it should assume, we include the types in angle brackets when we refer to the data structure type: for example, a std::vector<int> is a vector of ints, while a std::vector<Animal> would be a vector of Animal objects, and std::vector<int*> is a vector of pointers to integers.

How do generic data structures work?

The details of how generic C++ STL data structures work are complex and related to an advanced feature of the C++ language called "templating". You won't need to understand how to write templated classes for this course, but you can think of this as writing a class with one or more type parameter that the compiler searches and replaces with the actual types before it compiles your code. For example, a std::vector<T> specifies a type parameter T for the type of the vector elements, and all code implementing the vector will use T to refer to the element type. Only when you actually use, e.g., a vector<int> will the compiler generate and compile code for a vector of integers and appropriately set all element sizes in the code.

You can declare both stack-allocated and heap-allocated STL container data structures, and cpp2.cc shows some examples. However, one very important thing to realize is that these C++ data structures may themselves allocate memory on the heap (in fact, they usually do!), even if the data structure itself is declared as stack-allocated. If you think about this, this makes sense: all of these data structures are dynamic in size, i.e., you can add and remove elements in your code as you wish. This means that the data structures cannot be entirely on the stack or in the static segment, since both of these segments require object storage sizes to be known at compile time.

We won't be able to cover in detail all the APIs that STL collections offer in lectures, and we encourage you to make use of the reference links on our C++ primer page to explore them. The reference material can seem verbose and confusing at first; often, it's easiest to look at the code examples included in the documentation for specific methods to develop an intuition for how you use them. The methods you want often have relatively obvious names (e.g., contains(T element) checks if an std::vector<T> contains element; push_back(T element) on the same vector adds an element to the back), but not always (e.g., the easiest way to insert into a std::map<K, V> is to use emplace(K key, V value)).

S2: Caching and the Storage Hierarchy

We are now switching gears to talk about one of the most important performance-improving concepts in computer systems. This concept is the idea of cache memory.

Why are we covering this?

Caching is an immensely important concept to optimize performance of a computer system. As a software engineer in industry, or as a researcher, you will probably find yourself in countless situations where "add a cache" is the answer to a performance problem. Understanding the idea behind caches, as well as when a cache works well, is important to being able to build high-performance applications.

We will look at specific examples of caches, but a generic definition is the following: a cache is a small amount of fast storage used to speed up access to larger, slower storage.

One reasonable question is what we actually mean by "fast storage" and "slow storage", and why need both. Couldn't we just put all of the data on our computer into fast storage?

To answer this question, it helps to look at what different kinds of storage cost and how this cost has changed over time.

The Storage Hierarchy

When we learn about computer science concepts, we often talk about "cost": the time cost and space cost of algorithms, memory efficiency, and storage space. These costs fundamentally shape the kinds of solutions we build. But financial costs also shape the systems we build, and the costs of the storage technologies we rely on have changed dramatically, as have their capacities and speeds.

The table below gives the price per megabyte of different storage technology, in price per megabyte (2010 dollars), up to 2019. (Note that flash/SSD storage did not exist until the early 2000s, when the technology became available.)

Year	Memory (DRAM)	Flash/SSD	Hard disk
~1955	$411,000,000		$9,200
1970	$734,000.00		$260.00
1990	$148.20		$5.45
2003	$0.09	$0.305	$0.00132
2010	$0.019	$0.00244	$0.000073
2021	$0.003	$0.00008	$0.0000194

(Prices due to John C. McCallum, and inflation data from here. $1.00 in 1955 had "the same purchasing power" as $9.62 in 2019 dollars.)

Computer technology is amazing – not just for what it can do, but also for just how tremendously its cost has dropped over the course of just a few decades. The space required to store a modern smartphone photo (3 MB) on a harddisk would have costs tens of thousands of dollars in the 1950s, but now costs a fraction of a cent.

But one fundamental truth has remained the case across all these numbers: primary memory (DRAM) has always been substantially more expensive than long-term disk storage. This becomes even more evident if we normalize all numbers in the table to the cost of 1 MB of harddisk space in 2019, as the second table below does.

Year	Memory (DRAM)	Flash/SSD	Hard disk
~1955	219,800,000,000,000		333,155,000
1970	39,250,000,000		13,900,000
1990	7,925,000		291,000
2003	4,800	16,300	70
2010	1,000	130	3.9
2021	155	4.12	1

As a consequence of this price differential, computers have always had more persistent disk space than primary memory. Harddisks and flash/SSD storage are persistent (i.e., they survive power failiures), while DRAM memory is volatile (i.e., its contents are lost when the computer loses power), but harddisk and flash/SSD are also much slower to access than memory.

In particular, when thinking about storage performance, we care about the latency to access data in storage. The latency denotes the time it takes until data retrieved is available if read, or until it is on the storage medium if written. A longer latency is worse, and a smaller latency better, as a smaller latency means that the computer can complete operations sooner.

Another important storage performance metric is throughput (or "bandwidth"), which is the number of operations completed per time unit. Throughput is often, though not always, the inverse of latency. An ideal storage medium would habe low latency and high throughput, as it takes very little time to complete a request, and many units of data can be transferred per second.

In reality, though, latency generally grows, and throughput drops, as storage media are further and further away from the processor. This is partly due to the storage technologies employed (some, like spinning harddisks, are cheap to manufacture, but slow), and partly due to the inevitable physics of sending information across longer and longer wires.

The table below shows the typical capacity, latency, and throughput achievable with the different storage technologies available in our computers.

Storage type	Capacity	Latency	Throughput (random access)	Throughput (sequential)
Registers	~30 (100s of bytes)	0.5 ns	16 GB/sec (2x10⁹ accesses/sec)
SRAM (processor caches)	5 MB	4 ns	1.6 GB/sec (2x10⁸ accesses/sec)
DRAM (main memory)	8 GB	60 ns	100 GB/sec
SSD (stable storage)	512 GB	60 µs	550 MB/sec
Hard disk	2–5 TB	4–13 ms	1 MB/sec	200 MB/sec

This notion of larger, cheaper, and slower storage further away from the processor, and smaller, faster, and more expensive storage closer to it is referred to as the storage hierarchy, as it's possibly to neatly rank storage according to these criteria. The storage hierarchy is often depicted as a pyramid, where wider (and lower) entries correspond to larger and slower forms of storage.

This picture includes processor caches, which are small regions of fast (SRAM) memory that are on the processor chip itself. The storage hierarchy shows the processor caches divided into multiple levels, with the L1 cache (sometimes pronounced "level-one cache") closer to the processor than the L2, L3, and L4 caches. This reflects how processor caches are actually laid out, but we often think of a processor cache as a single unit.

Different computers have different sizes and access costs for these hierarchy levels; the ones in the table above are typical. Here are some more, based on Malte's MacBook Air from ca. 2013: a few hundred bytes of of registers; ~5 MB of processor cache; 8 GB primary memory; 256 GB SSD. The processor cache divides into three levels: 128 KB of total L1 cache, divided into four 32 KB components; each L1 cache is accessed only by a single processor core (which makes it faster, as cores don't need to coordinate). There are 256 KB of L2 cache, and there are 4 MB of L3 cache shared by all cores.

Each layer in the storage hierarchy acts as a cache for the following layer.

The programs diskio-slow and diskio-fast in the lecture code illustrate the huge difference caching can make to performance. Both programs write bytes to a file they create (the file is simply called data; you can see it in the lecture code directory after running these programs).

diskio-slow is a program that writes data to the computer's disk (SSD or harddisk) one byte at a time, and ensures that the byte is written to disk immediately and before the operation returns (the O_SYNC flag passed to open ensures this). It can write a few hundred bytes per second – hardly an impressive speed, as writing a single picture (e.g., from your smartphone camera) would take several minutes if done this way!

diskio-fast, on the other hand, writes to disk via series of caches. It easily achieves write throughputs of hundreds of megabytes per second: in fact, it writes 50 MB in about a tenth of a second on my laptop! This happens because these writes don't actually go to the computer's disk immediately. Instead, the program just writes to memory and relies on the operating system to "flush" the data out to stable storage over time in a way that it deems efficient. This improves performance, but it does come with a snag: if my computer loses power before the operating system gets around to putting my data on disk, it may get lost, even though my program was under the impression that the write to the file succeeded.

Finally, consider how the concept of caching abounds in everyday life, too. Imagine how life would differ without, say, fast access to food storage – if every time you felt hungry, you had to walk to a farm and eat a carrot you pulled out of the dirt. Your whole day would be occupied with finding and eating food! (Indeed, this is what some animals spend most of their time doing.) Instead, your refrigerator (or your dorm's refrigerator) acts as a cache for your neighborhood grocery store, and that grocery store acts as a cache for all the food producers worldwide.

S3: Cache Structure

A generic cache is structured as follows.

The fast storage of the cache is divided into fixed-size slots. Each slot can hold data from slow storage, and is at any point in time either empty or occupied by data from slow storage. Each slot has a name (for instance, we might refer to "the first slot" in a cache, or "slot 0"). Slow storage is divided into blocks, which can occupy cache slots (so the cache slot size and the slow storage block size ought to be identical).

Each block on slow storage has an "address" (this is not a memory address! Disks and other storage media have their own addressing schemes). If a cache slot (in fast storage) is full, i.e., if it holds data from a block in slow storage, the cache must also know the address of that block.

In the contexts of specific levels in the hierarchy, slots and blocks are described by specific terms. For instance, a cache line is a slot in the processor cache (or, sometimes, a block in memory).

Cache Hits and Misses

Read caches must respond to user requests for data at particular addresses. On each access, a cache typically checks whether the specified block is already loaded into a slot. If it is, the cache returns that data; otherwise, the cache first loads the block into some slot, then returns the data from the slot.

A cache access is called a hit if the data accessed is already loaded into a cache slot, and can therefore be access quickly, and it's called a miss otherwise. Cache hits are good (they imply fast access), and cache misses are bad: they incur both the cost of accessing the cache and the cost of accessing the slower storage. Ideally, most accesses to the cache should be hits! In other words, we seek a high hit rate, where the hit rate is the fraction of accesses that hit.

For example, consider the arrayaccess.cc program. This program generates an in-memory array of integers and then iterates over it, summing the integers, and measures the time to complete this iteration. ./arrayaccess -u -i 1 10000000 runs through the array in linear, increasing order (-u), completes one iteration (-i 1), and uses an array of 10 million integers (ca. 40 MB).

Given that the array is in primary memory (DRAM), how long would we expect this to take? Let's focus on a specific part: the access to the first three integers in the array (i.e., array[0] to array[2]). If we didn't have a cache at all, accessing each integer would incur a latency of 60 ns (the latency of DRAM access). For the first three integers, we'd therefore spend 60 ns + 60 ns + 60 ns = 180 ns.

But we do have the processor caches, which are faster to access than DRAM! They can hold blocks from DRAM in their slots. Since no blocks of the array are in any processor cache at the start of the program, the access to the first integer still goes to DRAM and takes 60 ns. But this access will fetch a whole cache block, which consists of multiple integers, and deposit it into a slot of the L3 processor cache (and, in fact, the L2 and L1, but we will focus this example on the L3 cache). Let's assume the cache slot size is 12 bytes (i.e., three integers) for the example; real cache lines are often on the order of 64 or 128 bytes long. If the first access brings the block into the cache, the subsequent accesses to the second and third integers can read them directly from the cache, which takes about 4 ns per read for the L3 cache. Thus, we end up with a total time of 60 ns + 4 ns + 4 ns = 68 ns, which is much faster than 180 ns.

The principle behind this speedup is called locality of reference. Caching is based on the assumption that if a program accesses a location in memory, it is likely to access this location or an adjacent location again in the future. In many real-world programs, this assumption is a good one, as it is often true. In our example, we have spatial locality of reference, as access to array[0] is indeed soon after followed by accesses to array[1] and array[2], which are already in the cache.

By contrast, a totally random access pattern has no locality of reference, and caching generally does not help much with it. With our arrayaccess program, passing -r changes the access to be entirely random, i.e., we still access each integer, but we do so in a random order. Despite the fact that the program sums just as many integers, it takes much longer (on the order of 10x longer) to run! This is because it does not benefit from caching.

Cache Replacement

In our nice, linear arrayaccess example, the first access to the array brought the first block into the processor caches. Once the program hits the second block, that will also be brought into the cache, filling another slot, as will the third, etc. But what happens once all cache slots are full? In this situation, the cache needs to throw out an existing block to free up a slot for a new block. This is called cache eviction, and the way the cache decides which slot to free up is called an "eviction policy" or a "replacement policy".

What might a good eviction policy be? We would like the cache to be effective in the future, so ideally we want to avoid evicting a block that we will need shortly after. On option is to always evict the oldest block in the cache, i.e., to cycle through slots as eviction is needed. This is a round-robin, or first-in, first-out (FIFO) policy. Another sensible policy is to evict the cache entry that was least recently accessed; this is a policy called least recently used (LRU), and happens to be what most real-world caches use.

But what's the best replacement policy? Intuitively, it is a policy that always evicts the block that will be accessed farthest into the future. This policy is impossible to implement in the real world unless we know the future, but it is provably optimal (no algorithm can do better)! Consequently, this policy is called Bélády's optimal algorithm, named after its discoverer, László Bélády (the article).

It turns out that processors can sometimes make good guesses at the future, however. Looping linearly over a large array, as we do in ./arrayaccess -u, is such a case. The processor can detect that we are always accessing the neighboring block after the previous one, and it can speculatively bring the next block into memory while it still works on the summation for the previous block. This speculative loading of blocks into the cache is called prefetching. If it works well, prefetching can remove some of the 60 ns price of the first access to each block, as the block may already be in the cache when we access it!

Summary

Today, we looked into the handy data structures provided by the C++ standard library, and got an intial feel for how you can use them to make your life easier.

We then talked about the notion of caching as a performance optimization at a high level; and we developed our understand of caches, why they are necessary, and why they benefit performance. We saw that differences in the price of storage technologies create a storage hierarchy with smaller, but faster, storage at the top, and slower, but larger storage at the bottom. Caches are a way of making the bottom layers appear faster than they actually are!

We then dove deeper into how your processor's cache, a specific instance of the caching paradigm, works and speeds up program execution. We found that accessing data already in the cache (a "hit") is much faster than accessing it on the underlying storage to bring the block into the cache (a "miss"). We also considered different algorithms for which slot to free up when the cache is full and a new block needs to be brought into the cache, and found the widely-used LRU algorithm as a reasonable predictor of future access patterns due to the locality of reference exhibited by typical programs.