Lecture 8: Alignment, Collection Rules, Storage hierarchy

Signed Integers, Alignment, Collection Rules

Caching

Lecture 8: Alignment, Collection Rules, Storage hierarchy #

🎥 Lecture video (Brown ID required)

❓ Post-Lecture Quiz (due 11:59pm, Wednesday, October 1)

💻 Code examples:

Alignment continued, and Collection Rules #

In previous lectures, we built up to specifying a set of rules that govern how the C language expects data to be laid out in memory. We’re now ready to write down these rules.

Here are the first two:

the first member rule says that the address of the collection (array, structure, or union \[see below\]) is the same as the address of its first member;
the array rule says that all members of an array are laid out consecutively in memory; and

How are the members of a struct like list_node_t actually laid out in memory? This is defined by the struct rule, which says that the members of a struct are laid out in the order they’re declared in, without overlap, and subject only to alignment constraints. These mysterious “alignment constraints” are what makes our list_node_t have a size of 16 bytes even though it only needs 12.

So, by the first member rule, the struct will be aligned. (It turns out that, in practice, structures on the heap are aligned on 16-byte boundaries because malloc() on x86-64 Linux returns 16-byte aligned pointers; structures on the stack are aligned by the compiler.)

The size of a struct might therefore be larger than the sum of the sizes of its components due to alignment constraints. Since the compiler must lay out struct components in order, and it must obey the components' alignment constraints, and it must ensure different components don’t overlap, it must sometimes introduce extra space in structs. This space is called padding, and it’s effectively wasted memory. Our linked list node is an example of a situation where padding is required : the struct will have 4 bytes of padding after int v, to ensure that list_node_t* has a correct alignment (address divisible by 8).

So, we can now specify the third rule:

the struct rule says that members of a struct are laid out in declaration order, without overlap, and with minimum padding as necessary to satisfy the struct members’ alignment constraints.

In addition to these rules, there are three more we haven’t covered or made explicit yet.

Aside: Unions #

⚠️ We did not cover unions in the course this year. Following material is for your education only; we won’t test you on it. Feel free to skip ahead to the other rules.

For the next rule, we need to learn about unions, which are another collection type in C.

A union is a C data structure that looks a lot like a struct, but which contains only one of its members. Here’s an example:

union int_or_char {
  int i;
  char c;
}

Any variable u of type union int_or_char is either an integer (so u.i is valid) or a char (so u.c is valid), but never both at the same time. Unions are rarely used in practice and you won’t need them in this course. The size of a union is the maximum of the sizes of its members, and so is its alignment.

What are unions good for?

Unions are helpful when a data structure’s size is of the essence (e.g., for embedded environments like the controller chip in a microwave), and in situations where the same bytes can represent one thing or another. For example, the internet is based on a protocol called IP, and there are two versions of: IPv4 (the old one) and IPv6 (the new one, which permits >4B computers on the internet). But there are situations where we need to pass an address that either follows the IPv4 format (4 bytes) or the IPv6 format (16 bytes). A union makes this possible without wasting memory or requiring two separate data structures.

Now we can get to the next rule!

The union rule says that the address of all members of a union is the same as the address of the union.

Back to other rules! #

The remaining two rules are far more important:

The minimum rule says that the memory used for a collection shall be the minimum possible without violating any of the other rules.
The malloc rule says that any call to malloc that succeeds returns a pointer that is aligned for any type. This rule has some important consequences: it means that malloc() must return pointers aligned for the maximum alignment, which on x86-64 Linux is 16 bytes. In other words, any pointer returned from malloc points to an address that is a multiple of 16.

One consequence from the struct rule and the minimum rule is that reordering struct members can reduce size of structures! Look at the example in mexplore-structalign.c. The struct ints_and_chars defined in that file consists of three ints and three chars, whose declarations alternate. What will the size of this structure be?

It’s 24 bytes. The reason is that each int requires 4 bytes (so, 12 bytes total), and each char requires 1 byte (3 bytes total), but alignment requires the integers to start at addresses that are multiples of four! Hence, we end up with a struct layout like the following:

0x... 00 ... 04 ... 08 ... 0c ... 10 ... 14 ...   <- addresses (hex)
     +------+--+---+------+--+---+------+--+---+
     |  i1  |c1|PAD|  i2  |c2|PAD|  i3  |c3|PAD|  <- values
     +------+--+---+------+--+---+------+--+---+

This adds 9 bytes of padding – a 37.5% overhead! The padding is needed because the characters only use one byte, but the next integer has to start on an address divisible by 4.

But if we rearrange the members of the struct, declaring them in order i1, i2, i3, c1, c2, c3, the structure’s memory layout changes. We now have the three integers adjacent, and since they require an alignment of 4 and are 4 bytes in size, no padding is needed between them. Following, we can put the characters into contiguous bytes also, since their size and alignment are 1.

0x... 00 ... 04 ... 08 ... 0c 0d 0e 0f ...   <- addresses (hex)
     +------+------+------+--+--+--+--+
     |  i1  |  i2  |  i3  |c1|c2|c3|P.|      <- values
     +------+------+------+--+--+--+--+

We only need a single byte of padding (6.25% overhead), as the struct must be padded to 16 bytes (why? Consider an array of ints_and_chars and the alignment of the next element!). In addition, the structure is now 16 bytes in size rather than 24 bytes – a 33% saving.

Caching and the Storage Hierarchy #

We are now switching gears to talk about one of the most important performance-improving concepts in computer systems. This concept is the idea of cache memory.

Why are we covering this?

Caching is an immensely important concept to optimize performance of a computer system. As a software engineer in industry, or as a researcher, you will probably find yourself in countless situations where “add a cache” is the answer to a performance problem. Understanding the idea behind caches, as well as when a cache works well, is important to being able to build high-performance applications.

We will look at specific examples of caches, but a generic definition is the following: a cache is a small amount of fast storage used to speed up access to larger, slower storage.

One reasonable question is what we actually mean by “fast storage” and “slow storage”, and why need both. Couldn’t we just put all of the data on our computer into fast storage?

To answer this question, it helps to look at what different kinds of storage cost and how this cost has changed over time.

The Storage Hierarchy #

When we learn about computer science concepts, we often talk about “cost”: the time cost and space cost of algorithms, memory efficiency, and storage space. These costs fundamentally shape the kinds of solutions we build. But financial costs also shape the systems we build, and the costs of the storage technologies we rely on have changed dramatically, as have their capacities and speeds.

The table below gives the price per megabyte of different storage technology, in price per megabyte (2010 dollars), up to 2019. (Note that flash/SSD storage did not exist until the early 2000s, when the technology became available.)

Year	Memory (DRAM)	Flash/SSD	Hard disk
~1955	$411,000,000		$9,200
1970	$734,000.00		$260.00
1990	$148.20		$5.45
2003	$0.09	$0.305	$0.00132
2010	$0.019	$0.00244	$0.000073
2021	$0.003	$0.00008	$0.0000194

(Prices due to John C. McCallum, and inflation data from here. $1.00 in 1955 had “the same purchasing power” as $9.62 in 2019 dollars.)

Computer technology is amazing – not just for what it can do, but also for just how tremendously its cost has dropped over the course of just a few decades. The space required to store a modern smartphone photo (3 MB) on a harddisk would have costs tens of thousands of dollars in the 1950s, but now costs a fraction of a cent.

But one fundamental truth has remained the case across all these numbers: primary memory (DRAM) has always been substantially more expensive than long-term disk storage. This becomes even more evident if we normalize all numbers in the table to the cost of 1 MB of harddisk space in 2019, as the second table below does.

Year	Memory (DRAM)	Flash/SSD	Hard disk
~1955	219,800,000,000,000		333,155,000
1970	39,250,000,000		13,900,000
1990	7,925,000		291,000
2003	4,800	16,300	70
2010	1,000	130	3.9
2021	155	4.12	1

As a consequence of this price differential, computers have always had more persistent disk space than primary memory. Harddisks and flash/SSD storage are persistent (i.e., they survive power failiures), while DRAM memory is volatile (i.e., its contents are lost when the computer loses power), but harddisk and flash/SSD are also much slower to access than memory.

In particular, when thinking about storage performance, we care about the latency to access data in storage. The latency denotes the time it takes until data retrieved is available if read, or until it is on the storage medium if written. A longer latency is worse, and a smaller latency better, as a smaller latency means that the computer can complete operations sooner.

Another important storage performance metric is throughput (or “bandwidth”), which is the number of operations completed per time unit. Throughput is often, though not always, the inverse of latency. An ideal storage medium would habe low latency and high throughput, as it takes very little time to complete a request, and many units of data can be transferred per second.

In reality, though, latency generally grows, and throughput drops, as storage media are further and further away from the processor. This is partly due to the storage technologies employed (some, like spinning harddisks, are cheap to manufacture, but slow), and partly due to the inevitable physics of sending information across longer and longer wires.

The table below shows the typical capacity, latency, and throughput achievable with the different storage technologies available in our computers.

Storage type	Capacity	Latency	Throughput (random access)	Throughput (sequential)
Registers	~30 (100s of bytes)	0.5 ns	16 GB/sec (2x10⁹ accesses/sec)
DRAM (main memory)	8 GB	60 ns	100 GB/sec
SSD (stable storage)	512 GB	60 µs	550 MB/sec
Hard disk	2–5 TB	4–13 ms	1 MB/sec	200 MB/sec

This notion of larger, cheaper, and slower storage further away from the processor, and smaller, faster, and more expensive storage closer to it is referred to as the storage hierarchy, as it’s possibly to neatly rank storage according to these criteria. The storage hierarchy is often depicted as a pyramid, where wider (and lower) entries correspond to larger and slower forms of storage.

This picture includes processor caches, which are small regions of fast (SRAM) memory that are on the processor chip itself. The storage hierarchy shows the processor caches divided into multiple levels, with the L1 cache (sometimes pronounced “level-one cache”) closer to the processor than the L2, L3, and L4 caches. This reflects how processor caches are actually laid out, but we often think of a processor cache as a single unit.

Different computers have different sizes and access costs for these hierarchy levels; the ones in the table above are typical. Here are some more, based on Malte’s MacBook Air from ca. 2013: a few hundred bytes of of registers; ~5 MB of processor cache; 8 GB primary memory; 256 GB SSD. The processor cache divides into three levels: 128 KB of total L1 cache, divided into four 32 KB components; each L1 cache is accessed only by a single processor core (which makes it faster, as cores don’t need to coordinate). There are 256 KB of L2 cache, and there are 4 MB of L3 cache shared by all cores.

Each layer in the storage hierarchy acts as a cache for the following layer.

Finally, consider how the concept of caching abounds in everyday life, too. Imagine how life would differ without, say, fast access to food storage – if every time you felt hungry, you had to walk to a farm and eat a carrot you pulled out of the dirt. Your whole day would be occupied with finding and eating food! (Indeed, this is what some animals spend most of their time doing.) Instead, your refrigerator (or your dorm’s refrigerator) acts as a cache for your neighborhood grocery store, and that grocery store acts as a cache for all the food producers worldwide.

Summary #

Today, we learned some handy rules about collections and their memory representation and review how these rules they interact with alignment, particularly within structs. We saw that changing the order in which members are declared in a struct can significantly affect its size, meaning that alignment matters for writing efficient systems code.