CS 33

Caches

Many of the slides in this lecture are either from or adapted from slides provided by the authors of the textbook “Computer Systems: A Programmer’s Perspective,” 2nd Edition and are provided from the website of Carnegie-Mellon University, course 15-213, taught by Randy Bryant and David O’Hallaron in Fall 2010. These slides are indicated “Supplied by CMU” in the notes section of the slides.
Cache Memories

- **Cache memories** are small, fast SRAM-based memories managed automatically in hardware
  - hold frequently accessed blocks of main memory
- CPU looks first for data in caches (e.g., L1, L2, and L3), then in main memory
- Typical system structure:

Supplied by CMU.
General Cache Organization (S, E, B)

$E = 2^s$ lines per set

$S = 2^t$ sets

**Cache size:**

$C = S \times E \times B$ data bytes

Supplied by CMU.
Example: Direct Mapped Cache (E = 1)

Direct mapped: one line per set
Assume: cache block size 8 bytes

Supplied by CMU.
Example: Direct Mapped Cache (E = 1)

Direct mapped: one line per set
Assume: cache block size 8 bytes

Supplied by CMU.
Example: Direct Mapped Cache (E = 1)

Direct mapped: one line per set
Assume: cache block size 8 bytes

No match: old line is evicted and replaced

Supplied by CMU.
Direct-Mapped Cache Simulation

M=16 byte addresses, B=2 bytes/block,
S=4 sets, E=1 Blocks/set

Address trace (reads, one byte per read):
0  [0000,], miss
1  [0001,], hit
7  [0111,] miss
8  [1000,], miss
0  [0000,], miss

<table>
<thead>
<tr>
<th>Set</th>
<th>Tag</th>
<th>Block</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>0</td>
<td>M[0-1]</td>
</tr>
<tr>
<td>1</td>
<td></td>
<td></td>
</tr>
<tr>
<td>2</td>
<td></td>
<td></td>
</tr>
<tr>
<td>3</td>
<td>1</td>
<td>M[6-7]</td>
</tr>
</tbody>
</table>
A Higher-Level Example

```c
int sum_array_rows(double a[16][16])
{
    int i, j;
    double sum = 0;
    for (i = 0; i < 16; i++)
        for (j = 0; j < 16; j++)
            sum += a[i][j];
    return sum;
}
```

```c
int sum_array_cols(double a[16][16])
{
    int i, j;
    double sum = 0;
    for (j = 0; j < 16; j++)
        for (i = 0; i < 16; i++)
            sum += a[i][j];
    return sum;
}
```

Ignore the variables sum, i, j
assume: cold (empty) cache,
a[0][0] goes here

32 B = 4 doubles

Supplied by CMU.
Note that the cache holds two rows of the matrix.
For each reference to an element of the matrix, its entire row is brought into the cache, even though the rest of the row is not immediately used.
If arrays x and y have the same alignment, i.e., both start in the same cache set, then each access to an element of y replaces the cache line containing the corresponding element of x, and vice versa. The result is that loop is executed very slowly — each access to either array results in a conflict miss.
However, if the two arrays start in different cache sets, then the loop executes quickly — there is a cache miss on just every fourth access to each array.

```c
double dotprod(double x[8], double y[8]) {
    double sum = 0.0;
    int i;
    for (i=0; i<8; i++)
        sum += x[i] * y[i];
    return sum;
}
```
E-way Set-Associative Cache (Here: E = 2)

E = 2: two lines per set
Assume: cache block size 8 bytes

valid? + match: yes = hit

Address of short int:

| t bits | 001 | 100 |

block offset

Supplied by CMU.
E-way Set-Associative Cache (Here: E = 2)

E = 2: two lines per set
Assume: cache block size 8 bytes

valid? + match: yes = hit

No match:
- One line in set is selected for eviction and replacement
- Replacement policies: random, least recently used (LRU), ...

Supplied by CMU.
### Quiz 1

Given the address above and the cache contents as shown, what is the value of the `int` at the given address?

- a) 1111
- b) 3333
- c) 4444
- d) 7777
2-Way Set-Associative Cache Simulation

\[ t=2 \quad s=1 \quad b=1 \]

M=16 byte addresses, B=2 bytes/block,
S=2 sets, E=2 blocks/set

Address trace (reads, one byte per read):

<table>
<thead>
<tr>
<th>Address</th>
<th>Outcome</th>
</tr>
</thead>
<tbody>
<tr>
<td>0 [0000]</td>
<td>miss</td>
</tr>
<tr>
<td>1 [0001]</td>
<td>hit</td>
</tr>
<tr>
<td>7 [0111]</td>
<td>miss</td>
</tr>
<tr>
<td>8 [1000]</td>
<td>miss</td>
</tr>
<tr>
<td>0 [0000]</td>
<td>hit</td>
</tr>
</tbody>
</table>

Set 0

<table>
<thead>
<tr>
<th>v</th>
<th>Tag</th>
<th>Block</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>00</td>
<td>M[0-1]</td>
</tr>
<tr>
<td>1</td>
<td>10</td>
<td>M[8-9]</td>
</tr>
</tbody>
</table>

Set 1

<table>
<thead>
<tr>
<th>v</th>
<th>Tag</th>
<th>Block</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>01</td>
<td>M[6-7]</td>
</tr>
<tr>
<td>0</td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Supplied by CMU.
A Higher-Level Example

```c
int sum_array_rows(double a[16][16])
{
    int i, j;
    double sum = 0;

    for (i = 0; i < 16; i++)
        for (j = 0; j < 16; j++)
            sum += a[i][j];

    return sum;
}
```

```c
int sum_array_rows(double a[16][16])
{
    int i, j;
    double sum = 0;

    for (j = 0; j < 16; j++)
        for (i = 0; i < 16; i++)
            sum += a[i][j];

    return sum;
}
```

Ignore the variables sum, i, j

Assume: cold (empty) cache, a[0][0] goes here

32 B = 4 doubles

Supplied by CMU.
The cache still holds two rows of the matrix, but each row may go into one of two different cache lines. In the slide, the first row goes into the first lines of the cache sets, the second row goes into the second lines of the cache sets.
A Higher-Level Example

```c
int sum_array_rows(double a[16][16])
{
    int i, j;
    double sum = 0;
    for (i = 0; i < 16; i++)
        for (j = 0; j < 16; j++)
            sum += a[i][j];
    return sum;
}

int sum_array_cols(double a[16][16])
{
    int i, j;
    double sum = 0;
    for (j = 0; j < 16; j++)
        for (i = 0; i < 16; i++)
            sum += a[i][j];
    return sum;
}
```

There is still a cache miss on each access.
With a 2-way set-associative cache, our dot-product example runs quickly even if the two arrays have the same alignment.
Supplied by CMU.

The L3 cache is known as the *last-level cache* (LLC) in the Intel documentation.
What About Writes?

- **Multiple copies of data exist:**
  - L1, L2, main memory, disk

- **What to do on a write-hit?**
  - write-through (write immediately to memory)
  - write-back (defer write to memory until replacement of line)
    » need a dirty bit (line different from memory or not)

- **What to do on a write-miss?**
  - write-allocate (load into cache, update line in cache)
    » good if more writes to the location follow
  - no-write-allocate (writes immediately to memory)

- **Typical**
  - write-through + no-write-allocate
  - write-back + write-allocate
Cache Performance Metrics

- **Miss rate**
  - fraction of memory references not found in cache (misses / accesses)
  - typical numbers (in percentages):
    - 3-10% for L1
    - can be quite small (e.g., < 1%) for L2, depending on size, etc.
  - 1 – hit rate

- **Hit time**
  - time to deliver a line in the cache to the processor
    - includes time to determine whether the line is in the cache
  - typical numbers:
    - 1-2 clock cycles for L1
    - 5-20 clock cycles for L2

- **Miss penalty**
  - additional time required because of a miss
    - typically 50-200 cycles for main memory (trend: increasing!)
Let’s Think About Those Numbers

• Huge difference between a hit and a miss
  – could be 100x, if just L1 and main memory

• Would you believe 99% hit rate is twice as good as 97%?
  – consider:
    cache hit time of 1 cycle
    miss penalty of 100 cycles
  – average access time:
    97% hits: \(.97 \times 1 \text{ cycle} + 0.03 \times 100 \text{ cycles} \approx 4 \text{ cycles}\)
    99% hits: \(.99 \times 1 \text{ cycle} + 0.01 \times 100 \text{ cycles} \approx 2 \text{ cycles}\)

• This is why “miss rate” is used instead of “hit rate”
Writing Cache-Friendly Code

• Make the common case go fast
  – focus on the inner loops of the core functions

• Minimize the misses in the inner loops
  – repeated references to variables are good (temporal locality)
  – stride-1 reference patterns are good (spatial locality)

Key idea: our qualitative notion of locality is quantified through our understanding of cache memories

Supplied by CMU.
Miss-Rate Analysis for Matrix Multiply

- **Assume:**
  - Block size = 32B (big enough for four 64-bit words)
  - matrix dimension (N) is very large
    » approximate 1/N as 0.0
  - cache is not big enough to hold multiple rows

- **Analysis method:**
  - look at access pattern of inner loop

Supplied by CMU.
Matrix Multiplication Example

- **Description:**
  - multiply N x N matrices
  - O(N^3) total operations
  - N reads per source element
  - N values summed per destination
    » but may be able to hold in register

```c
/* ijk */
for (i=0; i<n; i++) {
    for (j=0; j<n; j++) {
        sum = 0.0;
        for (k=0; k<n; k++)
            sum += a[i][k] * b[k][j];
        c[i][j] = sum;
    }
}
```

Supplied by CMU.
Layout of C Arrays in Memory (review)

- C arrays allocated in row-major order
  - each row in contiguous memory locations
- Stepping through columns in one row:
  - for (i = 0; i < N; i++)
    - sum += a[0][i];
  - accesses successive elements
  - if block size (B) > 4 bytes, exploit spatial locality
    » compulsory miss rate = 4 bytes / B
- Stepping through rows in one column:
  - for (i = 0; i < n; i++)
    - sum += a[i][0];
  - accesses distant elements
  - no spatial locality!
    » compulsory miss rate = 1 (i.e. 100%)

Supplied by CMU.
Assume we are multiplying arrays of doubles, thus each element is eight bytes long, and thus a cache line holds four matrix elements.
Matrix Multiplication (jik)

```c
/* jik */
for (j=0; j<n; j++) {
    for (i=0; i<n; i++) {
        sum = 0.0;
        for (k=0; k<n; k++)
            sum += a[i][k] * b[k][j];
        c[i][j] = sum
    }
}
```

### Inner loop:
- A
- B
- C

### Misses per inner loop iteration:

<table>
<thead>
<tr>
<th></th>
<th>A</th>
<th>B</th>
<th>C</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>0.25</td>
<td>1.0</td>
<td>0.0</td>
</tr>
</tbody>
</table>
Matrix Multiplication (kij)

```c
/* kij */
for (k=0; k<n; k++) {
    for (i=0; i<n; i++) {
        r = a[i][k];
        for (j=0; j<n; j++)
            c[i][j] += r * b[k][j];
    }
}
```

**Inner loop:**

- Fixed
- Row-wise
- Row-wise

**Misses per inner loop iteration:**

<table>
<thead>
<tr>
<th></th>
<th>A</th>
<th>B</th>
<th>C</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>0.0</td>
<td>0.25</td>
<td>0.25</td>
</tr>
</tbody>
</table>

Supplied by CMU.
Matrix Multiplication (ijk)

`/* ijk */
for (i=0; i<n; i++) {
    for (k=0; k<n; k++) {
        r = a[i][k];
        for (j=0; j<n; j++)
            c[i][j] += r * b[k][j];
    }
}

Misses per inner loop iteration:

<table>
<thead>
<tr>
<th></th>
<th>A</th>
<th>B</th>
<th>C</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>0.0</td>
<td>0.25</td>
<td>0.25</td>
</tr>
</tbody>
</table>

Supplied by CMU.
Matrix Multiplication (jki)

```c
/* jki */
for (j=0; j<n; j++) {
    for (k=0; k<n; k++) {
        r = b[k][j];
        for (i=0; i<n; i++)
            c[i][j] += a[i][k] * r;
    }
}
```

Misses per inner loop iteration:

<table>
<thead>
<tr>
<th></th>
<th>A</th>
<th>B</th>
<th>C</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>1.0</td>
<td>0.0</td>
<td>1.0</td>
</tr>
</tbody>
</table>

Supplied by CMU.
Matrix Multiplication (kji)

```c
/* kji */
for (k=0; k<n; k++) {
    for (j=0; j<n; j++) {
        r = b[k][j];
        for (i=0; i<n; i++)
            c[i][j] += a[i][k] * r;
    }
}
```

**Misses per inner loop iteration:**

<table>
<thead>
<tr>
<th></th>
<th>A</th>
<th>B</th>
<th>C</th>
</tr>
</thead>
<tbody>
<tr>
<td>Misses</td>
<td>1.0</td>
<td>0.0</td>
<td>1.0</td>
</tr>
</tbody>
</table>

Supplied by CMU.
Summary of Matrix Multiplication

For \( i = 0 \) to \( n + 1 \)
For \( j = 0 \) to \( n + 1 \)
For \( k = 0 \) to \( n + 1 \)
    \( \text{sum} = 0.0 \)
    \( \text{sum} += a[i][k] \times b[k][j] \)
    \( c[i][j] = \text{sum} \)
End for
End for
End for

\( ijk \) & \( jik \):
- 2 loads, 0 stores
- misses/iter = 1.25

For \( k = 0 \) to \( n + 1 \)
For \( i = 0 \) to \( n + 1 \)
    \( r = a[i][k] \)
    For \( j = 0 \) to \( n + 1 \)
        \( c[1][j] += r \times b[k][j] \)
    End for
End for
End for

\( kij \) & \( ikj \):
- 2 loads, 1 store
- misses/iter = 0.5

For \( i = 0 \) to \( n + 1 \)
For \( j = 0 \) to \( n + 1 \)
    \( r = b[k][j] \)
For \( k = 0 \) to \( n + 1 \)
    \( c[i][j] += a[i][k] \times r \)
End for
End for

\( jki \) & \( kji \):
- 2 loads, 1 store
- misses/iter = 2.0

Supplied by CMU.
Supplied by CMU.
Matrix Multiplication: More Analysis

```c
/* Multiply n x n matrices a and b. */
void mmm(double *a, double *b, double *c, int n) {
    int i, j, k;
    for (i = 0; i < n; i++)
        for (j = 0; j < n; j++)
            for (k = 0; k < n; k++)
                c[i*n+j] += a[i*n + k]*b[k*n + j];
}
```

Supplied by CMU.
Cache-Miss Analysis

• Assume:
  – matrix elements are doubles
  – cache block = 8 doubles
  – cache size $C \ll n$ (much smaller than $n$)

• First iteration:
  – $n/8 + n = 9n/8$ misses
  – afterwards in cache: (schematic)
    
    \[
    \begin{array}{c}
    \text{8 wide} \\
    \end{array}
    \]
Cache-Miss Analysis

- Assume:
  - matrix elements are doubles
  - cache block = 8 doubles
  - cache size C << n (much smaller than n)

- Second iteration:
  - again:
    \[ \frac{n}{8} + n = \frac{9n}{8} \text{ misses} \]

- Total misses:
  - \[ 9n/8 * n^2 = (9/8) * n^3 \]
Blocked Matrix Multiplication

```c
/* Multiply n x n matrices a and b */
void mm(double *a, double *b, double *c, int n) {
    int i, j, k;
    for (i = 0; i < n; i++)
        for (j = 0; j < n; j++)
            for (k = 0; k < n; k++)
                c[i*n+j] += a[i*n+k]*b[k*n+j];
}
```

Supplied by CMU.
Cache-Miss Analysis

- **Assume:**
  - cache block = 8 doubles
  - cache size $C \ll n$ (much smaller than $n$)
  - three blocks fit into cache: $3B^2 < C$

- **First (block) iteration:**
  - $B^2/8$ misses for each block
  - $2n/B * B^2/8 = nB/4$ (omitting matrix $c$)

- afterwards in cache (schematic)

---

Supplied by CMU.
Cache-Miss Analysis

- Assume:
  - cache block = 8 doubles
  - cache size C << n (much smaller than n)
  - three blocks fit into cache: $3B^2 < C$

- Second (block) iteration:
  - same as first iteration
  - $2n/B \times B^2/8 = nB/4$

- Total misses:
  - $nB/4 \times (n/B)^2 = n^3/(4B)$
Summary

- No blocking: \((9/8) \times n^3\)
- Blocking: \(1/(4B) \times n^3\)

- Suggest largest possible block size \(B\), but limit \(3B^2 < C\)!

- **Reason for dramatic difference:**
  - matrix multiplication has inherent temporal locality:
    - input data: \(3n^2\), computation \(2n^3\)
    - every array element used \(O(n)\) times!
  - but program has to be written properly

Supplied by CMU.
Quiz 2

What is the smallest value of B (in 8-byte doubles) for which the cache-miss analysis works?

a) 1
b) 2
c) 4
d) 8
Concluding Observations

- **Programmer can optimize for cache performance**
  - how data structures are organized
  - how data are accessed
    - nested loop structure
    - blocking is a general technique
- **All systems favor “cache-friendly code”**
  - getting absolute optimum performance is very platform specific
    - cache sizes, line sizes, associativities, etc.
  - can get most of the advantage with generic code
    - keep working set reasonably small (temporal locality)
    - use small strides (spatial locality)