### TOPICS IN COMPUTING WITH EMERGING TECHNOLOGIES

FALL 2020

BROWN

PROF. IRIS BAHAR

SEPTEMBER 16, 2020

LECTURE 3: MEMORY DESIGN

# OVERVIEW OF EMERGING TECHNOLOGIES

Read and comment on 2 survey papers on emerging technologies

- Find under Modules→Week #1
  - Read the following 2 papers:
    - Computing's Energy Problem (and what we can do about it)
    - The era of hyper-scaling in electronics
  - Click on <u>online discussion</u>
- Thanks to those who have already submitted.
- Post your comments ASAP.

1

### SUBMIT TOPICS OF INTEREST

- Find under Modules→Week #1
  - Click on list of topics
  - Also find under Assignments→Assignment #I
- Think of topics you are most interested in learning about this semester. They may or may not relate to your own research.
- Submit as a text entry with a list of 2-4 topics you would like to cover this semester.
- This will help me plan paper topics for the semester and pair up people with mutual interests.
- Due by Sept 16

### LECTURE SLIDES FOR WEEK #2

- Lecture slides are posted on Canvas (see Modules: week #2) and on the CS course webpages (cs.brown.edu/courses/csci2952j)
- Recommended textbooks:
  - Hennessy, Patterson, Computer Organization and Design:The Hardware/Software Interface, Morgan Kaufmann
- Neil H. E.Weste and David Harris, CMOSVLSI Design: A Circuit and Systems Perspective, 4th Edition, Addison Wesley Publishers, 2011



· We will continue with our overview of memory design and conventional transistor design

- Paper discussion delayed by I week
- Papers for 4<sup>th</sup> week of class:
  - Emerging NVM: A Survey on Architectural Integration and Research Challenges
  - Memory that never forgets: emerging nonvolatile memory and the implication for architecture design
- I will post papers some time next week
  - I will assign teams for reviewing the papers
  - Expect different team assignments weekly
- I will also assign discussion leaders for the week
  - If you want to volunteer, let me know
  - We will rotate discussion leaders throughout the semester. Expect to lead 2-3 times
- Starting the 6<sup>th</sup> or 7<sup>th</sup> week of class, students will be able to choose papers to review

5





6

### SRAM: SIZING IS EVERYTHING



- Key takeaways:
  - SRAM is very stable because of reinforcement of cross-coupled inverters
  - Retains value as long as connected to power
  - Correct sizing of transistors prevent read disturb and allow overwrite on write
  - Dual rail adds extra noise resilience
- Use of sense amp to speed up transition of bit lines (!BL, BL) to "full rail" values



### ADVANCED DRAM ORGANIZATION

- Bits in a DRAM are organized as a rectangular array
  - DRAM accesses an entire row
  - Burst mode: supply successive words from a row with reduced latency
- Double data rate (DDR) DRAM
  - Transfer on rising and falling clock edges
- Quad data rate (QDR) DRAM
  - Separate DDR inputs and outputs



10

# DRAM PERFORMANCE FACTORS New buffer Allows several words to be read and refreshed in parallel Synchronous DRAM Allows for consecutive accesses in bursts without needing to send each address Improves bandwidth DRAM banking Allows simultaneous access to multiple DRAMs Improves bandwidth



### CACHE MEMORY Cache memory • The level of the memory hierarchy closest to the CPU Given accesses X<sub>1</sub>, ..., X<sub>n-1</sub>, X<sub>n</sub> $X_4$ $X_4$ $X_1$ $X_1$ • How do we know if the data X<sub>n-2</sub> X<sub>n-2</sub> is present? • Where do we look? X<sub>n-1</sub> X<sub>n-1</sub> $X_2$ X<sub>2</sub> X<sub>n</sub> $X_3$ $X_3$ a. Before the reference to X<sub>n</sub> b. After the reference to X<sub>n</sub>

13







| CA | CHE E                      | EXA     | MPLE         |      |  |
|----|----------------------------|---------|--------------|------|--|
|    | ilocks, I wor<br>ial state | d/block | a, direct ma | pped |  |
|    | Index                      | V       | Tag          | Data |  |
|    | 000                        | N       |              |      |  |
|    | 001                        | N       |              |      |  |
|    | 010                        | N       |              |      |  |
|    | 011                        | N       |              |      |  |
|    | 100                        | N       |              |      |  |
|    | 101                        | N       |              |      |  |
|    | 110                        | N       |              |      |  |
|    | 111                        | N       |              |      |  |



### 

| CAC | CHE E  | XAI | MPLE      |     |          |             |
|-----|--------|-----|-----------|-----|----------|-------------|
|     |        |     |           |     |          |             |
|     | Word a | ddr | Binary ad | ddr | Hit/miss | Cache block |
|     | 22     |     | 10 110    | )   | Miss     | 110         |
|     |        |     |           |     |          |             |
|     | Index  | V   | Tag       | Dat | а        |             |
|     | 000    | N   |           |     |          |             |
|     | 001    | N   |           |     |          |             |
|     | 010    | N   |           |     |          |             |
|     | 011    | N   |           |     |          |             |
|     | 100    | N   |           |     |          |             |
|     | 101    | N   |           |     |          |             |
|     | 110    | Υ   | 10        | Me  | m[10110] |             |

Ν

| CAC | CHE E  | XA  | MPLE      |     |          |             |
|-----|--------|-----|-----------|-----|----------|-------------|
|     | Word a | ddr | Binary ad | ldr | Hit/miss | Cache block |
|     | 26     |     | 11 010    |     | Miss     | 010         |
|     |        |     |           |     |          |             |
|     | Index  | V   | Tag       | Dat | а        |             |
|     | 000    | Ν   |           |     |          |             |
|     | 001    | Ν   |           |     |          |             |
|     | 010    | Y   | 11        | Ме  | m[11010] |             |
|     | 011    | Ν   |           |     |          |             |
|     | 100    | Ν   |           |     |          |             |
|     | 101    | N   |           |     |          |             |
|     | 110    | Y   | 10        | Me  | m[10110] |             |
|     | 111    | N   |           |     |          |             |

| CAC | CHE E  | XA  | MPLE      |     |          |             |
|-----|--------|-----|-----------|-----|----------|-------------|
|     | Word a | ddr | Binary ad | dr  | Hit/miss | Cache block |
|     | 16     |     | 10 000    |     | Miss     | 000         |
|     | 3      |     | 00 011    |     | Miss     | 011         |
|     | 16     |     | 10 000    |     | Hit      | 000         |
|     | Index  | V   | Tag       | Dat | а        |             |
|     | 000    | Y   | 10        | Ме  | m[10000] |             |
|     | 001    | N   |           |     |          |             |
|     | 010    | Y   | 11        | Me  | m[11010] |             |
|     | 011    | Y   | 00        | Me  | m[00011] |             |
|     | 100    | N   |           |     |          |             |
|     | 101    | N   |           |     |          |             |
|     | 110    | Y   | 10        | Me  | m[10110] |             |
|     | 111    | Ν   |           |     |          |             |

### CACHE EXAMPLE

| Word addr | Binary addr | Hit/miss | Cache block |
|-----------|-------------|----------|-------------|
| 22        | 10 110      | Hit      | 110         |
| 26        | 11 010      | Hit      | 010         |

| Index | V   | Tag | Data       |  |
|-------|-----|-----|------------|--|
|       | - · | lag | Data       |  |
| 000   | N   |     |            |  |
| 001   | N   |     |            |  |
| 010   | Y   | 11  | Mem[11010] |  |
| 011   | N   |     |            |  |
| 100   | N   |     |            |  |
| 101   | N   |     |            |  |
| 110   | Y   | 10  | Mem[10110] |  |
| 111   | N   |     |            |  |

22

| CAC | CHE E     | XA | MPLE        |     |            |             |  |
|-----|-----------|----|-------------|-----|------------|-------------|--|
|     | Word addr |    | Binary addr |     | Hit/miss   | Cache block |  |
|     | 18        |    | 10 010      |     | Miss       | 010         |  |
|     |           |    |             |     |            |             |  |
|     | Index     | V  | Tag         | Dat | ta         |             |  |
|     | 000       | Y  | 10          | Me  | Mem[10000] |             |  |
|     | 001       | Ν  |             |     |            |             |  |
|     | 010       | Y  | 10          | Me  | m[10010]   |             |  |
|     | 011       | Y  | 00          | Me  | m[00011]   |             |  |
|     |           |    |             | -   |            |             |  |

Mem[10110]

100

101

110 111 Ν

N Y

N



### CACHE MISSES

- On cache hit, CPU proceeds normally
- On cache miss
  - Stall the CPU pipeline (wait for data for CPU to proceed)
  - Fetch block from next level of hierarchy
  - Instruction cache miss
    - Restart instruction fetch
- Data cache miss
- Complete data access

27

# BLOCK SIZE CONSIDERATIONS Larger blocks should reduce miss rate Due to spatial locality But in a fixed-sized cache Larger blocks → fewer unique blocks More competition → increased miss rate Larger blocks → pollution (if spatial locality is weak) Larger miss penalty

- Takes longer to fill block with new data
- Can override benefit of reduced miss rate
- Early restart and critical-word-first can help

### ASSOCIATIVE CACHES

- Fully associative
  - Allow a given block to go in any cache entry
  - Requires all entries to be searched at once
  - Comparator per entry (expensive)
- n-way set associative
  - Each set contains n entries
  - Block number determines which set
  - (Block number) modulo (#Sets in cache)
  - Search all entries in a given set at once
  - n comparators (less expensive)



### ASSOCIATIVITY EXAMPLE

- Compare 4-block caches
  - Direct mapped, 2-way set associative, fully associative
  - Block access sequence: 0, 8, 0, 6, 8
- Direct mapped

| Block   | Cache | Hit/miss | Cache content after access |   |        |   |  |  |
|---------|-------|----------|----------------------------|---|--------|---|--|--|
| address | index |          | 0                          | 1 | 2      | 3 |  |  |
| 0       | 0     | miss     | Mem[0]                     |   |        |   |  |  |
| 8       | 0     | miss     | Mem[8]                     |   |        |   |  |  |
| 0       | 0     | miss     | Mem[0]                     |   |        |   |  |  |
| 6       | 2     | miss     | Mem[0]                     |   | Mem[6] |   |  |  |
| 8       | 0     | miss     | Mem[8]                     |   | Mem[6] |   |  |  |

Address 8, 0 conflict in the cache (data thrashing)

37

#### ASSOCIATIVITY EXAMPLE 2-way set associative Block Cache Hit/miss Cache content after access address index Set 1 Set 0 0 0 miss Mem[0] 8 0 miss Mem[0] Mem[8] 0 0 hit Mem[0] Mem[8] 6 0 miss Mem[0] Mem[6] 8 0 miss Mem[8] Mem[6] Fully associative Block Hit/miss Cache content after access address 0 miss Mem[0] Mem[0] Mem[8] 8 miss Mem[0] Mem[8] 0 hit

Mem[0] Mem[8]

Mem[0] Mem[8]

Mem[6]

Mem[6]

miss

hit



38

6

### MULTILEVEL CACHES

- Primary cache attached to CPU
  - Small, but fast
- Level-2 cache services misses from primary cache
- Larger, slower, but still faster than main memory
- Main memory services L-2 cache misses
- Some high-end systems include L-3 cache

### MULTILEVEL CACHE EXAMPLE

- Given
  - CPU base CPI = I, clock rate =  $4GHz \rightarrow 0.25ns$  per clock cycle
  - Miss rate/instruction = 2%
  - Main memory access time = 100ns
- With just primary cache
  - Miss penalty = 100ns/0.25ns = 400 cycles
  - Effective CPI = 1 + 0.02 × 400 = 9

42

# EXAMPLE (CONT.) Now add L-2 cache Access time = 5ns Global miss rate to main memory = 0.5% Primary miss with L-2 hit Penalty = 5ns/0.25ns = 20 cycles Primary miss with L-2 miss Extra penalty = 400 cycles

- CPI = 1 + 0.02 × 20 + 0.005 × 400 = 3.4
- Performance ratio = 9/3.4 = 2.6

# MAIN MEMORY SUPPORTING CACHES

- Use DRAMs for main memory
  - Fixed width (e.g., I word)
- Connected by fixed-width clocked bus
- Bus clock is typically slower than CPU clock
- Example cache block read
  - I bus cycle for address transfer
  - I5 bus cycles per DRAM access
  - I bus cycle per data transfer
- For 4-word block, I-word-wide DRAM
  - Miss penalty = 1 + 4×15 + 4×1 = 65 bus cycles
- Bandwidth = 16 bytes / 65 cycles = 0.25 B/cycle



### CACHE PERFORMANCE EXAMPLE

### Given

- I-cache miss rate = 2%
- D-cache miss rate = 4%
- Miss penalty = 100 cycles
- Base CPI (ideal cache) = 2
- Load & stores are 36% of instructions
- Miss cycles per instruction
  - I-cache: 0.02 × 100 = 2
  - D-cache: 0.36 × 0.04 × 100 = 1.44
- Actual CPI = 2 + 2 + 1.44 = 5.44
- Ideal CPU is 5.44/2 =2.72 times faster

48



## PERFORMANCE SUMMARY

- When CPU performance increased
  - Miss penalty becomes more significant
- Decreasing base CPI
  - Greater proportion of time spent on memory stalls
- Increasing clock rate
  - Memory stalls account for more CPU cycles
- Can't neglect cache behavior when evaluating system performance

### INTERACTIONS WITH ADVANCED CPUS

- Out-of-order CPUs can execute instructions during cache miss
  - Pending store stays in load/store unit
  - Dependent instructions wait in reservation stations
    - Independent instructions continue
- Effect of miss depends on program data flow
  - Much harder to analyse
  - Use system simulation