Lecture 16: Virtual Memory and Page Tables

🎥 Lecture video (Brown ID required)
💻 Lecture code
❓ Post-Lecture Quiz (due 11:59pm, Wednesday, April 6).

Virtual Memory

We previously used memory protection to isolate kernel memory from user-space processes, preventing attacks where user-space processes write to kernel memory. But this is not enough – we also need to prevent user-space processes from accessing the memory of other user-space processes!

In DemoOS as we looked at in lectures so far (and also in WeensyOS at the start of Project 3), there is no isolation between processes. The only things that prevents utter chaos is that the processes happen to use non-overlapping memory addresses; for example, p-alice starts at address 0x10'0000 and ends at 0x13'FFFF, while p-eve starts at 0x14'0000 and ends at 0x17'FFFF (and likewise with the first and second process on WeensyOS).

There are a couple of problems with this approach:

  1. If Eve successfully guesses an address within Alice's memory, she can read and modify Alice's data.
  2. If Alice and Eve's processes every accidentally use the same address, they will corrupt each other's memory; programmers need to carefully choose non-overlapping memory regions for their processes.
  3. The processes' memory regions are of fixed size, and a process that needs more than, say, the 0x3FFFF bytes of memory between the top and bottom address (= 256 KB) either cannot run or needs to carefully avoid any memory used by other processes.
  4. If we have many processes, there may not be enough memory to run them all as we need to pre-reserve a fixed amount of memory for each process (in the examples, 0x3FFFF bytes = 256 KB).
So we need something safer and more flexible! Virtual memory is a concept that achieves both these goals.

Demo: Eve Attacking Alice's Memory

Two relatively simple attacks demonstrate the danger of giving Eve's process access to the memory of Alice's process. In the first, Eve might form a pointer into data stored, e.g., on Alice's stack and change that data, for example to make Alice print an attack message rather than her normal Hi, I'm Alice! message.

In the lecture demo as compiled, the address of Alice's message on her process's stack is 0x13'ff59. (You can obtain this by printing the address &msg from p-alice.cc, or by looking at the disassembly in obj/p-alice.asm). Eve may modify her program as follows to overwrite Alice's message:

--- p-eve.cc    2020-03-05 10:24:04.760050399 -0500
+++ p-eve.cc    2020-03-05 10:25:21.671907556 -0500
@@ -7,6 +7,8 @@
         if (i % 1024 == 0) {
             console_printf(0x0E00, "Hi, I'm Eve! #%d\n", i / 512);
         }
+        char* msg = (char*) 0x13ff59;
+        snprintf(msg, 15, "EVE ATTACK!");

         if (i % 2048 == 0) {
           char* syscall = (char*) 0x40ad6;

What's the syntax of the above listing?

This file is a unified diff, which is a format for expressing differences between text files (sometimes referred to as "patches". It's the format that the git diff command produces its output in. Developers often use diffs to concisely show differences to code. The lines black starting with a space are context lines, which indicate where in the file the changes should occur; the green lines starting with a "+" sign indicate lines to be added; if there were lines starting with a "-" sign, they would indicate lines to be removed (typically shown in red). More about the diff format!

An even worse attack involves Eve writing to Alice's code in the static segment of her process. Recall that the code segment contains the machine instructions executed by Alice's process. If Eve finds a convenient place to sneakily insert new code, she can cause Alice to compute on her behalf, or worse, force Alice into an infinite loop, permanently disabling her process.

Remember that the two bytes 0xEB 0xFE correspond to an infinite loop in x86-64 machine code (a two-byte instruction encoding an unconditional, relative jump by -2 bytes). If Eve writes these bytes into the first two bytes of any instruction in the inner loop of Alice's code, Alice will enter an infinite loop the next time she executes the instructions at that address. For example, one such address is 0x10'0077, normally a mov instruction just after Alice's process returns from a system call (see obj/p-alice.asm). Eve might make the following change to her program to corrupt Alice's process:

--- p-eve.cc    2020-03-05 10:26:32.893535962 -0500
+++ p-eve.cc    2020-03-05 10:25:51.894988665 -0500
@@ -11,9 +11,9 @@
         snprintf(msg, 15, "EVE ATTACK!");

         if (i % 2048 == 0) {
+          char* alicecode = (char*) 0x100077;
+          alicecode[0] = 0xEB;
+          alicecode[1] = 0xFE;
-          char* syscall = (char*) 0x40ad6;
-          syscall[0] = 0xEB;
-          syscall[1] = 0xFE;

           console_printf(0x0D00, "MWAHAHAHAHAHAH EVE REIGNS SUPREME!\n");
         }
This replaces the prior attack on the kernel code (which got Eve's process killed, as we now protect the kernel memory) with an attack on Alice's code.

Back to Virtual Memory!

The basic idea behind virtual memory is to create, for each user-space process, the illusion that it runs alone on the computer and has access to the computer's full memory. In other words, we seek to give different processes different views of the actual memory.

Recall that the (physical) memory in DemoOS is roughly laid out as follows:

         0x0
            +--------------------------------------------------------------------+
null page ->|R                                                                   |
            +--------------------------------------------------------------------+
 0x40000 -->|KKKKKKKKKKKKKKKKKKKKKKKKKKKKKK                                     K| <-- kernel stack
(kernel mem)+--------------------------------------------------------------------+
            |                                    RRRRRRRRRRRRRRRRRRRRRCRRRRRRRRRR| console @ 0xB8000
            +--------------------------------------------------------------------+
            |RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR|
            +--------------------------------------------------------------------+
0x100000 -->|Code|Data|Heap ...       Alice's process memory            ... Stack| <-- 0x13ffff
            +--------------------------------------------------------------------+
0x140000 -->|Code|Data|Heap ...         Eve's process memory            ... Stack| <-- 0x17ffff
            +--------------------------------------------------------------------+
            |                                                                    |
            +--------------------------------------------------------------------+
            |                                                                    |
            +--------------------------------------------------------------------+
                                                                                 0x1fffffff (MEMSIZE_PHYSICAL - 1)

What we'd like to achieve through virtual memory is that Alice's user-space process has the following view of this memory:

         0x0
            +--------------------------------------------------------------------+
null page ->| XXX NO ACCESS XXX                                                  |
            +--------------------------------------------------------------------+
 0x40000 -->| XXX NO ACCESS in userspace XXX                                     | <-- kernel stack
(kernel mem)+--------------------------------------------------------------------+
            | XXX NO ACCESS XXX                                       C          | console @ 0xB8000 (can access)
            +--------------------------------------------------------------------+
            | XXX NO ACCESS XXX                                                  |
            +--------------------------------------------------------------------+
0x100000 -->|Code|Data|Heap ...       Alice's process memory            ... Stack| <-- 0x13ffff
            +--------------------------------------------------------------------+
0x140000 -->| XXX NO ACCESS (Eve's memory) XXX                                   | <-- 0x17ffff
            +--------------------------------------------------------------------+
            |                                                                    |
            +--------------------------------------------------------------------+
            |                                                                    |
            +--------------------------------------------------------------------+
                                                                                 0x1fffffff (MEMSIZE_PHYSICAL - 1)

... while Eve's user-space process should see this view:

         0x0
            +--------------------------------------------------------------------+
null page ->| XXX NO ACCESS XXX                                                  |
            +--------------------------------------------------------------------+
 0x40000 -->| XXX NO ACCESS in userspace XXX                                     | <-- kernel stack
(kernel mem)+--------------------------------------------------------------------+
            | XXX NO ACCESS XXX                                       C          | console @ 0xB8000 (can access)
            +--------------------------------------------------------------------+
            | XXX NO ACCESS XXX                                                  |
            +--------------------------------------------------------------------+
0x100000 -->| XXX NO ACCESS (Alice's memory) XXX                                 | <-- 0x13ffff
            +--------------------------------------------------------------------+
0x140000 -->|Code|Data|Heap ...         Eve's process memory            ... Stack| <-- 0x17ffff
            +--------------------------------------------------------------------+
            |                                                                    |
            +--------------------------------------------------------------------+
            |                                                                    |
            +--------------------------------------------------------------------+
                                                                                 0x1fffffff (MEMSIZE_PHYSICAL - 1)

The memory labeled XXX NO ACCESS XXX here should behave as if it didn't exist: in particular, any access to a memory page in these regions should cause an exception (this is called a "page fault") that transfers control into the kernel.

Note that neither Alice nor Eve need to know here that they're not looking at the real picture of memory, but rather at a specific fiction created for their process. From the perspective of each process, it may as well be running on a computer with the physical memory laid out as shown in these views. (This is the power of virtualization: the notion of faking out an equivalent abstraction over some hardware without changing the interface.)

To achieve this, we need a layer of indirection between the addresses user-space processes use and the physical memory addresses that correspond to actually locations in RAM. We achieve this indirection by mapping virtual pages to physical pages through page tables.

Page Tables: Intro

Page tables are what let us actually convert a virtual address to physical address. Each user-space process gets its own page table, which it uses to perform that conversion.

Making page tables work efficiently requires some non-trivial data structure design. We'll work towards the actual design real computers use by considering a set of "strawman" designs that don't quite work.

Strawman 1: direct, byte-level lookup table.
Let's first consider what a very simple page table that maps virtual address to physical addresses might look like. Each entry in the page table would store a virtual address (8 bytes) and a physical address that it maps to (another 8 bytes), or NONE if there is no physical address for this virtual address and access should cause a fault into the kernel. Since memory on our computers is divided into individually-addressable bytes (recall the post-office boxes from lecture 1!), we'd need an entry in the table for each byte. (On a 64-bit computer, that's 264 entries in theory; but real 64-bit machines only support up to 248 actual addresses.)

For example, Alice's page table might look as follows (assuming we're running on a computer with a full 64-bit address space with 48 usable address bits):

+----------------------+-----------------------+
| Virtual address (8B) | Physical address (8B) |
+----------------------+-----------------------+ ---
| 0x0                  | NONE                  |  |
| ...                  | ...                   |  |
| 0x100000             | 0x100000              |  |
| 0x100001             | 0x100001              |  |
| ...                  | ...                   |  | 2^48 entries
| 0x140000             | NONE                  |  |
| 0x140001             | NONE                  |  |
| ...                  | ...                   |  |
| 0xFFFFFFFFFFFF       | NONE                  |  |
+----------------------+-----------------------+ ---

There is a rather obvious problem with this plan: if we needed 8 + 8 = 16 bytes of address translation for every byte of memory on the computer, and if we'd store a table with 248 entries, storing that table would use sixteen times as much memory as the computer actually has!

Strawman 2: store only per-page entries.
Perhaps we can improve on this, and since memory is divided into pages, we might as well only store entries for each 4096-byte page. Specifically, we can use the lower 12 bits of the virtual address as an offset into the page (since we're only storing page-granularity mappings in the page table).

In other words, we slice the 64-bit virtual address as follows:

63         47                      11        0
+---------+-----------------------+-----------+
| UNUSED  |                       | offset    |
+---------+-----------------------+-----------+
          |- in page table, w/ bottom bits 0 -|

This means that instead of 248 entries, we would store "only" 236 entries (a page is 4096 = 212 bytes, so 248 / 212 = 236 entries).

Question: Why is the offset 12 bits?
Answer: We define the size of a page to be 4096 = 212 bytes. We want to be able to index into any byte of a destination physical page. So, we need 12 bits to represent every possible offset.

With this scheme, Alice's page table would look as follows:

+----------------------+-----------------------+
| Virtual address (8B) | Physical address (8B) |
+----------------------+-----------------------+ ---
| 0x0                  | NONE                  |  |
| ...                  | ...                   |  |
| 0x100000             | 0x100000              |  |
| 0x101000             | 0x101000              |  |
| ...                  | ...                   |  | 2^36 entries
| 0x140000             | NONE                  |  |
| 0x141000             | NONE                  |  |
| ...                  | ...                   |  |
| 0xFFFFFFFFFFFF       | NONE                  |  |
+----------------------+-----------------------+ ---

Each entry still uses 16 (= 24) bytes of memory, so that's a total of 240 bytes for the page table. Unfortunately for us, that's still 1,024 GB, way more memory than our computers have!

Strawman 3: store only physical addresses.
Perhaps we can avoid storing the virtual address entirely, and thus save 8 bytes. To eachieve this, we can turn the virtual address into an index into the table.

In other words, we slice the 64-bit virtual address as before and use the upper 36 bits as the index into the table:

63         47                      11        0
+---------+-----------------------+-----------+
| UNUSED  | index (upper 36 bits) | offset    |
+---------+-----------------------+-----------+

The table would now look like this:

                       +-----------------------+
                       | Physical address (8B) |
                       +-----------------------+ ---
                       | NONE                  |  |
  index from addr      | ...                   |  |
  (36 bits)            | 0x100000              |  |
   ------------------> | 0x101000              |  |
                       | ...                   |  | 2^36 entries
                       | NONE                  |  |
                       | NONE                  |  |
                       | ...                   |  |
                       | NONE                  |  |
                       +-----------------------+ ---

When we use the index to find the corresponding entry in the page table page, we get a physical page address. As before, the offset from the virtual address tells us the offset into the physical page. But this scheme still uses 512 GB (8 bytes times 236 entries) of memory for the page table.

One observation to make here is that ideally we would store nothing for virtual address that map to NONE. A sparse data structure might help us achieve this! We could use something like a linked list (which would allow us to skip large, empty parts of the address space), but the O(N) worst-case access complexity would make such a plan slow. We need something that supports fast lookups, but still uses little space.

x86-64 Page Tables and Address Translation

The actual page table structure used in x86-64 computers combines the tricks from the strawman designs 2 and 3 above, and adds a clever tree structure into the mix.

We can think of a multi-level page table structure (x86-64 uses four levels; x86-32 used two) as a tree. Multiple levels reduce space needed because when we look up the physical page for a given virtual address, we may visit a branch of the tree that tells us there's actually no valid physical page for us to access. In that case, we just stop searching. So, multiple levels means we can have a sparse tree.

It's important to note that every process has its own page table. In WeensyOS, the kernel code contains a global array ptable whose slots contain the addresses of the top-level (L4) page table for each process. This way, the kernel can locate and modify the page tables for every process.

Each L4 page table contains 512 8-byte entries that can either either be empty or contain the address of an L3 page table. The page has 512 entries because dividing the page size of 4096 bytes by 8 yields 512.

The picture below shows an example of how the page tables for Alice's and Eve's process might be structured. The L4, L3, and L2 page tables each contain entries that are either empty or contain the address of the page containing a lower-level page table.

Why does the kernel have its own page table?

Modern computer architectures, including the x84-64 architecture, accelerate virtual address translation via page tables in hardware. As a consequence, it's actually impossible to turn virtual memory off and work directly with physical addresses. Consequently, even the kernel needs to use page tables to translate virtual addresses! It may use an identity mapping to make virtual addresses of key resources (such as the console, or special I/O memory) identical to their physical addresses, but it does need to go through the translation.

OS kernels, like Linux's and the WeensyOS kernel, can actually operate with different active page tables. For example, when a process makes a system call, the kernel executes in process context, meaning that the active page tables are those of the calling user-space process. But in other contexts, such as when handling an interrupt or at bootup, the kernel is not running on behalf of a userspace process and uses the kernel (process 0) page tables.

This structure means that there is 1 L4 page table, up to 512 L3 page tables (each of 512 L4 PT entries can point to a single L3 PT page), up to 5122 L2 page tables, and up to 5123 L1 page tables. In practice, there are far fewer, as the picture shows. Using all 5123 L1 page tables would allow addressing 5123 × 512 = 5124 ≈ 68 billion pages. 68 billion pages of 4096 bytes each would cover 256 TB of addressable memory; the page tables themselves would be 512 GB in size. Most programs only need a fraction of this space, and by leaving page table entries empty at high levels (e.g., L3 or L4), entire subtrees of the page table structure don't need to exist.

The L1 page table entries are special, as they supply the actual information to translate a physical address into a virtual one. Instead of storing another page table's address, the L1 page table entries (PTEs) contain both part of the physical address (PA) that a given virtual address (VA) translates into, and also the permission bits that control the access rights on the page (PTE_P, PTE_W, and PTE_U; as well as other bits we don't cover in detail in this course).

Summary

Today, we started looking at how we can protect memory. Specifically, we saw how we can use page permissions to protect a user-space program from just writing over kernel memory by changing the memory mappings such that kernel memory is not available for user-space processes. Next time, we'll dive deeper into how to protect user-space processes from each other and how we can have multiple programs share the same memory safely through a notion called virtual memory.

Computers realize virtual memory using page tables, which are mapping tables that help translating virtual addresses (which user-space programs work with) to physical addresses (which refer to real memory addresses in the computer's DRAM chips). Page tables in the x86-64 architecture follow an ingenious four-level design that allows for chunking into page-sized units (so page tables can themselves be stored in physical memory pages) and for a high-level cutoff that avoids wasting space storing translations for virtual addresses that map to nothing (i.e., which have no corresponding physical page).