Lecture 18: Address Translation (continued), Process Creation

🎥 Lecture video (Brown ID required)
💻 Lecture code
❓ Post-Lecture Quiz (due 11:59pm, Monday, April 8).

Page Table recap

We can think of a multi-level page table structure (x86-64 uses four levels; x86-32 used two) as a tree. Multiple levels reduce space needed because when we look up the physical page for a given virtual address, we may visit a branch of the tree that tells us there's actually no valid physical page for us to access. In that case, we just stop searching. So, multiple levels means we can have a sparse tree.

It's important to note that every process has its own page table. In WeensyOS, the kernel code contains a global array ptable whose slots contain the addresses of the top-level (L4) page table for each process. This way, the kernel can locate and modify the page tables for every process.

Each L4 page table contains 512 8-byte entries that can either either be empty or contain the address of an L3 page table. The page has 512 entries because dividing the page size of 4096 bytes by 8 yields 512.

The picture below shows an example of how the page tables for Alice's and Eve's process might be structured. The L4, L3, and L2 page tables each contain entries that are either empty or contain the address of the page containing a lower-level page table.

Why does the kernel have its own page table?

Modern computer architectures, including the x84-64 architecture, accelerate virtual address translation via page tables in hardware. As a consequence, it's actually impossible to turn virtual memory off and work directly with physical addresses. Consequently, even the kernel needs to use page tables to translate virtual addresses! It may use an identity mapping to make virtual addresses of key resources (such as the console, or special I/O memory) identical to their physical addresses, but it does need to go through the translation.

OS kernels, like Linux's and the WeensyOS kernel, can actually operate with different active page tables. For example, when a process makes a system call, the kernel executes in process context, meaning that the active page tables are those of the calling user-space process. But in other contexts, such as when handling an interrupt or at bootup, the kernel is not running on behalf of a userspace process and uses the kernel (process 0) page tables.

This structure means that there is 1 L4 page table, up to 512 L3 page tables (each of 512 L4 PT entries can point to a single L3 PT page), up to 5122 L2 page tables, and up to 5123 L1 page tables. In practice, there are far fewer, as the picture shows. Using all 5123 L1 page tables would allow addressing 5123 × 512 = 5124 ≈ 68 billion pages. 68 billion pages of 4096 bytes each would cover 256 TB of addressable memory; the page tables themselves would be 512 GB in size. Most programs only need a fraction of this space, and by leaving page table entries empty at high levels (e.g., L3 or L4), entire subtrees of the page table structure don't need to exist.

The L1 page table entries are special, as they supply the actual information to translate a physical address into a virtual one. Instead of storing another page table's address, the L1 page table entries (PTEs) contain both part of the physical address (PA) that a given virtual address (VA) translates into, and also the permission bits that control the access rights on the page (PTE_P, PTE_W, and PTE_U; as well as other bits we don't cover in detail in this course).

The access permission bits are stored in the lowest 12 bits of the PTE, since those bits aren't needed as part of the physical address. Recall that the lowest 12 bits address a byte within the page, and that we use the offset (lowest 12 bits) from the virtual address directly; therefore, the lowest 12 bits of the page's physical address are always zero, making them available for metadata storage. (The top bit, i.e., bit 63, is also used for metadata: it represents the "execute disable" or NX/XD bit, which marks data pages as non-executable.)

Virtual Address Translation

The x86-64 architecture uses four levels of page table. This is reflected in the structure of a virtual address:

63         47     38     29     20     11        0
+---------+------+------+------+------+-----------+
| UNUSED  |  L4  |  L3  |  L2  |  L1  | offset    |
+---------+------+------+------+------+-----------+
          |9 bits|9 bits|9 bits|9 bits| 12 bits

In x86-64 virtual address has 64 bits, but only the first 48 bits are meaningful. We have 9 bits to index into each page table level, and 12 bits for the offset. (This means that the top 16 bits are leftover and unused.)

Each page table "chunk" at each layer has 512 entries. Why 512? Because the chunk of the page table itself needs to fit into a page, which is 212 = 4096 bytes large. Each entry is an 8-byte address, so we can fit 29 = 512 entries into a page.

This is also why the indexes in the virtual address are each 9 bits long! We need 9 bits to choose one out of the 512 entries in each page table chunk.

Each L4 page table entry that is present holds the address of a L3 page table, and each L3 page table entry that is present holds the address of an L2 page table. Each L2 page table entry in turn holds the address of an L1 page table. Entries in the L1 page table actually hold physical addresses (as they are the bottom of the tree) and also store the access bits in the lower 12 bits, where the page address is always all zeroes.

There is only one L4 page table per process, which forms the top of the tree. But there are up to 512 L3 page tables per process, up to 5122 L2 page tables, and up to 5123 L1 page tables per process. In reality, there will be far fewer than this, since no process will actually be using the full virtual address space. Instead, there will be many large gaps in the virtual address space, and there will be no lower-level page tables for these address ranges.

Example

The picture below zooms in on the L1 page table used in translating VA 0x10'0001 to PA 0x8001. Note that the indexes into the L4, L3, and L2 page tables are all zero, since the upper bits of the VA are all zero. (The full 48-bit VA is 0x0000'0010'0001.) The offset bits (lowest 12 bits) correspond to hexadecimal value 0x001, and they get copied straight into the VA. The next nine bits (bit 12 to 21) are, in binary, 0b1'0000'0000 (hex: 0x100, decimal 256). They serve as the index into the L1 page table, where the 256th entry contains the value 0x8 (0x0'0000'0008 as full 36-bit value) in bits 12 to 47. This value gets copied into bits 12 to 47 of the PA, and combined with the offset of 0x001 results in the full PA of 0x0000'0000'8001.

Page tables are the fundamental abstraction for virtual memory on modern computers. While you don't need to remember the exact details of the x86-64 page table structure, you should understand why the structure is designed this way, and how it works – for example, you might get asked to design a page table structure for another architecture in the quiz!

Page Table Lookups (x86-64)

A successful lookup (finding a physical address from a virtual address) goes as follows in the case of x86-64 page tables:

  1. Use the address in the %cr3 register to find the L4 page table address
  2. Use the L4 index from the virtual address to get the L3 page table address
  3. Use the L3 page table address and the L3 index to get a L2 page table address
  4. Use the L2 page table address and the L2 index to get a L1 page table address
  5. Use the L1 page table address and the L1 index to get the destination physical page
  6. Use the destination physical page and the offset to get actual physical address within that destination physical page

Finally, one important detail of virtual address translation is that user-space processes don't need to switch into the kernel to translate a VA to a PA. If every memory access from user-space required a protected control transfer into the kernel, it would be horrendously slow! Instead, even though the process page tables are not writable or accessible from userspace, the CPU itself can access them when operating in user-space. This may sound strange, but it works because the CPU stores the physical address of the L4 page table in a special register, %cr3 on x86-64. This register is privileged and user-space code cannot change its value (trying to do so traps into the kernel). When dereferencing a virtual address, the CPU looks at the L4 page table at the address stored in %cr3 and then follows the translation chain directly (i.e., using logic built into the chip, rather than assembly instructions). This makes address translation much faster – however, it turns out that even this isn't fast enough, and the CPU has a special cache for address translations. This is called the Translation Lookaside Buffer (TLB), and it stores the physical addresses for the most recently translated virtual addresses.

Process Lifecycle

Processes are how we run programs on our computers, and our computers often use several processes to get things done. For example, a simple terminal command such as ls or grep each run a new process that produces some output and then exits.

In WeensyOS, the kernel starts four processes at startup, but (at least until step 5 of Project 4), there is no way for a user-space process to start another user-space process. A realistic operating system clearly needs to be able to do so.

Process Creation

Many Unix-based operating systems – which include Linux, the BSD line of operating systems, and Mac OS X – use a system call named fork() for process creation. fork elicits controversy even after nearly 50 years of use, and it's not the only way to create processes (Windows, for example, has a different approach). But it is how millions of computers and devices do it!

The fork() system call

fork() has the effect of cloning a user-space process. For example, this program (fork1.cc) calls fork() ("forks"), prints a message, and exits:

#include "helpers.hh"

int main() {
    pid_t p1 = fork();
    assert(p1 >= 0);

    printf("Hello from pid %d\n", getpid());
}
How many times will the message be printed when we run it? It is printed twice:

$ ./fork1
Hello from pid 19244
Hello from pid 19245
This happens because the call to fork() enters the kernel, which clones the process, and then continues user-space execution in both clones. Both processes execute the rest of the program, and thus both execute the printf function call. Note that the processes have different process IDs, as evidenced by the fact that the getpid() system call returns different values.

The return value from fork() depends on whether it is returning into the parent or into the child – every successful call to fork() returns twice:

  1. In the parent process, the return value is the new process's PID.
  2. In the child process, the return value is 0.
A negative return value indicates that fork() failed and no child process was created.

A forked child process shares many of its parent process's resources, and consequently the OS kernel needs to copy various pieces of information from the parent. The information that needs copying includes:

You will need to implement these copies as part of your fork implementation in WeensyOS.

Remember that execution continues in the same program for both the parent and child process (although their execution can diverge). If the child forks again, it can create further processes (see fork2.cc, which ends up with a total of four processes).

Since the child process receives a full copy of the parent process's address space, any virtual address that was mapped and valid in the parent is also valid in the child process. However, the same virtual address is backed by a different physical address in the child. In other words, parent memory and child memory are entirely independent.

Let's do a quick exercise to remind us of what fork() does. Take a look at this program:

int main() {
    printf("Hello from initial pid %d\n", getpid());

    pid_t p1 = fork();
    assert(p1 >= 0);

    pid_t p2  = fork();
    assert(p2 >= 0);

    printf("Hello from final pid %d\n", getpid());
}

Question: How many lines of output would you expect to see when you run the program?

Summary

Today, we did a recap of address translation through page tables and learned about some critical hardware features to make address translation fast.

We also talked about how the fork() system call allows a user-space process to start another process by essentially cloning itself. The two processes, called "parent" and "child" continue executing from the same place in the code, and they start with identical memory mappings (though these mappings are backed by different physical memory pages, for the most part). But the processes can evolve independently after the fork() system call returns. In Project 4, you will implement handling of the fork() system call in the WeensyOS kernel!