⚠️ This is not the current iteration of the course! Head here for the current offering.

Lecture 12: Syscalls and Virtual Memory

» Lecture code
» Post-Lecture Quiz (due 6pm Monday, March 8).

S1: System Calls and File I/O

System Calls

When processes (that is, running programs!) want the OS kernel to do something on their behalf, their mechanism of choice is a system call. System calls are like function calls, but they also transition control into the kernel. We've seen some examples of system calls: read() and write() on Linux, and sys_yield() on WeensyOS. From the perspective of a user-space process, these system calls look just like function calls into the standard library, but there's a lot more going on!

System calls are an example of what is called protected control transfer (interrupts are another example). The idea is to transfer control of the processor to the OS kernel, and to do so safely and in an organized manner. Recall from last lecture that when the kernel runs, the processor is in privileged mode (indicated by the %cs register being set to 0 on x86-64). But when user-space programs run, the processor must be in unprivileged mode. A system call must therefore change the privilege level!

System calls follow a specific protocol when they're invoked. Some of this is similar to the protocol for function calls (the calling convention, from when we talked about the stack), and some of it is specific to system calls. It works as follows:

  1. The user-space process sets up the stack and registers according to the architecture's calling convention (e.g., first argument in %rdi, arguments beyond the sixth on the stack, etc.).
  2. The user-space process invokes the special syscall machine instruction. This instruction:
    • Sets the processor to privileged mode.
    • Saves the address of the machine instruction following the syscall instruction into a register (on x86-64, %rcx).
    • Triggers an interrupt to set the next instruction to execute to a known location in the kernel (on WeensyOS, this is a function called syscall_entry).
  3. The kernel saves the values in all registers (including the stack pointer, base pointer, general purpose register, flags, etc.) to a well-defined place in memory for this process, called the process descriptor. This information needs to be preserved so that the kernel can restore the state of affairs when it finally returns to the user-space process.
  4. The kernel goes off and does whatever work it needs to do (I/O, running other processes, etc.).
  5. Eventually, the kernel decides to return to the process that made the system call. It restores the register contents from the values saved in the process descriptor, and then invokes the sysret instruction, which:
    • Reduces the processor's privilege level to unprivileged (%cs = 3 on x86-64).
    • Sets the address of the next instruction to the instruction after the syscall instruction, based on the saved register value.
  6. We're back in user-space, and the process continues executing!
The transition between user-space and kernel happens on step 2 and 5 of this sequence.

System calls are not cheap. They require the processor to do significant extra work compared to normal function calls (e.g., saving registers, handling an interrupt, etc.). A system call also means that the user-space process probably loses some locality of reference, and thus may have more cache misses after the system call returns. In practice, a system call takes 1-2µs to handle. This may seem small, but compared to a DRAM access (60ns), it's quite expensive – more than 20x the cost of a memory access. Frequent system calls are therefore one major source of poor performance in programs. In Project 3, you implement a set of tricks to avoid having to make frequent system calls!

File Descriptors

Input and output (I/O) on a computer must generally happen through the OS kernel, so that it can mediate and ensure that only one process at a time uses the physical resources affected by the I/O (e.g., a harddisk, or your WiFi). This avoids chaos and helps with fair sharing of the computer's hardware. (There are some exceptions to this rule, notably memory-mapped I/O and recent fast datacenter networking, but most classic I/O goes through the kernel.)

When a user-space process makes I/O system calls like read() or write(), it needs to tell the kernel what file it wants to do I/O on. This requires the kernel and the user-space process to have a shared way of referring to a file. On UNIX-like operating systems (such as macOS and Linux), this is done using file descriptors.

File descriptors are identifiers that the kernel uses to keep track of open resources (such as files) used by user-space processes. User-space processes refer to these resources using integer file descriptor (FD) numbers; in the kernel, the FD numbers index into a FD table maintained for each process, which may contain extra information like the filename, the offset into the file for the next I/O operation, or the amount of data read/written. For example, a user-space process may use the number 3 to refer to a file descriptor that the kernel knows corresponds to /home/malte/cats.txt.

To get a file descriptor number allocated, a process calls the open() syscall. open() causes the OS kernel to do permission checks, and if they pass, to allocate an FD number from the set of unused numbers for this process. The kernel sets up its metadata, and then returns the FD number to user-space. The FD number for the first file you open is usually 3, the next one 4, etc.

Why is the first file descriptor number usually 3?

On UNIX-like operating systems such as macOS and Linux, there are some standard file descriptor numbers. FD 0 normally refers to stdin (input from the terminal), 1 refers to stdout (output to the terminal), and 2 refers to stderr (output to the terminal, for errors). You can close these standard FDs; if you then open other files, they will reuse FD numbers 0 through 2, but your program will no longer be able to interact with the terminal.

Now that user-space has the FD number, it uses this number as a handle to pass into read() and write(). The full API for the read system call is: int read(int fd, void* buf, size_t count). The first argument indicates the FD to work with, the second is a pointer to the buffer (memory region) that the kernel is supposed to put the data read into, and the third is the number of bytes to read. read() returns the number of bytes actually read (or 0 if there are no more bytes in the file; or -1 if there was an error). write() has an analogous API, except the kernel reads from the buffer pointed to and copies the data out.

One important aspect that is not part of the API of read() or write() is the current I/O offset into the file (sometimes referred to as the "read-write head" in man pages). In other words, when a user-space process calls read(), it fetches data from whatever offset the kernel currently knows for this FD. If the offset is 24, and read() wants to read 10 bytes, the kernel copies bytes 24-33 into the user-space buffer provided as an argument to the system call, and then sets the kernel offset for the FD to 34.

A user-space process can influence the kernel's offset via the lseek() system call, but is generally expected to remember on its own where in the file the kernel is at. In Project 3, you'll have to maintain such metadata for your caching in user-space memory. In particular, when reading data into the cache or writing cached data into a file, you'll need to be mindful of the current offset that the I/O will happen at.

S2: Caching I/O Q&A

This lecture segment was replaced by an interactive Q&A session on Project 3.

S3: Virtual Memory

We previously used memory protection to isolate kernel memory from user-space processes, preventing attacks where user-space processes write to kernel memory. But this is not enough – we also need to prevent user-space processes from accessing the memory of other user-space processes!

In DemoOS as we looked at in lectures so far (and also in WeensyOS at the start of Project 3), there is no isolation between processes. The only things that prevents utter chaos is that the processes happen to use non-overlapping memory addresses; for example, p-alice starts at address 0x10'0000 and ends at 0x13'FFFF, while p-eve starts at 0x14'0000 and ends at 0x17'FFFF (and likewise with the first and second process on WeensyOS).

There are a couple of problems with this approach:

  1. If Eve successfully guesses an address within Alice's memory, she can read and modify Alice's data.
  2. If Alice and Eve's processes every accidentally use the same address, they will corrupt each other's memory; programmers need to carefully choose non-overlapping memory regions for their processes.
  3. The processes' memory regions are of fixed size, and a process that needs more than, say, the 0x3FFFF bytes of memory between the top and bottom address (= 256 KB) either cannot run or needs to carefully avoid any memory used by other processes.
  4. If we have many processes, there may not be enough memory to run them all as we need to pre-reserve a fixed amount of memory for each process (in the examples, 0x3FFFF bytes = 256 KB).
So we need something safer and more flexible! Virtual memory is a concept that achieves both these goals.

Demo: Eve Attacking Alice's Memory

Two relatively simple attacks demonstrate the danger of giving Eve's process access to the memory of Alice's process. In the first, Eve might form a pointer into data stored, e.g., on Alice's stack and change that data, for example to make Alice print an attack message rather than her normal Hi, I'm Alice! message.

In the lecture demo as compiled, the address of Alice's message on her process's stack is 0x13'ff59. (You can obtain this by printing the address &msg from p-alice.cc, or by looking at the disassembly in obj/p-alice.asm). Eve may modify her program as follows to overwrite Alice's message:

--- p-eve.cc    2020-03-05 10:24:04.760050399 -0500
+++ p-eve.cc    2020-03-05 10:25:21.671907556 -0500
@@ -7,6 +7,8 @@
         if (i % 1024 == 0) {
             console_printf(0x0E00, "Hi, I'm Eve! #%d\n", i / 512);
         }
+        char* msg = (char*) 0x13ff59;
+        snprintf(msg, 15, "EVE ATTACK!");

         if (i % 2048 == 0) {
           char* syscall = (char*) 0x40ad6;

What's the syntax of the above listing?

This file is a unified diff, which is a format for expressing differences between text files (sometimes referred to as "patches". It's the format that the git diff command produces its output in. Developers often use diffs to concisely show differences to code. The lines black starting with a space are context lines, which indicate where in the file the changes should occur; the green lines starting with a "+" sign indicate lines to be added; if there were lines starting with a "-" sign, they would indicate lines to be removed (typically shown in red). More about the diff format!

An even worse attack involves Eve writing to Alice's code in the static segment of her process. Recall that the code segment contains the machine instructions executed by Alice's process. If Eve finds a convenient place to sneakily insert new code, she can cause Alice to compute on her behalf, or worse, force Alice into an infinite loop, permanently disabling her process.

Remember that the two bytes 0xEB 0xFE correspond to an infinite loop in x86-64 machine code (a two-byte instruction encoding an unconditional, relative jump by -2 bytes). If Eve writes these bytes into the first two bytes of any instruction in the inner loop of Alice's code, Alice will enter an infinite loop the next time she executes the instructions at that address. For example, one such address is 0x10'0077, normally a mov instruction just after Alice's process returns from a system call (see obj/p-alice.asm). Eve might make the following change to her program to corrupt Alice's process:

--- p-eve.cc    2020-03-05 10:26:32.893535962 -0500
+++ p-eve.cc    2020-03-05 10:25:51.894988665 -0500
@@ -11,9 +11,9 @@
         snprintf(msg, 15, "EVE ATTACK!");

         if (i % 2048 == 0) {
+          char* alicecode = (char*) 0x100077;
+          alicecode[0] = 0xEB;
+          alicecode[1] = 0xFE;
-          char* syscall = (char*) 0x40ad6;
-          syscall[0] = 0xEB;
-          syscall[1] = 0xFE;

           console_printf(0x0D00, "MWAHAHAHAHAHAH EVE REIGNS SUPREME!\n");
         }
This replaces the prior attack on the kernel code (which got Eve's process killed, as we now protect the kernel memory) with an attack on Alice's code.

Back to Virtual Memory!

The basic idea behind virtual memory is to create, for each user-space process, the illusion that it runs alone on the computer and has access to the computer's full memory. In other words, we seek to give different processes different views of the actual memory.

Recall that the (physical) memory in DemoOS is roughly laid out as follows:

         0x0
            +--------------------------------------------------------------------+
null page ->|R                                                                   |
            +--------------------------------------------------------------------+
 0x40000 -->|KKKKKKKKKKKKKKKKKKKKKKKKKKKKKK                                     K| <-- kernel stack
(kernel mem)+--------------------------------------------------------------------+
            |                                    RRRRRRRRRRRRRRRRRRRRRCRRRRRRRRRR| console @ 0xB8000
            +--------------------------------------------------------------------+
            |RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR|
            +--------------------------------------------------------------------+
0x100000 -->|Code|Data|Heap ...       Alice's process memory            ... Stack| <-- 0x13ffff
            +--------------------------------------------------------------------+
0x140000 -->|Code|Data|Heap ...         Eve's process memory            ... Stack| <-- 0x17ffff
            +--------------------------------------------------------------------+
            |                                                                    |
            +--------------------------------------------------------------------+
            |                                                                    |
            +--------------------------------------------------------------------+
                                                                                 0x1fffffff (MEMSIZE_PHYSICAL - 1)

What we'd like to achieve through virtual memory is that Alice's user-space process has the following view of this memory:

         0x0
            +--------------------------------------------------------------------+
null page ->| XXX NO ACCESS XXX                                                  |
            +--------------------------------------------------------------------+
 0x40000 -->| XXX NO ACCESS in userspace XXX                                     | <-- kernel stack
(kernel mem)+--------------------------------------------------------------------+
            | XXX NO ACCESS XXX                                       C          | console @ 0xB8000 (can access)
            +--------------------------------------------------------------------+
            | XXX NO ACCESS XXX                                                  |
            +--------------------------------------------------------------------+
0x100000 -->|Code|Data|Heap ...       Alice's process memory            ... Stack| <-- 0x13ffff
            +--------------------------------------------------------------------+
0x140000 -->| XXX NO ACCESS (Eve's memory) XXX                                   | <-- 0x17ffff
            +--------------------------------------------------------------------+
            |                                                                    |
            +--------------------------------------------------------------------+
            |                                                                    |
            +--------------------------------------------------------------------+
                                                                                 0x1fffffff (MEMSIZE_PHYSICAL - 1)

... while Eve's user-space process should see this view:

         0x0
            +--------------------------------------------------------------------+
null page ->| XXX NO ACCESS XXX                                                  |
            +--------------------------------------------------------------------+
 0x40000 -->| XXX NO ACCESS in userspace XXX                                     | <-- kernel stack
(kernel mem)+--------------------------------------------------------------------+
            | XXX NO ACCESS XXX                                       C          | console @ 0xB8000 (can access)
            +--------------------------------------------------------------------+
            | XXX NO ACCESS XXX                                                  |
            +--------------------------------------------------------------------+
0x100000 -->| XXX NO ACCESS (Alice's memory) XXX                                 | <-- 0x13ffff
            +--------------------------------------------------------------------+
0x140000 -->|Code|Data|Heap ...         Eve's process memory            ... Stack| <-- 0x17ffff
            +--------------------------------------------------------------------+
            |                                                                    |
            +--------------------------------------------------------------------+
            |                                                                    |
            +--------------------------------------------------------------------+
                                                                                 0x1fffffff (MEMSIZE_PHYSICAL - 1)

The memory labeled XXX NO ACCESS XXX here should behave as if it didn't exist: in particular, any access to a memory page in these regions should cause an exception (this is called a "page fault") that transfers control into the kernel.

Note that neither Alice nor Eve need to know here that they're not looking at the real picture of memory, but rather at a specific fiction created for their process. From the perspective of each process, it may as well be running on a computer with the physical memory laid out as shown in these views. (This is the power of virtualization: the notion of faking out an equivalent abstraction over some hardware without changing the interface.)

To achieve this, we need a layer of indirection between the addresses user-space processes use and the physical memory addresses that correspond to actually locations in RAM. We achieve this indirection by mapping virtual pages to physical pages through page tables.

Summary

Today, we discussed how system calls work (both on WeensyOS and on Linux), and learned about file descriptors and their role in I/O.

In the last part, we then saw that we need to protect user-space processes' memory from other processes. Virtual memory is the way to achieve this, as it gives each process its own view of the computer's memory.