Lecture 15: Privilege Separation, Memory Protection
🎥 Lecture video (Brown ID required)
🎥 2023 video (full audio) (Brown ID required)
💻 Lecture code
❓ Post-Lecture Quiz (due 11:59pm, Wednesday, March 20).
Userspace/Kernel Privilege Separation
Last time, we saw how we can prevent a process stuck in an infinite loop from taking over our computer by using timer interrupts to regularly return control to the kernel. This is part of the kernel's mission to safely share hardware resources of the computer between different user-space programs.
Shared Hardware Resources
Nearly all hardware resources in our computers are shared: for some, like memory, the kernel can split them into regions that different processes use, but for others, the sharing happens in the time domain. The kernel gives one process access to the resource, then after some time takes control back from that process and potentially puts another process in control of the resource. This notion of time-sliced resource use is called time-sharing. One of the most important time-shared resources of the computer is the processor itself: access to its arithmetic circuits, its registers, and the ability to run instructions is time-shared by different programs, which take turns in using the processor.
Other hardware resources that are time-shared include the keyboard, a printer, or specific memory bytes, which may well be used by different programs over time.
Representing Privilege
How does the kernel receive its privilege, and how does it ensure that it maintains a higher privilege than user-space processes? Again, the interplay between hardware and the operating system plays a key role here.
In particular, the kernel is the first program to run when your computer starts up. When the computer starts ("boots"), it runs with full privilege – in other words, the first program to run can access all hardware as well as read and write any memory location. Therefore, the first program to run better be the kernel! The kernel then can start other programs at with restricted privileges.
One way in which x86-64 computer represent privilege is through a special register, the %cs
register. This register
either stores the value 0 or 3. 0 indicates that the processor is running in privileged mode (in practice, this means it's running
the kernel), and 3 indicates that it runs in unprivileged mode (typically, a user-space program). The processor's privilege mode
controls, amongst other things, whether the current process can do things like turning off interrupts.
In WeensyOS, the kernel runs with interrupts turned off – a design that simplifies the kernel, for example because we don't need to worry about receiving an interrupt while the kernel is doing some I/O on behalf of some user-space process. But this simplicity comes at the cost of some performance, which is why widely-used kernels like Linux do support interrupts during most of kernel execution. However, user-space processes should never be able to disable interrupts!
The x86-64 instruction cli
("clear interrupts") has the effect of turning off all interrupt handling in
the processor. Consequently, if a process is able to run this instruction, it can ignore timer interrupts. Eve might use code like
asm volatile ("cli");
in her process to try to achieve this, but assuming that the kernel correctly limited privileges
when it started the user-space process, Eve's attempt to run this instruction will result in an exception (a "fault", and
in more specific x86-64 terms, a general protection fault), which causes the kernel to take control.
What should the kernel do when Eve tries to execute a privileged instruction? There are several options, and it's down to the programmer writing the OS kernel to decide what policy they apply. For example, the kernel could simply ignore the illegal instruction and continue executing Eve's process from the next instruction. Or the kernel could seek to punish Eve for trying to break the rules and kill her process. In particular, adding the following code to the WeensyOS kernel's exception handler has the effect of killing Eve's process:
void exception(regstate* regs) {
// [...]
case INT_GP:
if (regs->reg_cs & 3) {
// Userspace fault, kill the process
current->state = P_BROKEN;
break;
} else {
goto unhandled;
}
// [...]
}
In the WeensyOS kernel, current
is always a pointer to the process descriptor (a structure with information
about the process) for the last user-space process that ran before control entered the kernel. Therefore,
current->state
refers to the state of the current process, and setting it to P_BROKEN
marks the
process as no longer viable, and prevents the kernel from running it again.
There are various other kinds of faults that can occur – for example, a division by zero in user-space can also trigger a processor fault that the kernel needs to handle. The approach for handling these is the same: the kernel exception handler needs to define what to do if the fault occurs, and it may choose to kill the process, or continue it, or take another action.
Memory Organization on WeensyOS
The default output of WeensyOS in your assignments will show an interactive view of how memory is organized at any point in time. (In lectures, we will use a modified version of WeensyOS, called DemoOS, that displays different output).
WeensyOS supports 2 MB of physical memory, ranging from addresses 0x000000 to 0x200000. The PHYSICAL MEMORY display shows how this memory is used by the kernel and different processes.
Each character in the display represents 4096 (212) bytes of memory. This unit is called a page, and we'll learn more about it in the next lecture. A page represented by a period (.) corresponds to unused memory. A page labeled R is reserved memory, which cannot be used because it serves specific purposes for interaction with hardware or because of historic reasons in the x86 architecture.
The kernel's memory starts at address 0x40000. This memory holds the kernel code as well as kernel data, and it appears as a pink K in the output. The memory from address 0x100000 onwards is where WeensyOS places user-space processes. In the picture, you can see four active processes: 1 starts at 0x100000 and ends at 0x13FFFF; 2 starts at 0x140000 and ends at 0x17FFFF; 3 starts at 0x180000 and ends at 0x1BFFFF; and 4 starts at 0x1C0000 and ends at 0x1FFFFF.
Finally, a special memory page at address 0xB8000 within the reserved memory contains the contents of the screen. WeensyOS does not have a graphical user interface, and the screen is referred to as the console. (This term derives from the fact that before personal computers, the multiple users could connect to a shared computer through "terminals", but the main computer operator would sit at a "console" and manage the computer from there.) The 4096 bytes in this page represent the complete content of the output: each character is represented by two bytes, with the first indicating its color and the second indicating the character to display.
Memory Protection
Timer interrupts, restricted process privilege, and kernel fault/interrupt handlers help the kernel protect the computer from processes that attack the CPU time resource. But there are other important resources that it also needs to protect, most importantly the memory (RAM), which is a pool of fast storage that all processes (including the kernel) share.
Clearly, we need to protect the kernel's memory from user-space processes like Eve's process. If Eve's process was allowed to
write to kernel memory, Eve could simply move her infinite loop into the kernel by modifying the kernel code memory. For example,
Eve might replace two instruction bytes with the sequence 0xEB 0xFE
, which in x86-64 machine code is a two-byte
instruction that stands for an unconditional, relative jump by negative two bytes. In other words, this instruction will always
jump back to itself – an infinite loop!
If Eve was able to inject such an infinite loop into the kernel code, she could clearly monopolize the CPU: within the WeensyOS kernel, timer interrupts are disabled, so our remedy from earlier examples no longer works (nothing could ever interrupt that loop). On an unmodified WeensyOS, this attack is indeed possible!
For example, Eve might choose to overwrite the first few bytes of the syscall_entry
function in the WeensyOS
kernel. This function is located at address 0x40ad6
(the specific address may differ if you compile the code), as
shown by searching for syscall_entry
in obj/kernel.asm
(the kernel disassembly). If Eve successfully
writes 0xEB 0xFE
to 0x40ad6
, the computer gets stuck. An attack might look like the following in Eve's
user-space code:
#include "u-lib.hh"
void process_main() {
unsigned i = 0;
while (true) {
++i;
if (i % 1024 == 0) {
console_printf(0x0E00, "Hi, I'm Eve! #%d\n", i / 512);
}
if (i % 2048 == 0) {
// form a pointer into kernel memory, specifically kernel code for `syscall_entry`
char *syscall = (char *) 0x40ad6;
// write infinite loop bytes there (0xEB 0xFE is a 2-byte instruction meaning an
// unconditional, relative jump by two bytes in x86-64 machine code).
syscall[0] = 0xEB;
syscall[1] = 0xFE;
console_printf(0x0E00, "EVE ATTACK!\n");
while (true) {
}
}
sys_yield();
}
}
Note that Eve here forms a pointer into kernel code by simply casting the literal address into a char*
. This is
possible because Eve knows the kernel memory layout, and because she can directly access kernel memory despite being a user-space
process.
Solution: Memory Mappings with Restricted Permissions
Fortunately, our computers' hardware provides a mechanism to restrict the access permissions for different regions of memory. These permissions are set on a per-page granularity: specifically, every memory page (recall, a page is 4 KB = 4096 bytes of memory) can have several access bits set:
- PTE_P: present. Indicates that the page is in use.
- PTE_W: writable. Indicates that the page can be read and written.
- PTE_U: user-space. Indicates that the page can be accessed by user-space processes (i.e., if
%cs
is set to 3).
Where are the permission bits stored?
The page permission bits are actually stored in a data structure called the page table, which we'll soon hear more about. The page table is a convenient place to store this information because every user-space program that accesses memory actually has to look up the address it's accessing in the page table in order to perform virtual-to-physical address translation (which we'll also hear more about soon).
In the WeensyOS kernel, the memory permissions get set up in the kernel()
function right after the kernel
starts. In particular, this loop is where all the action is:
// (re-)initalize kernel page table:
// all of physical memory is accessible except `nullptr`
for (vmiter it(kernel_pagetable);
it.va() < MEMSIZE_PHYSICAL;
it += PAGESIZE) {
if (it.va() != 0) {
it.map(it.va(), PTE_P | PTE_W | PTE_U);
} else {
it.map(it.va(), 0);
}
}
This loop uses C++ iterator syntax to iterate over the memory mappings stored in something called the "kernel
pagetable". C++ iterators are a handy concept for iterating over datastructures that do not support simple integer
indices for access. The loop says that, for a new memory iterator it
of type vmiter
over the
kernel_pagetable
, we wish to iterate until the current address (it.va()
) is
MEMSIZE_PHYSICAL
(0x20'0000 in WeensyOS), and that we want to jump in steps of PAGESIZE
(4096
bytes). (Note that the vmiter
implicitly starts iteration at address 0; an alternative syntax for the
constructor of the vmiter
class is vmiter it(kernel_pagetable, 0)
.)
The loop body does one of two things:
- If we're not dealing with the address 0x0 (the null pointer!), set all permissions on the page at that address.
- If we are dealing with the null pointer, clear all permissions so that any access to the page results in a fault into the kernel. (This is helpful to catch null pointer dereferences in your programs!)
PTE_U
set on kernel memory
pages is dangerous. Let's change this loop to make kernel memory unavailable to user-space processes, while keeping the
user-space memory above PROC_START_ADDR
available to them:
// (re-)initalize kernel page table:
// all of physical memory is accessible except `nullptr`
for (vmiter it(kernel_pagetable);
it.va() < MEMSIZE_PHYSICAL;
it += PAGESIZE) {
if (it.va() >= PROC_START_ADDR || it.va() == CONSOLE_ADDR) {
// process memory, console (for memory-mapped I/O) accessible to userspace
it.map(it.va(), PTE_P | PTE_W | PTE_U);
} else if (it.va() != 0) {
// kernel memory, NOT accessible to userspace
it.map(it.va(), PTE_P | PTE_W);
} else {
// null page (for null pointer derefs) mapped with no permissions
it.map(it.va(), 0);
}
}
Note that we had to special-case CONSOLE_ADDR
(0xB'8000) here: even though the console is within memory below
PROC_START_ADDR
, user-space processes need to be able to access it in order to run console_printf
successfully.
With this change to the kernel, Eve can no longer access kernel memory. However, the kernel crashes with an error indicating
an unexpected "page fault" when trying to access address 0x40ad6
at instruction 0x140045
(within Eve's code!). Next time, we'll fix this and also learn more about the notion of page tables!
Summary
Today, we developed a better understanding of how the OS kernel works together with the hardware to establish a notion of privilege: the kernel executes with the processor in privileged mode, while user-space processes do not. When user-space processes try to perform an operation that requires privilege, an interrupt occurs and puts the kernel back in control. Typically, the kernel then kills the offending process!
We looked at how hardware mechanisms and the OS kernel work together to make sure our computers run robustly even when programs misbehave. In particular, we saw that there are certain privileged operations that only the kernel can perform, and which cause faults that put the kernel in control of the CPU if user-space processes attempt to perform them. This allows the OS kernel to prevent any attacks that try to take over the CPU, and ensures that user-space processes run with reduced privileges.
We also started to investigate the notion of memory protection, rooted in the realization that RAM is a shared resource between all the programs running on the computer. It's important that kernel memory is protected from userspace processes, since they could otherwise just modify the kernel code to gain control of the computer!