CSCI 0300/1310: Fundamentals of Computer Systems

⚠️ This is not the current iteration of the course! Head here for the current offering.

Lecture 8: Integer Overflow, Assembly Language

» Lecture video (Brown ID required)
» Lecture code (Integer Overflow)
» Lecture code (Assembly) » Post-Lecture Quiz (due 6pm Monday, February 22).

Integer overflow

Arithmetic overflow on signed integers is undefined behavior! To demonstrate this, let's look at ubexplore.c. This program takes its first argument, converts it to an integer, and then adds 1 to it. It also calls a function called check_signed_increment, which uses an assertion to check that the result of adding 1 to x (the function's argument) is indeed greater than x. Intuitively, this should always be true from a mathematical standpoint. But in two's complement arithmetic, it's not always true: consider what happens if I pass 0x7fff'ffff (the largest positive signed int) to the program. Adding 1 to this value turns it into 0x8000'0000, which is the smallest negative number representable in a signed integer! So the assertion should fail in that case.

With compiler optimizations turned off, this is indeed what happens. But since undefined behavior allows the compiler to do whatever it wants, the optimizer decides to just remove the assertion in the optimized version of the code! This is perfectly legal, because C compilers assume that programmers never write code that triggers undefined behavior, and certainly that programmers never rely on a specific behavior of code that is undefined behavior (it's undefined, after all).

Perhaps confusingly, arithmetic overflow on unsigned numbers does not constitute undefined behavior. It still best avoided, of course :)

The good news is that there is a handy sanitizer tool that helps you detect undefined behavior such as arithmetic overflow on signed numbers. The tool is called UBSan, and you can add it to your program by passing the -fsanitize=undefined flag when you compile.

⚠️ We did not cover ubexplore2.c this year. Following material is for your education only; we won't test you on it. Feel free to skip ahead.

And just to mess with you and demonstrate that arithmetic overflow on signed integers produces confusing results not only with compiler optimizations enabled, let's look at ubexplore2.c. This program runs a for loop to print the numbers between its first and second argument. ./ubexplore2.opt 0 10 prints numbers from 0 to 10 inclusive, and ./ubexplore2.opt 0x7ffffff0 0x7fffffff prints 16 numbers from 2,147,483,632 to 2,147,483,647 (the largest positive signed 4-byte integer we can represent). But ./ubexplore2.noopt 0x7ffffff0 0x7fffffff prints a lot more and appears to loop infinitely! It turns out that although the optimized behavior is correct for mathematical addition (which doesn't have overflow), the unoptimized code is actually correct for computer arithmetic. When we look at the code carefully, we understand why: the loop increments i after the body executes, and 0x7fff'ffff overflows into 0x8000'0000 (= -1), so next time the loop condition is checked, -1 is indeed less than or equal to n2. But with optimizations enabled, the compiler increments i early and compares i + 1 < n2 rather than i <= n2 (a legal optimization if assuming that i + 1 > i always).

Assembly Language

We've now arrived at the end of our introduction to C programming. The rest of the course will mostly look at higher-level concepts built atop the understanding we have developed. But before the look at higher levels, we will briefly pull back the covers and see what happens at the level below C to make your programs run.

    | Web sites, Google, Facebook, AirBnB, etc.
--- |-------------------------------------------
    | Distributed systems     <-- block 4
 C  |-------------------------
 S  | Parallel programming    <-- block 3
    |-------------------------
 1  | C++ | Operating systems <-- block 2
 3  |-------------------------
 1  | C programming language  <-- we discussed this so far
    |-------------------------
    | Assembly language       <-- we will briefly cover this
--- |-------------------------------------------
    | Hardware (chips)

Now that you understand the C language and memory representations of data, you may wonder about the "magic" hexadecimal bytes that the compiler outputs to make your computer's processor do things like adding numbers. How does the compiler choose these bytes, and what bytes are valid?

Each computer architecture (such as x86-64, which most modern computers use and we're considering in this course) has an instruction set specified by the manufacturer. The instruction set, first and foremost, defines what sequences of bytes trigger specific behavior in the processor (e.g., adding numbers, comparing them for equality, or loading data from memory). But hexadecimal bytes are hard for humans to read, so the instruction set also comes with a human-readable assembly language that consists of short, mnemonic instructions that correspond directly to a byte encoding (i.e., each of these instructions corresponds to a specific, unique set of hexadecimal bytes).

Why are we covering this?

You will almost certainly never need to write code in assembly language yourself, but it is helpful to have at least some intuition of how to read it. Being able to read assembly helps you debug weird problems with your program, and can help you understand compiler optimizations betters. Some of the examples we look at will also illustrate to you the tricks used to make computers run our code efficiently.
If you're curious to learn more details about assembly, consider taking CSCI 0330, which explains it in more detail and with more hands-on exercises than we'll have time for!

From C code to instructions

Let's take a look at how your C program get turned into hexadecimal bytes that can run on your processor.

The compiler, which we've discussed a lot already, turns your C program into assembly language. Lots of cleverness and optimizations.

The assembler turns assembly language into a set of bytes in an executable. This a very direct translation.

Registers

Assembly instructions operate on registers, small pieces of very fast memory inside the processor. To process data stored in memory, the processor first needs to load it into registers; and once it has completed working on the data in a register, it needs to store it back to memory.

Registers are the fastest kind of memory available in the machine. x86-64 has 14 general-purpose registers and several special-purpose registers. The table below lists all basic registers, with special-purpose registers highlighted in yellow. You won't understand all columns yet, but you will soon and can then use this table as a reference (we won't ask you to memorize it in detail). You'll notice different naming conventions for subsets of the same register, a side effect of the long history of the x86 architecture (the first x86 processor, the 8086 was first released in 1978).

Full register name	32-bit (bits 0–31)	16-bit (bits 0–15)	8-bit low (bits 0–7)	8-bit high (bits 8–15)	Use in calling convention	Callee-saved?
General-purpose registers:
%rax	%eax	%ax	%al	%ah	Return value (accumulator)	No
%rbx	%ebx	%bx	%bl	%bh	–	Yes
%rcx	%ecx	%cx	%cl	%ch	4th function argument	No
%rdx	%edx	%dx	%dl	%dh	3rd function argument	No
%rsi	%esi	%si	%sil	–	2nd function argument	No
%rdi	%edi	%di	%dil	–	1st function argument	No
%r8	%r8d	%r8w	%r8b	–	5th function argument	No
%r9	%r9d	%r9w	%r9b	–	6th function argument	No
%r10	%r10d	%r10w	%r10b	–	–	No
%r11	%r11d	%r11w	%r11b	–	–	No
%r12	%r12d	%r12w	%r12b	–	–	Yes
%r13	%r13d	%r13w	%r13b	–	–	Yes
%r14	%r14d	%r14w	%r14b	–	–	Yes
%r15	%r15d	%r15w	%r15b	–	–	Yes
Special-purpose registers:
%rsp	%esp	%sp	%spl	–	Stack pointer	Yes
%rbp	%ebp	%bp	%bpl	–	Base pointer (general-purpose in some compiler modes)	Yes
%rip	%eip	%ip	–	–	Instruction pointer (Program counter; called $pc in GDB)	*
%rflags	%eflags	%flags	–	–	Flags and condition codes	No

Note that unlike primary memory (RAM) – which is what we think of when we discuss memory in a C/C++ program – registers have no addresses! There is no address value that, if cast to a pointer and dereferenced, would return the contents of the %rax register. Registers live in a separate world from the memory, and we need special instructions to move data to and from registers and memory.

Whenever you see %ZZZ in assembly code, this refers to a register named ZZZ. The x86-64 registers have confusing names because they evolved over time; each register also has multiple names that refer to different subsets of its bits. For example %rax, one of the general-purpose registers that is, by convention, used to pass return values from functions, is split into the following five names:

 63                              31              15      7       0
 +-------------------------------+-------------------------------+
 |                               |               |       |       |
 +---------------------------------------------------------------+

 |---------------------%rax (64 bits/8 bytes)--------------------|
                                 |-----%eax (32 bits/4 bytes)----|
                                                 |-%ax (16b/2B)--|
                                                 |--%ah--|--%al--| <-- 8 bits/1 byte each

Assembly instructions often have a suffix that indicates what input data size and register width they're operating on. For instance, a set of "move" instructions help load signed and unsigned 8-, 16-, and 32-bit quantities from memory into registers. movzbl, for example, moves an 8-bit quantity (a byte) into 32-bit register (a longword; e.g., %eax) with zero extension; movslq moves a 32-bit quantity (longword) into a 64-bit register (quadword; e.g., %rax) with sign extension.

What's up with long suddenly meaning 32-bits (4 bytes)?

Because of wonderful history of the x86 architecture, and to confuse you, a "long" in x86-64 hardware terms does not refer to the same things as a long integer type in C. Specifically, a x86-64 assembly long is 4 bytes, so it corresponds to a C int. The 8-byte long (or indeed any pointer type) in C uses "quad" instructions in x86-64 assembly, denoted by a q suffix.

Note that what looks like types (such as long, short, etc.) here merely refers to the register width used in the instruction. All actual types are removed from the program during compilation; there are no types in assembly (for examples, see asm06.s and asm07.s and their corresponding C source files in the lecture code).

Instructions

There are three basic kinds of assembly instructions. We'll see two today, and another next time.

Computation: These instructions computate on values, typically values stored in registers. Most have zero or one source operands and one source/destination operand, with the source operand coming first. For example, the instruction addq %rax, %rbx performs the computation %rbx := %rbx + %rax.
Data movement: These instructions move data between registers and memory – so they can move values from one register to another, from memory into a register, and from a register back to memory. Almost all move instructions have one source operand and one destination operand; the source operand comes first. For example, movq %rax, %rbx copies the contents of %rax into %rbx, so it performs the assignment %rbx = %rax.

Some instructions appear to combine computation and data movement. For example, given the C code int* pi; ... ++(*pi); the compiler might generate incl (%rax) rather than movl (%rax), %ebx; incl %ebx; movl %ebx, (%rax). However, the processor actually divides these complex instructions into tiny, simpler, invisible instructions called microcode, because the simpler instructions can be made to execute faster. The complex incl instruction actually runs in three phases: data movement, then computation, then data movement. This matters when we introduce parallelism.

Different assembly syntaxes

There are actually multiple ways of writing x86-64 assembly. We use the "AT&T syntax", which is distinguished from the "Intel syntax" by several features, but especially by the use of percent signs for registers. Sadly, and just to make things more confusing, the Intel syntax puts destination registers before source registers.

How to read an assembly file

Assembly files (and assembly layout in GDB, layout asm) can be confusing at first. The important tricks to reading them are the following:

Focus only on the instructions that you care about, and initially ignore anything else.
Work backwards from the return statement.

Here's an example assembly file (output from gcc -S testasm.c) to consider:

    .file "testasm.c"
    .text
    .globl  _Z1fiii
    .type   _Z1fiii, @function
_Z1fiii:
.LFB0:
    cmpl    %edx, %esi
    je  .L3
    movl    %esi, %eax
    ret
.L3:
    movl    %edi, %eax
    ret
.LFE0:
    .size   _Z1fiii, .-_Z1fiii
    .ident  "GCC: (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0"
    .section        .note.GNU-stack,"",@progbits

There are many lines here that are effectively comments. All lines starting with a dot (e.g., .file or .ident) are of this kind: they constitute "directives", rather than instructions. Some directives tell the assembler what to do, but they're often unimportant to your understanding. All directive lines that end in a colon, however, are important: they constitute labels, which matter for control flow instructions (e.g., .L3:).

The actual instructions are on the indented lines between the labels. The processor will execute these top to bottom unless it's told to continue somewhere other than the next instruction by a control flow instruction (e.g., je). A good way to figure out the important parts of the instructions is often to work backwards from the ret instructions, which is the function's return point. Why do we work backwards? Going forward is more difficult because it requires understanding what the state of the processor's register is when we call the function (something not explicitly described in the file here). By working backwards from the return, we can often figure out what the function does without knowing its inputs.

Sometimes, we are also interested in looking at the assembly in an already-compiled object file – i.e., binary code after the translation from human-readable assembly language to bytes. This is called "disassembling" from executable instructions, and happens when we look at assembly using GDB, objdump -d, or objdump -S. This output looks different from compiler-generated assembly: in disassembled instructions, there are no intermediate labels or directives. This is because the labels and directives disappear during the process of generating executable instructions.

Summary

Today, we looked at how the computer operates at the level just below C code: it executes a sequence of assembly instructions, which are small operations that translate into operations of the processor's circuits. Assembly is hard to write, but it is useful to be able to read it somewhat intuitively.

We saw that in assembly, there are computation and data movement instructions, and that the compiler often produces somewhat unexpected instruction sequences to make things faster. This is part of why we use compilers: they are incredibly smart at distilling our programs down into the fastest possible sequence of instructions.

Next time, we will look at how function calls work and how the assembly instructions actually manage the automatic lifetime memory in the stack segment. After that, we will leave the low-level world of assembly and start moving up the systems stack (no pun in intended)!