Lecture 4: Memory Organization, Dynamic Memory Allocation, and Undefined Behavior

Pointers and Strings

Arrays and Structures

Lecture 4: Memory Organization, Dynamic Memory Allocation, and Undefined Behavior #

🎥 Lecture video (Brown ID required)

❓ Post-Lecture Quiz (due 11:59pm, Wednesday, February 5)

💻 Code examples:

More on Strings: the null terminator #

Last lecture, we started looking at strings in the program mexplore-string.c. This program defines a string with the type char[]. char is the name for a byte type in the C language; it refers to the fact that a byte is exactly sufficient to store one character according to the ASCII standard, a way of translating numbers into characters and vice versa. Computers can only store numbers, so all characters in a computer are actually "encoded" as numbers. For example, the uppercase letter "A" in ASCII corresponds to the number 65 (see man ascii for the translation table).

A string in C is simple a sequence of bytes, each represented as a char. So, to look at the string in memory, we need to change our call to hexdump() to print the right number of bytes. Our two strings contain 12 and 13 characters, so that's 12 and 13 bytes. And indeed, ./mexplore-string prints those those bytes if you pass 12 and 13 to hexdump(). But what if we pass a larger number? Turns out we just get whatever is in memory next to the string. For global_st in our example, that's a bunch of garbage. If we keep going long enough (e.g., 20,000 bytes), the program actually crashes with a Segmentation fault (SEGV). You can probably guess now what that error is about: the program tried to access memory outside of a valid segment, and the OS terminated it to keep things safe.

But how does a program know when a string ends? We may not always know the length at compile time, as we do in these examples. It turns out that there's a bonus NUL (0x0, sometimes written \0) character hanging around at the end of every string. The NUL byte is a terminator that has the special meaning "end-of-string". Every C string must have a NUL byte at its end. (Forgetting the NUL byte is a huge source of bugs in the real world and in your assignments. Watch out for it; this matters for Lab 1 and Project 1!) So, our strings are really 13 bytes long, not 12.

Exploring Data Representation In Memory #

We already understand how programs and their data are just bytes in memory, but now let's look in more detail at how data is represented in memory.

Why are we covering this?
We're now building an understanding of where the different parts of a C/C++ program (and, in fact, programs in other languages too!) are stored in memory. This will help you understand how a program obtains and manages memory, something that some programming languages (e.g., Java, OCaml, Pyret) do automatically behind your back, while others, and particularly systems programming languages like C and C++, force you as the programmer do some of this memory management. This gives you a great degree of control, and allows avoiding expensive hidden memory allocation and copying costs.

For this, we will use the program mexplore.c. This program declares and defines (recall the difference!) three variables, all of type char. What's the difference between these three char variables? Let's take a look. The first one, global_ch is defined in what we call global scope: it's at the top level of the file, not inside a function or inside the curly braces ({ }) that C and C++ use to delineate scopes in the program. This variable can be referred to from anywhere in the entire program.

The second variable, const_global_ch is also a global variable, but the const keyword indicates that it is constant and the compiler and OS should not allow modifications to it.

Finally, our third variable is inside function f(). It's called local_ch and is a local variable. It's valid only within the scope of f() and other parts of the program (such as main) cannot refer to it.

The hexdump() function that f() calls is defined in hexdump.c and imported via hexdump.h, a "header file". (In a future lecture, we'll talk about why header files exist.) hexdump(ADDR, N) has the effect of printing the contents of N bytes of memory at address ADDR. We're passing our character variables to it, but prefix the variable with an ampersand character, &. So: hexdump(&global_ch, 1) means "print 1 byte from a box located at the address of global_ch".

This is an important concept of the C language: you can always get the memory address at which an object is located. The term "object" here means something different from what it means in an object-oriented language like Java: rather than an instance of a class, an "object" according to the C standard is a set of bytes that contain a value. This can be code (a function) or data (a variable).

In other words, local and ptr in the snippet below refer to the same object, i.e., to the same bytes of memory:

void f() {
    int local = 1;
    int* ptr = &local;
}

Reacll that &local means "the address of local", and the value stored in the memory location where ptr is located is the 8 bytes that make up the address of local. The type of ptr is int*, which signifies that it is the address of an integer in memory (a short* would be the address of a short, a char* the address of a char, etc.).

You can invert the ampersand operator using the asterisk (*) operator: *ptr dereferences the pointer and turns the address back into a value. In other words, *&local is the same as plain, old local. Thee concepts are very, very important – you'll use them all the time!

Back to our mexplore.c program though. Let's look at what it prints when we run it (note that the specific addresses will be different on your computer).

$ ./mexplore
00601038  41                                                |A|
004009a4  42                                                |B|
7ffd4977e80f  43                                                |C|

On the left, we see the addresses of our three variables, printed in hexadecimal notation. Next, just to the right of that, we see the hexadecimal value of the data stored in the byte at each of these addresses. For example, hexadecimal 41 (often written 0x41 for clarity) is equal to ... 16*4 + 1 = 65! Not surprisingly, this equals to the ASCII character "A", which we see on the right.

But let me draw your attention to the addresses on the left. They vary a fair amount! The exact locations of variables in memory are decided by the compiler and OS together, but the general region where an object lives is determined by its lifetime. Think about how long each of our variables needs to stick around before the memory can be reused:

The global variables, global_ch and const_global_ch, need to be around for the entire runtime of the program, as the program could reference them anywhere in the code. This is called a static lifetime.
The local variable, local_ch, needs to stick around until it goes out of scope, which happens when the execution reaches the closing curly brace of f(). After it's out of scope, no code in the program can refer to the variable, so it is fine to reuse its memory. This is called an automatic lifetime. Local variables and function arguments have automatic lifetimes.

But what if a function needs to create an object whose size is not known at the start of the program (so it can't be global) and which also needs to survive beyond the end of the function? In this common situation, neither a static lifetime nor an automatic lifetime are appropriate.

Dynamically Allocated Objects #

What if you have a variable that you create inside a function, but you want to keep it around after the function returns? For example, what if I want to define a character inside f() in mexplore-with-dynamic.c and then print it from main()? I could make it a global variable, but when writing the program, I may not know exactly how many characters I'm going to need.

For this purpose, the C language allows for a third kind of object lifetime: a dynamic lifetime. For an object with a dynamic lifetime, you as the programmer have to explicitly create and destroy the object – that is, you must set aside memory for the object and make sure it is released again when the object is no longer needed.

To set aside memory, you use the malloc() standard library function (memory allocate). malloc() takes only one argument, which is the number of bytes you're asking for, and it returns the address of the newly allocated memory (i.e., the address where the OS has set aside memory boxes for this object). For example,

char* allocated_ch = malloc(1);

reserves 1 byte of memory and stores the address of that bytes memory "box" in the variable allocated_ch. The char* in brackets is a cast of the pointer returned to the type we expect (a pointer to a char); this is needed because malloc() itself does not have any idea what kind of object you're allocating, so we need to tell the compiler.

Why is there a (char*) in front of malloc()?

The char* in brackets is a cast of the pointer returned to the type we expect (a pointer to a char); this is needed because malloc() itself does not have any idea what kind of object you're allocating: it just asks the OS for some number of bytes and gives you the address of the first one. Its return type is void*, which means "a pointer to something unspecified". If we just leave it like that and try to assign a void* to variable of type char*, the compiler will print a warning!

To actually set the value inside a dynamic lifetime object, we have to dereference the pointer as usual. For example, in our character example in mexplore-with-dynamic.c, we dereference the pointer we got from malloc() and put our character ('D') into it by assigning a value to the dereferenced pointer:

void f() {
    [...]

    char* allocated_ch = malloc(1);

    *allocated_ch = 'D';
// |-------------|   ^
// deref'd pointer   | value we assign
//     = char        | (also a char)
}

Similarities and differences with Java

Calls to malloc() may look clunky, but they effectively do the same thing as the new keyword in Java: setting aside memory for a new object. Indeed, C++ actually provides a new keyword that, under the hood, invokes malloc(). One big difference compared to Java, however, is that you're responsible for cleaning up and returning that memory. Java figures out automatically when an object with a dynamic lifetime is no longer needed, and frees the memory then (a process called "garbage collection"). C and C++ don't do so, but leave it to the programmer to decide when the time is right to return the memory.

The big upside of dynamic-lifetime objects is that we can decide at runtime how big they need to be, and that they can outlive the function that creates them. Consider a string that takes the characters a user typed into the program – a quantity that's hard to predict correctly, and data that we certainly want to outlive the function that reads the input! The big downside of dynamic-lifetime objects is that it's the programmer's responsibility to free the memory allocated. You to this by calling the free() function with the address of the allocated boxes as an argument. For example, free(allocated_ch); will free the memory we asked malloc() to set aside for allocated_ch.

Incorrect use of dynamic lifetimes is an immensely common source of problems, bugs, and security holes in C/C++ programs: serious problems like memory leaks, double free, use-after-free, etc. all arise from this language feature.

Memory Segments #

Objects with different lifetimes are grouped into different regions in memory. The program code, global variables, and constant global variables are all stored in static segments, as these all have static lifetimes and known sizes at compile time.

Other objects are come and go, and therefore the memory regions that contain them grow and shrink. For example, as functions call each other, they create more and more local variables with automatic lifetimes; and as the program calls malloc() to reserve memory for objects with dynamic lifetimes, more memory is needed for these objects.

If we look at the addresses printed by mexplore, we see that the global variables stored in are at relatively low memory addresses (around 0x60'0000 hexadecimal and 0x40'0000 hexadecimal), while the local variable is stored at a high address, close to 0x7fff'ffff'ffff (about 2⁴⁷), and the dynamically allocated character is stored in between (albeit closer to the static segments).

There is a reason for this placement: it allows both types of segments to grow without risk of getting in each other's way. In particular, the automatic-lifetime segment grows downwards as more local variables come into scope, while the dynamic-lifetime segment grows upwards. If they ever meet, your computer runs out of memory.

The size of the dynamic segment changes as the program asks for memory and frees it up again. Moreover, the used memory in the dynamic segment is not necessarily contiguous. Consider a program that allocates four chars, c1 to c4 and then frees the third one: assuming the chars start at address 0x1a00050, the memory would appear as below.

0x1a000.. ... 50 ... 51 ... 52 ... 53         <- addresses
     ----+------+------+------+------+--
    ...  |  c1  |  c2  | FREE |  c4  |   ...  <- values
     ----+------+------+------+------+--

The gap between c2 and c4 arises because the memory of c3 has been freed by the program, but it has not been reused. In other words, the dynamic segment can have "holes" in it (this is called fragmentation).

Finally, these segments have names:

The part of the static-lifetime segment that holds program called is called the "text segment" (because it holds program text; though the "text" is in computer language); the parts for read-only globals ("rodata") and modifyable globals ("data" and "bss") also have names.
The automatic-lifetime segment is called the stack. It's named so because it grows and shrinks like a stack of paper: when you call a function, its automatic-lifetime objects are added to the left of the existing automatic ones, and when the function returns, the program gives their memory up again.
The dynamic-lifetime segment is called the heap, and it generally grows upwards in terms of memory addresses. But unlike the stack, it can have "holes": if the programmer destroyed an object that sits in between two others in memory, there is unused memory in between. (You can think of this as "air gaps" in a heap of things, maybe!).

The stack and heap terms are important, and you will keep seeing them!

Java Similarities Note

In Java, any object created with the new keyword is allocated with dynamic lifetime and lives on the heap. (Java puts only "primitive" types (int, double etc.) on the stack.) Indeed, Java under the hood uses malloc()! Yet, it seems like Java has automatic lifetime for all objects, as you never need to destroy them explicitly! This works because your program runs inside the Java virtual machine (JVM), which "magically" injects code that tracks whether each object is still reachable via an in-scope variable; if this reference count goes to zero, the JVM automatically calls free() to delete the object. But all this magic is not free in performance terms, as there is code to run to keep track of objects' reference counts. C and C++ instead opt to do nothing and give maximum control to the programmer, for better or worse.

To summarize, the table below lists the memory segments we've learned about, what data they contain, and roughly where they are in terms of memory addresses. (In the OS part of the course, it will turn out that these memory addresses are actually a lie and the OS playing clever tricks on us, but for now let's assume they are the actual memory addresses.)

Object declaration (C program text)	Lifetime	Segment	Example address range (runtime location in x86-64 Linux, non-position-independent)
Constant global	Static	Code (or Text)	0x40'0000 (≈1 × 2²²)
Global	Static	Data	0x60'0000 (≈1.5 × 2²²)
Local	Automatic	Stack	0x7fff'448d'0000 (≈2⁴⁷ = 2²⁵ × 2²²)
Anonymous, returned by `malloc` (C) / `new` (C++)	Dynamic	Heap	0x1a0'0000 (≈8 × 2²²)

Example: Strings of dynamic length #

How do we allocate memory for a string whose length we don't know at compile time? The answer is that we malloc() the right number of bytes in the dynamic lifetime segment. This how allocated_st gets the memory to back it. Could we just assign a string into allocated_st? No, because the type of allocated_st is char*, a pointer to a char (specifically, to the first character in the string).

char* allocated_st = malloc(100); // get 100 bytes of memory for string

*allocated_st = "C programming is cool"; // WRONG! assigns byte from address of static string into char
// ^               ^
// | char          | static string: type char[] (which you can treat as char*), at address in static segment

allocated_st = "C programming is cool";  // WRONG! assigns address of static string to pointer, forgets about 100 bytes allocated
// ^              ^
// | char*        | still a static string: type char[] (which you can treat as char*)

Uninitialized Memory #

malloc() asks the OS for some memory in the dynamic segment, and returns a pointer to the first byte of the newly-allocated memory. But what are the contents of that memory?

Let's look at mexplore-uninitialized to find out. This program does the following:

it twice allocates 100 bytes of dynamic lifetime memory;
it then writes two strings into this memory and prints them using hexdump();
it then frees the memory using free(), so it is now unallocated;
finally, the program asks for 100 bytes again, perhaps for another string, and prints the contents of that memory.

What do you expect the contents to be? Some reasonable expectations would be for the memory to contain all zeros, for it to be set to all ones (0xff) or some other special value, or for it to contain random data.

The truth is that all of the above can happen! The contents of uninitialized memory are undefined in the C language standard. Therefore, all sorts of behavior is acceptable: the compiler could generate computer code to write zeros over newly-allocated memory (some compilers do so in certain situations), or it could just return the address of the first byte of new memory from malloc() and leave the contents at whatever they were before.

What we see today is the latter behavior: the contents of the new memory are whatever they were before, and in our example, we see some fragments of the previous strings in there. This is both awesome and dangerous! It's awesome because our programs are super fast: the computer needs to do no work for new memory and just uses it as-is. But it's hugely dangerous because it's very easy to leak secrets via uninitialized memory. Imagine you stored your SSN and credit card details in the first string, freed it up using free(), and then some other part of the program, perhaps under the control of an evil person, asks for memory and gets these bytes. This kind of thing happens routinely in the real world, and it's a huge source of computer security problems.

How do you avoid leaking secrets through uninitialized memory?

To avoid accidentally leaking information from your program, you as the programmer have to overwrite the memory before you call free(). This is very common in applications that deal with cryptographic information and passwords: before they give up memory using free(), they write zeros all over that memory.

Summary #

Today, we explored variable lifetimes and the different memory segments of a program on your computer for different lifetimes of variables in C: the static segment (with static lifetime), the stack (with automatic lifetime), and the heap (with dynamic, programmer-managed liftime). We looked at examples of dynamic lifetime memory (heap-allocated chars and strings) and learned what the contents of newly allocated, uninitialized memory are (and can be!).