Lecture 4: Memory Organization, Dynamic Memory Allocation, and Undefined Behavior #
More on Strings: the null terminator #
Last lecture, we started looking at strings in the program
mexplore-string.c
. This program defines a string with the type
char[]
. char
is the name for a byte type in the C language; it
refers to the fact that a byte is exactly sufficient to store one
character according to the ASCII standard, a way of translating numbers
into characters and vice versa. Computers can only store numbers, so all
characters in a computer are actually "encoded" as numbers. For example,
the uppercase letter "A" in ASCII corresponds to the number 65 (see
man ascii
for the translation table).
A string in C is simple a sequence of bytes, each represented as a
char
. So, to look at the string in memory, we need to change our call
to hexdump()
to print the right number of bytes. Our two strings
contain 12 and 13 characters, so that's 12 and 13 bytes. And indeed,
./mexplore-string
prints those those bytes if you pass 12
and 13
to hexdump()
. But what if we pass a larger number? Turns out we just
get whatever is in memory next to the string. For global_st
in our
example, that's a bunch of garbage. If we keep going long enough (e.g.,
20,000 bytes), the program actually crashes with a Segmentation fault
(SEGV). You can probably guess now what that error is about: the program
tried to access memory outside of a valid segment, and the OS
terminated it to keep things safe.
But how does a program know when a string ends? We may not always know
the length at compile time, as we do in these examples. It turns out
that there's a bonus NUL (0x0
, sometimes written \0
) character
hanging around at the end of every string. The NUL byte is a
terminator that has the special meaning "end-of-string". Every C
string must have a NUL byte at its end. (Forgetting the NUL byte is a
huge source of bugs in the real world and in your assignments. Watch out
for it; this matters for Lab 1 and Project 1!) So, our strings are
really 13 bytes long, not 12.
Exploring Data Representation In Memory #
We already understand how programs and their data are just bytes in memory, but now let's look in more detail at how data is represented in memory.
Why are we covering this?
We're now building an understanding of where the different parts of a C/C++ program (and, in fact, programs in other languages too!) are stored in memory. This will help you understand how a program obtains and manages memory, something that some programming languages (e.g., Java, OCaml, Pyret) do automatically behind your back, while others, and particularly systems programming languages like C and C++, force you as the programmer do some of this memory management. This gives you a great degree of control, and allows avoiding expensive hidden memory allocation and copying costs.
For this, we will use the program mexplore.c
. This program declares
and defines (recall the difference!) three variables, all of type
char
. What's the difference between these three char
variables?
Let's take a look. The first one, global_ch
is defined in what we call
global scope: it's at the top level of the file, not inside a function
or inside the curly braces ({ }
) that C and C++ use to delineate
scopes in the program. This variable can be referred to from anywhere in
the entire program.
The second variable, const_global_ch
is also a global variable, but
the const
keyword indicates that it is constant and the compiler and
OS should not allow modifications to it.
Finally, our third variable is inside function f()
. It's called
local_ch
and is a local variable. It's valid only within the scope
of f()
and other parts of the program (such as main
) cannot refer to
it.
The hexdump()
function that f()
calls is defined in hexdump.c
and
imported via hexdump.h
, a "header file". (In a future lecture, we'll
talk about why header files exist.) hexdump(ADDR, N)
has the effect of
printing the contents of N
bytes of memory at address ADDR
. We're
passing our character variables to it, but prefix the variable with an
ampersand character, &
. So: hexdump(&global_ch, 1)
means "print 1
byte from a box located at the address of global_ch
".
This is an important concept of the C language: you can always get the memory address at which an object is located. The term "object" here means something different from what it means in an object-oriented language like Java: rather than an instance of a class, an "object" according to the C standard is a set of bytes that contain a value. This can be code (a function) or data (a variable).
In other words, local
and ptr
in the snippet below refer to the same
object, i.e., to the same bytes of memory:
void f() {
int local = 1;
int* ptr = &local;
}
Reacll that &local
means "the address of local
", and the value
stored in the memory location where ptr
is located is the 8 bytes that
make up the address of local
. The type of ptr
is int*
, which
signifies that it is the address of an integer in memory (a short*
would be the address of a short
, a char*
the address of a char
,
etc.).
You can invert the ampersand operator using the asterisk (*
) operator:
*ptr
dereferences the pointer and turns the address back into a
value. In other words, *&local
is the same as plain, old local
.
Thee concepts are very, very important – you'll use them all the
time!
Back to our mexplore.c
program though. Let's look at what it prints
when we run it (note that the specific addresses will be different on
your computer).
$ ./mexplore
00601038 41 |A|
004009a4 42 |B|
7ffd4977e80f 43 |C|
On the left, we see the addresses of our three variables, printed in
hexadecimal notation. Next, just to the right of that, we see the
hexadecimal value of the data stored in the byte at each of these
addresses. For example, hexadecimal 41 (often written 0x41
for
clarity) is equal to ... 16*4 + 1 = 65! Not surprisingly, this equals
to the ASCII character "A", which we see on the right.
But let me draw your attention to the addresses on the left. They vary a fair amount! The exact locations of variables in memory are decided by the compiler and OS together, but the general region where an object lives is determined by its lifetime. Think about how long each of our variables needs to stick around before the memory can be reused:
- The global variables,
global_ch
andconst_global_ch
, need to be around for the entire runtime of the program, as the program could reference them anywhere in the code. This is called a static lifetime. - The local variable,
local_ch
, needs to stick around until it goes out of scope, which happens when the execution reaches the closing curly brace off()
. After it's out of scope, no code in the program can refer to the variable, so it is fine to reuse its memory. This is called an automatic lifetime. Local variables and function arguments have automatic lifetimes.
But what if a function needs to create an object whose size is not known at the start of the program (so it can't be global) and which also needs to survive beyond the end of the function? In this common situation, neither a static lifetime nor an automatic lifetime are appropriate.
Dynamically Allocated Objects #
What if you have a variable that you create inside a function, but you
want to keep it around after the function returns? For example, what if
I want to define a character inside f()
in mexplore-with-dynamic.c
and then print it from main()
? I could make it a global variable, but
when writing the program, I may not know exactly how many characters I'm
going to need.
For this purpose, the C language allows for a third kind of object lifetime: a dynamic lifetime. For an object with a dynamic lifetime, you as the programmer have to explicitly create and destroy the object – that is, you must set aside memory for the object and make sure it is released again when the object is no longer needed.
To set aside memory, you use the malloc()
standard library function
(memory allocate). malloc()
takes only one argument,
which is the number of bytes you're asking for, and it returns the
address of the newly allocated memory (i.e., the address where the OS
has set aside memory boxes for this object). For example,
char* allocated_ch = malloc(1);
reserves 1 byte of memory and stores the address of that bytes memory
"box" in the variable allocated_ch
. The char*
in brackets is a
cast of the pointer returned to the type we expect (a pointer to a
char
); this is needed because malloc()
itself does not have any idea
what kind of object you're allocating, so we need to tell the compiler.
Why is there a
(char*)
in front ofmalloc()
?The
char*
in brackets is a cast of the pointer returned to the type we expect (a pointer to achar
); this is needed becausemalloc()
itself does not have any idea what kind of object you're allocating: it just asks the OS for some number of bytes and gives you the address of the first one. Its return type isvoid*
, which means "a pointer to something unspecified". If we just leave it like that and try to assign avoid*
to variable of typechar*
, the compiler will print a warning!
To actually set the value inside a dynamic lifetime object, we have to
dereference the pointer as usual. For example, in our character example
in mexplore-with-dynamic.c
, we dereference the pointer we got from
malloc()
and put our character ('D'
) into it by assigning a value to
the dereferenced pointer:
void f() {
[...]
char* allocated_ch = malloc(1);
*allocated_ch = 'D';
// |-------------| ^
// deref'd pointer | value we assign
// = char | (also a char)
}
Similarities and differences with Java
Calls to
malloc()
may look clunky, but they effectively do the same thing as thenew
keyword in Java: setting aside memory for a new object. Indeed, C++ actually provides anew
keyword that, under the hood, invokesmalloc()
. One big difference compared to Java, however, is that you're responsible for cleaning up and returning that memory. Java figures out automatically when an object with a dynamic lifetime is no longer needed, and frees the memory then (a process called "garbage collection"). C and C++ don't do so, but leave it to the programmer to decide when the time is right to return the memory.
The big upside of dynamic-lifetime objects is that we can decide at
runtime how big they need to be, and that they can outlive the function
that creates them. Consider a string that takes the characters a user
typed into the program – a quantity that's hard to predict correctly,
and data that we certainly want to outlive the function that reads the
input! The big downside of dynamic-lifetime objects is that it's the
programmer's responsibility to free the memory allocated. You to this by
calling the free()
function with the address of the allocated boxes as
an argument. For example, free(allocated_ch);
will free the memory we
asked malloc()
to set aside for allocated_ch
.
Incorrect use of dynamic lifetimes is an immensely common source of problems, bugs, and security holes in C/C++ programs: serious problems like memory leaks, double free, use-after-free, etc. all arise from this language feature.
Memory Segments #
Objects with different lifetimes are grouped into different regions in memory. The program code, global variables, and constant global variables are all stored in static segments, as these all have static lifetimes and known sizes at compile time.
Other objects are come and go, and therefore the memory regions that
contain them grow and shrink. For example, as functions call each other,
they create more and more local variables with automatic lifetimes; and
as the program calls malloc()
to reserve memory for objects with
dynamic lifetimes, more memory is needed for these objects.
If we look at the addresses printed by mexplore
, we see that the
global variables stored in are at relatively low memory addresses
(around 0x60'0000
hexadecimal and 0x40'0000
hexadecimal), while the
local variable is stored at a high address, close to 0x7fff'ffff'ffff
(about 247), and the dynamically allocated character is
stored in between (albeit closer to the static segments).
There is a reason for this placement: it allows both types of segments to grow without risk of getting in each other's way. In particular, the automatic-lifetime segment grows downwards as more local variables come into scope, while the dynamic-lifetime segment grows upwards. If they ever meet, your computer runs out of memory.
The size of the dynamic segment changes as the program asks for memory
and frees it up again. Moreover, the used memory in the dynamic segment
is not necessarily contiguous. Consider a program that allocates four
char
s, c1
to c4
and then frees the third one: assuming the char
s
start at address 0x1a00050
, the memory would appear as below.
0x1a000.. ... 50 ... 51 ... 52 ... 53 <- addresses
----+------+------+------+------+--
... | c1 | c2 | FREE | c4 | ... <- values
----+------+------+------+------+--
The gap between c2
and c4
arises because the memory of c3
has been
freed by the program, but it has not been reused. In other words, the
dynamic segment can have "holes" in it (this is called fragmentation).
Finally, these segments have names:
- The part of the static-lifetime segment that holds program called is called the "text segment" (because it holds program text; though the "text" is in computer language); the parts for read-only globals ("rodata") and modifyable globals ("data" and "bss") also have names.
- The automatic-lifetime segment is called the stack. It's named so because it grows and shrinks like a stack of paper: when you call a function, its automatic-lifetime objects are added to the left of the existing automatic ones, and when the function returns, the program gives their memory up again.
- The dynamic-lifetime segment is called the heap, and it generally grows upwards in terms of memory addresses. But unlike the stack, it can have "holes": if the programmer destroyed an object that sits in between two others in memory, there is unused memory in between. (You can think of this as "air gaps" in a heap of things, maybe!).
The stack and heap terms are important, and you will keep seeing them!
Java Similarities Note
In Java, any object created with the
new
keyword is allocated with dynamic lifetime and lives on the heap. (Java puts only "primitive" types (int
,double
etc.) on the stack.) Indeed, Java under the hood usesmalloc()
! Yet, it seems like Java has automatic lifetime for all objects, as you never need to destroy them explicitly! This works because your program runs inside the Java virtual machine (JVM), which "magically" injects code that tracks whether each object is still reachable via an in-scope variable; if this reference count goes to zero, the JVM automatically callsfree()
to delete the object. But all this magic is not free in performance terms, as there is code to run to keep track of objects' reference counts. C and C++ instead opt to do nothing and give maximum control to the programmer, for better or worse.
To summarize, the table below lists the memory segments we've learned about, what data they contain, and roughly where they are in terms of memory addresses. (In the OS part of the course, it will turn out that these memory addresses are actually a lie and the OS playing clever tricks on us, but for now let's assume they are the actual memory addresses.)
Object declaration |
Lifetime |
Segment |
Example address range |
---|---|---|---|
Constant global |
Static |
Code (or Text) |
0x40'0000 (≈1 × 222) |
Global |
Static |
Data |
0x60'0000 (≈1.5 × 222) |
Local |
Automatic |
Stack |
0x7fff'448d'0000 (≈247 = 225 × 222) |
Anonymous, returned by |
Dynamic |
Heap |
0x1a0'0000 (≈8 × 222) |
Example: Strings of dynamic length #
How do we allocate memory for a string whose length we don't know at
compile time? The answer is that we malloc()
the right number of bytes
in the dynamic lifetime segment. This how allocated_st
gets the memory
to back it. Could we just assign a string into allocated_st
? No,
because the type of allocated_st
is char*
, a pointer to a char
(specifically, to the first character in the string).
char* allocated_st = malloc(100); // get 100 bytes of memory for string
*allocated_st = "C programming is cool"; // WRONG! assigns byte from address of static string into char
// ^ ^
// | char | static string: type char[] (which you can treat as char*), at address in static segment
allocated_st = "C programming is cool"; // WRONG! assigns address of static string to pointer, forgets about 100 bytes allocated
// ^ ^
// | char* | still a static string: type char[] (which you can treat as char*)
Uninitialized Memory #
malloc()
asks the OS for some memory in the dynamic segment, and
returns a pointer to the first byte of the newly-allocated memory. But
what are the contents of that memory?
Let's look at mexplore-uninitialized
to find out. This program does
the following:
- it twice allocates 100 bytes of dynamic lifetime memory;
- it then writes two strings into this memory and prints them using
hexdump()
; - it then frees the memory using
free()
, so it is now unallocated; - finally, the program asks for 100 bytes again, perhaps for another string, and prints the contents of that memory.
What do you expect the contents to be? Some reasonable expectations
would be for the memory to contain all zeros, for it to be set to all
ones (0xff
) or some other special value, or for it to contain random
data.
The truth is that all of the above can happen! The contents of
uninitialized memory are undefined in the C language standard.
Therefore, all sorts of behavior is acceptable: the compiler could
generate computer code to write zeros over newly-allocated memory (some
compilers do so in certain situations), or it could just return the
address of the first byte of new memory from malloc()
and leave the
contents at whatever they were before.
What we see today is the latter behavior: the contents of the new memory
are whatever they were before, and in our example, we see some fragments
of the previous strings in there. This is both awesome and dangerous!
It's awesome because our programs are super fast: the computer needs to
do no work for new memory and just uses it as-is. But it's hugely
dangerous because it's very easy to leak secrets via uninitialized
memory. Imagine you stored your SSN and credit card details in the first
string, freed it up using free()
, and then some other part of the
program, perhaps under the control of an evil person, asks for memory
and gets these bytes. This kind of thing happens routinely in the real
world, and it's a huge source of computer security problems.
How do you avoid leaking secrets through uninitialized memory?
To avoid accidentally leaking information from your program, you as the programmer have to overwrite the memory before you call
free()
. This is very common in applications that deal with cryptographic information and passwords: before they give up memory usingfree()
, they write zeros all over that memory.
Summary #
Today, we explored variable lifetimes and the different memory segments of a program on your computer for different lifetimes of variables in C: the static segment (with static lifetime), the stack (with automatic lifetime), and the heap (with dynamic, programmer-managed liftime). We looked at examples of dynamic lifetime memory (heap-allocated chars and strings) and learned what the contents of newly allocated, uninitialized memory are (and can be!).