Lecture 4: Strings, dynamic memory allocation, and undefined behavior
» Lecture code
» Post-Lecture Quiz (due 6pm Wednesday, February 4).
Strings!
So far, we've looked at the memory for one character. Let's look at strings of characters
(mexplore-string.c
) instead, since you'll need to work with strings for Lab 1 and Project 1.
A string in C is simple a sequence of bytes, each represented as a char
. So, to look at the string in
memory, we need to change our call to hexdump()
to print the right number of bytes. Our two strings contain
12 and 13 characters, so that's 12 and 13 bytes. And indeed, ./mexplore-string
prints those those bytes if
you pass 12
and 13
to hexdump()
. But what if we pass a larger number? Turns out we
just get whatever is in memory next to the string. For global_st
in our example, that's a bunch of garbage.
If we keep going long enough (e.g., 20,000 bytes), the program actually crashes with a Segmentation fault
(SEGV). You can probably guess now what that error is about: the program tried to access memory outside of a valid
segment, and the OS terminated it to keep things safe.
But how does a program know when a string ends? We may not always know the length at compile time, as we do in these
examples. It turns out that there's a bonus NUL (0x0
, sometimes written \0
) character hanging
around at the end of every string. The NUL byte is a terminator that has the special meaning
"end-of-string". Every C string must have a NUL byte at its end. (Forgetting the NUL byte is a huge
source of bugs in the real world and in your assignments. Watch out for it; this matters for Lab 1 and Project 1!) So,
our strings are really 13 bytes long, not 12.
How do we allocate memory for a string whose length we don't know at compile time? The answer is that we
malloc()
the right number of bytes in the dynamic lifetime segment. This how allocated_st
gets the memory to back it. Could we just assign a string into allocated_st
? No, because the type of
allocated_st
is char*
, a pointer to a char
(specifically, to the first
character in the string).
char* allocated_st = malloc(100); // get 100 bytes of memory for string
*allocated_st = "C programming is cool"; // WRONG! assigns byte from address of static string into char
// ^ ^
// | char | static string: type char[] (which you can treat as char*), at address in static segment
allocated_st = "C programming is cool"; // WRONG! assigns address of static string to pointer, forgets about 100 bytes allocated
// ^ ^
// | char* | still a static string: type char[] (which you can treat as char*)
So how do we correctly copy a string into another one? The answer is that we need to copy each individual byte of
the string, using a loop! (The example below uses a for
loop, but we could also use a while
loop; try thinking about that would work.)
char* allocated_st = malloc(100); // get 100 bytes of memory for string
char* temp_st = "C programming is cool";
for (int i = 0; i < strlen(temp_st); i++) {
*(allocated_st + i) = *(temp_st + i); // Correct, using for loop that assigns each byte separately
// ^ ^ |---------------|
// | char* |int *(char*)
}
Note that we're doing math on addresses here: allocated_st + i
adds i
to the
address stored in allocated_st
. In terms of our post office box metaphor, it directs us i
mailboxes further to the right. This kind of math on addresses is called pointer arithmetic. C and C++ are languages
that allow programmers to use this powerful, but dangerous, concept.
But writing a for
loop every time we want to copy a string seems quite painful! Fortunately, we can rely
on loops that other people have written for us and use standard library functions to achieve the same thing:
char* allocated_st1 = malloc(100);
sprintf(allocated_st1, "C programming is cool"); // <== Set string, using standard library function
char* allocated_st2 = malloc(100);
char* temp_st = "C programming is still cool";
strcpy(allocated_st2, temp_st); // <== Copy string, using standard library function
In Project 1, you will implement a function quite similar to strcpy
! The ideas from this section will help
you a lot, so come back to read up on them if you're stuck.
Addendum on literal strings
There was some brief confusion over why I could not modify characters in a literal string. It turns out that our
compilers make literal strings (those strings written directly into the program code in between double quotes,
like "We <3 systems"
) read-only to prevent us from accidentally writing over adjacent bytes in
the static-lifetime segment of memory. This explains why I received a segmentation fault when trying to write to such a
literal string. If I instead allocate the string's memory in the dynamic-lifetime segment, the code works:
char* allocated_st = (char*)malloc(100);
sprintf(allocated_st, "We <3 systems");
// This works!
for (int i = 0; i < 13; i++) {
if (*(allocated_st + i) == '<') {
*(allocated_st + i) = 'E';
*(allocated_st + i + 1) = '>';
}
}
Uninitialized Memory
malloc()
asks the OS for some memory in the dynamic segment, and returns a pointer to the first byte
of the newly-allocated memory. But what are the contents of that memory?
Let's look at mexplore-uninitialized
to find out. This program does the following:
- it twice allocates 100 bytes of dynamic lifetime memory;
- it then writes two strings into this memory and prints them using
hexdump()
; - it then frees the memory using
free()
, so it is now unallocated; - finally, the program asks for 100 bytes again, perhaps for another string, and prints the contents of that memory.
0xff
) or some other special value, or for it to contain random data.
The truth is that all of the above can happen! The contents of uninitialized memory are undefined
in the C language standard. Therefore, all sorts of behavior is acceptable: the compiler could generate computer
code to write zeros over newly-allocated memory (some compilers do so in certain situations), or it could just return the
address of the first byte of new memory from malloc()
and leave the contents at whatever they were
before.
What we see today is the latter behavior: the contents of the new memory are whatever they were before, and in our
example, we see some fragments of the previous strings in there. This is both awesome and dangerous! It's awesome because
our programs are super fast: the computer needs to do no work for new memory and just uses it as-is. But it's hugely
dangerous because it's very easy to leak secrets via uninitialized memory. Imagine you stored your SSN and credit card
details in the first string, freed it up using free()
, and then some other part of the program, perhaps
under the control of an evil person, asks for memory and gets these bytes. This kind of thing happens routinely in the
real world, and it's a huge source of computer security problems.
How do you avoid leaking secrets through uninitialized memory?
To avoid accidentally leaking information from your program, you as the programmer have to overwrite the memory before you call
free()
. This is very common in applications that deal with cryptographic information and passwords: before they give up memory usingfree()
, they write zeros all over that memory.
Summary
Today, we understood how C strings are represented as sequences of bytes terminated with a NUL byte, and how easy it is to accidentally go past their end. We also looked at another example of dynamic lifetime memory (heap-allocated strings) and learned what the contents of newly allocated, uninitialized memory are (and can be!).
Finally, we had a first taster of pointer arithmetic; we'll talk about this again next time!
You now know everything you'll need to complete Lab 1, and nearly everything you'll need for the strings portiton of Project 1.