CSCI 0300/1310: Fundamentals of Computer Systems

⚠️ This is not the current iteration of the course! Head here for the current offering.

Lecture 4: Strings, dynamic memory allocation, and undefined behavior

» Lecture code
» Post-Lecture Quiz (due 6pm Wednesday, February 4).

Strings!

So far, we've looked at the memory for one character. Let's look at strings of characters (mexplore-string.c) instead, since you'll need to work with strings for Lab 1 and Project 1.

A string in C is simple a sequence of bytes, each represented as a char. So, to look at the string in memory, we need to change our call to hexdump() to print the right number of bytes. Our two strings contain 12 and 13 characters, so that's 12 and 13 bytes. And indeed, ./mexplore-string prints those those bytes if you pass 12 and 13 to hexdump(). But what if we pass a larger number? Turns out we just get whatever is in memory next to the string. For global_st in our example, that's a bunch of garbage. If we keep going long enough (e.g., 20,000 bytes), the program actually crashes with a Segmentation fault (SEGV). You can probably guess now what that error is about: the program tried to access memory outside of a valid segment, and the OS terminated it to keep things safe.

But how does a program know when a string ends? We may not always know the length at compile time, as we do in these examples. It turns out that there's a bonus NUL (0x0, sometimes written \0) character hanging around at the end of every string. The NUL byte is a terminator that has the special meaning "end-of-string". Every C string must have a NUL byte at its end. (Forgetting the NUL byte is a huge source of bugs in the real world and in your assignments. Watch out for it; this matters for Lab 1 and Project 1!) So, our strings are really 13 bytes long, not 12.

How do we allocate memory for a string whose length we don't know at compile time? The answer is that we malloc() the right number of bytes in the dynamic lifetime segment. This how allocated_st gets the memory to back it. Could we just assign a string into allocated_st? No, because the type of allocated_st is char*, a pointer to a char (specifically, to the first character in the string).

char* allocated_st = malloc(100); // get 100 bytes of memory for string

*allocated_st = "C programming is cool"; // WRONG! assigns byte from address of static string into char
// ^               ^
// | char          | static string: type char[] (which you can treat as char*), at address in static segment

allocated_st = "C programming is cool";  // WRONG! assigns address of static string to pointer, forgets about 100 bytes allocated
// ^              ^
// | char*        | still a static string: type char[] (which you can treat as char*)

So how do we correctly copy a string into another one? The answer is that we need to copy each individual byte of the string, using a loop! (The example below uses a for loop, but we could also use a while loop; try thinking about that would work.)

char* allocated_st = malloc(100); // get 100 bytes of memory for string
char* temp_st = "C programming is cool";

for (int i = 0; i < strlen(temp_st); i++) {
  *(allocated_st + i) = *(temp_st + i);   // Correct, using for loop that assigns each byte separately
  //   ^           ^   |---------------|
  //   | char*     |int     *(char*)
}

Note that we're doing math on addresses here: allocated_st + i adds i to the address stored in allocated_st. In terms of our post office box metaphor, it directs us i mailboxes further to the right. This kind of math on addresses is called pointer arithmetic. C and C++ are languages that allow programmers to use this powerful, but dangerous, concept.

But writing a for loop every time we want to copy a string seems quite painful! Fortunately, we can rely on loops that other people have written for us and use standard library functions to achieve the same thing:

char* allocated_st1 = malloc(100);
sprintf(allocated_st1, "C programming is cool"); // <== Set string, using standard library function

char* allocated_st2 = malloc(100);
char* temp_st = "C programming is still cool";
strcpy(allocated_st2, temp_st); // <== Copy string, using standard library function

In Project 1, you will implement a function quite similar to strcpy! The ideas from this section will help you a lot, so come back to read up on them if you're stuck.

Addendum on literal strings

There was some brief confusion over why I could not modify characters in a literal string. It turns out that our compilers make literal strings (those strings written directly into the program code in between double quotes, like "We <3 systems") read-only to prevent us from accidentally writing over adjacent bytes in the static-lifetime segment of memory. This explains why I received a segmentation fault when trying to write to such a literal string. If I instead allocate the string's memory in the dynamic-lifetime segment, the code works:

char* allocated_st = (char*)malloc(100);
sprintf(allocated_st, "We <3 systems");

// This works!
for (int i = 0; i < 13; i++) {
  if (*(allocated_st + i) == '<') {
    *(allocated_st + i) = 'E';
    *(allocated_st + i + 1) = '>';
  }
}

Uninitialized Memory

malloc() asks the OS for some memory in the dynamic segment, and returns a pointer to the first byte of the newly-allocated memory. But what are the contents of that memory?

Let's look at mexplore-uninitialized to find out. This program does the following:

it twice allocates 100 bytes of dynamic lifetime memory;
it then writes two strings into this memory and prints them using hexdump();
it then frees the memory using free(), so it is now unallocated;
finally, the program asks for 100 bytes again, perhaps for another string, and prints the contents of that memory.

What do you expect the contents to be? Some reasonable expectations would be for the memory to contain all zeros, for it to be set to all ones (0xff) or some other special value, or for it to contain random data.

The truth is that all of the above can happen! The contents of uninitialized memory are undefined in the C language standard. Therefore, all sorts of behavior is acceptable: the compiler could generate computer code to write zeros over newly-allocated memory (some compilers do so in certain situations), or it could just return the address of the first byte of new memory from malloc() and leave the contents at whatever they were before.

What we see today is the latter behavior: the contents of the new memory are whatever they were before, and in our example, we see some fragments of the previous strings in there. This is both awesome and dangerous! It's awesome because our programs are super fast: the computer needs to do no work for new memory and just uses it as-is. But it's hugely dangerous because it's very easy to leak secrets via uninitialized memory. Imagine you stored your SSN and credit card details in the first string, freed it up using free(), and then some other part of the program, perhaps under the control of an evil person, asks for memory and gets these bytes. This kind of thing happens routinely in the real world, and it's a huge source of computer security problems.

How do you avoid leaking secrets through uninitialized memory?

To avoid accidentally leaking information from your program, you as the programmer have to overwrite the memory before you call free(). This is very common in applications that deal with cryptographic information and passwords: before they give up memory using free(), they write zeros all over that memory.

Summary

Today, we understood how C strings are represented as sequences of bytes terminated with a NUL byte, and how easy it is to accidentally go past their end. We also looked at another example of dynamic lifetime memory (heap-allocated strings) and learned what the contents of newly allocated, uninitialized memory are (and can be!).

Finally, we had a first taster of pointer arithmetic; we'll talk about this again next time!

You now know everything you'll need to complete Lab 1, and nearly everything you'll need for the strings portiton of Project 1.