CS 131/CSCI 1310: Fundamentals of Computer Systems

Lecture 3: Strings, Arrays, and Data Representation

» Lecture video (Brown ID required)
» Lecture code
» Post-Lecture Quiz (due 6pm on Monday, February 3rd).

Recap: Memory Segments

Why are we covering this?

Understanding the memory layout of your program and the role of different segments will hugely help in debugging larger C and C++ programs later in the course. You'll develop an intuition for how to work with addresses and how to look at memory to see what's contained in it. This is also essential preparation for the operating systems block, where you will see how the OS actually sets up the segments!

Last time, we talked about how data in C programs lives in different memory locations depending on what its lifetime is. The program code, global variables, and constant global variables are all stored in static segments, as these must be around for the whole time the program runs. They all have static lifetimes and known sizes at compile time.

If we look at the addresses printed by mexplore, we see that the global variables stored in are at relatively low memory addresses (around 0x60'0000 hexadecimal and 0x40'0000 hexadecimal), while the local variable is stored at a high address, close to 0x7fff'ffff'ffff (about 2⁴⁷).

These segments have names:

The part of the static-lifetime segment that holds program called is called the "text segment" (because it holds program text; though the "text" is in computer language); the parts for read-only globals ("rodata") and modifyable globals ("data" and "bss") also have names.
The automatic-lifetime segment is called the stack. It's named so because it grows and shrinks like a stack of paper: when you call a function, its automatic-lifetime objects are added to the left of the existing automatic ones, and when the function returns, the program gives their memory up again. We'll see more of that today.

We'll investigate the stack some more in later lectures.

Dynamically Allocated Objects

What if you have a variable that you create inside a function, but you want to keep it around after the function returns? For example, what if I want to define a character inside f() in mexplore-with-dynamic.c and then print it from main()? I could make it a global variable, but when writing the program, I may not know exactly how many characters I'm going to need.

For this purpose, the C language allows for a third kind of object lifetime: a dynamic lifetime. For an object with a dynamic lifetime, you as the programmer have to explicitly create and destroy the object – that is, you must set aside memory for the object and make sure it is released again when the object is no longer needed.

To set aside memory, you use the malloc() standard library function (memory allocate). malloc() takes only one argument, which is the number of bytes you're asking for, and it returns the address of the newly allocated memory (i.e., the address where the OS has set aside memory boxes for this object). For example,

char* allocated_ch = malloc(1);

reserves 1 byte of memory and stores the address of that bytes memory "box" in the variable allocated_ch.

Interlude: Pointers and How To Use Them

malloc returns an address, and that address gets stored in a local variable. Local variables have their memory boxes in the stack segment, on the right of our memory diagram. We refer to memory locations that hold addresses as pointers, because you can think of them as arrows pointing to other memory boxes.

In terms of C types, a type followed by an asterisk corresponds to a pointer. For example, int* is a pointer to an integer. An int* itself occupies 8 bytes of memory (since it stores an address), and it points to the first byte of a 4-byte sequence of memory boxes that store an int.

To actually get to the value that a pointer points to, you use the * (asterisk) operator on the pointer:

void f() {
    int local = 1;
    int* ptr = &local;

    // prints 1
    printf("value of ptr: %d\n", *ptr);  // <== here, we "dereference" the pointer to get to the value of local

    // prints the address of local, twice
    printf("address of local: %p %p\n", &local, ptr); // <== "%p" is a printf format for printing pointers!
}

For some people, it helps to think of the asterisk operator as "cancelling" out the asterisk on the type: i.e., *(int*) == int.

Some C programmers like to put the asterisk next to the type (int* ptr for int-pointer ptr), others put it with the variable name (int *ptr), because that way it's clear that the dereferenced version of ptr is an int. Use whatever notation makes sense for you!

In our character example in mexplore-with-dynamic.c, we dereference the pointer we got from malloc() and put our character ('D') into it by assigning a value to the dereferenced pointer:

void f() {
    [...]

    char* allocated_ch = malloc(1);

    *allocated_ch = 'D';
// |-------------|   ^
// deref'd pointer   | value we assign
//     = char        | (also a char)
}

Pointers are how to use the are very important in this course and in C/C++ programming in general. We'll keep coming back to these concepts. The key things to remember are: & takes the address of an object and makes a pointer, * dereferences a pointer and follows it to the value it refers to in memory. Types with an asterisk next to them are pointer types.

Back to Dynamic Allocation: Getting Rid of Dynamic Objects

The big upside of dynamic-lifetime objects is that we can decide at runtime how big they need to be (by asking malloc() for the right number of bytes), and that dynamic-lifetime objects can outlive the function that creates them. The big downside of dynamic-lifetime objects is that it's the programmer's responsibility to free the memory allocated. You do this by calling the free() function with the address of the first allocted byte (the pointer returned from malloc() as an argument. For example, free(allocated_ch); will free the memory we asked malloc() to set aside for allocated_ch.

Incorrect use of dynamic-lifetime objects is an immensely common source of problems, bugs, and security holes in C/C++ programs: serious problems like memory leaks, double free, use-after-free, etc. all arise from this language feature. You will spend your share of time debugging these issues, but fortunately there are some neat tools that can help you. We'll learn about those in Lab 2 and future lectures.

Similarities and differences with Java

Calls to malloc() may look clunky, but they effectively do the same thing as the new keyword in Java: setting aside memory for a new object. Indeed, C++ actually provides a new keyword that, under the hood, invokes malloc(). One big difference compared to Java, however, is that you're responsible for cleaning up and returning that memory. Java figures out automatically when an object with a dynamic lifetime is no longer needed, and frees the memory then (a process called "garbage collection"). C and C++ don't do so, but leave it to the programmer to decide when the time is right to return the memory.

Dynamic lifetime objects are located in a memory segment called the heap. The size of this segment changes as the program asks for memory and frees it up again. Moreover, the used memory in the heap segment is not necessarily contiguous. Consider a program that allocates four chars, c1 to c4 and then frees the third one: assuming the chars start at address 0x1a00050, the memory would appear as below.

0x1a000.. ... 50 ... 51 ... 52 ... 53         <- addresses
     ----+------+------+------+------+--
    ...  |  c1  |  c2  | FREE |  c4  |   ...  <- values
     ----+------+------+------+------+--

The gap between c2 and c4 arises because the memory of c3 has been freed by the program, but it has not been reused. In other words, the heap can have "holes" in it (this is called fragmentation).

To summarize, the table below lists the memory segments we've learned about, what data they contain, and roughly where they are in terms of memory addresses. (In the OS part of the course, it will turn out that these memory addresses are actually a lie and the OS playing clever tricks on us, but for now let's assume they are the actual memory addresses.)

Object declaration (C program text)	Lifetime	Segment	Example address range (runtime location in x86-64 Linux, non-position-independent)
Constant global	Static	Code (or Text)	0x40'0000 (≈1 × 2²²)
Global	Static	Data	0x60'0000 (≈1.5 × 2²²)
Local	Automatic	Stack	0x7fff'448d'0000 (≈2⁴⁷ = 2²⁵ × 2²²)
Anonymous, returned by `malloc` (C) / `new` (C++)	Dynamic	Heap	0x1a0'0000 (≈8 × 2²²)

Strings!

So far, we've looked at the memory for one character. Let's look at strings of characters (mexplore-string.c) instead, since you'll need to work with strings for Lab 1 and Project 1.

A string in C is simple a sequence of bytes, each represented as a char. So, to look at the string in memory, we need to change our call to hexdump() to print the right number of bytes. Our two strings contain 12 and 13 characters, so that's 12 and 13 bytes. And indeed, ./mexplore-string prints those those bytes if you pass 12 and 13 to hexdump(). But what if we pass a larger number? Turns out we just get whatever is in memory next to the string. For global_st in our example, that's a bunch of garbage. If we keep going long enough (e.g., 20,000 bytes), the program actually crashes with a Segmentation fault (SEGV). You can probably guess now what that error is about: the program tried to access memory outside of a valid segment, and the OS terminated it to keep things safe.

But how does a program know when a string ends? We may not always know the length at compile time, as we do in these examples. It turns out that there's a bonus NUL (0x0, sometimes written \0) character hanging around at the end of every string. The NUL byte is a terminator that has the special meaning "end-of-string". Every C string must have a NUL byte at its end. (Forgetting the NUL byte is a huge source of bugs in the real world and in your assignments. Watch out for it; this matters for Lab 1 and Project 1!) So, our strings are really 13 bytes long, not 12.

How do we allocate memory for a string whose length we don't know at compile time? The answer is that we malloc() the right number of bytes in the dynamic lifetime segment. This how allocated_st gets the memory to back it. Could we just assign a string into allocated_st? No, because the type of allocated_st is char*, a pointer to a char (specifically, to the first character in the string).

char* allocated_st = malloc(100); // get 100 bytes of memory for string

*allocated_st = "C programming is cool"; // WRONG! assigns byte from address of static string into char
// ^               ^
// | char          | static string: type char[] (which you can treat as char*), at address in static segment

allocated_st = "C programming is cool";  // WRONG! assigns address of static string to pointer, forgets about 100 bytes allocated
// ^              ^
// | char*        | still a static string: type char[] (which you can treat as char*)

So how do we correctly copy a string into another one? The answer is that we need to copy each individual byte of the string, using a loop! (The example below uses a for loop, but we could also use a while loop; try thinking about that would work.)

char* allocated_st = malloc(100); // get 100 bytes of memory for string
char* temp_st = "C programming is cool";

for (int i = 0; i < strlen(temp_st); i++) {
  *(allocated_st + i) = *(temp_st + i);   // Correct, using for loop that assigns each byte separately
  //   ^           ^   |---------------|
  //   | char*     |int     *(char*)
}

Note that we're doing math on addresses here: allocated_st + i adds i to the address stored in allocated_st. In terms of our post office box metaphor, it directs us i mailboxes further to the right. This kind of math on addresses is called pointer arithmetic. C and C++ are languages that allow programmers to use this powerful, but dangerous, concept.

But writing a for loop every time we want to copy a string seems quite painful! Fortunately, we can rely on loops that other people have written for us and use standard library functions to achieve the same thing:

char* allocated_st1 = malloc(100);
sprintf(allocated_st1, "C programming is cool"); // <== Set string, using standard library function

char* allocated_st2 = malloc(100);
char* temp_st = "C programming is still cool";
strcpy(allocated_st2, temp_st); // <== Copy string, using standard library function

In Project 1, you will implement a function quite similar to strcpy! The ideas from this section will help you a lot, so come back to read up on them if you're stuck.

Uninitialized Memory

malloc() asks the OS for some memory in the dynamic segment, and returns a pointer to the first byte of the newly-allocated memory. But what are the contents of that memory?

Let's look at mexplore-uninitialized to find out. This program does the following:

it twice allocates 100 bytes of dynamic lifetime memory;
it then writes two strings into this memory and prints them using hexdump();
it then frees the memory using free(), so it is now unallocated;
finally, the program asks for 100 bytes again, perhaps for another string, and prints the contents of that memory.

What do you expect the contents to be? Some reasonable expectations would be for the memory to contain all zeros, for it to be set to all ones (0xff) or some other special value, or for it to contain random data.

The truth is that all of the above can happen! The contents of uninitialized memory are undefined in the C language standard. Therefore, all sorts of behavior is acceptable: the compiler could generate computer code to write zeros over newly-allocated memory (some compilers do so in certain situations), or it could just return the address of the first byte of new memory from malloc() and leave the contents at whatever they were before.

What we see today is the latter behavior: the contents of the new memory are whatever they were before, and in our example, we see some fragments of the previous strings in there. This is both awesome and dangerous! It's awesome because our programs are super fast: the computer needs to do no work for new memory and just uses it as-is. But it's hugely dangerous because it's very easy to leak secrets via uninitialized memory. Imagine you stored your SSN and credit card details in the first string, freed it up using free(), and then some other part of the program, perhaps under the control of an evil person, asks for memory and gets these bytes. This kind of thing happens routinely in the real world, and it's a huge source of computer security problems.

How do you avoid leaking secrets through uninitialized memory?

To avoid accidentally leaking information from your program, you as the programmer have to overwrite the memory before you call free(). This is very common in applications that deal with cryptographic information and passwords: before they give up memory using free(), they write zeros all over that memory.

Summary

Today, we looked more at where program data lives in memory. We learned about dynamically-allocated memory, which is hugely important for real-world programs that need to keep objects around after the end of a function or create objects whose size is only known at runtime. We then understood how C strings are represented as sequences of bytes terminated with a NUL byte, and how easy it is to accidentally go past their end. We also learned what the contents of newly allocated, uninitialized memory are (and can be!). You're now in a good place to work on Lab 1 and Project 1, but we'll talk more about arrays and pointer arithmetic next time.