Lecture 3: Strings, Arrays, and Data Representation
» Lecture video (Brown ID required)
» Lecture code
» Post-Lecture Quiz (due 6pm on Monday, February 3rd).
Recap: Memory Segments
Why are we covering this?
Understanding the memory layout of your program and the role of different segments will hugely help in debugging larger C and C++ programs later in the course. You'll develop an intuition for how to work with addresses and how to look at memory to see what's contained in it. This is also essential preparation for the operating systems block, where you will see how the OS actually sets up the segments!
Last time, we talked about how data in C programs lives in different memory locations depending on what its lifetime is. The program code, global variables, and constant global variables are all stored in static segments, as these must be around for the whole time the program runs. They all have static lifetimes and known sizes at compile time.
If we look at the addresses printed by mexplore
, we see that the global variables stored in are at
relatively low memory addresses (around 0x60'0000
hexadecimal and 0x40'0000
hexadecimal), while
the local variable is stored at a high address, close to 0x7fff'ffff'ffff
(about 247).
These segments have names:
- The part of the static-lifetime segment that holds program called is called the "text segment" (because it holds program text; though the "text" is in computer language); the parts for read-only globals ("rodata") and modifyable globals ("data" and "bss") also have names.
- The automatic-lifetime segment is called the stack. It's named so because it grows and shrinks like a stack of paper: when you call a function, its automatic-lifetime objects are added to the left of the existing automatic ones, and when the function returns, the program gives their memory up again. We'll see more of that today.
We'll investigate the stack some more in later lectures.
Dynamically Allocated Objects
What if you have a variable that you create inside a function, but you want to keep it around after the function
returns? For example, what if I want to define a character inside f()
in
mexplore-with-dynamic.c
and then print it from main()
? I could make it a global variable, but
when writing the program, I may not know exactly how many characters I'm going to need.
For this purpose, the C language allows for a third kind of object lifetime: a dynamic lifetime. For an object with a dynamic lifetime, you as the programmer have to explicitly create and destroy the object – that is, you must set aside memory for the object and make sure it is released again when the object is no longer needed.
To set aside memory, you use the malloc()
standard library function (memory allocate).
malloc()
takes only one argument, which is the number of bytes you're asking for, and it returns the
address of the newly allocated memory (i.e., the address where the OS has set aside memory boxes for this object).
For example,
char* allocated_ch = malloc(1);
reserves 1 byte of memory and stores the address of that bytes memory "box" in the variable allocated_ch
.
Interlude: Pointers and How To Use Them
malloc
returns an address, and that address gets stored in a local variable. Local variables have their
memory boxes in the stack segment, on the right of our memory diagram. We refer to memory locations that hold addresses
as pointers, because you can think of them as arrows pointing to other memory boxes.
In terms of C types, a type followed by an asterisk corresponds to a pointer. For example, int*
is a
pointer to an integer. An int*
itself occupies 8 bytes of memory (since it stores an address), and it
points to the first byte of a 4-byte sequence of memory boxes that store an int
.
To actually get to the value that a pointer points to, you use the *
(asterisk) operator on the pointer:
void f() {
int local = 1;
int* ptr = &local;
// prints 1
printf("value of ptr: %d\n", *ptr); // <== here, we "dereference" the pointer to get to the value of local
// prints the address of local, twice
printf("address of local: %p %p\n", &local, ptr); // <== "%p" is a printf format for printing pointers!
}
For some people, it helps to think of the asterisk operator as "cancelling" out the asterisk on the type:
i.e., *(int*) == int
.
Some C programmers like to put the asterisk next to the type (int* ptr
for int
-pointer
ptr
), others put it with the variable name (int *ptr
), because that way it's clear that the
dereferenced version of ptr
is an int
. Use whatever notation makes sense for you!
In our character example in mexplore-with-dynamic.c
, we dereference the pointer we got from
malloc()
and put our character ('D'
) into it by assigning a value to the dereferenced pointer:
void f() {
[...]
char* allocated_ch = malloc(1);
*allocated_ch = 'D';
// |-------------| ^
// deref'd pointer | value we assign
// = char | (also a char)
}
Pointers are how to use the are very important in this course and in C/C++ programming in general. We'll keep
coming back to these concepts. The key things to remember are: &
takes the address of an object and
makes a pointer, *
dereferences a pointer and follows it to the value it refers to in memory. Types with
an asterisk next to them are pointer types.
Back to Dynamic Allocation: Getting Rid of Dynamic Objects
The big upside of dynamic-lifetime objects is that we can decide at runtime how big they need to be (by asking
malloc()
for the right number of bytes), and that dynamic-lifetime objects can outlive the function that
creates them. The big downside of dynamic-lifetime objects is that it's the programmer's responsibility to
free the memory allocated. You do this by calling the free()
function with the address of the first allocted
byte (the pointer returned from malloc()
as an argument. For example, free(allocated_ch);
will
free the memory we asked malloc()
to set aside for allocated_ch
.
Incorrect use of dynamic-lifetime objects is an immensely common source of problems, bugs, and security holes in C/C++ programs: serious problems like memory leaks, double free, use-after-free, etc. all arise from this language feature. You will spend your share of time debugging these issues, but fortunately there are some neat tools that can help you. We'll learn about those in Lab 2 and future lectures.
Similarities and differences with Java
Calls to
malloc()
may look clunky, but they effectively do the same thing as thenew
keyword in Java: setting aside memory for a new object. Indeed, C++ actually provides anew
keyword that, under the hood, invokesmalloc()
. One big difference compared to Java, however, is that you're responsible for cleaning up and returning that memory. Java figures out automatically when an object with a dynamic lifetime is no longer needed, and frees the memory then (a process called "garbage collection"). C and C++ don't do so, but leave it to the programmer to decide when the time is right to return the memory.
Dynamic lifetime objects are located in a memory segment called the heap. The size of this segment changes
as the program asks for memory and frees it up again. Moreover, the used memory in the heap segment is not necessarily
contiguous. Consider a program that allocates four char
s, c1
to c4
and
then frees the third one: assuming the char
s start at address 0x1a00050
, the memory would appear
as below.
0x1a000.. ... 50 ... 51 ... 52 ... 53 <- addresses ----+------+------+------+------+-- ... | c1 | c2 | FREE | c4 | ... <- values ----+------+------+------+------+--
The gap between c2
and c4
arises because the memory of c3
has been freed by
the program, but it has not been reused. In other words, the heap can have "holes" in it (this is called
fragmentation).
To summarize, the table below lists the memory segments we've learned about, what data they contain, and roughly where they are in terms of memory addresses. (In the OS part of the course, it will turn out that these memory addresses are actually a lie and the OS playing clever tricks on us, but for now let's assume they are the actual memory addresses.)
Object declaration |
Lifetime | Segment |
Example address range |
---|---|---|---|
Constant global |
Static |
Code (or Text) |
0x40'0000 (≈1 × 222) |
Global |
Static |
Data |
0x60'0000 (≈1.5 × 222) |
Local |
Automatic |
Stack |
0x7fff'448d'0000 (≈247 = 225 × 222) |
Anonymous, returned by |
Dynamic |
Heap |
0x1a0'0000 (≈8 × 222) |
Strings!
So far, we've looked at the memory for one character. Let's look at strings of characters
(mexplore-string.c
) instead, since you'll need to work with strings for Lab 1 and Project 1.
A string in C is simple a sequence of bytes, each represented as a char
. So, to look at the string in
memory, we need to change our call to hexdump()
to print the right number of bytes. Our two strings contain
12 and 13 characters, so that's 12 and 13 bytes. And indeed, ./mexplore-string
prints those those bytes if
you pass 12
and 13
to hexdump()
. But what if we pass a larger number? Turns out we
just get whatever is in memory next to the string. For global_st
in our example, that's a bunch of garbage.
If we keep going long enough (e.g., 20,000 bytes), the program actually crashes with a Segmentation fault
(SEGV). You can probably guess now what that error is about: the program tried to access memory outside of a valid
segment, and the OS terminated it to keep things safe.
But how does a program know when a string ends? We may not always know the length at compile time, as we do in these
examples. It turns out that there's a bonus NUL (0x0
, sometimes written \0
) character hanging
around at the end of every string. The NUL byte is a terminator that has the special meaning
"end-of-string". Every C string must have a NUL byte at its end. (Forgetting the NUL byte is a huge
source of bugs in the real world and in your assignments. Watch out for it; this matters for Lab 1 and Project 1!) So,
our strings are really 13 bytes long, not 12.
How do we allocate memory for a string whose length we don't know at compile time? The answer is that we
malloc()
the right number of bytes in the dynamic lifetime segment. This how allocated_st
gets the memory to back it. Could we just assign a string into allocated_st
? No, because the type of
allocated_st
is char*
, a pointer to a char
(specifically, to the first
character in the string).
char* allocated_st = malloc(100); // get 100 bytes of memory for string
*allocated_st = "C programming is cool"; // WRONG! assigns byte from address of static string into char
// ^ ^
// | char | static string: type char[] (which you can treat as char*), at address in static segment
allocated_st = "C programming is cool"; // WRONG! assigns address of static string to pointer, forgets about 100 bytes allocated
// ^ ^
// | char* | still a static string: type char[] (which you can treat as char*)
So how do we correctly copy a string into another one? The answer is that we need to copy each individual byte of
the string, using a loop! (The example below uses a for
loop, but we could also use a while
loop; try thinking about that would work.)
char* allocated_st = malloc(100); // get 100 bytes of memory for string
char* temp_st = "C programming is cool";
for (int i = 0; i < strlen(temp_st); i++) {
*(allocated_st + i) = *(temp_st + i); // Correct, using for loop that assigns each byte separately
// ^ ^ |---------------|
// | char* |int *(char*)
}
Note that we're doing math on addresses here: allocated_st + i
adds i
to the
address stored in allocated_st
. In terms of our post office box metaphor, it directs us i
mailboxes further to the right. This kind of math on addresses is called pointer arithmetic. C and C++ are languages
that allow programmers to use this powerful, but dangerous, concept.
But writing a for
loop every time we want to copy a string seems quite painful! Fortunately, we can rely
on loops that other people have written for us and use standard library functions to achieve the same thing:
char* allocated_st1 = malloc(100);
sprintf(allocated_st1, "C programming is cool"); // <== Set string, using standard library function
char* allocated_st2 = malloc(100);
char* temp_st = "C programming is still cool";
strcpy(allocated_st2, temp_st); // <== Copy string, using standard library function
In Project 1, you will implement a function quite similar to strcpy
! The ideas from this section will help
you a lot, so come back to read up on them if you're stuck.
Uninitialized Memory
malloc()
asks the OS for some memory in the dynamic segment, and returns a pointer to the first byte
of the newly-allocated memory. But what are the contents of that memory?
Let's look at mexplore-uninitialized
to find out. This program does the following:
- it twice allocates 100 bytes of dynamic lifetime memory;
- it then writes two strings into this memory and prints them using
hexdump()
; - it then frees the memory using
free()
, so it is now unallocated; - finally, the program asks for 100 bytes again, perhaps for another string, and prints the contents of that memory.
0xff
) or some other special value, or for it to contain random data.
The truth is that all of the above can happen! The contents of uninitialized memory are undefined
in the C language standard. Therefore, all sorts of behavior is acceptable: the compiler could generate computer
code to write zeros over newly-allocated memory (some compilers do so in certain situations), or it could just return the
address of the first byte of new memory from malloc()
and leave the contents at whatever they were
before.
What we see today is the latter behavior: the contents of the new memory are whatever they were before, and in our
example, we see some fragments of the previous strings in there. This is both awesome and dangerous! It's awesome because
our programs are super fast: the computer needs to do no work for new memory and just uses it as-is. But it's hugely
dangerous because it's very easy to leak secrets via uninitialized memory. Imagine you stored your SSN and credit card
details in the first string, freed it up using free()
, and then some other part of the program, perhaps
under the control of an evil person, asks for memory and gets these bytes. This kind of thing happens routinely in the
real world, and it's a huge source of computer security problems.
How do you avoid leaking secrets through uninitialized memory?
To avoid accidentally leaking information from your program, you as the programmer have to overwrite the memory before you call
free()
. This is very common in applications that deal with cryptographic information and passwords: before they give up memory usingfree()
, they write zeros all over that memory.
Summary
Today, we looked more at where program data lives in memory. We learned about dynamically-allocated memory, which is hugely important for real-world programs that need to keep objects around after the end of a function or create objects whose size is only known at runtime. We then understood how C strings are represented as sequences of bytes terminated with a NUL byte, and how easy it is to accidentally go past their end. We also learned what the contents of newly allocated, uninitialized memory are (and can be!). You're now in a good place to work on Lab 1 and Project 1, but we'll talk more about arrays and pointer arithmetic next time.