⚠️ This is not the current iteration of the course! Head here for the current offering.

Lecture 3: Pointers and Data Representation

» Lecture code
» Post-Lecture Quiz (due 6pm on Monday, February 1st).

Pointers and How To Use Them

We previously discussed that memory boxes can store addresses of other memory boxes, and how an address occupies 8 bytes. C provides the ampersand operator (&) to get the address of any variable. In other words, if you have a variable local, writing &local gives you the address of the memory boxes that store local.

Where are you going to store the address value returned from &local? Well, if you want to hold on to it, you'll want to store it in a variable itself. Where does that variable live? It better be in memory, too! In other words, the 8 bytes corresponding to the address of &local will have a memory location of their own. We refer to such memory locations that hold addresses as pointers, because you can think of them as arrows pointing to other memory boxes.

In terms of C types, a type followed by an asterisk corresponds to a pointer. For example, int* is a pointer to an integer. An int* itself occupies 8 bytes of memory (since it stores an address), and it points to the first byte of a 4-byte sequence of memory boxes that store an int.

To actually get to the value that a pointer points to, you use the * (asterisk) operator on the pointer (see data1/ptr-intro.c in the lecture code):

void f() {
    int local = 1;
    int* ptr = &local;

    // prints 1
    printf("value of ptr: %d\n", *ptr);  // <== here, we "dereference" the pointer to get to the value of local

    // prints the address of local, twice
    printf("address of local: %p %p\n", &local, ptr); // <== "%p" is a printf format for printing pointers!
}
For some people, it helps to think of the asterisk operator as "cancelling" out the asterisk on the type: i.e., *(int*) == int.

Some C programmers like to put the asterisk next to the type (int* ptr for int-pointer ptr), others put it with the variable name (int *ptr), because that way it's clear that the dereferenced version of ptr is an int. Use whatever notation makes sense for you!

To change the value behind a pointer, you must dereference it. For example, the following program changes the value of integer local through the pointer ptr by assigning a value to the dereferenced pointer:

void f() {
    int local = 1;
    int* ptr = &local;

    *ptr           = 42;
// |-------------|   ^
// deref'd pointer   | value we assign
//     = int         | (also a int)

    printf("value of local is now: %d\n", local);  // <== prints 42
}

Pointers are how to use the are very important in this course and in C/C++ programming in general. We'll keep coming back to these concepts. The key things to remember are: & takes the address of an object and makes a pointer, * dereferences a pointer and follows it to the value it refers to in memory. Types with an asterisk next to them are pointer types.

Exploring Data Representation In Memory

We already understand how programs are just bytes in memory, but now let's look in more detail at how data is represented in memory.

Why are we covering this?
We're now building an understanding of where the different parts of a C/C++ program (and, in fact, programs in other languages too!) are stored in memory. This will help you understand how a program obtains and manages memory, something that some programming languages (e.g., Java, OCaml, Pyret) do automatically behind your back, while others, and particularly systems programming languages like C and C++, force you as the programmer do some of this memory management. This gives you a great degree of control, and allows avoiding expensive hidden memory allocation and copying costs.

For this, we will use the program mexplore.c. This program declares and defines (recall the difference!) three variables, all of type char. char is the name for a byte type in the C language; it refers to the fact that a byte is exactly sufficient to store one character according to the ASCII standard, a way of translating numbers into characters and vice versa. Computers can only store numbers, so all characters in a computer are actually "encoded" as numbers. For example, the uppercase letter "A" in ASCII corresponds to the number 65 (see man ascii for the translation table).

What's ASCII, and do we still use it today?

In the early days of computers, every computer had its own way of encoding letters as numbers. ASCII, the American Standard Code for Information Interchange, was defined in the 1960s to find a common way of representing text. Original ASCII uses 7 bits, and can therefore represent 128 distinct characters – enough for the alphabet, numbers, and some funky special characters (e.g., newlines (\n), the NUL character (\0), and "bell" character that made typewriter bells go off).
But even 256 characters aren't sufficient to support languages that use non-Latin alphabets, and certainly not for advanced emoji. So, while all of today's computers still support ASCII, we've mostly moved on to a new standard, Unicode, which supports 1.1 million characters using one to four bytes per character. Fun fact: to be backwards compatible, Unicode is defined such that the original ASCII character encodings remain valid Unicode letters.

What's the difference between these three char variables? Let's take a look. The first one, global_ch is defined in what we call global scope: it's at the top level of the file, not inside a function or inside the curly braces ({ }) that C and C++ use to delineate scopes in the program. This variable can be referred to from anywhere in the entire program.

The second variable, const_global_ch is also a global variable, but the const keyword indicates that it is constant and the compiler and OS should not allow modifications to it.

Finally, our third variable is inside function f(). It's called local_ch and is a local variable. It's valid only within the scope of f() and other parts of the program (such as main) cannot refer to it.

The hexdump() function that f() calls is defined in hexdump.c and imported via hexdump.h, a "header file". (In a future lecture, we'll talk about why header files exist.) hexdump(ADDR, N) has the effect of printing the contents of N bytes of memory at address ADDR. We're passing our character variables to it, but prefix the variable with an ampersand character, &. So: hexdump(&global_ch, 1) means "print 1 byte from a box located at the address of global_ch".

This is an important concept of the C language: you can always get the memory address at which an object is located. The term "object" here means something different from what it means in an object-oriented language like Java: rather than an instance of a class, an "object" according to the C standard is a set of bytes that contain a value. This can be code (a function) or data (a variable).

In other words, local and ptr in the snippet below refer to the same object, i.e., to the same bytes of memory:

void f() {
    int local = 1;
    int* ptr = &local;
}
Reacll that &local means "the address of local", and the value stored in the memory location where ptr is located is the 8 bytes that make up the address of local. The type of ptr is int*, which signifies that it is the address of an integer in memory (a short* would be the address of a short, a char* the address of a char, etc.).

You can invert the ampersand operator using the asterisk (*) operator: *ptr dereferences the pointer and turns the address back into a value. In other words, *&local is the same as plain, old local. Thee concepts are very, very important – you'll use them all the time!

Back to our mexplore.c program though. Let's look at what it prints when we run it (note that the specific addresses will be different on your computer).

$ ./mexplore
00601038  41                                                |A|
004009a4  42                                                |B|
7ffd4977e80f  43                                                |C|

On the left, we see the addresses of our three variables, printed in hexadecimal notation. Next, just to the right of that, we see the hexadecimal value of the data stored in the byte at each of these addresses. For example, hexadecimal 41 (often written 0x41 for clarity) is equal to ... 16*4 + 1 = 65! Not surprisingly, this equals to the ASCII character "A", which we see on the right.

But let me draw your attention to the addresses on the left. They vary a fair amount! The exact locations of variables in memory are decided by the compiler and OS together, but the general region where an object lives is determined by its lifetime. Think about how long each of our variables needs to stick around before the memory can be reused:

  1. The global variables, global_ch and const_global_ch, need to be around for the entire runtime of the program, as the program could reference them anywhere in the code. This is called a static lifetime.
  2. The local variable, local_ch, needs to stick around until it goes out of scope, which happens when the execution reaches the closing curly brace of f(). After it's out of scope, no code in the program can refer to the variable, so it is fine to reuse its memory. This is called an automatic lifetime. Local variables and function arguments have automatic lifetimes.

But what if a function needs to create an object whose size is not known at the start of the program (so it can't be global) and which also needs to survive beyond the end of the function? In this common situation, neither a static lifetime nor an automatic lifetime are appropriate.

Dynamically Allocated Objects

What if you have a variable that you create inside a function, but you want to keep it around after the function returns? For example, what if I want to define a character inside f() in mexplore-with-dynamic.c and then print it from main()? I could make it a global variable, but when writing the program, I may not know exactly how many characters I'm going to need.

For this purpose, the C language allows for a third kind of object lifetime: a dynamic lifetime. For an object with a dynamic lifetime, you as the programmer have to explicitly create and destroy the object – that is, you must set aside memory for the object and make sure it is released again when the object is no longer needed.

To set aside memory, you use the malloc() standard library function (memory allocate). malloc() takes only one argument, which is the number of bytes you're asking for, and it returns the address of the newly allocated memory (i.e., the address where the OS has set aside memory boxes for this object). For example,

char* allocated_ch = malloc(1);
reserves 1 byte of memory and stores the address of that bytes memory "box" in the variable allocated_ch. The char* in brackets is a cast of the pointer returned to the type we expect (a pointer to a char); this is needed because malloc() itself does not have any idea what kind of object you're allocating, so we need to tell the compiler.

To actually set the value inside a dynamic lifetime object, we have to dereference the pointer as usual. For example, in our character example in mexplore-with-dynamic.c, we dereference the pointer we got from malloc() and put our character ('D') into it by assigning a value to the dereferenced pointer:

void f() {
    [...]

    char* allocated_ch = malloc(1);

    *allocated_ch = 'D';
// |-------------|   ^
// deref'd pointer   | value we assign
//     = char        | (also a char)
}

Similarities and differences with Java

Calls to malloc() may look clunky, but they effectively do the same thing as the new keyword in Java: setting aside memory for a new object. Indeed, C++ actually provides a new keyword that, under the hood, invokes malloc(). One big difference compared to Java, however, is that you're responsible for cleaning up and returning that memory. Java figures out automatically when an object with a dynamic lifetime is no longer needed, and frees the memory then (a process called "garbage collection"). C and C++ don't do so, but leave it to the programmer to decide when the time is right to return the memory.

The big upside of dynamic-lifetime objects is that we can decide at runtime how big they need to be, and that they can outlive the function that creates them. Consider a string that takes the characters a user typed into the program – a quantity that's hard to predict correctly, and data that we certainly want to outlive the function that reads the input! The big downside of dynamic-lifetime objects is that it's the programmer's responsibility to free the memory allocated. You to this by calling the free() function with the address of the allocated boxes as an argument. For example, free(allocated_ch); will free the memory we asked malloc() to set aside for allocated_ch.

Incorrect use of dynamic lifetimes is an immensely common source of problems, bugs, and security holes in C/C++ programs: serious problems like memory leaks, double free, use-after-free, etc. all arise from this language feature.

Memory Segments

Objects with different lifetimes are grouped into different regions in memory. The program code, global variables, and constant global variables are all stored in static segments, as these all have static lifetimes and known sizes at compile time.

Other objects are come and go, and therefore the memory regions that contain them grow and shrink. For example, as functions call each other, they create more and more local variables with automatic lifetimes; and as the program calls malloc() to reserve memory for objects with dynamic lifetimes, more memory is needed for these objects.

If we look at the addresses printed by mexplore, we see that the global variables stored in are at relatively low memory addresses (around 0x60'0000 hexadecimal and 0x40'0000 hexadecimal), while the local variable is stored at a high address, close to 0x7fff'ffff'ffff (about 247), and the dynamically allocated character is stored in between (albeit closer to the static segments).

There is a reason for this placement: it allows both types of segments to grow without risk of getting in each other's way. In particular, the automatic-lifetime segment grows downwards as more local variables come into scope, while the dynamic-lifetime segment grows upwards. If they ever meet, your computer runs out of memory.

The size of the dynamic segment changes as the program asks for memory and frees it up again. Moreover, the used memory in the dynamic segment is not necessarily contiguous. Consider a program that allocates four chars, c1 to c4 and then frees the third one: assuming the chars start at address 0x1a00050, the memory would appear as below.

0x1a000.. ... 50 ... 51 ... 52 ... 53         <- addresses
     ----+------+------+------+------+--
    ...  |  c1  |  c2  | FREE |  c4  |   ...  <- values
     ----+------+------+------+------+--

The gap between c2 and c4 arises because the memory of c3 has been freed by the program, but it has not been reused. In other words, the dynamic segment can have "holes" in it (this is called fragmentation).

Finally, these segments have names:

The stack and heap terms are important, and you will keep seeing them!

Java Similarities Note

In Java, any object created with the new keyword is allocated with dynamic lifetime and lives on the heap. (Java puts only "primitive" types (int, double etc.) on the stack.) Indeed, Java under the hood uses malloc()! Yet, it seems like Java has automatic lifetime for all objects, as you never need to destroy them explicitly! This works because your program runs inside the Java virtual machine (JVM), which "magically" injects code that tracks whether each object is still reachable via an in-scope variable; if this reference count goes to zero, the JVM automatically calls free() to delete the object. But all this magic is not free in performance terms, as there is code to run to keep track of objects' reference counts. C and C++ instead opt to do nothing and give maximum control to the programmer, for better or worse.

To summarize, the table below lists the memory segments we've learned about, what data they contain, and roughly where they are in terms of memory addresses. (In the OS part of the course, it will turn out that these memory addresses are actually a lie and the OS playing clever tricks on us, but for now let's assume they are the actual memory addresses.)

Object declaration
(C program text)

Lifetime

Segment

Example address range
(runtime location in x86-64 Linux, non-position-independent)

Constant global

Static

Code (or Text)

0x40'0000 (≈1 × 222)

Global

Static

Data

0x60'0000 (≈1.5 × 222)

Local

Automatic

Stack

0x7fff'448d'0000 (≈247 = 225 × 222)

Anonymous, returned by malloc (C) / new (C++)

Dynamic

Heap

0x1a0'0000 (≈8 × 222)

Summary

Today, we looked at the notion of pointers (values storing addresses) in C. Then, we discussed more about where program data lives in memory. We learned about dynamically-allocated memory, which is hugely important for real-world programs that need to keep objects around after the end of a function or create objects whose size is only known at runtime. We then built an understanding of how a program's memory is split into different segments that contain objects with different lifetimes. You're now in a good place to work on Lab 1, but we'll talk more about arrays and pointer arithmetic next time.