Lecture 3: Pointers and Data Representation
» Lecture code
» Post-Lecture Quiz (due 6pm on Monday, February 1st).
Pointers and How To Use Them
We previously discussed that memory boxes can store addresses of other memory boxes, and how an address occupies
8 bytes. C provides the ampersand operator (&
) to get the address of any variable. In other words, if you
have a variable local
, writing &local
gives you the address of the memory boxes that
store local
.
Where are you going to store the address value returned from &local
? Well, if you want to hold on to it,
you'll want to store it in a variable itself. Where does that variable live? It better be in memory, too! In other
words, the 8 bytes corresponding to the address of &local
will have a memory location of their own.
We refer to such memory locations that hold addresses as pointers, because you can think of them as
arrows pointing to other memory boxes.
In terms of C types, a type followed by an asterisk corresponds to a pointer. For example, int*
is a
pointer to an integer. An int*
itself occupies 8 bytes of memory (since it stores an address), and it
points to the first byte of a 4-byte sequence of memory boxes that store an int
.
To actually get to the value that a pointer points to, you use the *
(asterisk) operator on the pointer
(see data1/ptr-intro.c
in the lecture code):
void f() {
int local = 1;
int* ptr = &local;
// prints 1
printf("value of ptr: %d\n", *ptr); // <== here, we "dereference" the pointer to get to the value of local
// prints the address of local, twice
printf("address of local: %p %p\n", &local, ptr); // <== "%p" is a printf format for printing pointers!
}
For some people, it helps to think of the asterisk operator as "cancelling" out the asterisk on the type:
i.e., *(int*) == int
.
Some C programmers like to put the asterisk next to the type (int* ptr
for int
-pointer
ptr
), others put it with the variable name (int *ptr
), because that way it's clear that the
dereferenced version of ptr
is an int
. Use whatever notation makes sense for you!
To change the value behind a pointer, you must dereference it. For example, the following program changes
the value of integer local
through the pointer ptr
by assigning a value to the dereferenced
pointer:
void f() {
int local = 1;
int* ptr = &local;
*ptr = 42;
// |-------------| ^
// deref'd pointer | value we assign
// = int | (also a int)
printf("value of local is now: %d\n", local); // <== prints 42
}
Pointers are how to use the are very important in this course and in C/C++ programming in general. We'll keep
coming back to these concepts. The key things to remember are: &
takes the address of an object and
makes a pointer, *
dereferences a pointer and follows it to the value it refers to in memory. Types with
an asterisk next to them are pointer types.
Exploring Data Representation In Memory
We already understand how programs are just bytes in memory, but now let's look in more detail at how data is represented in memory.
Why are we covering this?
We're now building an understanding of where the different parts of a C/C++ program (and, in fact, programs in other languages too!) are stored in memory. This will help you understand how a program obtains and manages memory, something that some programming languages (e.g., Java, OCaml, Pyret) do automatically behind your back, while others, and particularly systems programming languages like C and C++, force you as the programmer do some of this memory management. This gives you a great degree of control, and allows avoiding expensive hidden memory allocation and copying costs.
For this, we will use the program mexplore.c
. This program declares and defines (recall the difference!)
three variables, all of type char
. char
is the name for a byte type in the C language; it
refers to the fact that a byte is exactly sufficient to store one character according to the ASCII standard, a way
of translating numbers into characters and vice versa. Computers can only store numbers, so all characters in a
computer are actually "encoded" as numbers. For example, the uppercase letter "A" in ASCII corresponds to
the number 65 (see man ascii
for the translation table).
What's ASCII, and do we still use it today?
In the early days of computers, every computer had its own way of encoding letters as numbers. ASCII, the American Standard Code for Information Interchange, was defined in the 1960s to find a common way of representing text. Original ASCII uses 7 bits, and can therefore represent 128 distinct characters – enough for the alphabet, numbers, and some funky special characters (e.g., newlines (
\n
), the NUL character (\0
), and "bell" character that made typewriter bells go off).
But even 256 characters aren't sufficient to support languages that use non-Latin alphabets, and certainly not for advanced emoji. So, while all of today's computers still support ASCII, we've mostly moved on to a new standard, Unicode, which supports 1.1 million characters using one to four bytes per character. Fun fact: to be backwards compatible, Unicode is defined such that the original ASCII character encodings remain valid Unicode letters.
What's the difference between these three char
variables? Let's take a look. The first one,
global_ch
is defined in what we call global scope: it's at the top level of the file, not inside
a function or inside the curly braces ({ }
) that C and C++ use to delineate scopes in the program. This
variable can be referred to from anywhere in the entire program.
The second variable, const_global_ch
is also a global variable, but the const
keyword
indicates that it is constant and the compiler and OS should not allow modifications to it.
Finally, our third variable is inside function f()
. It's called local_ch
and is a local
variable. It's valid only within the scope of f()
and other parts of the program (such as
main
) cannot refer to it.
The hexdump()
function that f()
calls is defined in hexdump.c
and imported via
hexdump.h
, a "header file". (In a future lecture, we'll talk about why header files exist.)
hexdump(ADDR, N)
has the effect of printing the contents of N
bytes of memory at address
ADDR
. We're passing our character variables to it, but prefix the variable with an ampersand character,
&
. So: hexdump(&global_ch, 1)
means "print 1 byte from a box located at the
address of global_ch
".
This is an important concept of the C language: you can always get the memory address at which an object is located. The term "object" here means something different from what it means in an object-oriented language like Java: rather than an instance of a class, an "object" according to the C standard is a set of bytes that contain a value. This can be code (a function) or data (a variable).
In other words, local
and ptr
in the snippet below refer to the same object, i.e., to
the same bytes of memory:
void f() {
int local = 1;
int* ptr = &local;
}
Reacll that &local
means "the address of local
", and the value stored in the memory
location where ptr
is located is the 8 bytes that make up the address of local
. The type of
ptr
is int*
, which signifies that it is the address of an integer in memory
(a short*
would be the address of a short
, a char*
the address of a
char
, etc.).
You can invert the ampersand operator using the asterisk (*
) operator: *ptr
dereferences
the pointer and turns the address back into a value. In other words, *&local
is the same as plain, old
local
. Thee concepts are very, very important – you'll use them all the time!
Back to our mexplore.c
program though. Let's look at what it prints when we run it (note that the
specific addresses will be different on your computer).
$ ./mexplore
00601038 41 |A|
004009a4 42 |B|
7ffd4977e80f 43 |C|
On the left, we see the addresses of our three variables, printed in hexadecimal notation. Next, just to the right
of that, we see the hexadecimal value of the data stored in the byte at each of these addresses. For example,
hexadecimal 41 (often written 0x41
for clarity) is equal to ... 16*4 + 1 = 65! Not surprisingly, this
equals to the ASCII character "A", which we see on the right.
But let me draw your attention to the addresses on the left. They vary a fair amount! The exact locations of variables in memory are decided by the compiler and OS together, but the general region where an object lives is determined by its lifetime. Think about how long each of our variables needs to stick around before the memory can be reused:
- The global variables,
global_ch
andconst_global_ch
, need to be around for the entire runtime of the program, as the program could reference them anywhere in the code. This is called a static lifetime. - The local variable,
local_ch
, needs to stick around until it goes out of scope, which happens when the execution reaches the closing curly brace off()
. After it's out of scope, no code in the program can refer to the variable, so it is fine to reuse its memory. This is called an automatic lifetime. Local variables and function arguments have automatic lifetimes.
But what if a function needs to create an object whose size is not known at the start of the program (so it can't be global) and which also needs to survive beyond the end of the function? In this common situation, neither a static lifetime nor an automatic lifetime are appropriate.
Dynamically Allocated Objects
What if you have a variable that you create inside a function, but you want to keep it around after the function
returns? For example, what if I want to define a character inside f()
in
mexplore-with-dynamic.c
and then print it from main()
? I could make it a global variable, but
when writing the program, I may not know exactly how many characters I'm going to need.
For this purpose, the C language allows for a third kind of object lifetime: a dynamic lifetime. For an object with a dynamic lifetime, you as the programmer have to explicitly create and destroy the object – that is, you must set aside memory for the object and make sure it is released again when the object is no longer needed.
To set aside memory, you use the malloc()
standard library function (memory allocate).
malloc()
takes only one argument, which is the number of bytes you're asking for, and it returns the
address of the newly allocated memory (i.e., the address where the OS has set aside memory boxes for this object).
For example,
char* allocated_ch = malloc(1);
reserves 1 byte of memory and stores the address of that bytes memory "box" in the variable allocated_ch
.
The char*
in brackets is a cast of the pointer returned to the type we expect (a pointer to a
char
); this is needed because malloc()
itself does not have any idea what kind of object
you're allocating, so we need to tell the compiler.
To actually set the value inside a dynamic lifetime object, we have to dereference the pointer as usual.
For example, in our character example in mexplore-with-dynamic.c
, we dereference the pointer we got from
malloc()
and put our character ('D'
) into it by assigning a value to the dereferenced pointer:
void f() {
[...]
char* allocated_ch = malloc(1);
*allocated_ch = 'D';
// |-------------| ^
// deref'd pointer | value we assign
// = char | (also a char)
}
Similarities and differences with Java
Calls to
malloc()
may look clunky, but they effectively do the same thing as thenew
keyword in Java: setting aside memory for a new object. Indeed, C++ actually provides anew
keyword that, under the hood, invokesmalloc()
. One big difference compared to Java, however, is that you're responsible for cleaning up and returning that memory. Java figures out automatically when an object with a dynamic lifetime is no longer needed, and frees the memory then (a process called "garbage collection"). C and C++ don't do so, but leave it to the programmer to decide when the time is right to return the memory.
The big upside of dynamic-lifetime objects is that we can decide at runtime how big they need to be, and that they
can outlive the function that creates them. Consider a string that takes the characters a user typed into the program
– a quantity that's hard to predict correctly, and data that we certainly want to outlive the function that reads
the input! The big downside of dynamic-lifetime objects is that it's the programmer's responsibility to free the memory
allocated. You to this by calling the free()
function with the address of the allocated boxes as an
argument. For example, free(allocated_ch);
will free the memory we asked malloc()
to
set aside for allocated_ch
.
Incorrect use of dynamic lifetimes is an immensely common source of problems, bugs, and security holes in C/C++ programs: serious problems like memory leaks, double free, use-after-free, etc. all arise from this language feature.
Memory Segments
Objects with different lifetimes are grouped into different regions in memory. The program code, global variables, and constant global variables are all stored in static segments, as these all have static lifetimes and known sizes at compile time.
Other objects are come and go, and therefore the memory regions that contain them grow and shrink. For example, as
functions call each other, they create more and more local variables with automatic lifetimes; and as the program calls
malloc()
to reserve memory for objects with dynamic lifetimes, more memory is needed for these objects.
If we look at the addresses printed by mexplore
, we see that the global variables stored in are at
relatively low memory addresses (around 0x60'0000
hexadecimal and 0x40'0000
hexadecimal), while
the local variable is stored at a high address, close to 0x7fff'ffff'ffff
(about 247), and
the dynamically allocated character is stored in between (albeit closer to the static segments).
There is a reason for this placement: it allows both types of segments to grow without risk of getting in each other's way. In particular, the automatic-lifetime segment grows downwards as more local variables come into scope, while the dynamic-lifetime segment grows upwards. If they ever meet, your computer runs out of memory.
The size of the dynamic segment changes
as the program asks for memory and frees it up again. Moreover, the used memory in the dynamic segment is not necessarily
contiguous. Consider a program that allocates four char
s, c1
to c4
and
then frees the third one: assuming the char
s start at address 0x1a00050
, the memory would appear
as below.
0x1a000.. ... 50 ... 51 ... 52 ... 53 <- addresses ----+------+------+------+------+-- ... | c1 | c2 | FREE | c4 | ... <- values ----+------+------+------+------+--
The gap between c2
and c4
arises because the memory of c3
has been freed by
the program, but it has not been reused. In other words, the dynamic segment can have "holes" in it (this is
called fragmentation).
Finally, these segments have names:
- The part of the static-lifetime segment that holds program called is called the "text segment" (because it holds program text; though the "text" is in computer language); the parts for read-only globals ("rodata") and modifyable globals ("data" and "bss") also have names.
- The automatic-lifetime segment is called the stack. It's named so because it grows and shrinks like a stack of paper: when you call a function, its automatic-lifetime objects are added to the left of the existing automatic ones, and when the function returns, the program gives their memory up again.
- The dynamic-lifetime segment is called the heap, and it generally grows upwards in terms of memory addresses. But unlike the stack, it can have "holes": if the programmer destroyed an object that sits in between two others in memory, there is unused memory in between. (You can think of this as "air gaps" in a heap of things, maybe!).
Java Similarities Note
In Java, any object created with the
new
keyword is allocated with dynamic lifetime and lives on the heap. (Java puts only "primitive" types (int
,double
etc.) on the stack.) Indeed, Java under the hood usesmalloc()
! Yet, it seems like Java has automatic lifetime for all objects, as you never need to destroy them explicitly! This works because your program runs inside the Java virtual machine (JVM), which "magically" injects code that tracks whether each object is still reachable via an in-scope variable; if this reference count goes to zero, the JVM automatically callsfree()
to delete the object. But all this magic is not free in performance terms, as there is code to run to keep track of objects' reference counts. C and C++ instead opt to do nothing and give maximum control to the programmer, for better or worse.
To summarize, the table below lists the memory segments we've learned about, what data they contain, and roughly where they are in terms of memory addresses. (In the OS part of the course, it will turn out that these memory addresses are actually a lie and the OS playing clever tricks on us, but for now let's assume they are the actual memory addresses.)
Object declaration |
Lifetime | Segment |
Example address range |
---|---|---|---|
Constant global |
Static |
Code (or Text) |
0x40'0000 (≈1 × 222) |
Global |
Static |
Data |
0x60'0000 (≈1.5 × 222) |
Local |
Automatic |
Stack |
0x7fff'448d'0000 (≈247 = 225 × 222) |
Anonymous, returned by |
Dynamic |
Heap |
0x1a0'0000 (≈8 × 222) |
Summary
Today, we looked at the notion of pointers (values storing addresses) in C. Then, we discussed more about where program data lives in memory. We learned about dynamically-allocated memory, which is hugely important for real-world programs that need to keep objects around after the end of a function or create objects whose size is only known at runtime. We then built an understanding of how a program's memory is split into different segments that contain objects with different lifetimes. You're now in a good place to work on Lab 1, but we'll talk more about arrays and pointer arithmetic next time.