Lecture 4: Arrays and Data Representation
» Lecture video (Brown ID required)
» Lecture code
» Post-Lecture Quiz (due 6pm Wednesday, February 5).
Addenda to Lecture 3
Here are addenda to two of the programs we looked at last time. First, we saw that the stack grew
in unexpected ways in our mexplore.c
program: a new local variable received an address
higher than previous local variables' addresses. We guessed this happened because of a compiler optimization,
and that is indeed the case. If you build the code in this lecture's folder in the lecture code repository, you will
see two binaries: mexplore.noopt
and mexplore.opt
. The former is unoptimized (all compiler
optimizations turned off) and the latter is built with compiler optimizations enabled.
If you run mexplore.noopt
, you'll see that the addresses of our local variables on the stack are in the
descending order you would expect: later variables have lower hexadecimal addresses. If you run mexplore.opt
,
by contrast, things are not quite in descending order. The reason is that the compiler applies two optimizations: first,
it does something called inlining, where it takes a short function (like our f()
) and instead of
bothering to call it, it simply copies and pastes f()
's code directly into main()
. Second,
because we were dealing with char
objects that are 1 byte in size and because of a property called
alignment that prevents the compiler from putting addresses (8 bytes) or other larger things (e.g., integers)
at memory addresses that aren't multiples of these objects' sizes, the compiler decided that it can put a few
characters into bytes above the first character's address that would otherwise remain unused. (Don't worry about the
specifics; we'll talk about alignment in more detail.)
The second confusion was over why I could not modify characters in a string. It turns out that our compilers make
literal strings (those strings written directly into the program code in between double quotes, like
"We <3 systems"
) read-only to prevent us from accidentally writing over adjacent bytes in the
static-lifetime segment of memory. This explains why I received a segmentation fault when trying to write to such a
literal string. If I instead allocate the string's memory in the dynamic-lifetime segment, the code works:
char* allocated_st = (char*)malloc(100);
sprintf(allocated_st, "We <3 systems");
// This works!
for (int i = 0; i < 13; i++) {
if (*(allocated_st + i) == '<') {
*(allocated_st + i) = 'E';
*(allocated_st + i + 1) = '>';
}
}
See the corrected example in
last lecture's code for details.
Arrays
Why are we covering this?
As programmers, we often want to deal with collections of data, such as a million integers to sort. But in C and C++, such collections ultimately turn into bytes in memory that the compiler lays out in specific ways and that we as the programmer have to intpret correctly. Once you understand arrays in C and how they relate to pointers, you will understand how computers represent sequences of data. This becomes important in the vector part of Project 1, in Project 2, and in the OS part of the course.
We've danced around the concept of arrays in C for a while now. C's arrays are simple and lay out a set of equal-size elements consecutively in memory. In many ways, arrays and strings in C are very similar. In particular, you can always think of an array as a pointer to the start of its memory, above which point the elements of the array are laid out in sequence.
C allows you to declare, define, and initialize an array in one go, using curly bracket notation:
int[] = { 1, 2, 3 }
declares and defines an array of three integers and immediately initializes its
contents to the numbers 1 to 3.
How large is this array in memory? It contains three integers of four bytes each, so the array will use 12 bytes in memory. Note that the length of the array (3 elements) is not stored with it in memory! In general, it's up to you as the programmer to remember what an array's length is.
All elements of an array in C must have the same type and size in memory, and the length of an array is fixed.
In other words, you cannot have an array of { int, char, int, long }
, nor can you append elements to an
array. This similar to Java arrays, but a significant difference compared to Python or OCaml lists.
You can also declare an array in C without initializing it. To do so, you put the array's desired length into square
brackets next to the variable name: int a[5]
is an array of five integers. But what is the size
of such an array in memory? You can calculate it manually: the array must be backed by sufficient memory to hold five
integers (20 bytes, since each integer is 4 bytes long). But it turns out C also has a handy keyword to help you get
the byte sizes of its types.
Finding object sizes with sizeof
The sizeof
keyword returns the size in bytes (!) of its argument, which can either be a type or an
object. Some examples:
sizeof(int)
is 4 bytes, because integers consist of four bytes;sizeof(char)
is 1 byte;sizeof(int*)
andsizeof(char*)
are both 8 bytes, because all pointers on a 64-bit computer are 8 bytes in size;- for
int i = 5;
,sizeof(i)
is also 4 bytes, becausei
is an integer; - for an array
int arr[] = { 1, 2, 3 }
,sizeof(arr)
is 12 bytes. - for a pointer to an array, such as
int* p = &arr[0]
,sizeof(p)
is 8 bytes, independent of the size of the array.
sizeof
is at the same time great and a huge source of confusion. It is crucial to remember that
sizeof
only returns the byte size of the known compile-time type of its argument. Importantly,
sizeof
cannot return the length of an array, nor can it return the size of the memory allocation
behind a pointer. If you call sizeof(ptr)
, where ptr
is a char*
, you will get
8 bytes (since the size of a pointer is 8 bytes), independent of whether that char*
points to a much larger
memory allocation (e.g., 100 bytes on the heap).
Why doessizeof
work this way?
The dirty, but amazing, secret behind
sizeof
is that it actually results in no actual compiled code. Instead, the C compiler replaces any invocation ofsizeof(type)
orsizeof(expression)
with the byte size of the argument known at compile time (sosizeof(int)
just turns into a literal4
in the program). Hence,sizeof
cannot possibly determine runtime information like the size of a memory allocation; but it also requires zero processor operations and memory at runtime!
Arrays are just pointers!
Recall our array int a[5]
of five integers. This declaration will tell the compiler to set aside sufficient
memory to hold five integers (20 bytes, since each integer is 4 bytes long).
What are the contents of the memory for a
? Again, the memory is uninitialized, so it could be
anything! To actually fill in your array, you use subscript notation with square brackets on the left hand size of an
assignment: a[0] = 1;
, a[1] = 2;
etc.
Now here's a curious, but super important detail. a[1]
means that we're assigning into the second
element of our array. Where is that element in memory – i.e., what does a[1]
really mean in terms of
the memory boxes that we write to?
Let's figure this out from first principles. a[0]
is the first element, which starts at the first address
in the array. That address is the same as a pointer to the first byte of a
in memory. In fact, we can
rewrite a[0]
as *((int*)a + 0)
! Let's tease this apart: (int*)a
means that we
want to treat a
as a pointer to an integer in memory, and the + 0
part adds zero to it using
pointer arithmetic. Then we dereference the resulting pointer, which gives us the value at a[0]
.
By this reasoning, what does a[1]
translate to? If you guessed *((int*)a + 1)
, that's
correct! But here's a snag: remember that integers are 4 bytes long. Pointer arithmetic always increments the address
stored in the pointer by the number of bytes of its type. In this case, this means that (int*)a + 1
adds
4 bytes to the address (so it's equivalent to (char*)a + 4
). Putting it all together, a[1] = 2;
turns out to really mean "write the integer 2 into the four memory boxes starting four boxes down from the address
in a
".
Here's an important takeaway: array subscript notation and pointer arithmetic are one and the same thing! In fact, the C compiler internally just turns your square-bracked subscript notation into pointer arithmetic.
More generally, the C language definition has a rule for collections of data: the first member rule, which says that the address of a collection is the same as the address of its first member. This rule applies to arrays, but we'll shortly see that it also applies to other structures.
There is a second rule for arrays, the array rule, which says that all elements ("members") of the array are laid out consecutively in memory.
Finally, you'll now see how strings and arrays are super similar: you can think of a string as an array of
char
elements, each one byte in length, with a bonus terminator element at the n+1th
position in the array (whose memory allocation must be n+1 bytes long – easy to forget!).
Structures (struct
): making new types
Why are we covering this?
C would be rather restriced if you could only use its primitive types (
int
,char
and so on). To let you define new data types of your own, the language supplies the idea of a structure (struct
). Structures occur all the time in real-world C programs, and they're very important both to understanding how we can lay out data in specific patters in memory in order to communicate with hardware (one big purpose of systems programming languages when used to write operating systems!), and how C++ classes work.
The C language uses the struct
keyword to define new data structures: objects laid out in
memory in a specific format. This is a very powerful part of the language, and you'll find yourself using structs
a lot in your projects. In partiuclar, you will encounter structs in when you implement your vector in Project 1.
A structure declaration consists of the struct
keyword, a name for this structure, and a specification
of its members within curly braces:
struct x_t {
int i1;
int i2;
int i3;
char c1;
char c2;
char c3;
};
This defines a structure called x_t
(the _t
suffix is a convention and indicates that we're
dealing with a user-defined type), each instance of which contains three integers and three characters.
You can create instances of the structure by declaring local, stack-allocated (automatic lifetime) variables, or by allocating sufficient dynamic memory and interpreting it as an instance of the structure in question.
int main() {
// declares a new instance of x_t on the stack (automatic lifetime)
struct x_t stack_allocated;
stack_allocated.i1 = 1;
stack_allocated.c2 = 'A';
printf("stack-allocated structure at %p: i1 = %d (addr: %p), c1 = %c (addr: %p)\n",
&stack_allocated, // need to take address, as stack_allocated is not a pointer
stack_allocated.i1,
&stack_allocated.i1,
stack_allocated.c1
&stack_allocated.c1);
// makes a new instance of x_t on the heap (dynamic lifetime)
struct x_t* heap_allocated = (struct x_t*)malloc(sizeof(struct x_t));
heap_allocated->i1 = 3;
heap_allocated->c1 = 'X';
printf("heap-allocated structure at %p: i1 = %d (addr: %p), c1 = %c (addr: %p)\n",
heap_allocated, // already an address, so no & needed here
heap_allocated>i1,
&heap_allocated->i1,
heap_allocated->c1
&heap_allocated->c1);
}
By the first member rule a pointer to a struct (like heap_allocated
above) always points to
the address of its first member.
Structures are great, and if you are familiar with object-oriented languages like Java, structures are C's closest equivalent to objects in these languages. You can use structures to build complex data structures. For example, the below structures together implement a linked list of integers:
struct list_node {
int value;
struct list_node* next;
};
struct list {
struct list_node* head;
}
The idea is that each list_node
contains a value (int value
) and a pointer to the next node
(i.e., the memory address of the next node, in next
). The list
structure just contains a
pointer to the first node. What are the sizes of these structures? list
is 8 bytes in size, because it only
contains a pointer, and list_node
is 12 bytes in size, as it contains a 4-byte int
and an
8 byte pointer. (For reason that we'll understand soon, sizeof(struct list_node_t)
actually returns 16
bytes, however.)
We will return to structs and their memory layout in the next lecture.
Summary
Today, we focused on how the C language represents collections of objects, and specifically looked at arrays and structs. We learned some handy rules about collections and their memory representation, which are summarized below:
- The first member rule says that the address of a collection is the same as the address of its first member.
- The array rule says that all members of an array are laid out consecutively in memory.
We also figured out pointer arithmetic in more detail and understood how it's related to array subscript syntax.
You now know everything you'll need to complete Lab 1, and nearly everything you'll need for Project 1.