Lecture 5: Arrays and Structures #
Arrays #
Why are we covering this?
As programmers, we often want to deal with collections of data, such as a million integers to sort. But in C and C++, such collections ultimately turn into bytes in memory that the compiler lays out in specific ways and that we as the programmer have to intpret correctly. Once you understand arrays in C and how they relate to pointers, you will understand how computers represent sequences of data. This becomes important in the vector part of Project 1, in Project 2, and in the OS part of the course.
We've danced around the concept of arrays in C for a while now. C's arrays are simple and lay out a set of equal-size elements consecutively in memory. In many ways, arrays and strings in C are very similar. In particular, you can always think of an array as a pointer to the start of its memory, above which point the elements of the array are laid out in sequence.
C allows you to declare, define, and initialize an array in one go,
using curly bracket notation: int[] = { 1, 2, 3 }
declares and defines
an array of three integers and immediately initializes its contents to
the numbers 1 to 3.
How large is this array in memory? It contains three integers of four bytes each, so the array will use 12 bytes in memory. Note that the length of the array (3 elements) is not stored with it in memory! In general, it's up to you as the programmer to remember what an array's length is.
All elements of an array in C must have the same type and size in
memory, and the length of an array is fixed. In other words, you
cannot have an array of { int, char, int, long }
, nor can you append
elements to an array. This similar to Java arrays, but a significant
difference compared to Python or OCaml lists.
You can also declare an array in C without initializing it. To do so,
you put the array's desired length into square brackets next to the
variable name: int a[5]
is an array of five integers. But what is
the size of such an array in memory? You can calculate it manually: the
array must be backed by sufficient memory to hold five integers (20
bytes, since each integer is 4 bytes long). It turns out C also has a
handy keyword to help you get the byte sizes of its types — we'll talk
about this next time.
Arrays are just pointers! #
Recall our array int a[5]
of five integers. This declaration will tell
the compiler to set aside sufficient memory to hold five integers (20
bytes, since each integer is 4 bytes long).
What are the contents of the memory for a
? Again, the memory is
uninitialized, so it could be anything! To actually fill in your
array, you use subscript notation with square brackets on the left hand
size of an assignment: a[0] = 1;
, a[1] = 2;
etc.
Now here's a curious, but super important detail. a[1]
means that
we're assigning into the second element of our array. Where is that
element in memory – i.e., what does a[1]
really mean in terms of the
memory boxes that we write to?
Let's figure this out from first principles. a[0]
is the first
element, which starts at the first address in the array. That address is
the same as a pointer to the first byte of a
in memory. In fact, we
can rewrite a[0]
as *((int*)a + 0)
! Let's tease this apart:
(int*)a
means that we want to treat a
as a pointer to an integer in
memory, and the + 0
part adds zero to it using pointer arithmetic.
Then we dereference the resulting pointer, which gives us the value at
a[0]
.
By this reasoning, what does a[1]
translate to? If you guessed
*((int*)a + 1)
, that's correct! But here's a snag: remember that
integers are 4 bytes long. Pointer arithmetic always increments the
address stored in the pointer by the number of bytes of its type. In
this case, this means that (int*)a + 1
adds 4 bytes to the address (so
it's equivalent to (char*)a + 4
). Putting it all together, a[1] = 2;
turns out to really mean "write the integer 2 into the four memory boxes
starting four boxes down from the address in a
".
Here's an important takeaway: array subscript notation and pointer arithmetic are one and the same thing! In fact, the C compiler internally just turns your square-bracked subscript notation into pointer arithmetic.
More generally, the C language definition has a rule for collections of data: the first member rule, which says that the address of a collection is the same as the address of its first member. This rule applies to arrays, but we'll shortly see that it also applies to other structures.
There is a second rule for arrays, the array rule, which says that all elements ("members") of the array are laid out consecutively in memory.
Finally, you'll now see how strings and arrays are super similar: you
can think of a string as an array of char
elements, each one byte in
length, with a bonus terminator element at the n+1th
position in the array (whose memory allocation must be n+1 bytes long
– easy to forget!).
Structures (struct
): making new types
#
Why are we covering this?
C would be rather restriced if you could only use its primitive types (
int
,char
and so on). To let you define new data types of your own, the language supplies the idea of a structure (struct
). Structures occur all the time in real-world C programs, and they're very important both to understanding how we can lay out data in specific patters in memory in order to communicate with hardware (one big purpose of systems programming languages when used to write operating systems!), and how C++ classes work.
The C language uses the struct
keyword to define new data
structures: objects laid out in memory in a specific format. This is a
very powerful part of the language, and you'll find yourself using
structs a lot in your projects. In particular, you will encounter
structs in when you implement your vector in Project 1.
A structure declaration consists of the struct
keyword, a name for
this structure, and a specification of its members within curly
braces:
struct x_t {
int i1;
int i2;
int i3;
char c1;
char c2;
char c3;
};
This defines a structure called x_t
(the _t
suffix is a convention
and indicates that we're dealing with a user-defined type), each
instance of which contains three integers and three characters.
⚠️ We did not cover the following in Lecture 5 this year. We will talk about it in Lecture 6, but we're providing the material here in case you want to read ahead.
There are several ways to obtain memory for an instance of your struct
in C: using a global, static-lifetime struct, stack-allocating a local
struct with automatic lifetime, or heap-allocating dynamic-lifetime
memory to hold the struct. The example below, based on
mexplore-struct.c
, shows how stack- and heap-allocated structs works.
int main() {
// declares a new instance of x_t on the stack (automatic lifetime)
struct x_t stack_allocated;
stack_allocated.i1 = 1;
stack_allocated.c2 = 'A';
printf("stack-allocated structure at %p: i1 = %d (addr: %p), c1 = %c (addr: %p)\n",
&stack_allocated, // need to take address, as stack_allocated is not a pointer
stack_allocated.i1,
&stack_allocated.i1,
stack_allocated.c1
&stack_allocated.c1);
// makes a new instance of x_t on the heap (dynamic lifetime)
struct x_t* heap_allocated = (struct x_t*)malloc(sizeof(struct x_t));
heap_allocated->i1 = 3;
heap_allocated->c1 = 'X';
printf("heap-allocated structure at %p: i1 = %d (addr: %p), c1 = %c (addr: %p)\n",
heap_allocated, // already an address, so no & needed here
heap_allocated>i1,
&heap_allocated->i1,
heap_allocated->c1
&heap_allocated->c1);
}
Observe that we access struct members in two different ways: when the
struct is a value (e.g., a stack-allocated struct), we access member
i1
as stack_allocated.i1
, using a dot to separate variable and
member name. (This is the same syntax that you'd use to access members
of Java objects.) But if we're dealing with a pointer to a struct
(such as the pointer returned from malloc()
for our heap-allocated
struct), we use ->
to separate variable and member name. The arrow
syntax (->
) implicitly dereferences the pointer and then accesses the
member. In other words, heap_allocated->i1
is identical to
(*heap_allocated).i1
.
Different ways to initialize a struct
Like arrays, structures also have an initializer list syntax that makes it easy for you to set the values of their members when creating a struct. For example, you could write
struct x_t my_x = { 1, 2, 3, 'A', 'B', 'C'};
, or even only partially initialize the struct viastruct x_t my_x2 = { .i2 = 42, .c3 = 'X' };
. The values of uninitialized members in practice depends on where the memory comes from (static segment data is initialized to zeros; other segments are not), but it's generally best to treat such memory as uninitialized and set all members.
Sick of writing
struct x_t
all the time?⚠️ We did not cover this in Lecture 5; we will talk about
typedef
in the next lecture.Normally, you always need to put the
struct
keyword in front of your new struct type whenever you use it. But this gets tedious, and the C language provides the helpful keywordtypedef
to save you some work. You can usetypedef
with a struct definition like this:typedef struct { int i1; int i2; int i3; char c1; char c2; char c3; } x_t;
... and henceforth you just write
x_t
to refer to your struct type.
A pointer to a struct (like heap_allocated
above) always points to the
address of its first member.
Summary #
Today, we started exploring how the C language represents collections of objects, and specifically looked at arrays and structs. We also figured out pointer arithmetic in more detail and understood how it's related to array subscript syntax.