Lecture 6: Structures, typedef, and Signed Integers #
Structures #
There are several ways to obtain memory for an instance of your struct
in C: using a global, static-lifetime struct, stack-allocating a local
struct with automatic lifetime, or heap-allocating dynamic-lifetime
memory to hold the struct. The example below, based on
mexplore-struct.c
, shows how stack- and heap-allocated structs works.
int main() {
// declares a new instance of x_t on the stack (automatic lifetime)
struct x_t stack_allocated;
stack_allocated.i1 = 1;
stack_allocated.c2 = 'A';
printf("stack-allocated structure at %p: i1 = %d (addr: %p), c1 = %c (addr: %p)\n",
&stack_allocated, // need to take address, as stack_allocated is not a pointer
stack_allocated.i1,
&stack_allocated.i1,
stack_allocated.c1
&stack_allocated.c1);
// makes a new instance of x_t on the heap (dynamic lifetime)
struct x_t* heap_allocated = (struct x_t*)malloc(sizeof(struct x_t));
heap_allocated->i1 = 3;
heap_allocated->c1 = 'X';
printf("heap-allocated structure at %p: i1 = %d (addr: %p), c1 = %c (addr: %p)\n",
heap_allocated, // already an address, so no & needed here
heap_allocated>i1,
&heap_allocated->i1,
heap_allocated->c1
&heap_allocated->c1);
}
Observe that we access struct members in two different ways: when the
struct is a value (e.g., a stack-allocated struct), we access member
i1
as stack_allocated.i1
, using a dot to separate variable and
member name. (This is the same syntax that you'd use to access members
of Java objects.) But if we're dealing with a pointer to a struct
(such as the pointer returned from malloc()
for our heap-allocated
struct), we use ->
to separate variable and member name. The arrow
syntax (->
) implicitly dereferences the pointer and then accesses the
member. In other words, heap_allocated->i1
is identical to
(*heap_allocated).i1
.
Different ways to initialize a struct
Like arrays, structures also have an initializer list syntax that makes it easy for you to set the values of their members when creating a struct. For example, you could write
struct x_t my_x = { 1, 2, 3, 'A', 'B', 'C'};
, or even only partially initialize the struct viastruct x_t my_x2 = { .i2 = 42, .c3 = 'X' };
. The values of uninitialized members in practice depends on where the memory comes from (static segment data is initialized to zeros; other segments are not), but it's generally best to treat such memory as uninitialized and set all members.
A pointer to a struct (like heap_allocated
above) always points to the
address of its first member.
Recap: Finding object sizes with sizeof
#
The sizeof
keyword returns the size in bytes (!) of its argument,
which can either be a type or an object. Some examples:
sizeof(int)
is 4 bytes, because integers consist of four bytes;sizeof(char)
is 1 byte;sizeof(int*)
andsizeof(char*)
are both 8 bytes, because all pointers on a 64-bit computer are 8 bytes in size;- for
int i = 5;
,sizeof(i)
is also 4 bytes, becausei
is an integer; - for an array
int arr[] = { 1, 2, 3 }
,sizeof(arr)
is 12 bytes. - for a pointer to an array, such as
int* p = &arr[0]
,sizeof(p)
is 8 bytes, independent of the size of the array. - for a
struct ints_and_chars
defined as above,sizeof(struct ints_and_chars)
is the size of the struct in memory, which is greater than or equal to (for reasons we will see in futures lectures, not necessarily always equal) the sum of the sizes of the members.
sizeof
is at the same time great and a huge source of confusion. It is
crucial to remember that sizeof
only returns the byte size of the
known compile-time type of its argument. Importantly, sizeof
cannot
return the length of an array, nor can it return the size of the
memory allocation behind a pointer. If you call sizeof(ptr)
, where
ptr
is a char*
, you will get 8 bytes (since the size of a pointer is
8 bytes), independent of whether that char*
points to a much larger
memory allocation (e.g., 100 bytes on the heap).
Why does
sizeof
work this way?The dirty, but amazing, secret behind
sizeof
is that it actually results in no actual compiled code. Instead, the C compiler replaces any invocation ofsizeof(type)
orsizeof(expression)
with the byte size of the argument known at compile time (sosizeof(int)
just turns into a literal4
in the program). Hence,sizeof
cannot possibly determine runtime information like the size of a memory allocation; but it also requires zero processor operations and memory at runtime!
Type aliases: typedef
#
If you're sick of writing code like struct ints_and_chars
all the time
when you use structs, the typedef
keyword, which defines a type alias,
is for you.
Normally, you always need to put the struct
keyword in front of your
new struct type whenever you use it. But this gets tedious, and the C
language provides the helpful keyword typedef
to save you some work.
You can use typedef
with a struct definition like this:
typedef struct {
int i1;
int i2;
int i3;
char c1;
char c2;
char c3;
} x_t;
... and henceforth you just write x_t
to refer to your struct type.
Linked List Example #
Now let's build a useful data structure! We'll look at a linked list of
integers here (linked-list.c
). This actually consists of two
structures: one to represent the list as a whole (list_t
) and one to
represent nodes in the list (node_t
). The list_t
structure contains
a pointer to the first node of the list, and (in this simple
implementation) nothing else. The node_t
structure contains the node's
value (an int
) and a pointer to the next node_t
in memory.
typedef struct node {
int value;
struct node* next;
} node_t;
typedef struct list {
node_t* head;
} list_t;
Why does the
next
pointer innode_t
have typestruct node*
, notnode_t*
?C compilers do not allow recursively-defined type definitions. In particular, you cannot use the type you're defining via
typedef
within its own definition. You can, however, use astruct
pointer within the structure's definition. Think of it this way:struct node
is already known a known object for the compiler when the pointer occurs in the definition, butnode_t
isn't yet, as its definition only ends with the semicolon.
Note that you can only nest a pointer to a struct in its own definition, not an instance of the struct itself. Try to think of why that must be the case, remebering that C types must have fixed memory sizes at compile time!
What's the size of our two structs involved here? list
is 8 bytes in
size, because it only contains a pointer, and node
is 12 bytes in
size, as it contains a 4-byte int
and an 8 byte pointer. (For reason
that we'll understand soon, sizeof(struct node_t)
actually returns 16
bytes, however.)
Signed number representation #
Why are we covering this?
Debugging computer systems often require you to look at memory dumps and understand what the contents of memory mean. Signed numbers have a non-obvious representation (they will appear as very large hexadecimal values), and learning how the computer interprets hexadecimal bytes as negative numbers will help you understand better what is in memory and whether that data is what you expect. Moreover, arithmetic on signed numbers can trigger undefined behavior in non-intuitive ways; this demonstrates an instance of undefined behavior unrelated to memory access!
Recall from last time that our computers use a little endian number
representation. This makes reading the values of pointers and integers
from memory dumps (like those produced by our hexdump()
function) more
difficult, but it is how things work.
Using position notation on bytes allows us to represent unsigned numbers very well: the higher the byte's position in the number, the greater its value. You may have wondered how we can represent negative, signed numbers in this system, however. The answer is a representation called two's complement, which is what the x86-64 architecture (and most other architectures) use.
Two's complement strikes most people as weird when they first encounter
it, but there is an intuition for it. The best way to think about it is
that adding 1 to -1 should produce 0. The representation of 1 in a
4-byte integer is 0x0000'0001
(N.B.: for clarity for humans, I'm using
big endian notation here; on the machine, this will be laid out as
0x0100'0000
). What number, when added to this representation, yields
0?
The answer is 0xffff'ffff
, the largest representable integer in 4
bytes. If we add 1 to it, we flip each bit from f
to 0
and carry a
one, which flips the next bit in turn. At the end, we have:
0x0000'0001
+ 0xffff'ffff
--------------
0x1'0000'0000 == 0x0000'0000 (mod 2^32)
The computer simply throws away the carried 1 at the top, since it's
outside the 4-byte width of the integer, and we end up with zero, since
all arithmetic on fixed-size integers is modulo their size (here,
164 = 232). You can see this in action in
signed-int.c
.
More generally, in two's complement arithmetic, we always have -x + x
= 0, so a negative number added to its positive complement yields
zero. The principle that makes this possible is that -x
corresponds to
positive x
, with all bits flipped (written ~x
) and 1 added. In
other words, -x = ~x + 1.
Signed numbers split their range in half, with half representing
negative numbers and the other half representing 0 and positive numbers.
For example, a signed char
can represent numbers -128 to 127 inclusive
(the positive range is one smaller because it also includes 0). The most
significant bit acts as a sign bit, so all signed numbers whose top
bit is set to 1 are negative. Consequently, the largest positive value
of a signed char
is 0x7f
(binary 0111'1111), and the
largest-magnitude negative value is 0x80
(binary 1000'0000),
representing -128. The number -1 corresponds to 0xff
(binary
1111'1111), so that adding 1 to it yields zero (modulo 28).
Two's complement representation has some nice properties for building hardware: for example, the processor can use the same circuits for addition and subtraction of signed and unsigned numbers. On the downside, however, two's complement representation also has a nasty property: arithmetic overflow on signed numbers is undefined behavior.
Summary #
Today, we explored how to define custom data structures in C, and how
they are represented in memory. We also looked at how typedef
allows C
programmers to define type aliases.
We also learned more about how computer represent integers, and in particular about how they represent negative numbers in a binary encoding called two's complement. Next time, we'll learn that certain arithmetic operations on numbers can invoke the dreaded undefined behavior, and the confusing effects this can have, before we talk about the memory layout of structures and general rules about collections.