### Lecture 6: sizeof, typedef, and Signed Integers

🎥 Lecture video (Brown ID required)

💻 Lecture code

❓ Post-Lecture Quiz (due 11:59pm, Wednesday, February 15).

#### The `sizeof`

and `typedef`

keywords

##### Finding object sizes with `sizeof`

The `sizeof`

keyword returns the size **in bytes** (!) of its argument, which can either be a type or an
object. Some examples:

`sizeof(int)`

is 4 bytes, because integers consist of four bytes;`sizeof(char)`

is 1 byte;`sizeof(int*)`

and`sizeof(char*)`

are both 8 bytes, because all pointers on a 64-bit computer are 8 bytes in size;- for
`int i = 5;`

,`sizeof(i)`

is also 4 bytes, because`i`

is an integer; - for an array
`int arr[] = { 1, 2, 3 }`

,`sizeof(arr)`

is 12 bytes. - for a pointer to an array, such as
`int* p = &arr[0]`

,`sizeof(p)`

is 8 bytes, independent of the size of the array.

`sizeof`

is at the same time great and a huge source of confusion. It is *crucial* to remember that
`sizeof`

only returns the byte size of the known compile-time type of its argument. Importantly,
** sizeof cannot return the length of an array, nor can it return the size of the memory allocation
behind a pointer**. If you call

`sizeof(ptr)`

, where `ptr`

is a `char*`

, you will get
8 bytes (since the size of a pointer is 8 bytes), independent of whether that `char*`

points to a much larger
memory allocation (e.g., 100 bytes on the heap).Why does`sizeof`

work this way?

The dirty, but amazing, secret behind

`sizeof`

is that it actually results in no actual compiled code. Instead, the C compiler replaces any invocation of`sizeof(type)`

or`sizeof(expression)`

with the byte size of the argument known at compile time (so`sizeof(int)`

just turns into a literal`4`

in the program). Hence,`sizeof`

cannot possibly determine runtime information like the size of a memory allocation; but it also requires zero processor operations and memory at runtime!

##### Type aliases: `typedef`

If you're sick of writing code like `struct x_t`

all the time when you use structs, the `typedef`

keyword, which defines a type alias, is for you.

Normally, you always
need to put the `struct`

keyword in front of your new struct type whenever you use it. But this gets tedious,
and the C language provides the helpful keyword `typedef`

to save you some work. You can use
`typedef`

with a struct definition like this:

```
typedef struct {
int i1;
int i2;
int i3;
char c1;
char c2;
char c3;
} x_t;
```

... and henceforth you just write `x_t`

to refer to your struct type.
#### Linked List Example

Now let's build a useful data structure! We'll look at a linked list of integers here (`linked-list.c`

).
This actually consists of two structures: one to represent the list as a whole (`list_t`

) and one to
represent nodes in the list (`node_t`

). The `list_t`

structure contains a pointer to the
first node of the list, and (in this simple implementation) nothing else. The `node_t`

structure
contains the node's value (an `int`

) and a pointer to the next `node_t`

in memory.

```
typedef struct node {
int value;
struct node* next;
} node_t;
typedef struct list {
node_t* head;
} list_t;
```

Why does the`next`

pointer in`node_t`

have type`struct node*`

, not`node_t*`

?

C compilers do not allow recursively-defined type definitions. In particular, you cannot use the type you're defining via

`typedef`

within its own definition. You can, however, use a`struct`

pointer within the structure's definition. Think of it this way:`struct node`

is already known a known object for the compiler when the pointer occurs in the definition, but`node_t`

isn't yet, as its definition only ends with the semicolon.

Note that you can only nest apointerto a struct in its own definition, not an instance of the struct itself. Try to think of why that must be the case, remebering that C types must have fixed memory sizes at compile time!

What's the size of our two structs involved here? `list`

is 8 bytes in size, because it only
contains a pointer, and `node`

is 12 bytes in size, as it contains a 4-byte `int`

and an
8 byte pointer. (For reason that we'll understand soon, `sizeof(struct node_t)`

actually returns 16
bytes, however.)

#### Signed number representation

Why are we covering this?

Debugging computer systems often require you to look at memory dumps and understand what the contents of memory

mean. Signed numbers have a non-obvious representation (they will appear as very large hexadecimal values), and learning how the computer interprets hexadecimal bytes as negative numbers will help you understand better what is in memory and whether that data is what you expect. Moreover, arithmetic on signed numbers can trigger undefined behavior in non-intuitive ways; this demonstrates an instance of undefined behavior unrelated to memory access!

Recall from last time that our computers use a *little endian* number representation. This makes reading the
values of pointers and integers from memory dumps (like those produced by our `hexdump()`

function) more
difficult, but it is how things work.

Using position notation on bytes allows us to represent unsigned numbers very well: the higher the byte's position in
the number, the greater its value. You may have wondered how we can represent negative, signed numbers in this system,
however. The answer is a representation called **two's complement**, which is what the x86-64 architecture (and most
other architectures) use.

Two's complement strikes most people as weird when they first encounter it, but there is an intuition for it. The
best way to think about it is that *adding 1 to -1 should produce 0*. The representation of 1 in a 4-byte integer
is `0x0000'0001`

(N.B.: for clarity for humans, I'm using big endian notation here; on the machine, this will
be laid out as `0x0100'0000`

). What number, when added to this representation, yields 0?

The answer is `0xffff'ffff`

, the largest representable integer in 4 bytes. If we add 1 to it, we flip each
bit from `f`

to `0`

and carry a one, which flips the next bit in turn. At the end, we have:

0x0000'0001 + 0xffff'ffff -------------- 0x1'0000'0000 == 0x0000'0000 (mod 2^32)The computer simply throws away the carried 1 at the top, since it's outside the 4-byte width of the integer, and we end up with zero, since all arithmetic on fixed-size integers is modulo their size (here, 16

^{4}= 2

^{32}). You can see this in action in

`signed-int.c`

.
More generally, in two's complement arithmetic, we always have **-x + x = 0**, so a negative number added to its
positive complement yields zero. The principle that makes this possible is that `-x`

corresponds to positive
`x`

, *with all bits flipped (written ~x) and 1 added*. In other words,

**-x = ~x + 1**.

Signed numbers split their range in half, with half representing negative numbers and the other half representing 0
and positive numbers. For example, a signed `char`

can represent numbers -128 to 127 inclusive (the positive
range is one smaller because it also includes 0). The most significant bit acts as a *sign bit*, so all signed
numbers whose top bit is set to 1 are negative. Consequently, the largest positive value of a signed `char`

is `0x7f`

(binary 0111'1111), and the largest-magnitude negative value is `0x80`

(binary
1000'0000), representing -128. The number -1 corresponds to `0xff`

(binary 1111'1111), so that adding 1 to it
yields zero (modulo 2^{8}).

Two's complement representation has some nice properties for building hardware: for example, the processor can use the same circuits for addition and subtraction of signed and unsigned numbers. On the downside, however, two's complement representation also has a nasty property: arithmetic overflow on signed numbers is undefined behavior.

#### Summary

Today, we explored the `sizeof()`

operator in C, which allows programs to determine the size of types (including custom ones, such as structs you define) at compile time.
We also looked at how `typedef`

allows C programmers to define type aliases.

We also learned more about how computer represent integers, and in particular about how they represent negative numbers in a binary encoding called two's complement. Next time, we'll learn that certain arithmetic operations on numbers can invoke the dreaded undefined behavior, and the confusing effects this can have, before we talk about the memory layout of structures and general rules about collections.