OCaml and S-Expressions

Last time we wrote a compiler for a very small language: the integers. Next time, we’ll grow our language a bit. Today, though, we’re going to step back and learn some OCaml. We’ll also talk about how to represent our language’s syntax.

OCaml basics

Here’s the compiler we wrote last time:

let compile (program: string): string =
  String.concat "\n" 
  [ "global _entry"; 
  "_entry:"; 
  Printf.sprintf "\tmov rax, %s" program;
  "\tret" ]

This program illustrates several OCaml features (see the lecture capture for details):

  • Functions
  • Types
  • Lists

S-Expressions

We want to eventually extend our language with support for many new features, such as conditionals and functions. We will talk in great depth about the semantics of those features: what a conditional expression means. We’ll also need to figure out what the syntax corresponding to those features is. For instance, conditionals look like this in Javascript:

if (x) {
  return y;
} else {
  return z;
}

And like this in Python:

if x:
  return y
else:
  return z

We will talk more about syntax much later in the course. For now, our language is going to support a very simple syntax: S-expressions.

S-expressions (short for “symbolic expressions”) were introduced by John McCarthy in his paper on the LISP programming language. They were originally intended as an internal representation that programmers wouldn’t manipulate directly, but LISP programmers liked writing them and the intended language syntax (M-Expressions) was never implemented. Here are some S-expressions:

(if x y z)

(* (+ 2 6 4) 3)

(define (f x) (if (= x 0) 1 (* x (f (- x 1)))))

S-expressions are a simple, convenient way to represent structured expressions. Because of their simple structure, they are very easy to represent and manipulate inside a compiler or an interpreter.

Look at the expressions above. What components are there? Ignoring everything we might know about what the expressions mean and focusing on how they look, we can see three different kinds of expression:

Symbols
such as f and if
Numbers
such as 4 and 0
Lists
sequences of other S-Expressions wrapper in parentheses, like (+ 2 6 4) or (* (+ 2 6 4) 3)

S-Expressions in OCaml

How should we represent S-expressions in OCaml? One option would be to use strings. So we could have

let e1 = "(if x y z)"
let e2 = "(* (+ 2 6 4) 3)"

This representation is convenient to read in: programmers write programs as sequences of characters in files. It is, however, very difficult to work with. Imagine extending our compiler to support unary operations on numbers (for instance, a function that increments a number). Doing so using this string representation would be very difficult. We’d have to see if the first character of the string is a (, then look at the next characters to find the operation, then find the argument, and so on. Moreover, we’d need to repeat this effort for every operation we want to support.

A better representation for S-expressions will use the structure we talked about above. We can add a new type to OCaml to encode this structure, as follows:

type s_exp =
  Sym of string | Num of int | Lst of s_exp list

Sym, Num, and Lst are all constructors: ways of building s-expressions. So what this type definition is saying is that an s-expression can be a symbol or a number or a list.

The “of” after each constructor indicates that that constructor is associated with some data of a particular type. For instance, the symbol if represented as an s_exp looks like Sym "if"; 4000000000000 looks like Num 4000000000000.

Lst is defined recursively: it takes as its argument a list of other s-expressions. These could be build using the Sym or Num constructors or using more Lst constructors. So for instance, `(if x y z)` looks like

let e1 = Lst [Sym "if"; Sym "x"; Sym "y"; Sym "z"]

and (* (+ 2 6 4) 3) looks like

let e2 = Lst [Sym "*"; Lst [Sym "+"; Num 2; Num 6; Num 4]; Num 3]

We can see that compared to the string representation, which is “flat”, this representation includes the nested structure of s-expressions. We can access this structure using pattern matching:

let which_constructor (e: s_exp) : string =
  match e with
  | Sym x -> "a symbol"
  | Num n -> "a number"
  | Lst l -> "a list"

This looks like a repeated “if”. The real power of pattern matching is destructuring: we can use names defined on the left-hand side of each pattern.

let rec total (e: s_exp) : int =
  match e with
  | Sym _ -> 0
  | Num n -> n
  | Lst l -> List.fold_left ( + ) 0 (List.map total l)

For the Lst constructor, it can sometimes be useful to destructure the list as well:

let rec has_if (e: s_exp) : bool =
  match e with
  | Sym _ -> false
  | Num _ -> false
  | Lst (Sym "if" :: _) -> true
  | Lst l -> List.exists has_if l

S-expressions in our compiler

The course staff has written a small library for S-expressions; HW1, out today, is designed to get you some practice working with this library. The library defines the same s_exp type listed above; it also defines functions to convert strings to s-expressions.

If our OCaml project is set up correctly, we can use the library like this:

open S_exp

let e1 = parse "(* (+ 2 6 4) 3)"

parse takes in a string and returns the corresponding s-expression. Later in the course, we will see how this function works.

The compiler we have developed can be modified to use s-expressions:

open S_exp

let compile (program: s_exp): string =
  match program with
  | Num n ->
    String.concat "\n" 
    [ "global _entry"; 
    "_entry:"; 
    Printf.sprintf "\tmov rax, %d" n;
    "\tret" ]

We only support numbers, so we haven’t included the other cases. Our OCaml editor will complain about this program, saying something about “non-exhaustive pattern matching”; this means that there are some s-expressions that we haven’t matched. We can “fix” this error like this:

open S_exp

exception BadExpression of s_exp

let compile (program: s_exp): string =
  match program with
  | Num n ->
    String.concat "\n" 
    [ "global _entry"; 
    "_entry:"; 
    Printf.sprintf "\tmov rax, %d" n;
    "\tret" ]
  | e -> raise (BadExpression e)

Now we will get a somewhat-useful error if we try to compile something that isn’t a number.