2 Everything (We Will Say) About Parsing
23 + 5 - 6 |
Parsing is a large, complex problem that is far from solved due to the difficulties of ambiguity. For instance, an alternate parse tree for the above input expression might put subtraction at the top and addition below it. We might also want to consider whether this addition operation is commutative and hence whether the order of arguments can be switched. Everything only gets much, much worse when we get to full-fledged programming languages (to say nothing of natural languages).
2.1 A Lightweight, Built-In First Half of a Parser
(+ 23 (- 5 6)) |
2.2 A Convenient Shortcut
As you know you need to test your programs extensively, which is hard
to do when you must manually type terms in over and over again.
Fortunately, as you might expect, the parenthetical syntax is
integrated deeply into Racket through the mechanism of
quotation. That is, ’<expr>—
2.3 Types for Parsing
> (read) |
- s-expression |
[type in (+ 23 (- 5 6))] |
'(+ 23 (- 5 6)) |
In the typed language, an s-expression is treated distinctly from the
other types, such as numbers and lists. Underneath, an s-expression
is a large recursive datatype that consists of all the base printable
values—
> '+ |
- symbol |
'+ |
> (define l '(+ 1 2)) |
> l |
- s-expression |
'(+ 1 2) |
> (first l) |
. typecheck failed: (listof '_a) vs s-expression in: |
first |
(quote (+ 1 2)) |
l |
first |
> (define f (first (s-exp->list l))) |
> f |
- s-expression |
'+ |
> (symbol->string f) |
. typecheck failed: symbol vs s-expression in: |
symbol->string |
f |
symbol->string |
f |
first |
(first (s-exp->list l)) |
s-exp->list |
> (symbol->string (s-exp->symbol f)) |
- string |
"+" |
Fortunately we will use s-expressions only in our parser, and our goal is to get away from parsing as quickly as possible! Indeed, if anything this should be inducement to get away even quicker.
2.4 Completing the Parser
In principle, we can think of read as a complete parser. However, its output is generic: it represents the token structure without offering any comment on its intent. We would instead prefer to have a representation that tells us something about the intended meaning of the terms in our language, just as we wrote at the very beginning: “representing addition”, “represents a number”, and so on.
(define-type ArithC [numC (n : number)] [plusC (l : ArithC) (r : ArithC)] [multC (l : ArithC) (r : ArithC)])
(define (parse [s : s-expression]) : ArithC (cond [(s-exp-number? s) (numC (s-exp->number s))] [(s-exp-list? s) (let ([sl (s-exp->list s)]) (case (s-exp->symbol (first sl)) [(+) (plusC (parse (second sl)) (parse (third sl)))] [(*) (multC (parse (second sl)) (parse (third sl)))] [else (error 'parse "invalid list input")]))] [else (error 'parse "invalid input")]))
> (parse '(+ (* 1 2) (+ 2 3))) |
- ArithC |
(plusC |
(multC (numC 1) (numC 2)) |
(plusC (numC 2) (numC 3))) |
Congratulations! You have just completed your first representation of a program. From now on we can focus entirely on programs represented as recursive trees, ignoring the vagaries of surface syntax and how to get them into the tree form. We’re finally ready to start studying programming languages!
What happens if you forget to quote the argument to the parser? Why?
2.5 Coda
Racket’s syntax, which it inherits from Scheme and Lisp, is controversial. Observe, however, something deeply valuable that we get from it. While parsing traditional languages can be very complex, parsing this syntax is virtually trivial. Given a sequence of tokens corresponding to the input, it is absolutely straightforward to turn parenthesized sequences into s-expressions; it is equally straightforward (as we see above) to turn s-expressions into proper syntax trees. I like to call such two-level languages bicameral, in loose analogy to government legislative houses: the lower-level does rudimentary well-formedness checking, while the upper-level does deeper validity checking. (We haven’t done any of the latter yet, but we will [REF].)
The virtues of this syntax are thus manifold. The amount of code it requires is small, and can easily be embedded in many contexts. By integrating the syntax into the language, it becomes easy for programs to manipulate representations of programs (as we will see more of in [REF]). It’s therefore no surprise that even though many Lisp-based syntaxes have had wildly different semantics, they all share this syntactic legacy.
Of course, we could just use XML instead. That would be much better. Or JSON. Because that wouldn’t be anything like an s-expression at all.