3 Post-Shrubbery Parsing
Generally “parsing” programs means converting a character stream to an AST that can be executed. However, given that Rhombus has a bicameral approach to syntax, we will define parse to mean specifically a second stage of program elaboration.
For the first stage, we use the terms read or reading to refer to the process of converting a character sequence into an abstract shrubbery form. (This terminology is inherited from Lisp and Scheme.)
fun (x): x+x
⇒
(multi (group fun (parens (group x)) (block (group x (op +) x)))) A Rhombus module—
or any language implemented using shrubbery notation— is fully read before it is processed further, so shrubbery-level syntax errors are always reported first. This is similar to the way that programs in some languages are fully tokenized before they are parsed further. For the second stage, we use parse or parsing to refer to the process of converting an abstract shrubbery into a core language AST.
(multi (group fun (parens (group x)) (block (group x (op +) x)))) ⇒
(λx . plus x x)
Conceptually, the core Rhombus AST is the call-by-value λ-calculus with a handful of extensions, such as conditionals and quoted values (which is also the same as the core Racket AST).
Alternatively, everything documented as part of #lang rhombus could be considered the core language, ignoring the fact that much of that language is implemented internally as an expansion into other parts and into an even simpler language. This alternative view is less tidy theoretically, because it means that “core Rhombus” is a very big language, but it’s a productive view for our purposes.
From this productive perspective,
(multi (group fun (parens (group x)) (block (group x (op +) x)))) is fully parsed, and we could just as well render the representation of that program as
meaning the core function form with a core addition form in its body.
3.1 Patterns and Templates
Before we get to parsing expressions, where we’ll have to worry about infix operators, let’s first consider parsing definitions. For example, let’s suppose that instead of the separate class forms written in interp1.rhm, we’d like to write
datatype Expr
As the example illustrates, our new datatype form will expect an identifier, like Expr, followed by an ‹alts›, which is a sequence of | blocks. Each | block has a single ‹group› containing an identifier (Id, Plus, etc.) followed by a parenthesized sequence of field specifications. We don’t need to do anything with the fields except propagate them to class, so let’s treat each field specification as a generic ‹term› sequence.
In short, a valid use of datatype will match this syntax object pattern:
'datatype $name
| ...'
The pattern has many ellipses:
The ... immediately after $field means that field stands for a repetition of ‹term›s, instead of a single ‹term›.
There’s another ... after the , to mean repetitions of $field ..., which implies that field stands for a comma-separated repetition of repetitions, i.e., a repetition of depth 2.
The ... after | means a |-separated repetition of $variant($field ..., ...), so field actually standards for a repetition of depth 3! Meanwhile, variant is a also repetition of depth 1. For any match, the count associated with the variant repetition will be the same as the count for the outermost repetition count of field.
The repetitions variant and field will not record the , or | separators or (…) that were part of the match (or, more properly, the corresponding abstract structure of a shrubbery form that was read previously). The repetitions only record the matching portions of a syntax object inside those pieces.
If we have a match for that pattern, then we can generate the desired result using the following syntax object template:
extends $name
...'
The $variant($field ..., ...) part of the template reconstructs the same fragment that it matched in the pattern. The final ... is in its own group, which means that the preceding group should be replicated, and thus a class form is generated for each variant.
The matching and repetition rules for ellipses are sophisticated in order
to facilitate writing and using clean and expressive patterns like that one.
In particular, take a careful look at
the variable name as it is used near the end of the template.
It stands for a single matching ‹term›
from the pattern, but it is inside .... That’s allowed: as long as
a template ... follows something that tells it how many times
to repeat—
Let’s check that the pattern and template work as intended:
def input:
'datatype Expr
> // same idea in pattern-matching definition form via `def`:
| ...':
input
> // which lets us pull out the pieces to have a look at them
name
'Expr'
> [variant, ...]
['Id', 'Fun', 'Call']
[
[['name', '::', 'Symbol']],
[['arg', '::', 'Symbol'], ['body', '::', 'Expr']],
[['fun', '::', 'Expr'], ['arg', '::', 'Expr']]
]
3.2 Shorthands and Syntax Classes
The three-level repetition of field in our datatype example is somewhat tedious. The repetition of multiple fields for a variant is important, and the repetition for multiple variants is also important, but the innermost repetition is just because a field description like name :: Symbol has multiple ‹term›s.
To reduce tedious ellipses, such as in the innermost repetition for field, an escape that is alone in its ‹group› in a pattern can be matched to an entire ‹group›. Similarly, a template allows a whole group to be spliced in place of an escape.
By relying on the whole-group convention, our pattern can be a little simpler:
| ...':
input
[
['name :: Symbol'],
['arg :: Symbol', 'body :: Expr'],
['fun :: Expr', 'arg :: Expr']
]
extends $name
...'
'class Expr (): nonfinal
class Id (name :: Symbol): extends Expr
class Fun (arg :: Symbol, body :: Expr): extends Expr
class Call (fun :: Expr, arg :: Expr): extends Expr'
Note how the [[field, ...], ...] interaction shows that each field in the repetition is a ‹group› syntax object that contains multiple space-separated ‹term›s.
Sometimes these shorthands are not exatly what you want, or sometimes you want to be more strict about what is allowed at some point to match a pattern variable. In those cases, you can annotate the pattern variable with a syntax class, which is written by putting the pattern variable in parentheses, then adding :: followed by the syntax class.
Returning to our example, the $name and $variant escapes in our example will not match multiple ‹term›s, because they are not alone in their respective ‹group›s within the pattern. If we wanted to insist that $field is matched to a single ‹term› despite being alone in its ‹group›, the we could annotate the escape with the Term syntax class by using $(field :: Term). Similarly, we could use $(field :: Group) to be explicit about the possibility of matching a multi-‹term› ‹group›. Along the same lines, we’d like to insist that name and variant match an identifier (as opposed to a number, string, or other ‹term›) by using the Identifier syntax class.
fun my_expand(input):
// match via pattern
match input
| 'datatype $(name :: Identifier)
| $(variant :: Identifier)($field, ...)
| ...':
// result via template
extends $name
...'
> my_expand('datatype Expr
'class Expr (): nonfinal
class Id (name :: Symbol): extends Expr
class Call (fun :: Expr, arg :: Expr): extends Expr'
> my_expand('datatype Expr
| 1()
| 2()')
match: expected identifier
You can define your own syntax classes to support new syntactic categories that might be important for your own constructs, but the built-in set of syntax classes is sufficient for our purposes in the tutorial.
3.3 Parsing-Time Expressions
We can match a syntax object in a function and produce a replacement like my_expand in the previous example, but that’s a run-time operation. It’s too late to affect the way the surrounding program is parsed and its definitions are discovered. To implement a definition macro, we need to perform the same matching and substitution at parsing time instead of run time.
To write parsing-time functions and expressions, we must import the
rhombus/meta module with
import rhombus/meta open, or we can use the language
rhombus/and_meta after #lang instead of
rhombus. The rhombus/meta module
provides meta, which shifts the evaluation time of its body
from run time to parsing time—
println("running")
meta:
println("parsing")
If you run this module in DrRacket, then
“parsing” is likely to print twice, because DrRacket’s strategy for
debugging programs involves compiling them twice. If you
turn off debugging via the Choose Languague... item in the
Language menu, then “parsing” prints only once in
DrRacket.
This example module prints “running” when it is run, but it has to be
parsed before it can be run, and so “parsing” prints first—
If we put my_expand from the previous example into a meta block, then it can be called at parse time, but the function’s behavior is only half of the story. There still needs to be a connection between my_expand and uses of datatype in the run-time part of the module. We need to specifically use a macro-defining form to hook into the parsing process and create the connection.
3.4 Definition Macros
The defn.macro form combines meta, match for a syntax object pattern, and a hook into the parsing process for run-time definitions. That combination creates a macro that applies to definition contexts.
In the following example, defn.macro creates a parsing-time function and registers a connection to the name datatype, because datatype is the head of the pattern after defn.macro. When the parser later encounters datatype in a definition position, it matches the datatype use to the pattern, it and evaluates the body of the defn.macro form to get a replacement set of definitions.
defn.macro 'datatype $(name :: Identifier)
| $(variant :: Identifier)($field, ...)
| ...':
// result via template
extends $name
...'
datatype Expr
match e
| Id(name):
env[name]
| Fun(arg, body):
| Call(fun, arg):
interp(fun, env)(interp(arg, env))
In the defn.macro form that defines datatype, the template '…' expression in the body of the definition is a parsing-type expression. The code inside the '…' represents a run-time definition, since it will be spliced in place of a use of datatype in a run-time definition position.
3.5 Exercise
Start with with program interp_defn_macro.rhm. If you replace Id in datatype Expr with 7, then you get a nice error message from datatype. But if you replace name :: Symbol with 7 :: Symbol, then you get an error with poor reporting, because it comes from class.
Refine the field part of the datatype macro’s pattern to insist that a field is an identifier, followed by ::, and then another identifier. That way, any other shape will trigger an error from datatype instead of class.