On this page:
3.1 Patterns and Templates
3.2 Shorthands and Syntax Classes
3.3 Parsing-Time Expressions
3.4 Definition Macros
3.5 Exercise
8.18.0.18

3 Post-Shrubbery Parsing🔗

Generally “parsing” programs means converting a character stream to an AST that can be executed. However, given that Rhombus has a bicameral approach to syntax, we will define parse to mean specifically a second stage of program elaboration.

  • For the first stage, we use the terms read or reading to refer to the process of converting a character sequence into an abstract shrubbery form. (This terminology is inherited from Lisp and Scheme.)

     

    fun (x): x+x

      

    (multi
     (group fun
            (parens (group x))
            (block (group x (op +) x))))

    A Rhombus module—or any language implemented using shrubbery notation—is fully read before it is processed further, so shrubbery-level syntax errors are always reported first. This is similar to the way that programs in some languages are fully tokenized before they are parsed further.

  • For the second stage, we use parse or parsing to refer to the process of converting an abstract shrubbery into a core language AST.

     

    (multi
     (group fun
            (parens (group x))
            (block (group x (op +) x))))

      

    (λx . plus x x)

    Conceptually, the core Rhombus AST is the call-by-value λ-calculus with a handful of extensions, such as conditionals and quoted values (which is also the same as the core Racket AST).

    Alternatively, everything documented as part of #lang rhombus could be considered the core language, ignoring the fact that much of that language is implemented internally as an expansion into other parts and into an even simpler language. This alternative view is less tidy theoretically, because it means that “core Rhombus” is a very big language, but it’s a productive view for our purposes.

    From this productive perspective,

     

    (multi
     (group fun
            (parens (group x))
            (block (group x (op +) x))))

    is fully parsed, and we could just as well render the representation of that program as

     

    'fun (x): x + x'

    meaning the core function form with a core addition form in its body.

3.1 Patterns and Templates🔗

Before we get to parsing expressions, where we’ll have to worry about infix operators, let’s first consider parsing definitions. For example, let’s suppose that instead of the separate class forms written in interp1.rhm, we’d like to write

datatype Expr

| Id(name :: Symbol)

| Plus(left :: Expr, right :: Expr)

| Equals(left :: Expr, right :: Expr)

| Let(name :: Symbol, rhs :: Expr, body :: Expr)

| Fun(arg :: Symbol, body :: Expr)

| Call(fun :: Expr, arg :: Expr)

| Literal(val :: Any)

As the example illustrates, our new datatype form will expect an identifier, like Expr, followed by an alts, which is a sequence of | blocks. Each | block has a single group containing an identifier (Id, Plus, etc.) followed by a parenthesized sequence of field specifications. We don’t need to do anything with the fields except propagate them to class, so let’s treat each field specification as a generic term sequence.

In short, a valid use of datatype will match this syntax object pattern:

'datatype $name

 | $variant($field ..., ...)

 | ...'

The pattern has many ellipses:

  • The ... immediately after $field means that field stands for a repetition of terms, instead of a single term.

  • There’s another ... after the , to mean repetitions of $field ..., which implies that field stands for a comma-separated repetition of repetitions, i.e., a repetition of depth 2.

  • The ... after | means a |-separated repetition of $variant($field ..., ...), so field actually standards for a repetition of depth 3! Meanwhile, variant is a also repetition of depth 1. For any match, the count associated with the variant repetition will be the same as the count for the outermost repetition count of field.

The repetitions variant and field will not record the , or | separators or () that were part of the match (or, more properly, the corresponding abstract structure of a shrubbery form that was read previously). The repetitions only record the matching portions of a syntax object inside those pieces.

If we have a match for that pattern, then we can generate the desired result using the following syntax object template:

'class $name():

   nonfinal

 class $variant($field ..., ...):

   extends $name

 ...'

The $variant($field ..., ...) part of the template reconstructs the same fragment that it matched in the pattern. The final ... is in its own group, which means that the preceding group should be replicated, and thus a class form is generated for each variant.

The matching and repetition rules for ellipses are sophisticated in order to facilitate writing and using clean and expressive patterns like that one. In particular, take a careful look at the variable name as it is used near the end of the template. It stands for a single matching term from the pattern, but it is inside .... That’s allowed: as long as a template ... follows something that tells it how many times to repeat—which would be variant or field in this case—then a non-repetition component is copied as many times as needed.

Let’s check that the pattern and template work as intended:

def input:

  'datatype Expr

   | Id(name :: Symbol)

   | Fun(arg :: Symbol, body :: Expr)

   | Call(fun :: Expr, arg :: Expr)'

> match input

  | 'datatype $name

     | $variant($field ..., ...)

     | ...':

      "matches the definition of " ++ to_string(name)

"matches the definition of Expr"

> // same idea in pattern-matching definition form via `def`:

  def 'datatype $name

       | $variant($field ..., ...)

       | ...':

    input

> // which lets us pull out the pieces to have a look at them

  name

'Expr'

> [variant, ...]

['Id', 'Fun', 'Call']

> [[[field, ...], ...], ...]

[

  [['name', '::', 'Symbol']],

  [['arg', '::', 'Symbol'], ['body', '::', 'Expr']],

  [['fun', '::', 'Expr'], ['arg', '::', 'Expr']]

]

> // and even use them in a syntax object to build the classes

  'class $name():

     nonfinal

   class $variant($field ..., ...):

     extends $name

   ...'

'class Expr (): nonfinal

 class Id (name :: Symbol): extends Expr

 class Fun (arg :: Symbol, body :: Expr): extends Expr

 class Call (fun :: Expr, arg :: Expr): extends Expr'

3.2 Shorthands and Syntax Classes🔗

The three-level repetition of field in our datatype example is somewhat tedious. The repetition of multiple fields for a variant is important, and the repetition for multiple variants is also important, but the innermost repetition is just because a field description like name :: Symbol has multiple terms.

To reduce tedious ellipses, such as in the innermost repetition for field, an escape that is alone in its group in a pattern can be matched to an entire group. Similarly, a template allows a whole group to be spliced in place of an escape.

By relying on the whole-group convention, our pattern can be a little simpler:

> def 'datatype $name

       | $variant($field, ...) // dropped one layer of ...

       | ...':

    input

> [[field, ...], ...]

[

  ['name :: Symbol'],

  ['arg :: Symbol', 'body :: Expr'],

  ['fun :: Expr', 'arg :: Expr']

]

> 'class $name():

     nonfinal

   class $variant($field, ...):

     extends $name

   ...'

'class Expr (): nonfinal

 class Id (name :: Symbol): extends Expr

 class Fun (arg :: Symbol, body :: Expr): extends Expr

 class Call (fun :: Expr, arg :: Expr): extends Expr'

Note how the [[field, ...], ...] interaction shows that each field in the repetition is a group syntax object that contains multiple space-separated terms.

Sometimes these shorthands are not exatly what you want, or sometimes you want to be more strict about what is allowed at some point to match a pattern variable. In those cases, you can annotate the pattern variable with a syntax class, which is written by putting the pattern variable in parentheses, then adding :: followed by the syntax class.

Returning to our example, the $name and $variant escapes in our example will not match multiple terms, because they are not alone in their respective groups within the pattern. If we wanted to insist that $field is matched to a single term despite being alone in its group, the we could annotate the escape with the Term syntax class by using $(field :: Term). Similarly, we could use $(field :: Group) to be explicit about the possibility of matching a multi-term group. Along the same lines, we’d like to insist that name and variant match an identifier (as opposed to a number, string, or other term) by using the Identifier syntax class.

fun my_expand(input):

  // match via pattern

  match input

  | 'datatype $(name :: Identifier)

     | $(variant :: Identifier)($field, ...)

     | ...':

      // result via template

      'class $name():

         nonfinal

       class $variant($field, ...):

         extends $name

       ...'

> my_expand('datatype Expr

             | Id(name :: Symbol)

             | Call(fun :: Expr, arg :: Expr)')

'class Expr (): nonfinal

 class Id (name :: Symbol): extends Expr

 class Call (fun :: Expr, arg :: Expr): extends Expr'

> my_expand('datatype Expr

             | 1()

             | 2()')

match: expected identifier

You can define your own syntax classes to support new syntactic categories that might be important for your own constructs, but the built-in set of syntax classes is sufficient for our purposes in the tutorial.

3.3 Parsing-Time Expressions🔗

We can match a syntax object in a function and produce a replacement like my_expand in the previous example, but that’s a run-time operation. It’s too late to affect the way the surrounding program is parsed and its definitions are discovered. To implement a definition macro, we need to perform the same matching and substitution at parsing time instead of run time.

To write parsing-time functions and expressions, we must import the rhombus/meta module with import rhombus/meta open, or we can use the language rhombus/and_meta after #lang instead of rhombus. The rhombus/meta module provides meta, which shifts the evaluation time of its body from run time to parsing time—a.k.a., expand time or compile time, since those are all the same time relative to run time.

#lang rhombus/and_meta

 

println("running")

 

meta:

  println("parsing")

If you run this module in DrRacket, then “parsing” is likely to print twice, because DrRacket’s strategy for debugging programs involves compiling them twice. If you turn off debugging via the Choose Languague... item in the Language menu, then “parsing” prints only once in DrRacket. This example module prints “running” when it is run, but it has to be parsed before it can be run, and so “parsing” prints first—even though it is later in the module.

If we put my_expand from the previous example into a meta block, then it can be called at parse time, but the function’s behavior is only half of the story. There still needs to be a connection between my_expand and uses of datatype in the run-time part of the module. We need to specifically use a macro-defining form to hook into the parsing process and create the connection.

3.4 Definition Macros🔗

The defn.macro form combines meta, match for a syntax object pattern, and a hook into the parsing process for run-time definitions. That combination creates a macro that applies to definition contexts.

In the following example, defn.macro creates a parsing-time function and registers a connection to the name datatype, because datatype is the head of the pattern after defn.macro. When the parser later encounters datatype in a definition position, it matches the datatype use to the pattern, it and evaluates the body of the defn.macro form to get a replacement set of definitions.

#lang rhombus/and_meta

 

defn.macro 'datatype $(name :: Identifier)

            | $(variant :: Identifier)($field, ...)

            | ...':

  // result via template

  'class $name():

     nonfinal

   class $variant($field, ...):

     extends $name

   ...'

 

datatype Expr

| Id(name :: Symbol)

| Fun(arg :: Symbol, body :: Expr)

| Call(fun :: Expr, arg :: Expr)

 

fun interp(e :: Expr, env :: Map):

  match e

  | Id(name):

      env[name]

  | Fun(arg, body):

      fun (arg_val): interp(body, env ++ { arg: arg_val })

  | Call(fun, arg):

      interp(fun, env)(interp(arg, env))

In the defn.macro form that defines datatype, the template '' expression in the body of the definition is a parsing-type expression. The code inside the '' represents a run-time definition, since it will be spliced in place of a use of datatype in a run-time definition position.

3.5 Exercise🔗

Start with with program interp_defn_macro.rhm. If you replace Id in datatype Expr with 7, then you get a nice error message from datatype. But if you replace name :: Symbol with 7 :: Symbol, then you get an error with poor reporting, because it comes from class.

Refine the field part of the datatype macro’s pattern to insist that a field is an identifier, followed by ::, and then another identifier. That way, any other shape will trigger an error from datatype instead of class.