Rambles around computer science

Diverting trains of thought, wasting precious time

Fri, 13 Mar 2009

ANTLR recipes

I've been fiddling with ANTLR recently, trying to create a parser for my Cake language. Unfortunately I found it trickier than I was hoping, and this forced me to take a step back and create some simpler grammars so I could get the hang of how to realise some common grammatical features in ANTLR. Specifically, I looked at four features: recursive prefixing (i.e. a recursive unary prefix operator, so right-recursive), recursive suffixing (the same but suffixing, left-recursive), right-associative binary operators and left-associative binary operators.

The appeal of ANTLR has been its ability to build ASTs in special-purpose syntax (agnostic towards the host language) rather than relying on semantic actions. Unfortunately it took me a while to get the hang of these tree construction features. The basics are here, but here's some helpful extra snippets and emphasis that took me a while to discover.

If you want to build a tree at all, you must use either “^” or rewrite rules. Otherwise you will just get a list of tokens (i.e. a flat one-node structure) masquerading as an AST. This perhaps makes sense when considering that ANTLR and similar tools are most essentially language recognisers, and only tree-builders by extension.
I didn't have much success with using empty-RHS rules to terminate recursion. If you need to terminate a list, use the singleton case for termination. This does mean you have to use a syntactic predicate to prefer the longer (recursive) case over the shorter (termination) case. To illustrate, here's the first recipe, which does a prefix sequence. (It's probably possible to get away without this, but I don't know how! Answers on a postcard please.)
```
prefixSequence 	: (IDENT prefixSequence)=> IDENT^ prefixSequence
                | IDENT^ ;
				
$ python recipesParser.py --input='first second third fourth' --rule=prefixSequence
(first (second (third fourth)))
```
In antlr's print-out of an AST, as generated (as above) by command-line invocation of your generated parser (among other means), the first element of a bracketed sequence is implicitly the head. So don't worry if you were expecting brackets following the head (you see something like (first (second (third fourth))) but expected fourth to be bracketed too) -- the stuff to the right of the head token is implicitly underneath it in the tree.
A note on symmetry: even though you might think that prefixes and suffixes are complementary and symmetric ideas, or similarly that left-associative and right-associative operators are such, to a parser generator like ANTLR they are not at all similar. This is because ANTLR only parses input in one direction: from left to right. So, don't be surprised if your rules for these apparently complementary cases look entirely different. Of course, this is also the explanation for why left-recursion is bad but right-recursion is okay.
The left-recursive recipes---suffixSequence and laBinaryOpExpr---are the simplest superficially, but also the trickiest to understand in tree construction terms.
```
suffixSequence	: IDENT^ ( IDENT^ )*
                ;

laPrimitiveExpr	: IDENT^
                | '('! laBinaryOpExpr^ ')'!
                ;

laBinaryOpExpr	: laPrimitiveExpr^ ('*'^ laPrimitiveExpr )*
                ;

$ python recipesParser.py --input='first second third fourth' --rule=suffixSequence
(fourth (third (second first)))
$ python recipesParser.py --input='first * second * third * fourth' --rule=laBinaryOpExpr
(* (* (* first second) third) fourth)
```
There are at least four things to note here.
1. There is no explicit recursion. Iteration has completely replaced it. Normally iteration is simpler than recursion, but since trees are inherently recursive data structures, iteration makes it a bit harder to understand what trees are going to be created.
2. In the case of laBinaryOpExpr, antlr warns us that multiple alternatives could match input such as “*”, so it disabled “option 2” -- this means the implicit option between iterating the “*” (looking for a longer “*”-separated list) and not iterating the “*” (ending the list). It seems to be our good fortune that this is what we want, i.e. that it ignored the option to end the list early if it could be continued.
3. We have used “^” inside a subrule, and antlr has somehow “done the right thing” in terms of tree construction.
4. This is even more surprising given that we used “^” in the top-level part of the rule as well as the subrule. Notice how the subrule's “*” ends up being the ultimate root (i.e. the whole tree is rooted by a “*” node) even though one might expect the top-level rule's “^” to have precedence. If you step through the building of the tree, one iteration of the “*” at a time, you'll notice that each “^” always becomes the root of the tree when it is applied, and since a rightmore “^” will get applied later, the rightmore instance of the operator ends up highest in the tree, which is what we want (i.e. that rightmore operator applications will be evaluated later, only after deeper leftmore applications, in a conventional depth-first evaluation walk of the tree).
If you suffer from RewriteEmptyStreamException, it might be because you forgot to annotate some more depended-upon rules (e.g. primitive expressions, idents, etc.) with “^” or rewrite notation. This means they're not building a tree, therefore leaving their callers with no tree pieces to put together.
If you use (as you should) “^” or tree-rewriting in any part of a set of rules, you should use them in all parts of the rule set. Otherwise tokens won't even appear flat -- they will get thrown out and will not appear at all in the tree. You can mix and match “^” and rewriting between rules, but not within the same rule. For example, if you forget the “^” after IDENT below, you will get a tree with no identifiers in it.
```
raPrimitiveExpr : IDENT^
                | '('! raBinaryOpExpr^ ')'!
                ;
```
In rewrite rules, you can nest brackets to indicate more than one tree node, but you need to repeat the “^” before each opening bracket. Since ANTLR's branch nodes are labelled with tokens, every “^( ... )” expression should begin with a token. You can invent dummy tokens if you need to -- put them in a “tokens { ... }” block at the start of your grammar. The final example illustrates nested tree constructors (but not the dummy tokens).
```
raBinaryOpExpr : (raPrimitiveExpr '->')=> raPrimitiveExpr '->' raBinaryOpExpr 
                   -> ^('->' raPrimitiveExpr ^( raBinaryOpExpr ) )
               | raPrimitiveExpr^
               ;

$ python recipesParser.py --input='first -> second -> third -> fourth' --rule=raBinaryOpExpr
(-> first (-> second (-> third fourth)))
```

I think that's all for now. I still have to get a good handle on operator precedence, which may or may not spawn another post.

[/devel] permanent link contact

validate this page