Rambles around computer science
Diverting trains of thought, wasting precious time
Fri, 13 Mar 2009
ANTLR recipes
I've been fiddling with ANTLR recently, trying to create a
parser for my Cake language. Unfortunately I found it trickier than I was hoping, and this forced me
to take a step back and create some simpler grammars so I could get the hang of how to realise some
common grammatical features in ANTLR. Specifically, I looked at four features: recursive prefixing
(i.e. a recursive unary prefix operator, so right-recursive), recursive suffixing (the same but
suffixing, left-recursive), right-associative binary operators and left-associative binary
operators.
The appeal of ANTLR has been its ability to build ASTs in special-purpose syntax (agnostic
towards the host language) rather than relying on semantic actions. Unfortunately it took me a while
to get the hang of these tree construction features. The basics are here, but here's some helpful
extra snippets and emphasis that took me a while to discover.
- If you want to build a tree at all, you must use either “^” or
rewrite rules. Otherwise you will just get a list of tokens (i.e. a flat one-node structure)
masquerading as an AST. This perhaps makes sense when considering that ANTLR and similar tools are
most essentially language recognisers, and only tree-builders by extension.
- I didn't have much success with using empty-RHS rules to terminate recursion. If you need to
terminate a list, use the singleton case for termination. This does mean you have to use a syntactic
predicate to prefer the longer (recursive) case over the shorter (termination) case. To illustrate,
here's the first recipe, which does a prefix sequence. (It's probably possible to get away without
this, but I don't know how! Answers on a postcard please.)
prefixSequence : (IDENT prefixSequence)=> IDENT^ prefixSequence
| IDENT^ ;
$ python recipesParser.py --input='first second third fourth' --rule=prefixSequence
(first (second (third fourth)))
- In antlr's print-out of an AST, as generated (as above) by command-line invocation of your
generated parser (among other means), the first element of a bracketed sequence is implicitly the
head. So don't worry if you were expecting brackets following the head (you see something like
(first (second (third fourth))) but expected fourth to be bracketed too) -- the
stuff to the right of the head token is implicitly underneath it in the tree.
- A note on symmetry: even though you might think that prefixes and suffixes are complementary and
symmetric ideas, or similarly that left-associative and right-associative operators are such, to a
parser generator like ANTLR they are not at all similar. This is because ANTLR only parses input in
one direction: from left to right. So, don't be surprised if your rules for these apparently
complementary cases look entirely different. Of course, this is also the explanation for why
left-recursion is bad but right-recursion is okay.
- The left-recursive recipes---suffixSequence and laBinaryOpExpr---are the
simplest superficially, but also the trickiest to understand in tree construction terms.
suffixSequence : IDENT^ ( IDENT^ )*
;
laPrimitiveExpr : IDENT^
| '('! laBinaryOpExpr^ ')'!
;
laBinaryOpExpr : laPrimitiveExpr^ ('*'^ laPrimitiveExpr )*
;
$ python recipesParser.py --input='first second third fourth' --rule=suffixSequence
(fourth (third (second first)))
$ python recipesParser.py --input='first * second * third * fourth' --rule=laBinaryOpExpr
(* (* (* first second) third) fourth)
There are at least four things to note here.
- There is no explicit recursion. Iteration has completely replaced it. Normally iteration is
simpler than recursion, but since trees are inherently recursive data structures, iteration makes it
a bit harder to understand what trees are going to be created.
- In the case of laBinaryOpExpr, antlr warns us that multiple alternatives could match
input such as “*”, so it disabled “option 2” -- this means the
implicit option between iterating the “*” (looking for a longer
“*”-separated list) and not iterating the “*” (ending the
list). It seems to be our good fortune that this is what we want, i.e. that it ignored the option to
end the list early if it could be continued.
- We have used “^” inside a subrule, and antlr has somehow “done the
right thing” in terms of tree construction.
- This is even more surprising given that we used “^” in the top-level part
of the rule as well as the subrule. Notice how the subrule's “*” ends up being
the ultimate root (i.e. the whole tree is rooted by a “*” node) even though one
might expect the top-level rule's “^” to have precedence. If you step through
the building of the tree, one iteration of the “*” at a time, you'll notice
that each “^” always becomes the root of the tree when it is applied, and since
a rightmore “^” will get applied later, the rightmore instance of the operator
ends up highest in the tree, which is what we want (i.e. that rightmore operator applications will
be evaluated later, only after deeper leftmore applications, in a conventional depth-first
evaluation walk of the tree).
- If you suffer from RewriteEmptyStreamException, it might be because you forgot to annotate some
more depended-upon rules (e.g. primitive expressions, idents, etc.) with “^” or
rewrite notation. This means they're not building a tree, therefore leaving their callers with no
tree pieces to put together.
- If you use (as you should) “^” or tree-rewriting in any part of a
set of rules, you should use them in all parts of the rule set. Otherwise tokens won't
even appear flat -- they will get thrown out and will not appear at all in the tree. You can mix and
match “^” and rewriting between rules, but not within the same rule. For
example, if you forget the “^” after IDENT below, you will get a tree
with no identifiers in it.
raPrimitiveExpr : IDENT^
| '('! raBinaryOpExpr^ ')'!
;
- In rewrite rules, you can nest brackets to indicate more than one tree node, but you
need to repeat the “^” before each opening bracket. Since ANTLR's branch nodes
are labelled with tokens, every “^( ... )” expression should begin with a
token. You can invent dummy tokens if you need to -- put them in a “tokens { ...
}” block at the start of your grammar. The final example illustrates nested tree
constructors (but not the dummy tokens).
raBinaryOpExpr : (raPrimitiveExpr '->')=> raPrimitiveExpr '->' raBinaryOpExpr
-> ^('->' raPrimitiveExpr ^( raBinaryOpExpr ) )
| raPrimitiveExpr^
;
$ python recipesParser.py --input='first -> second -> third -> fourth' --rule=raBinaryOpExpr
(-> first (-> second (-> third fourth)))
I think that's all for now. I still have to get a good handle on operator precedence, which may
or may not spawn another post.
[/devel]
permanent link
contact
validate this page