We would like to have a parser for our terms, so that we don't have to write them in terms of type constructors.
term → name ( termlist ) | name | ( term ) | numeral | - term | term + term | term * term termlist → term , termlist | term
Here we have a grammar for terms, defined by a set of production rules.
The task of parsing, in general, is to reverse this, i.e., to find a sequence of productions that could generate a given string.
Unfortunately our grammar is ambiguous, since certain strings can be produced in several ways, e.g.
term → term + term → term + term * term
and
term → term * term | term + term * term
These correspond to different `parse trees'. Effectively, we are free to interpret x + y * z either as x + (y * z) or (x + y) * z.
We can encode operator precedences by introducing extra categories, e.g.
atom → name ( termlist ) | name | numeral | ( term ) | - atom mulexp → atom * mulexp | atom term → mulexp + term | mulexp termlist → term , termlist | term
Now it's unambiguous. Multiplication has higher precedence and both infixes associate to the right.
A recursive descent parser is a top-down parser, implemented as a series of mutually recursive functions, one for each syntactic category (term, mulexp, etc.).
The mutually recursive structure mirrors that it the grammar.
This makes them quite easy and natural to write—especially in ML, where recursion is the principal control mechanism.
For example, the procedure for parsing terms, say term will, on encountering a - symbol, make a recursive call to itself to parse the subterm, and on encountering a name followed by an opening parenthesis, will make a recursive call to termlist. This in itself will make at least one recursive call to term, and so on.
We assume that a parser accepts a list of input characters or tokens of arbitrary type.
It returns the result of parsing, which has some other arbitrary type, and also the list of input objects not yet processed. Therefore the type of a parser is:
α list → β * α list
For example, when given the input characters (x + y) * z the function atom will process the characters (x + y) and leave the remaining characters * z. It might return a parse tree for the processed expression using our earlier recursive type, and hence we would have:
atom "(x + y) * z" = (Fn("+",[Var "x", Var "y"]),"* z")
In ML, we can define a series of combinators for plugging parsers together and creating new parsers from existing ones.
By giving some of them infix status, we can make the ML parser program look quite similar in structure to the original grammar.
First we declare an exception to be used where parsing fails:
exception Noparse
p1 ++ p2 applies p1 first and then applies p2 to the remaining tokens; many keeps applying the same parser as long as possible.
p >> f works like p but then applies f to the result of the parse.
p1 || p2 tries p1 first, and if that fails, tries p2.
These are infix operators, in decreasing order of precedence:
infixr 5 ++ infixr 3 >> infixr 0 ||
fun (p1 ++ p2) input = let val (x, rest) = p1 input val (y, rest') = p2 rest in ((x, y), rest') end fun (p >> f) input = let val (x, rest) = p input in (f x, rest) end fun (p1 || p2) input = p1 input handle Noparse => p2 input fun many p input = let val (x, rest) = p input val (xs, rest') = many p rest in (x::xs, rest') end handle Noparse => ([], input)
We will also use the following general functions:
fun fst (x, _) = x fun snd (_, x) = x
We need a few primitive parsers to get us started.
fun some p [] = raise Noparse | some p (x::xs) = if p x then (x, xs) else raise Noparse fun a tok = some (fn item => item = tok) fun finished [] = (0, input) | finished _ = raise Noparse
The first two accept something satisfying p, and something equal to tok, respectively. The last one makes sure there is no unprocessed input.
First we want to do lexical analysis, i.e., split the input characters into tokens. This can also be done using our combinators, together with a few character discrimination functions. First we declare the type of tokens:
datatype token = Name of string | Num of string | Other of string
We want the lexer to accept a string and produce a list of tokens, ignoring spaces, e.g.
- lex "sin(x + y) * cos(2 * x + y)"; > val it = [Name "sin", Other "(", Name "x", Other "+", Name "y", Other ")", Other "*", Name "cos", Other "(", Num "2", Other "*", Name "x", Other "+", Name "y", Other ")"] : token list
val lex = let fun several p = many (some p) fun lower s = "a" <= s andalso s <= "z" fun upper s = "A" <= s andalso s <= "Z" fun letter s = lower s orelse upper s fun alpha s = letter s orelse s = "_" orelse s = "'" fun digit s = "0" <= s andalso s <= "9" fun alphanum s = alpha s orelse digit s fun space s = s = " " orelse s = "\n" orelse s = "\t" fun collect (x,xs) = x ^ (foldr op^ "" xs) val rawname = some alpha ++ several alphanum >> (Name o collect) val rawnumeral = some digit ++ several digit >> (Num o collect) val rawother = some (fn _ => true) >> Other val token = (rawname || rawnumeral || rawother) ++ several space >> fst val tokens = (several space ++ many token) >> snd val alltokens = (tokens ++ finished) >> fst in fst o alltokens o map str o explode end
In order to parse terms, we start with some basic parsers for single tokens of a particular kind:
fun name (Name s :: rest) = (s,rest) | name _ = raise Noparse fun numeral (Num s :: rest) = (s,rest) | numeral _ = raise Noparse fun other (Other s :: rest) = (s,rest) | other _ = raise Noparse
Now we can define a parser for terms, in a form very similar to the original grammar. The main difference is that each production rule has associated with it some sort of special action to take as a result of parsing.
fun atom input = (name ++ a (Other "(") ++ termlist ++ a (Other ")") >> (fn (f,(_,(a,_))) => Fn(f,a)) || name >> Var || numeral >> Const || a (Other "(") ++ term ++ a (Other ")") >> (fst o snd) || a (Other "-") ++ atom >> (fn (_,a) => Fn("-",[a]))) input and mulexp input = (atom ++ a (Other "*") ++ mulexp >> (fn (a,(_,m)) => Fn("*",[a,m])) || atom) input and term input = (mulexp ++ a (Other "+") ++ term >> (fn (a,(_,m)) => Fn("+",[a,m])) || mulexp) input and termlist input = (term ++ a (Other ",") ++ termlist >> (fn (x,(_,xs)) => x::xs) || term >> (fn x => [x])) input
Let us package everything up as a single parsing function:
val parser = fst o (term ++ finished >> fst) o lex
To see it in action, we try with an without the printer (from the previous lecture) installed:
- parser "sin(x + y) * cos(2 * x + y)"; > val it = Fn("*", [Fn("sin", [Fn("+", [Var "x", Var "y"])]), Fn("cos", [Fn("+", [Fn("*", [Const "2", Var "x"]), Var "y"])])]) : term - installPP print_term; > val it = () : unit - parser "sin(x + y) * cos(2 * x + y)"; > val it = 'sin(x + y) * cos(2 * x + y)' : term
We can let ML construct the `fixed-up' grammar from our list of infixes:
fun binop opr parser input = let val (result as (atom, rest)) = parser input in if rest <> [] andalso hd rest = Other opr then let val (atom', rest') = binop opr parser (tl rest) in (Fn(opr, [atom, atom']), rest') end else result end fun findmin (x::xs) = foldl (fn (x1 as (_,pr1), x2 as (_,pr2)) => if pr1 <= pr2 then x1 else x2) x xs fun delete elt (x::xs) = if x = elt then xs else x :: (delete elt xs) fun precedence lst parser input = if null lst then parser input else let val opp = findmin lst val lst' = delete opp lst in binop (fst opp) (precedence lst' parser) input end
Now the main parser is simpler and more general.
fun atom input = (name ++ a (Other "(") ++ termlist ++ a (Other ")") >> (fn (f,(_,(a,_))) => Fn(f,a)) || name >> Var || numeral >> Const || a (Other "(") ++ term ++ a (Other ")") >> (fst o snd) || a (Other "-") ++ atom >> (fn (_,a) => Fn("-",[a]))) input and term input = precedence infixes atom input and termlist input = (term ++ a (Other ",") ++ termlist >> (fn (x,(_,xs)) => x::xs) || term >> (fn x => [x])) input
This will construct the precedence parser using the list of infixes present at the time we define the parser. Now the basic grammar is simpler.
Some productions for the same syntactic category have a comon prefix. Note that our production rules for term have this property:
term > name ( termlist ) | name | …
We carefully put the longer production first in our actual implementation, otherwise success in reading a name would cause the abandonment of attempts to read a parenthesized list of arguments.
However, this backtracking can lead to our processing the initial name twice.
This is not very serious here, but it could be in termlist.
We can easily replace:
fun … and termlist input = (term ++ a (Other ",") ++ termlist >> (fn (x,(_,xs)) => x::xs) || term >> (fn x => [x])) input
with
fun … and termlist input = (term ++ many (a (Other ",") ++ term >> snd) >> op::) input
This gives another improvement to the parser, which is now more efficient and slightly simpler.
The final result is:
fun atom input = (name ++ a (Other "(") ++ termlist ++ a (Other ")") >> (fn (f,(_,(a,_))) => Fn(f,a)) || name >> Var || numeral >> Const || a (Other "(") ++ term ++ a (Other ")") >> (fst o snd) || a (Other "-") ++ atom >> (fn (_,a) => Fn("-",[a]))) input and term input = precedence infixes atom input and termlist input = (term ++ many (a (Other ",") ++ term >> snd) >> op::) input
With care, this parsing method can be used effectively. It is a good illustration of the power of higher-order functions.
The code of such a parser is highly structured and similar to the grammar, therefore easy to modify.
However, it is not as efficient as LR parsers. ML-Yacc is capable of generating good LR parsers automatically.
Recursive descent also has trouble with left recursion. For example, if we had wanted to make the addition operator left-associative in our earlier grammar, we could have used:
term → term + mulexp | mulexp
The naive transcription into ML would loop indefinitely. However, we can often replace such constructs with explicit repetitions.