Introduction to Functional Programming

Russ Ross

Computer Laboratory
University of Cambridge
Lent Term 2005


Lecture 12

Grammar for terms

We would like to have a parser for our terms, so that we don't have to write them in terms of type constructors.

term     → name ( termlist )
         | name
         | ( term )
         | numeral
         | - term
         | term + term
         | term * term
termlistterm , termlist
         | term

Here we have a grammar for terms, defined by a set of production rules.

Ambiguity

The task of parsing, in general, is to reverse this, i.e., to find a sequence of productions that could generate a given string.

Unfortunately our grammar is ambiguous, since certain strings can be produced in several ways, e.g.

termterm + termterm + term * term

and

termterm * term
     | term + term * term

These correspond to different `parse trees'. Effectively, we are free to interpret x + y * z either as x + (y * z) or (x + y) * z.

Encoding precedences

We can encode operator precedences by introducing extra categories, e.g.

atom     → name ( termlist )
         | name
         | numeral
         | ( term )
         | - atom
mulexpatom * mulexp
         | atom
termmulexp + term
         | mulexp
termlistterm , termlist
         | term

Now it's unambiguous. Multiplication has higher precedence and both infixes associate to the right.

Recursive descent

A recursive descent parser is a top-down parser, implemented as a series of mutually recursive functions, one for each syntactic category (term, mulexp, etc.).

The mutually recursive structure mirrors that it the grammar.

This makes them quite easy and natural to write—especially in ML, where recursion is the principal control mechanism.

For example, the procedure for parsing terms, say term will, on encountering a - symbol, make a recursive call to itself to parse the subterm, and on encountering a name followed by an opening parenthesis, will make a recursive call to termlist. This in itself will make at least one recursive call to term, and so on.

Parsers in ML

We assume that a parser accepts a list of input characters or tokens of arbitrary type.

It returns the result of parsing, which has some other arbitrary type, and also the list of input objects not yet processed. Therefore the type of a parser is:

α list → β * α list

For example, when given the input characters (x + y) * z the function atom will process the characters (x + y) and leave the remaining characters * z. It might return a parse tree for the processed expression using our earlier recursive type, and hence we would have:

atom "(x + y) * z" =
  (Fn("+",[Var "x", Var "y"]),"* z")

Parser combinators

In ML, we can define a series of combinators for plugging parsers together and creating new parsers from existing ones.

By giving some of them infix status, we can make the ML parser program look quite similar in structure to the original grammar.

First we declare an exception to be used where parsing fails:

exception Noparse

p1 ++ p2 applies p1 first and then applies p2 to the remaining tokens; many keeps applying the same parser as long as possible.

p >> f works like p but then applies f to the result of the parse.

p1 || p2 tries p1 first, and if that fails, tries p2.

These are infix operators, in decreasing order of precedence:

infixr 5 ++
infixr 3 >>
infixr 0 ||

Definitions of the combinators

fun (p1 ++ p2) input =
  let val (x, rest)  = p1 input
      val (y, rest') = p2 rest
  in ((x, y), rest') end

fun (p >> f) input =
  let val (x, rest) = p input
  in (f x, rest) end

fun (p1 || p2) input =
  p1 input
  handle Noparse => p2 input

fun many p input =
  let val (x, rest) = p input
      val (xs, rest') = many p rest
  in (x::xs, rest') end
  handle Noparse => ([], input)

We will also use the following general functions:

fun fst (x, _) = x
fun snd (_, x) = x

Atomic parsers

We need a few primitive parsers to get us started.

fun some p [] = raise Noparse
  | some p (x::xs) = if p x then (x, xs)
                     else raise Noparse

fun a tok = some (fn item => item = tok)

fun finished [] = (0, input)
  | finished _ = raise Noparse

The first two accept something satisfying p, and something equal to tok, respectively. The last one makes sure there is no unprocessed input.

Lexical analysis

First we want to do lexical analysis, i.e., split the input characters into tokens. This can also be done using our combinators, together with a few character discrimination functions. First we declare the type of tokens:

datatype token = Name of string
               | Num of string
               | Other of string

We want the lexer to accept a string and produce a list of tokens, ignoring spaces, e.g.

- lex "sin(x + y) * cos(2 * x + y)";
> val it =
    [Name "sin", Other "(", Name "x", Other "+",
     Name "y", Other ")", Other "*", Name "cos",
     Other "(", Num "2", Other "*", Name "x",
     Other "+", Name "y", Other ")"] : token list

Definition of the lexer

val lex = let
  fun several p = many (some p)

  fun lower s = "a" <= s andalso s <= "z"
  fun upper s = "A" <= s andalso s <= "Z"
  fun letter s = lower s orelse upper s
  fun alpha s = letter s orelse s = "_" orelse s = "'"
  fun digit s = "0" <= s andalso s <= "9"
  fun alphanum s = alpha s orelse digit s
  fun space s = s = " " orelse s = "\n" orelse s = "\t"

  fun collect (x,xs) = x ^ (foldr op^ "" xs)

  val rawname =    some alpha ++ several alphanum
                   >> (Name o collect)
  val rawnumeral = some digit ++ several digit
                   >> (Num o collect)
  val rawother =   some (fn _ => true)
                   >> Other

  val token =      (rawname || rawnumeral || rawother) ++
                   several space >> fst
  val tokens =     (several space ++ many token) >> snd
  val alltokens =  (tokens ++ finished) >> fst

in fst o alltokens o map str o explode end

Parsing terms

In order to parse terms, we start with some basic parsers for single tokens of a particular kind:

fun name (Name s :: rest) = (s,rest)
  | name _ = raise Noparse

fun numeral (Num s :: rest) = (s,rest)
  | numeral _ = raise Noparse

fun other (Other s :: rest) = (s,rest)
  | other _ = raise Noparse

Now we can define a parser for terms, in a form very similar to the original grammar. The main difference is that each production rule has associated with it some sort of special action to take as a result of parsing.

The term parser (take 1)

fun atom input =
    (name ++
     a (Other "(") ++ termlist ++ a (Other ")")
         >> (fn (f,(_,(a,_))) => Fn(f,a))
  || name
         >> Var
  || numeral
         >> Const
  || a (Other "(") ++ term ++ a (Other ")")
         >> (fst o snd)
  || a (Other "-") ++ atom
         >> (fn (_,a) => Fn("-",[a]))) input

and mulexp input =
    (atom ++ a (Other "*") ++ mulexp
         >> (fn (a,(_,m)) => Fn("*",[a,m]))
  || atom) input

and term input =
    (mulexp ++ a (Other "+") ++ term
         >> (fn (a,(_,m)) => Fn("+",[a,m]))
  || mulexp) input

and termlist input =
    (term ++ a (Other ",") ++ termlist
         >> (fn (x,(_,xs)) => x::xs)
  || term
         >> (fn x => [x])) input

Examples

Let us package everything up as a single parsing function:

val parser =
  fst o (term ++ finished >> fst) o lex

To see it in action, we try with an without the printer (from the previous lecture) installed:

- parser "sin(x + y) * cos(2 * x + y)";
> val it =
    Fn("*",
       [Fn("sin", [Fn("+", [Var "x", Var "y"])]),
        Fn("cos", [Fn("+", [Fn("*",
              [Const "2", Var "x"]), Var "y"])])])
    : term

- installPP print_term;
> val it = () : unit

- parser "sin(x + y) * cos(2 * x + y)";
> val it = 'sin(x + y) * cos(2 * x + y)' : term

Automating precedence parsing

We can let ML construct the `fixed-up' grammar from our list of infixes:

fun binop opr parser input =
  let val (result as (atom, rest)) = parser input
  in if rest <> [] andalso hd rest = Other opr
     then let val (atom', rest') = binop opr parser (tl rest)
          in (Fn(opr, [atom, atom']), rest') end
     else result end

fun findmin (x::xs) =
  foldl (fn (x1 as (_,pr1), x2 as (_,pr2)) =>
           if pr1 <= pr2 then x1 else x2)
        x xs

fun delete elt (x::xs) =
  if x = elt then xs else x :: (delete elt xs)

fun precedence lst parser input =
  if null lst then parser input
  else let val opp = findmin lst
           val lst' = delete opp lst
       in binop (fst opp) (precedence lst' parser) input
       end

The term parser (take 2)

Now the main parser is simpler and more general.

fun atom input =
    (name ++
     a (Other "(") ++ termlist ++ a (Other ")")
         >> (fn (f,(_,(a,_))) => Fn(f,a))
  || name
         >> Var
  || numeral
         >> Const
  || a (Other "(") ++ term ++ a (Other ")")
         >> (fst o snd)
  || a (Other "-") ++ atom
         >> (fn (_,a) => Fn("-",[a]))) input

and term input = precedence infixes atom input

and termlist input =
    (term ++ a (Other ",") ++ termlist
         >> (fn (x,(_,xs)) => x::xs)
  || term
         >> (fn x => [x])) input

This will construct the precedence parser using the list of infixes present at the time we define the parser. Now the basic grammar is simpler.

Backtracking and reprocessing

Some productions for the same syntactic category have a comon prefix. Note that our production rules for term have this property:

term > name ( termlist )
     | name
     | …

We carefully put the longer production first in our actual implementation, otherwise success in reading a name would cause the abandonment of attempts to read a parenthesized list of arguments.

However, this backtracking can lead to our processing the initial name twice.

This is not very serious here, but it could be in termlist.

An improved treatment

We can easily replace:

fun …
and termlist input =
    (term ++ a (Other ",") ++ termlist
         >> (fn (x,(_,xs)) => x::xs)
  || term
         >> (fn x => [x])) input

with

fun …
and termlist input =
    (term ++ many (a (Other ",") ++ term >> snd)
         >> op::) input

This gives another improvement to the parser, which is now more efficient and slightly simpler.

The term parser (take 3)

The final result is:

fun atom input =
    (name ++
     a (Other "(") ++ termlist ++ a (Other ")")
         >> (fn (f,(_,(a,_))) => Fn(f,a))
  || name
         >> Var
  || numeral
         >> Const
  || a (Other "(") ++ term ++ a (Other ")")
         >> (fst o snd)
  || a (Other "-") ++ atom
         >> (fn (_,a) => Fn("-",[a]))) input

and term input = precedence infixes atom input

and termlist input
    = (term ++ many (a (Other ",") ++ term >> snd)
         >> op::) input

General remarks

With care, this parsing method can be used effectively. It is a good illustration of the power of higher-order functions.

The code of such a parser is highly structured and similar to the grammar, therefore easy to modify.

However, it is not as efficient as LR parsers. ML-Yacc is capable of generating good LR parsers automatically.

Recursive descent also has trouble with left recursion. For example, if we had wanted to make the addition operator left-associative in our earlier grammar, we could have used:

termterm + mulexp
     | mulexp

The naive transcription into ML would loop indefinitely. However, we can often replace such constructs with explicit repetitions.