Benefits of OCaml

Parsing

Many applications, most notably compiler writing, require streams of characters to be converted into data structures such as trees. This conversion is known as parsing. As a language designed for compiler writing, OCaml provides tools to help with the parsing of character data.

OCaml provides two different forms of parsing:

  • Recursive-descent parsing using camlp4's syntax extension for stream parsing
  • Grammar-based parsing using ocamlyacc

Recursive descent parsers are easily written using OCaml's stream parsing syntax. In this case, a parser typically consists of either a simple lexer generated by the Genlex module or a custom lexer which may be written using stream parsing. The token stream resulting from the lexer is then decomposed into grammatical units by a set of mutually recursive functions.

Grammar-based parsing requires a grammar written in Backus-Naur Form (BNF) with associated actions written in OCaml. The ocamlyacc compiler can then be used to compile this grammar into OCaml source code implementing a parser for the grammar.

Recursive descent parsers tend to be faster but can be more complicated. Consequently, recursive descent parsers are generally only used for simple tasks and grammar based parsers are used to parse more sophisticated languages.

These techniques are best introduced by way of a simple calculator example. Our examples section contains a walkthrough of a complete interpreter for a purely functional programming language.

Recursive descent parsing

A calculator can be implemented in only a page of OCaml code using stream-based recursive descent parsing. The OCaml syntax extension for stream parsing is provided by camlp4, which must be loaded before the rest of the program:

# #load "camlp4o.cma";;
        Camlp4 Parsing version 3.09.2

#

Now we're ready to start writing the calculator. We begin by defining a lexer:

# open Genlex;;
# let keywords = ["("; ")"; "+"; "-"; "*"];;
val keywords : string list = ["("; ")"; "+"; "-"; "*"]
# let lex stream =
    let rec aux = parser
      | [< 'Int n when n<0; t=aux >] -> [< 'Kwd "-"; 'Int (-n); t >]
      | [< 'h; t=aux >] -> [< 'h; t >]
      | [< >] -> [< >] in
  aux(make_lexer keywords stream);;
val lex : char Stream.t -> Genlex.token Stream.t = <fun>

Note that we are careful to replace negative integers in the input stream with a minus sign followed by a positive integer.

Next, we define a type to represent abstract syntax trees:

# type expr = Num of int | Add of expr * expr | Mul of expr * expr;;
type expr = Num of int | Add of expr * expr | Mul of expr * expr

The recursive descent parser consists of three mutually-recursive functions:

# let rec parse_atom = parser
    | [< 'Int n >] -> Num n
    | [< 'Kwd "("; e=parse_expr; 'Kwd ")" >] -> e

  and parse_factor = parser
    | [< e1=parse_atom; stream >] ->
        (parser
         | [< 'Kwd "*"; e2=parse_factor >] -> Mul(e1, e2)
         | [< >] -> e1) stream

  and parse_expr = parser
    | [< e1=parse_factor; stream >] ->
        (parser
         | [< 'Kwd "+"; e2=parse_expr >] -> Add(e1, e2)
         | [< 'Kwd "-"; e2=parse_expr >] ->
             Add(e1, Mul(Num(-1), e2))
         | [< >] -> e1) stream;;
val parse_atom : Genlex.token Stream.t -> expr = <fun>
val parse_factor : Genlex.token Stream.t -> expr = <fun>
val parse_expr : Genlex.token Stream.t -> expr = <fun>

That is all that is required to parse simple arithmetic expressions. We can test it by lexing and parsing a string to get the abstract syntax tree representing the expression:

# parse_expr(lex(Stream.of_string "1+2*(3+4)"));;
- : expr = Add (Num 1, Mul (Num 2, Add (Num 3, Num 4)))

The same program can be written slightly more verbosely but more extensibly using ocamlyacc.

Grammar-based parsing

A grammar-based parser is split into an ocamllex lexer,

Lexing

We begin by describing a lexer in ocamllex syntax in the file "calcLexer.mll":

rule lex = parse
  | [' ' '\n' '\t'] { lex lexbuf }
  | ['0'-'9']+ as s { INT(int_of_string s) }
  | '+'             { PLUS }
  | '-'             { MINUS }
  | '*'             { TIMES }
  | '('             { OPEN }
  | ')'             { CLOSE }
  | eof             { EOF }

This lexer ignores whitespace, converts sequences of one or more digits into INT tokens and converts symbols into other tokens.

The tokens are actually defined in the parser (i.e. the lexer depends upon the parser).

Parser

Yacc grammars are defined in Backus Naur form (BNF) and work by spotting patterns in an incoming stream of tokens. A simple ocamlyacc grammar for a calculator might be:

%{ type expr = Int of int | Add of expr * expr | Mul of expr * expr %}

%token <int> INT
%token PLUS MINUS TIMES OPEN CLOSE EOF

%start expr1
%type <expr> expr1

%left PLUS MINUS
%left TIMES DIVIDE

%%

expr:
| INT             { Int $1 }
| OPEN expr CLOSE { $2 }
| expr PLUS expr  { Add($1, $3) }
| expr MINUS expr { Add($1, Mul(Int (-1), $3)) }
| expr TIMES expr { Mul($1, $3) };

expr1:
| expr EOF { $1 };

The parser contains a preamble that defines the expr type, following by declarations of the tokens in the input stream, the start point of the parser, the return type of that parser, operator associativities (left or right) and precedences (from low to high) and, finally, the grammar itself. Note that the grammar-based parser uses precedence declarations to obviate the need for separate atom, factor and term parsers.

We can test this lexer and parser from the top-level by compiling them to OCaml source code:

$ ocamllex calcLexer.mll
9 states, 267 transitions, table size 1122 bytes
$ ocamlyacc calcParser.mly

and then loading them into a running top-level:

# #use "calcParser.ml";;
...
# #use "calcLexer.ml";;
...

Lexing is performed by a function called lex (defined by rule lex = parse ... in "calcLexer.mll") and parsing a single expression is performed by expr1:

# expr1 lex (Lexing.from_string "1+2*(3+4)");;
- : expr = Add (Int 1, Mul (Int 2, Add (Int 3, Int 4)))

The ability to write efficient lexers and parsers succinctly in OCaml makes it ideally suited to interpreter and compiler writing.