🧩

Parser Combinators

File: parser.ml

higher-order functions closures monadic composition recursive descent operator precedence algebraic data types polymorphism

Overview

A parser combinator library lets you build complex parsers by snapping together small, reusable parser functions β€” like LEGO bricks. Each parser is a first-class function that can be passed around, combined, and composed.

This is a quintessential functional programming pattern. The library includes primitives (single characters, strings, digits), combinators (sequencing, choice, repetition), and three complete example parsers: arithmetic expressions, integer lists, and key-value pairs.

Core Design

A parser is simply a function that takes an input string and position, and returns either a parsed value with the new position, or an error:

type 'a result =
  | Ok of 'a * int        (* value + new position *)
  | Error of string * int  (* message + error position *)

type 'a parser = string -> int -> 'a result

Primitive Parsers

(* Match a character by predicate *)
let satisfy pred desc = fun input pos ->
  if pos >= String.length input then Error "end of input"
  else if pred input.[pos] then Ok (input.[pos], pos + 1)
  else Error "no match"

(* Built on satisfy: *)
let digit  = satisfy (fun c -> c >= '0' && c <= '9') "digit"
let letter = satisfy (fun c -> (*...*)) "letter"
let char_  = fun expected -> satisfy (fun c -> c = expected)

Key Combinators

Bind (>>=) β€” Monadic Sequencing

Run parser A, then feed its result to a function that produces parser B. This is the fundamental way to sequence parsers:

let bind p f = fun input pos ->
  match p input pos with
  | Error _ as e -> e
  | Ok (a, pos') -> (f a) input pos'

(* Example: parse two letters into a string *)
let two_letters =
  letter >>= fun a ->
  letter >>= fun b ->
  return_ (String.init 2 (fun i -> if i = 0 then a else b))

Choice (<|>) β€” Try Alternatives

let bool_parser =
  (string_ "true"  *> return_ true)
  <|>
  (string_ "false" *> return_ false)

Repetition β€” many, sep_by, between

(* Parse a comma-separated list of integers in brackets *)
let int_list =
  between (char_ '[') (char_ ']')
    (sep_by integer (char_ ','))
(* "[1, 2, 3]" β†’ [1; 2; 3] *)

chainl1 / chainr1 β€” Operator Precedence

Parse left-associative or right-associative infix operators. This is the key to correct arithmetic precedence:

(* Left-assoc: 10 - 3 - 2 = (10-3)-2 = 5 *)
let expr = chainl1 term add_op

(* Right-assoc: 2^3^2 = 2^(3^2) = 512 *)
let power = chainr1 atom pow_op

Arithmetic Expression Parser

A complete calculator built entirely from combinators. Handles +, -, *, /, ^, and parentheses with correct precedence:

type expr =
  | Num of int
  | Add of expr * expr
  | Sub of expr * expr
  | Mul of expr * expr  | ...

(* Grammar with precedence layers: *)
let expr  = chainl1 term  add_op   (* lowest: + - *)
let term  = chainl1 power mul_op   (* middle: * / *)
let power = chainr1 atom  pow_op   (* highest: ^  *)
let atom  = integer <|> parens    (* base: num or (expr) *)

Output

=== Parser Combinators ===
Building parsers from small, composable pieces

--- Primitive parsers ---
digit on "42":    OK: 4
letter on "abc":  OK: a
char 'x' on "xyz": OK: x
string "hi" on "hi there": OK: "hi"

--- Number parsing ---
integer "42": OK: 42
integer "-7": OK: -7
integer "0": OK: 0
integer "12345": OK: 12345
integer "abc": Error: expected digit at position 0

--- Combinators: many, sep_by, between ---
int list "[1, 2, 3]": OK: [1; 2; 3]
int list "[]": OK: []
int list "[42]": OK: [42]

--- Key-value parser ---
kv: OK: name = "Alice"
kv list: OK: {name = "Alice", age = "30", city = "NYC"}

--- Arithmetic expression parser ---
Supports: + - * / ^ () with correct precedence

  42                              AST: 42                                  = 42
  2 + 3                           AST: (2 + 3)                             = 5
  2 + 3 * 4                       AST: (2 + (3 * 4))                       = 14
  (2 + 3) * 4                     AST: ((2 + 3) * 4)                       = 20
  2 ^ 3 ^ 2                       AST: (2 ^ (3 ^ 2))                       = 512
  10 - 3 - 2                      AST: ((10 - 3) - 2)                      = 5
  100 / 5 / 4                     AST: ((100 / 5) / 4)                     = 5
  (1 + 2) * (3 + 4)               AST: ((1 + 2) * (3 + 4))                 = 21
  2 ^ 10                          AST: (2 ^ 10)                             = 1024
  ((3 + 5) * 2) - (10 / 2)        AST: (((3 + 5) * 2) - (10 / 2))          = 11

--- Quick calculator ---
  calc "1+2+3" = 6
  calc "2*3+4" = 10
  calc "2*(3+4)" = 14
  calc "2^8" = 256
  calc "-5 + 10" = 5

Combinator Reference

CombinatorOperatorDescription
bind>>=Sequence two parsers (monadic bind)
map<$>Transform a parser's result
β€”<*>Applicative apply
β€”<|>Choice: try first, then second
β€”*>Sequence, keep right result
β€”<*Sequence, keep left result
manyβ€”Zero or more repetitions
many1β€”One or more repetitions
sep_byβ€”Delimited lists
betweenβ€”Parse between delimiters
chainl1β€”Left-associative operators
chainr1β€”Right-associative operators
optionalβ€”Maybe parse (returns option)
try_β€”Backtracking on failure

Concepts Introduced

  • Higher-order functions: Parsers are functions that take and return functions. Combinators are functions that take parser functions and produce new parser functions. Functions all the way down.
  • Monadic composition: The bind (>>=) operator sequences parsers while threading state (position). This is the same pattern as Haskell's IO monad, Result monad, etc.
  • Closures: Each parser captures its configuration (expected character, predicate) in a closure β€” a function bundled with its environment.
  • Recursive descent: The expression parser demonstrates mutual recursion β€” expr calls term, which calls atom, which can call back to expr via parentheses.
  • Operator precedence: Layered grammar rules (expr β†’ term β†’ power β†’ atom) naturally encode precedence without any special algorithm.
  • Polymorphism: The same combinators work with any value type β€” 'a parser can produce chars, ints, strings, AST nodes, or anything else.