š Earley Parser
Chart parser for any context-free grammar ā handles ambiguity and left-recursion
Overview
The Earley algorithm (Jay Earley, 1970) can parse any context-free grammar ā including ambiguous and left-recursive ones ā with O(n³) worst-case, O(n²) for unambiguous grammars, and O(n) for many practical grammars. Unlike recursive descent or combinator parsers, it never needs grammar rewriting to eliminate left recursion.
This implementation builds a parse chart of Earley items (dotted rules with origin positions) and supports shared packed parse forests (SPPF) for representing all parse trees of ambiguous input.
Concepts Demonstrated
- Chart parsing ā dynamic programming over substrings via item sets
- Earley items ā
(rule, dot_position, origin)triples - Three core operations ā Predict, Scan, Complete
- Nullable detection ā fixed-point iteration for ε-producing nonterminals
- SPPF ā compact representation of exponentially many parse trees
- Grammar DSL ā OCaml operators for readable grammar definitions
The Algorithm
(* Define a grammar for arithmetic expressions *)
let g = grammar "Expr" [
rule "Expr" [nt "Expr"; t '+'; nt "Term"]; (* left-recursive! *)
rule "Expr" [nt "Term"];
rule "Term" [nt "Term"; t '*'; nt "Factor"];
rule "Term" [nt "Factor"];
rule "Factor" [t '('; nt "Expr"; t ')'];
rule "Factor" [t_pred is_digit];
]
(* Parse "3+4*5" ā works despite left recursion *)
let chart = parse g "3+4*5"
(* Three operations per chart position:
PREDICT: if dot before nonterminal, add rules for that nonterminal
SCAN: if dot before terminal matching input, advance dot
COMPLETE: if dot at end, advance all items that predicted this rule *)
Earley vs Other Parsers
| Parser | Grammar Class | Left Recursion | Ambiguity |
|---|---|---|---|
| Earley | All CFGs | ā Handled | ā All trees via SPPF |
| Combinators | LL(ā) / PEG-like | ā Infinite loop | ā First match only |
| Recursive descent | LL(k) | ā Must rewrite | ā Not supported |
| LALR (yacc) | LALR(1) | ā Handled | ā Shift-reduce conflicts |
When to Use Earley
- Natural language processing ā NL grammars are inherently ambiguous
- Grammar prototyping ā write grammars naturally, no refactoring needed
- Language workbenches ā parse user-defined grammars at runtime
- Education ā clearly demonstrates CFG parsing theory
Related Modules
- Parser Combinators ā composable parsers via higher-order functions
- Regex Engine ā regular expression matching