๐ Regular Expression Engine
Thompson's NFA construction with guaranteed linear-time matching
Overview
A complete regular expression engine built from scratch. Uses Thompson's NFA construction to guarantee O(nยทm) matching time (n = input length, m = pattern size) โ no pathological backtracking like PCRE-based engines.
This is a production-quality approach: the same algorithm used by grep,
RE2, and Go's regexp package.
Concepts Demonstrated
- Algebraic data types โ regex AST and NFA state representation
- Recursive descent parsing โ regex syntax โ AST with operator precedence
- Thompson's construction โ AST โ NFA with ฮต-transitions
- NFA simulation โ set-based epsilon-closure (no backtracking)
- Module design โ clean API surface with internal implementation
Supported Syntax
| Pattern | Matches |
|---|---|
. | Any character (except newline) |
\d \w \s | Digit, word char, whitespace |
\D \W \S | Negated character classes |
[abc] | Character class |
[a-z] | Character range |
[^abc] | Negated character class |
* + ? | Zero+, one+, optional |
| | Alternation |
(...) | Grouping |
^ $ | Start/end anchors |
Key Functions
| Function | Description |
|---|---|
compile | Parse pattern string โ compiled regex |
matches | Full-string match test |
find | Find first match (returns position + text) |
find_all | Find all non-overlapping matches |
replace | Replace first match with string |
replace_all | Replace all matches |
split | Split string on pattern |
Architecture: Thompson's NFA
(* 1. Parse: string โ AST *)
type regex = Char of char_matcher | Seq of regex * regex
| Alt of regex * regex | Star of regex | ...
(* 2. Compile: AST โ NFA (Thompson's construction) *)
(* Each regex node becomes a small NFA fragment *)
(* Fragments are connected via ฮต-transitions *)
(* 3. Simulate: NFA ร string โ bool *)
(* Track set of active states; advance on each character *)
(* No backtracking โ always O(n ร m) *)
This three-phase architecture separates concerns cleanly: parsing validates syntax, compilation builds the automaton, and simulation runs it efficiently.
Usage
let r = compile "ab+c" in
matches r "abbc" (* โ true *)
matches r "ac" (* โ false *)
find r "xxxabbcyyy" (* โ Some (3, "abbc") *)
find_all r "abc abbc" (* โ ["abc"; "abbc"] *)
replace r "abc" "X" (* โ "X" *)
split r "xabcyabbcz" (* โ ["x"; "y"; "z"] *)
Why No Backtracking?
Many regex engines (Perl, Python, Java) use backtracking, which can take
exponential time on pathological patterns like (a+)+b
against "aaaaaa...". Thompson's NFA avoids this entirely by tracking
all possible states simultaneously โ if any state reaches an accept state,
the match succeeds.