๐Ÿ” Regular Expression Engine

Thompson's NFA construction with guaranteed linear-time matching

Overview

A complete regular expression engine built from scratch. Uses Thompson's NFA construction to guarantee O(nยทm) matching time (n = input length, m = pattern size) โ€” no pathological backtracking like PCRE-based engines.

This is a production-quality approach: the same algorithm used by grep, RE2, and Go's regexp package.

Concepts Demonstrated

Supported Syntax

PatternMatches
.Any character (except newline)
\d \w \sDigit, word char, whitespace
\D \W \SNegated character classes
[abc]Character class
[a-z]Character range
[^abc]Negated character class
* + ?Zero+, one+, optional
|Alternation
(...)Grouping
^ $Start/end anchors

Key Functions

FunctionDescription
compileParse pattern string โ†’ compiled regex
matchesFull-string match test
findFind first match (returns position + text)
find_allFind all non-overlapping matches
replaceReplace first match with string
replace_allReplace all matches
splitSplit string on pattern

Architecture: Thompson's NFA

(* 1. Parse: string โ†’ AST *)
type regex = Char of char_matcher | Seq of regex * regex
           | Alt of regex * regex | Star of regex | ...

(* 2. Compile: AST โ†’ NFA (Thompson's construction) *)
(* Each regex node becomes a small NFA fragment *)
(* Fragments are connected via ฮต-transitions *)

(* 3. Simulate: NFA ร— string โ†’ bool *)
(* Track set of active states; advance on each character *)
(* No backtracking โ€” always O(n ร— m) *)

This three-phase architecture separates concerns cleanly: parsing validates syntax, compilation builds the automaton, and simulation runs it efficiently.

Usage

let r = compile "ab+c" in
matches r "abbc"          (* โ†’ true  *)
matches r "ac"            (* โ†’ false *)
find r "xxxabbcyyy"       (* โ†’ Some (3, "abbc") *)
find_all r "abc abbc"     (* โ†’ ["abc"; "abbc"] *)
replace r "abc" "X"       (* โ†’ "X" *)
split r "xabcyabbcz"      (* โ†’ ["x"; "y"; "z"] *)

Why No Backtracking?

Many regex engines (Perl, Python, Java) use backtracking, which can take exponential time on pathological patterns like (a+)+b against "aaaaaa...". Thompson's NFA avoids this entirely by tracking all possible states simultaneously โ€” if any state reaches an accept state, the match succeeds.