📦 Huffman Coding
Data compression with variable-length prefix codes
Overview
Implements Huffman coding, an optimal prefix-free compression algorithm. Builds a binary tree from character frequencies, assigning shorter codes to more frequent characters. Includes encoding, decoding, code table display, and compression statistics.
Concepts Demonstrated
- Algebraic data types — binary tree with Leaf and Node constructors
- Priority queues — greedy construction using min-frequency merging
- Recursive tree traversal — code generation via path accumulation
- Bit manipulation — packing variable-length codes into bytes
- Hash tables — character frequency counting and code lookup
- Compression analysis — ratio calculation, bits-per-character stats
Key Functions
| Function | Description |
|---|---|
build_tree | Construct Huffman tree from character frequencies |
generate_codes | Extract prefix codes from tree |
encode | Compress string → bit sequence |
decode | Decompress bit sequence → original string |
compress | Full compression pipeline (string → compressed data) |
decompress | Full decompression pipeline |
compression_ratio | Calculate space savings |
How It Works
(* Huffman tree *)
type tree = Leaf of char * int | Node of tree * tree * int
(* Algorithm:
1. Count character frequencies
2. Create leaf node per character
3. Repeatedly merge two lowest-frequency nodes
4. Result: optimal binary tree
5. Left edge = 0, right edge = 1 → code for each char *)
(* Example: "hello" *)
(* 'h':1 'e':1 'l':2 'o':1 *)
(* Codes: 'l'→"0" 'h'→"100" 'e'→"101" 'o'→"11" *)
(* Encoded: 100 101 0 0 11 = "10010100011" (11 bits vs 40) *)
Optimality
Huffman coding produces the optimal prefix-free code for a given character frequency distribution. The greedy algorithm (always merging the two least-frequent nodes) provably minimizes the expected code length. The prefix-free property means no code is a prefix of another, so decoding is unambiguous.