📦 Huffman Coding

Data compression with variable-length prefix codes

Overview

Implements Huffman coding, an optimal prefix-free compression algorithm. Builds a binary tree from character frequencies, assigning shorter codes to more frequent characters. Includes encoding, decoding, code table display, and compression statistics.

Concepts Demonstrated

Algebraic data types — binary tree with Leaf and Node constructors
Priority queues — greedy construction using min-frequency merging
Recursive tree traversal — code generation via path accumulation
Bit manipulation — packing variable-length codes into bytes
Hash tables — character frequency counting and code lookup
Compression analysis — ratio calculation, bits-per-character stats

Key Functions

Function	Description
`build_tree`	Construct Huffman tree from character frequencies
`generate_codes`	Extract prefix codes from tree
`encode`	Compress string → bit sequence
`decode`	Decompress bit sequence → original string
`compress`	Full compression pipeline (string → compressed data)
`decompress`	Full decompression pipeline
`compression_ratio`	Calculate space savings

How It Works

(* Huffman tree *)
type tree = Leaf of char * int | Node of tree * tree * int

(* Algorithm:
   1. Count character frequencies
   2. Create leaf node per character
   3. Repeatedly merge two lowest-frequency nodes
   4. Result: optimal binary tree
   5. Left edge = 0, right edge = 1 → code for each char *)

(* Example: "hello" *)
(* 'h':1 'e':1 'l':2 'o':1 *)
(* Codes: 'l'→"0" 'h'→"100" 'e'→"101" 'o'→"11" *)
(* Encoded: 100 101 0 0 11 = "10010100011" (11 bits vs 40) *)

Optimality

Huffman coding produces the optimal prefix-free code for a given character frequency distribution. The greedy algorithm (always merging the two least-frequent nodes) provably minimizes the expected code length. The prefix-free property means no code is a prefix of another, so decoding is unambiguous.