📦 Huffman Coding

Data compression with variable-length prefix codes

Overview

Implements Huffman coding, an optimal prefix-free compression algorithm. Builds a binary tree from character frequencies, assigning shorter codes to more frequent characters. Includes encoding, decoding, code table display, and compression statistics.

Concepts Demonstrated

Key Functions

FunctionDescription
build_treeConstruct Huffman tree from character frequencies
generate_codesExtract prefix codes from tree
encodeCompress string → bit sequence
decodeDecompress bit sequence → original string
compressFull compression pipeline (string → compressed data)
decompressFull decompression pipeline
compression_ratioCalculate space savings

How It Works

(* Huffman tree *)
type tree = Leaf of char * int | Node of tree * tree * int

(* Algorithm:
   1. Count character frequencies
   2. Create leaf node per character
   3. Repeatedly merge two lowest-frequency nodes
   4. Result: optimal binary tree
   5. Left edge = 0, right edge = 1 → code for each char *)

(* Example: "hello" *)
(* 'h':1 'e':1 'l':2 'o':1 *)
(* Codes: 'l'→"0" 'h'→"100" 'e'→"101" 'o'→"11" *)
(* Encoded: 100 101 0 0 11 = "10010100011" (11 bits vs 40) *)

Optimality

Huffman coding produces the optimal prefix-free code for a given character frequency distribution. The greedy algorithm (always merging the two least-frequent nodes) provably minimizes the expected code length. The prefix-free property means no code is a prefix of another, so decoding is unambiguous.