🌸 Bloom Filter
Probabilistic set membership with no false negatives
Overview
A space-efficient probabilistic data structure that tests whether an element is a member of a set. May return false positives ("maybe in set") but never false negatives ("definitely not in set"). Used in databases, caches, and network protocols to avoid expensive lookups.
Concepts Demonstrated
- Probabilistic data structures — trading certainty for space
- Hash functions — multiple independent hashes via double hashing
- Bit arrays — compact boolean storage
- False positive analysis — mathematical bounds on error rate
- Optimal sizing — computing ideal filter size and hash count
How It Works
(* 1. Create: bit array of size m, k hash functions *)
(* 2. Add: set bits at h₁(x), h₂(x), ..., hₖ(x) *)
(* 3. Query: check if ALL k bits are set *)
(* - All set → "probably in set" (false positive possible) *)
(* - Any unset → "definitely not in set" (guaranteed) *)
let bf = create ~expected_items:1000 ~false_positive_rate:0.01
let bf = add "hello" bf
let _ = mem "hello" bf (* → true (definitely added) *)
let _ = mem "world" bf (* → false (definitely not, or rare FP) *)
Key Functions
| Function | Description |
|---|---|
create | Create filter with target capacity and FP rate |
add | Insert an element |
mem | Test membership (may return false positive) |
false_positive_rate | Current estimated FP probability |
size_bytes | Memory usage of the filter |
union | Merge two filters (bitwise OR) |
intersection | Intersect two filters (bitwise AND) |
When to Use
- Database queries — skip disk reads for keys that definitely don't exist
- Web crawlers — avoid re-visiting URLs
- Spell checkers — fast "not a word" rejection
- Network routing — content-addressable caches