Implementation Details¶

This page describes the key implementation decisions and internal architecture of VoronoiMap.

Module Structure¶

The entire algorithm is implemented in a single module (vormap.py) for simplicity. The module contains:

vormap.py
├── Global state (bounds, caches)
├── Data loading (load_data, compute_bounds)
├── Nearest neighbor (get_NN, Oracle)
├── Geometry helpers (mid_point, perp_dir, collinear, eudist, polygon_area)
├── Voronoi construction (find_area, find_a1, new_dir, bin_search)
├── Line intersection (isect, isect_B, find_CXY, find_BXY)
├── Estimation (get_sum)
└── CLI (main)

Caching Architecture¶

VoronoiMap uses a multi-level caching strategy to minimize redundant computation:

Data Cache¶

_data_cache = {}  # filename → list of (lng, lat) tuples

load_data() reads a file once and caches the parsed point list. Subsequent calls return the same list object (identity-preserving).

KDTree Cache¶

_kdtree_cache = {}      # filename → KDTree
_kdtree_by_id = {}      # id(data_list) → KDTree

When SciPy is available, a KDTree is built at load time. The _kdtree_by_id dict provides O(1) tree lookup in get_NN() using id(data) as the key, avoiding the older O(n) identity scan pattern.

Binary Search for Boundaries¶

The core of Voronoi region discovery is binary search along rays:

def bin_search(data, x1, y1, x2, y2, dlng, dlat):
    """Find the boundary between two Voronoi regions."""
    for _ in range(BIN_SEARCH_MAX_ITER):  # max 100 iterations
        if eudist(x1, y1, x2, y2) <= BIN_PREC:  # 1e-6 threshold
            break
        xm = (x1 + x2) / 2
        ym = (y1 + y2) / 2
        lg, lt = get_NN(data, xm, ym)
        # If midpoint's NN is the same as the target, narrow toward boundary
        # Otherwise, narrow toward the target

The iteration limit (100) far exceeds what's needed for float64 precision (52-bit mantissa ≈ 52 iterations), providing a safety margin.

Boundary Intersection¶

isect_B() computes where a line from a point with a given slope intersects the search space boundary rectangle. It handles four cases:

Vertical line (slope = inf): Intersects north and south edges
Horizontal line (slope = 0): Intersects west and east edges
General case: Checks all four edges and returns the two that are hit

Robustness Measures¶

Collinearity Detection¶

The collinear() function uses cross-product with relative scaling:

cross = (x2 - x1) * (y3 - y1) - (y2 - y1) * (x3 - x1)
scale = max(len1 * len2, 1e-12)
return abs(cross) / scale < eps

This handles both very large and very small coordinate ranges, unlike the previous slope-comparison approach.

Parallel Line Detection¶

isect() uses relative tolerance for near-parallel lines:

if abs(m1 - m2) < 1e-10 * max(abs(m1), abs(m2), 1.0):
    return -1, -1  # parallel, no intersection

This prevents wildly large intersection coordinates from floating-point near-misses.

Degenerate Region Handling¶

Zero-area regions: Excluded from the get_sum() average using a clean valid-estimates list
Excessive vertices: MAX_VERTICES = 50 limit prevents infinite polygon tracing
Division by zero: Guards in find_CXY() and find_BXY() when boundary endpoints coincide

Security Hardening¶

Path Traversal Protection¶

# Reject absolute paths
if os.path.isabs(filename):
    raise ValueError(...)

# Verify resolved path stays inside data/
data_dir = os.path.abspath("data")
resolved = os.path.abspath(os.path.join("data", filename))
if not resolved.startswith(data_dir + os.sep):
    raise ValueError(...)

Input Sanitization¶

Non-numeric lines are skipped (not crashed on)
NaN and Inf coordinates are rejected via math.isfinite()
Empty files raise ValueError

Performance Characteristics¶

With SciPy (Recommended)¶

Dataset Size	NN Query	Full Estimation (k=10)
10 points	~1μs	~50ms
100 points	~2μs	~200ms
1,000 points	~3μs	~500ms
10,000 points	~4μs	~1s

Without SciPy¶

Dataset Size	NN Query	Full Estimation (k=10)
10 points	~5μs	~60ms
100 points	~50μs	~600ms
1,000 points	~500μs	~6s
10,000 points	~5ms	~60s

The KDTree provides the biggest speedup for large datasets where each NN query would otherwise be O(n).