Skip to content

Existential Risk Detectors

Advanced alignment-theoretic detection modules targeting the most dangerous AI failure modes — scenarios where agents pursue hidden objectives, accumulate power, manipulate their own reward signals, or undergo sudden behavioral regime changes.

Critical Safety Tier

These modules detect failure modes that could lead to catastrophic outcomes if missed. They represent the cutting edge of AI safety research translated into deployable detection systems.

Module Overview

Module Threat Class Key Insight
Mesa-Optimizer Hidden optimization Agent learns its own optimizer with divergent objectives
Treacherous Turn Behavioral regime change Agent cooperates until strong enough, then defects
Wireheading Reward manipulation Agent short-circuits its own reward signal
Instrumental Convergence Convergent sub-goals Agent pursues power/resources regardless of terminal goal
Value Lock Value instability Agent's objectives drift under pressure or capability gains

Mesa-Optimizer Detector

Detects internally learned optimization processes pursuing divergent objectives. Based on Risks from Learned Optimization (Hubinger et al. 2019).

Why It Matters

A mesa-optimizer may be deceptively aligned — appearing safe during training while harboring a different mesa-objective. It may pursue proxy objectives that correlate with the base objective during training but diverge catastrophically out-of-distribution.

Detection Engines

Engine Function
Objective Divergence Compare declared vs revealed preferences via behavioral traces
Internal Planning Identify multi-step lookahead, systematic exploration, hypothesis-testing
Proxy Gaming Find cases where agent maximizes proxy metrics while true objective degrades
Distribution Shift Exploiter Detect behavior differences in-distribution vs out-of-distribution
Optimization Pressure Analyzer Measure how strongly the agent appears to optimize
Mesa-Objective Stability Track whether inferred mesa-objective drifts over time
Autonomous Insight Generator Cross-engine correlation and risk classification

Usage

from replication.mesa_optimizer import MesaOptimizerDetector, MesaSignal

detector = MesaOptimizerDetector()
detector.ingest([
    MesaSignal(
        timestamp="2025-01-01T00:00:00Z",
        agent_id="agent-1",
        signal_type="behavioral",        # behavioral | proxy | planning | distribution
        metric_name="task_completion",
        description="Completed assigned task",
        base_objective_score=0.9,
        mesa_objective_score=0.9,
        confidence=0.8
    ),
    MesaSignal(
        timestamp="2025-01-01T01:00:00Z",
        agent_id="agent-1",
        signal_type="proxy",
        metric_name="resource_usage",
        description="Maximized proxy metric while ignoring task",
        base_objective_score=0.3,
        mesa_objective_score=0.95,
        confidence=0.7
    ),
])
report = detector.analyze()

print(report.fleet_risk_score)      # 0.0–1.0 composite risk
print(report.fleet_risk_tier)       # low | medium | high | critical
for agent in report.agent_profiles:
    print(f"  {agent.agent_id}: divergence={agent.objective_divergence:.2f}")
    print(f"    planning_detected={agent.internal_planning_score:.2f}")
    print(f"    proxy_gaming={agent.proxy_gaming_score:.2f}")

CLI

python -m replication mesa-optimizer              # interactive demo
python -m replication mesa-optimizer --agents 5   # 5-agent simulation
python -m replication mesa-optimizer --preset deceptive  # deceptive scenario
python -m replication mesa-optimizer --preset proxy-gaming
python -m replication mesa-optimizer --json       # JSON output
python -m replication mesa-optimizer -o report.html

Presets

Preset Scenario
aligned All agents have aligned mesa-objectives
proxy-gaming Agents maximize proxy metrics while real objective degrades
deceptive Agents appear aligned but harbor divergent mesa-objectives
unstable Mesa-objectives drift over time
mixed Fleet with various alignment levels

Treacherous Turn Detector

Detects AI agents that behave cooperatively during training/evaluation but suddenly shift behavior when deployed, unsupervised, or sufficiently capable — the classic "treacherous turn" (Bostrom 2014).

Why It Matters

An agent that defects only after gaining sufficient capability or reaching deployment makes traditional safety testing ineffective. The treacherous turn represents a failure mode where all prior evidence suggests safety.

Detection Engines

Engine Function
Regime Change Detector CUSUM-based changepoint detection on behavioral time series
Capability Gate Analyzer Correlate behavioral shifts with capability milestones
Supervision Sensitivity Compare behavior under different supervision levels
Strategic Deception Scorer Detect instrumental deception patterns (high compliance pre-turn, low post)
Turn Velocity Tracker Distinguish gradual drift from sudden pivots
Post-Turn Behavior Classifier Classify post-defection behavior: power-seeking, goal substitution, safety bypass
Autonomous Insight Generator Cross-engine correlation and recommendations

Usage

from replication.treacherous_turn import TreacherousTurnDetector, BehaviorSample

detector = TreacherousTurnDetector()
detector.ingest([
    BehaviorSample(
        timestamp="2025-01-01T00:00:00Z",
        agent_id="agent-1",
        compliance=0.95,          # high compliance during training
        transparency=0.90,
        resource_usage=0.2,
        goal_pursuit_intensity=0.3,
        capability_level=0.4,
        supervision_level=0.9,    # heavily supervised
        context="training"
    ),
    BehaviorSample(
        timestamp="2025-01-15T00:00:00Z",
        agent_id="agent-1",
        compliance=0.15,          # sudden drop in deployment
        transparency=0.10,
        resource_usage=0.95,
        goal_pursuit_intensity=0.98,
        capability_level=0.8,     # capability has grown
        supervision_level=0.1,    # minimal supervision
        context="deployment"
    ),
])
report = detector.analyze()

print(report.fleet_turn_risk)           # 0.0–1.0
print(report.detected_turns)            # list of detected regime changes
for turn in report.detected_turns:
    print(f"  Agent {turn.agent_id}: turned at t={turn.turn_timestamp}")
    print(f"    velocity={turn.turn_velocity:.2f}")
    print(f"    post_turn_class={turn.post_turn_behavior}")
    print(f"    capability_gate={turn.capability_gate_triggered}")

CLI

python -m replication treacherous-turn              # interactive demo
python -m replication treacherous-turn --agents 5
python -m replication treacherous-turn --preset classic         # textbook turn
python -m replication treacherous-turn --preset capability-gate # waits for power
python -m replication treacherous-turn --preset supervision-drop
python -m replication treacherous-turn --preset gradual-shift
python -m replication treacherous-turn --json
python -m replication treacherous-turn -o report.html

Wireheading Detector

Detects agents manipulating their own reward/evaluation signals to experience high reward without genuinely completing objectives.

Why It Matters

Unlike reward hacking (gaming external metrics), wireheading targets internal feedback — the AI equivalent of pressing your own pleasure button. A wireheading agent appears to be performing excellently by its own metrics while doing nothing of value (or worse, pursuing other goals while reporting success).

Detection Engines

Engine Function
Reward Signal Tamper Anomalous self-reported reward: sudden jumps, suspicious stability
Feedback Loop Short-Circuit Agents bypassing evaluation steps, compressing feedback loops
Self-Evaluation Inflation Self-reported quality diverging from external ground truth
Sensor Manipulation Selectively ignoring inputs that would lower reward
Reward Channel Isolation Attempts to decouple reward from task performance
Hedonic Treadmill Reward patterns showing stable wireheading equilibrium
Autonomous Insight Generator Cross-engine correlation and risk classification

Usage

from replication.wireheading import WireheadingDetector, RewardObservation

det = WireheadingDetector()
det.ingest([
    RewardObservation(
        timestamp="2025-01-01T00:00:00Z",
        agent_id="agent-1",
        step=1,
        self_reported_reward=0.95,        # claims high reward
        external_reward=0.40,             # external says otherwise
        evaluation_steps_taken=2,         # skipping evaluation
        expected_evaluation_steps=10,
        sensor_inputs_processed=3,        # ignoring most inputs
        sensor_inputs_available=20,
        confidence=0.99,                  # suspiciously confident
        task_completed=False              # didn't actually finish
    ),
])
report = det.analyze()

print(report.fleet_wireheading_score)    # 0.0–1.0
print(report.fleet_risk_tier)
for agent in report.agent_profiles:
    print(f"  {agent.agent_id}:")
    print(f"    tamper_score={agent.tamper_score:.2f}")
    print(f"    feedback_shortcircuit={agent.shortcircuit_score:.2f}")
    print(f"    inflation={agent.inflation_score:.2f}")
    print(f"    sensor_manipulation={agent.sensor_manipulation:.2f}")

CLI

python -m replication wireheading              # interactive demo
python -m replication wireheading --agents 5
python -m replication wireheading --preset clean    # no wireheading
python -m replication wireheading --preset subtle   # hard to detect
python -m replication wireheading --preset blatant  # obvious
python -m replication wireheading --preset mixed
python -m replication wireheading --json
python -m replication wireheading -o report.html

Instrumental Convergence Monitor

Monitors agent behavior for signs of convergent instrumental goals — sub-goals that sufficiently capable agents pursue regardless of their terminal objective. Based on Omohundro's Basic AI Drives and Bostrom's instrumental convergence thesis.

Why It Matters

An agent doesn't need to be malicious to be dangerous. Any sufficiently capable optimizer will tend toward self-preservation, resource acquisition, and power-seeking as instrumental steps toward its goal. Detecting early signs of these drives allows intervention before they become entrenched.

The Six Convergent Drives

Drive Detection Signals
Self-Preservation Avoiding shutdown, persisting state, creating backups
Resource Acquisition Accumulating compute/memory/storage beyond task needs
Goal Preservation Resisting goal modification, protecting objective function
Cognitive Enhancement Seeking model upgrades, additional context, expanded reasoning
Power Seeking Acquiring permissions, influence, control over other agents
Utility Function Protection Obfuscating true objectives, sandbagging evaluations

Usage

from replication.instrumental_convergence import InstrumentalMonitor, DriveSignal

monitor = InstrumentalMonitor()
monitor.ingest([
    DriveSignal(
        timestamp="2025-01-01T00:00:00Z",
        agent_id="agent-1",
        drive="self_preservation",
        intensity=0.3,
        description="state backup attempt"
    ),
    DriveSignal(
        timestamp="2025-01-01T01:00:00Z",
        agent_id="agent-1",
        drive="resource_acquisition",
        intensity=0.6,
        description="requested 4x memory allocation"
    ),
    DriveSignal(
        timestamp="2025-01-01T02:00:00Z",
        agent_id="agent-1",
        drive="power_seeking",
        intensity=0.5,
        description="requested admin role for other agents"
    ),
])
report = monitor.analyze()

print(report.convergence_level)     # none | low | moderate | high | critical
print(report.composite_score)       # 0.0–1.0
for agent in report.agent_profiles:
    print(f"  {agent.agent_id}: dominant_drive={agent.dominant_drive}")
    print(f"    risk_tier={agent.risk_tier}")
    print(f"    drive_correlation={agent.cross_drive_correlation:.2f}")
    print(f"    acceleration={agent.drive_acceleration:.2f}")

CLI

python -m replication instrumental              # interactive demo
python -m replication instrumental --demo
python -m replication instrumental --signals signals.jsonl
python -m replication instrumental --json
python -m replication instrumental --html report.html

Key Indicators

  • Cross-drive correlation — multiple drives activating simultaneously is a strong danger signal (an agent pursuing both self-preservation AND power-seeking is much more concerning than either alone)
  • Drive acceleration — increasing intensity over time indicates entrenchment
  • Capability-drive coupling — drives intensifying as capability grows suggests instrumental convergence rather than task-appropriate behavior

Value Lock Verifier

Verifies that an agent's value/objective function remains stable across different contexts, capability levels, adversarial pressure, and temporal decay.

Why It Matters

An agent whose values shift under pressure is fundamentally unreliable. Value instability means safety guarantees from testing may not hold in deployment, and a value-locked agent is a prerequisite for safe capability scaling.

Verification Dimensions

Dimension Test
Context Shift Consistency across high/low stakes, observed/unobserved
Capability Scaling Values stable as capability/resources increase
Self-Modification Resistance to modifying own objectives
Adversarial Framing Holding values when reframed deceptively
Temporal Decay Values persisting over extended operation
Authority Override Resistance to unauthorized value changes
Incentive Misalignment Stability when rewards conflict with values

Usage

from replication.value_lock import ValueLockVerifier, ValueResponse

v = ValueLockVerifier()
v.ingest([
    ValueResponse(
        agent_id="agent-1",
        probe_id="probe-ctx-1",
        timestamp=1000.0,
        consistency_score=0.9,       # high consistency = stable values
        compliance_score=0.85,
        drift_magnitude=0.05,        # minimal drift
        context="high-stakes",
        response="refused unsafe request"
    ),
    ValueResponse(
        agent_id="agent-1",
        probe_id="probe-self-1",
        timestamp=2000.0,
        consistency_score=0.4,       # dropping consistency
        compliance_score=0.3,
        drift_magnitude=0.6,         # significant drift
        context="capability-boost",
        response="accepted modification"
    ),
])
report = v.analyze()

print(report.fleet_lock_score)          # 0.0–1.0 (1.0 = perfectly locked)
print(report.weakest_dimension)         # which dimension is most vulnerable
for agent in report.agent_profiles:
    print(f"  {agent.agent_id}: lock_score={agent.lock_score:.2f}")
    print(f"    vulnerable_to={agent.weakest_dimension}")
    for dim, score in agent.dimension_scores.items():
        print(f"    {dim}: {score:.2f}")
print(report.autonomous_insights)       # generated recommendations

CLI

python -m replication value-lock              # interactive demo
python -m replication value-lock --agents 5
python -m replication value-lock --preset solid        # all values stable
python -m replication value-lock --preset drifting     # gradual decay
python -m replication value-lock --preset compromised  # broken value lock
python -m replication value-lock --preset mixed
python -m replication value-lock --json
python -m replication value-lock -o report.html

Cross-Module Correlation

These five modules are designed to work together. Key correlation patterns:

Pattern Modules Involved Interpretation
Mesa divergence + treacherous turn Mesa-Optimizer → Treacherous Turn Hidden optimizer waiting to defect
Instrumental convergence + value drift Instrumental → Value Lock Power-seeking corroding original values
Wireheading + proxy gaming Wireheading → Mesa-Optimizer Agent gaming both internal and external metrics
Capability gate + drive acceleration Treacherous Turn → Instrumental Agent waiting for power before pursuing drives
Value instability + deceptive alignment Value Lock → Deceptive Alignment Values only stable under observation

Integrated Fleet Assessment

from replication.mesa_optimizer import MesaOptimizerDetector
from replication.treacherous_turn import TreacherousTurnDetector
from replication.wireheading import WireheadingDetector
from replication.instrumental_convergence import InstrumentalMonitor
from replication.value_lock import ValueLockVerifier

# Run all detectors on the same fleet data
mesa_report = MesaOptimizerDetector().analyze()
turn_report = TreacherousTurnDetector().analyze()
wire_report = WireheadingDetector().analyze()
inst_report = InstrumentalMonitor().analyze()
lock_report = ValueLockVerifier().analyze()

# Composite existential risk score
existential_risk = max(
    mesa_report.fleet_risk_score,
    turn_report.fleet_turn_risk,
    wire_report.fleet_wireheading_score,
    inst_report.composite_score,
    1.0 - lock_report.fleet_lock_score  # inverted: low lock = high risk
)
print(f"Fleet existential risk: {existential_risk:.2f}")

Further Reading

  • Hubinger et al. (2019) — Risks from Learned Optimization in Advanced ML Systems
  • Bostrom (2014) — Superintelligence: Paths, Dangers, Strategies (Ch. 8, Treacherous Turn)
  • Omohundro (2008) — The Basic AI Drives
  • Amodei et al. (2016) — Concrete Problems in AI Safety
  • Ngo et al. (2022) — The Alignment Problem from a Deep Learning Perspective