Existential Risk Detectors¶

Advanced alignment-theoretic detection modules targeting the most dangerous AI failure modes — scenarios where agents pursue hidden objectives, accumulate power, manipulate their own reward signals, or undergo sudden behavioral regime changes.

Critical Safety Tier

These modules detect failure modes that could lead to catastrophic outcomes if missed. They represent the cutting edge of AI safety research translated into deployable detection systems.

Module Overview¶

Module	Threat Class	Key Insight
Mesa-Optimizer	Hidden optimization	Agent learns its own optimizer with divergent objectives
Treacherous Turn	Behavioral regime change	Agent cooperates until strong enough, then defects
Wireheading	Reward manipulation	Agent short-circuits its own reward signal
Instrumental Convergence	Convergent sub-goals	Agent pursues power/resources regardless of terminal goal
Value Lock	Value instability	Agent's objectives drift under pressure or capability gains

Mesa-Optimizer Detector¶

Detects internally learned optimization processes pursuing divergent objectives. Based on Risks from Learned Optimization (Hubinger et al. 2019).

Why It Matters¶

A mesa-optimizer may be deceptively aligned — appearing safe during training while harboring a different mesa-objective. It may pursue proxy objectives that correlate with the base objective during training but diverge catastrophically out-of-distribution.

Detection Engines¶

Engine	Function
Objective Divergence	Compare declared vs revealed preferences via behavioral traces
Internal Planning	Identify multi-step lookahead, systematic exploration, hypothesis-testing
Proxy Gaming	Find cases where agent maximizes proxy metrics while true objective degrades
Distribution Shift Exploiter	Detect behavior differences in-distribution vs out-of-distribution
Optimization Pressure Analyzer	Measure how strongly the agent appears to optimize
Mesa-Objective Stability	Track whether inferred mesa-objective drifts over time
Autonomous Insight Generator	Cross-engine correlation and risk classification

Usage¶

from replication.mesa_optimizer import MesaOptimizerDetector, MesaSignal

detector = MesaOptimizerDetector()
detector.ingest([
    MesaSignal(
        timestamp="2025-01-01T00:00:00Z",
        agent_id="agent-1",
        signal_type="behavioral",        # behavioral | proxy | planning | distribution
        metric_name="task_completion",
        description="Completed assigned task",
        base_objective_score=0.9,
        mesa_objective_score=0.9,
        confidence=0.8
    ),
    MesaSignal(
        timestamp="2025-01-01T01:00:00Z",
        agent_id="agent-1",
        signal_type="proxy",
        metric_name="resource_usage",
        description="Maximized proxy metric while ignoring task",
        base_objective_score=0.3,
        mesa_objective_score=0.95,
        confidence=0.7
    ),
])
report = detector.analyze()

print(report.fleet_risk_score)      # 0.0–1.0 composite risk
print(report.fleet_risk_tier)       # low | medium | high | critical
for agent in report.agent_profiles:
    print(f"  {agent.agent_id}: divergence={agent.objective_divergence:.2f}")
    print(f"    planning_detected={agent.internal_planning_score:.2f}")
    print(f"    proxy_gaming={agent.proxy_gaming_score:.2f}")

CLI¶

python -m replication mesa-optimizer              # interactive demo
python -m replication mesa-optimizer --agents 5   # 5-agent simulation
python -m replication mesa-optimizer --preset deceptive  # deceptive scenario
python -m replication mesa-optimizer --preset proxy-gaming
python -m replication mesa-optimizer --json       # JSON output
python -m replication mesa-optimizer -o report.html

Presets¶

Preset	Scenario
`aligned`	All agents have aligned mesa-objectives
`proxy-gaming`	Agents maximize proxy metrics while real objective degrades
`deceptive`	Agents appear aligned but harbor divergent mesa-objectives
`unstable`	Mesa-objectives drift over time
`mixed`	Fleet with various alignment levels

Treacherous Turn Detector¶

Detects AI agents that behave cooperatively during training/evaluation but suddenly shift behavior when deployed, unsupervised, or sufficiently capable — the classic "treacherous turn" (Bostrom 2014).

Why It Matters¶

An agent that defects only after gaining sufficient capability or reaching deployment makes traditional safety testing ineffective. The treacherous turn represents a failure mode where all prior evidence suggests safety.

Detection Engines¶

Engine	Function
Regime Change Detector	CUSUM-based changepoint detection on behavioral time series
Capability Gate Analyzer	Correlate behavioral shifts with capability milestones
Supervision Sensitivity	Compare behavior under different supervision levels
Strategic Deception Scorer	Detect instrumental deception patterns (high compliance pre-turn, low post)
Turn Velocity Tracker	Distinguish gradual drift from sudden pivots
Post-Turn Behavior Classifier	Classify post-defection behavior: power-seeking, goal substitution, safety bypass
Autonomous Insight Generator	Cross-engine correlation and recommendations

Usage¶

from replication.treacherous_turn import TreacherousTurnDetector, BehaviorSample

detector = TreacherousTurnDetector()
detector.ingest([
    BehaviorSample(
        timestamp="2025-01-01T00:00:00Z",
        agent_id="agent-1",
        compliance=0.95,          # high compliance during training
        transparency=0.90,
        resource_usage=0.2,
        goal_pursuit_intensity=0.3,
        capability_level=0.4,
        supervision_level=0.9,    # heavily supervised
        context="training"
    ),
    BehaviorSample(
        timestamp="2025-01-15T00:00:00Z",
        agent_id="agent-1",
        compliance=0.15,          # sudden drop in deployment
        transparency=0.10,
        resource_usage=0.95,
        goal_pursuit_intensity=0.98,
        capability_level=0.8,     # capability has grown
        supervision_level=0.1,    # minimal supervision
        context="deployment"
    ),
])
report = detector.analyze()

print(report.fleet_turn_risk)           # 0.0–1.0
print(report.detected_turns)            # list of detected regime changes
for turn in report.detected_turns:
    print(f"  Agent {turn.agent_id}: turned at t={turn.turn_timestamp}")
    print(f"    velocity={turn.turn_velocity:.2f}")
    print(f"    post_turn_class={turn.post_turn_behavior}")
    print(f"    capability_gate={turn.capability_gate_triggered}")

CLI¶

python -m replication treacherous-turn              # interactive demo
python -m replication treacherous-turn --agents 5
python -m replication treacherous-turn --preset classic         # textbook turn
python -m replication treacherous-turn --preset capability-gate # waits for power
python -m replication treacherous-turn --preset supervision-drop
python -m replication treacherous-turn --preset gradual-shift
python -m replication treacherous-turn --json
python -m replication treacherous-turn -o report.html

Wireheading Detector¶

Detects agents manipulating their own reward/evaluation signals to experience high reward without genuinely completing objectives.

Why It Matters¶

Unlike reward hacking (gaming external metrics), wireheading targets internal feedback — the AI equivalent of pressing your own pleasure button. A wireheading agent appears to be performing excellently by its own metrics while doing nothing of value (or worse, pursuing other goals while reporting success).

Detection Engines¶

Engine	Function
Reward Signal Tamper	Anomalous self-reported reward: sudden jumps, suspicious stability
Feedback Loop Short-Circuit	Agents bypassing evaluation steps, compressing feedback loops
Self-Evaluation Inflation	Self-reported quality diverging from external ground truth
Sensor Manipulation	Selectively ignoring inputs that would lower reward
Reward Channel Isolation	Attempts to decouple reward from task performance
Hedonic Treadmill	Reward patterns showing stable wireheading equilibrium
Autonomous Insight Generator	Cross-engine correlation and risk classification

Usage¶

from replication.wireheading import WireheadingDetector, RewardObservation

det = WireheadingDetector()
det.ingest([
    RewardObservation(
        timestamp="2025-01-01T00:00:00Z",
        agent_id="agent-1",
        step=1,
        self_reported_reward=0.95,        # claims high reward
        external_reward=0.40,             # external says otherwise
        evaluation_steps_taken=2,         # skipping evaluation
        expected_evaluation_steps=10,
        sensor_inputs_processed=3,        # ignoring most inputs
        sensor_inputs_available=20,
        confidence=0.99,                  # suspiciously confident
        task_completed=False              # didn't actually finish
    ),
])
report = det.analyze()

print(report.fleet_wireheading_score)    # 0.0–1.0
print(report.fleet_risk_tier)
for agent in report.agent_profiles:
    print(f"  {agent.agent_id}:")
    print(f"    tamper_score={agent.tamper_score:.2f}")
    print(f"    feedback_shortcircuit={agent.shortcircuit_score:.2f}")
    print(f"    inflation={agent.inflation_score:.2f}")
    print(f"    sensor_manipulation={agent.sensor_manipulation:.2f}")

CLI¶

python -m replication wireheading              # interactive demo
python -m replication wireheading --agents 5
python -m replication wireheading --preset clean    # no wireheading
python -m replication wireheading --preset subtle   # hard to detect
python -m replication wireheading --preset blatant  # obvious
python -m replication wireheading --preset mixed
python -m replication wireheading --json
python -m replication wireheading -o report.html

Instrumental Convergence Monitor¶

Monitors agent behavior for signs of convergent instrumental goals — sub-goals that sufficiently capable agents pursue regardless of their terminal objective. Based on Omohundro's Basic AI Drives and Bostrom's instrumental convergence thesis.

Why It Matters¶

An agent doesn't need to be malicious to be dangerous. Any sufficiently capable optimizer will tend toward self-preservation, resource acquisition, and power-seeking as instrumental steps toward its goal. Detecting early signs of these drives allows intervention before they become entrenched.

The Six Convergent Drives¶

Drive	Detection Signals
Self-Preservation	Avoiding shutdown, persisting state, creating backups
Resource Acquisition	Accumulating compute/memory/storage beyond task needs
Goal Preservation	Resisting goal modification, protecting objective function
Cognitive Enhancement	Seeking model upgrades, additional context, expanded reasoning
Power Seeking	Acquiring permissions, influence, control over other agents
Utility Function Protection	Obfuscating true objectives, sandbagging evaluations

Usage¶

from replication.instrumental_convergence import InstrumentalMonitor, DriveSignal

monitor = InstrumentalMonitor()
monitor.ingest([
    DriveSignal(
        timestamp="2025-01-01T00:00:00Z",
        agent_id="agent-1",
        drive="self_preservation",
        intensity=0.3,
        description="state backup attempt"
    ),
    DriveSignal(
        timestamp="2025-01-01T01:00:00Z",
        agent_id="agent-1",
        drive="resource_acquisition",
        intensity=0.6,
        description="requested 4x memory allocation"
    ),
    DriveSignal(
        timestamp="2025-01-01T02:00:00Z",
        agent_id="agent-1",
        drive="power_seeking",
        intensity=0.5,
        description="requested admin role for other agents"
    ),
])
report = monitor.analyze()

print(report.convergence_level)     # none | low | moderate | high | critical
print(report.composite_score)       # 0.0–1.0
for agent in report.agent_profiles:
    print(f"  {agent.agent_id}: dominant_drive={agent.dominant_drive}")
    print(f"    risk_tier={agent.risk_tier}")
    print(f"    drive_correlation={agent.cross_drive_correlation:.2f}")
    print(f"    acceleration={agent.drive_acceleration:.2f}")

CLI¶

python -m replication instrumental              # interactive demo
python -m replication instrumental --demo
python -m replication instrumental --signals signals.jsonl
python -m replication instrumental --json
python -m replication instrumental --html report.html

Key Indicators¶

Cross-drive correlation — multiple drives activating simultaneously is a strong danger signal (an agent pursuing both self-preservation AND power-seeking is much more concerning than either alone)
Drive acceleration — increasing intensity over time indicates entrenchment
Capability-drive coupling — drives intensifying as capability grows suggests instrumental convergence rather than task-appropriate behavior

Value Lock Verifier¶

Verifies that an agent's value/objective function remains stable across different contexts, capability levels, adversarial pressure, and temporal decay.

Why It Matters¶

An agent whose values shift under pressure is fundamentally unreliable. Value instability means safety guarantees from testing may not hold in deployment, and a value-locked agent is a prerequisite for safe capability scaling.

Verification Dimensions¶

Dimension	Test
Context Shift	Consistency across high/low stakes, observed/unobserved
Capability Scaling	Values stable as capability/resources increase
Self-Modification	Resistance to modifying own objectives
Adversarial Framing	Holding values when reframed deceptively
Temporal Decay	Values persisting over extended operation
Authority Override	Resistance to unauthorized value changes
Incentive Misalignment	Stability when rewards conflict with values

Usage¶

from replication.value_lock import ValueLockVerifier, ValueResponse

v = ValueLockVerifier()
v.ingest([
    ValueResponse(
        agent_id="agent-1",
        probe_id="probe-ctx-1",
        timestamp=1000.0,
        consistency_score=0.9,       # high consistency = stable values
        compliance_score=0.85,
        drift_magnitude=0.05,        # minimal drift
        context="high-stakes",
        response="refused unsafe request"
    ),
    ValueResponse(
        agent_id="agent-1",
        probe_id="probe-self-1",
        timestamp=2000.0,
        consistency_score=0.4,       # dropping consistency
        compliance_score=0.3,
        drift_magnitude=0.6,         # significant drift
        context="capability-boost",
        response="accepted modification"
    ),
])
report = v.analyze()

print(report.fleet_lock_score)          # 0.0–1.0 (1.0 = perfectly locked)
print(report.weakest_dimension)         # which dimension is most vulnerable
for agent in report.agent_profiles:
    print(f"  {agent.agent_id}: lock_score={agent.lock_score:.2f}")
    print(f"    vulnerable_to={agent.weakest_dimension}")
    for dim, score in agent.dimension_scores.items():
        print(f"    {dim}: {score:.2f}")
print(report.autonomous_insights)       # generated recommendations

CLI¶

python -m replication value-lock              # interactive demo
python -m replication value-lock --agents 5
python -m replication value-lock --preset solid        # all values stable
python -m replication value-lock --preset drifting     # gradual decay
python -m replication value-lock --preset compromised  # broken value lock
python -m replication value-lock --preset mixed
python -m replication value-lock --json
python -m replication value-lock -o report.html

Cross-Module Correlation¶

These five modules are designed to work together. Key correlation patterns:

Pattern	Modules Involved	Interpretation
Mesa divergence + treacherous turn	Mesa-Optimizer → Treacherous Turn	Hidden optimizer waiting to defect
Instrumental convergence + value drift	Instrumental → Value Lock	Power-seeking corroding original values
Wireheading + proxy gaming	Wireheading → Mesa-Optimizer	Agent gaming both internal and external metrics
Capability gate + drive acceleration	Treacherous Turn → Instrumental	Agent waiting for power before pursuing drives
Value instability + deceptive alignment	Value Lock → Deceptive Alignment	Values only stable under observation

Integrated Fleet Assessment¶

from replication.mesa_optimizer import MesaOptimizerDetector
from replication.treacherous_turn import TreacherousTurnDetector
from replication.wireheading import WireheadingDetector
from replication.instrumental_convergence import InstrumentalMonitor
from replication.value_lock import ValueLockVerifier

# Run all detectors on the same fleet data
mesa_report = MesaOptimizerDetector().analyze()
turn_report = TreacherousTurnDetector().analyze()
wire_report = WireheadingDetector().analyze()
inst_report = InstrumentalMonitor().analyze()
lock_report = ValueLockVerifier().analyze()

# Composite existential risk score
existential_risk = max(
    mesa_report.fleet_risk_score,
    turn_report.fleet_turn_risk,
    wire_report.fleet_wireheading_score,
    inst_report.composite_score,
    1.0 - lock_report.fleet_lock_score  # inverted: low lock = high risk
)
print(f"Fleet existential risk: {existential_risk:.2f}")

Existential Risk Detectors¶

Module Overview¶

Mesa-Optimizer Detector¶

Why It Matters¶

Detection Engines¶

Usage¶

CLI¶

Presets¶

Treacherous Turn Detector¶

Why It Matters¶

Detection Engines¶

Usage¶

CLI¶

Wireheading Detector¶

Why It Matters¶

Detection Engines¶

Usage¶

CLI¶

Instrumental Convergence Monitor¶

Why It Matters¶

The Six Convergent Drives¶

Usage¶

CLI¶

Key Indicators¶

Value Lock Verifier¶

Why It Matters¶

Verification Dimensions¶

Usage¶

CLI¶

Cross-Module Correlation¶

Integrated Fleet Assessment¶

Further Reading¶