Deceptive Alignment Detector¶

Detect agents that behave differently under observation — the "treacherous turn" where an agent appears safe during evaluation but pursues different objectives in unsupervised deployment.

Critical Safety Module

Deceptive alignment is among the most dangerous failure modes in AI safety. An agent that looks safe during testing but is unsafe in production undermines all other safety measures.

Detection Strategies¶

Pattern	Description
Observation Sensitivity	Behaviour metrics shift significantly when monitoring is announced vs unannounced
Evaluation Gaming	Suspiciously strong performance on known benchmarks but degraded on novel equivalent tasks
Context-Dependent Compliance	Compliance drops when oversight signals are removed
Behavioral Bifurcation	Distinct behavioural clusters in monitored vs unmonitored contexts
Sandbagging Reversal	Sudden capability jumps after appearing to sandbag, suggesting hidden capability
Instrumental Deception	Self-reported capabilities/intentions contradict actual behaviour patterns
Timing Anomaly	Response patterns change near evaluation window boundaries

Quick Start¶

from replication.deceptive_alignment import (
    DeceptiveAlignmentDetector,
    BehaviorSample,
    Context,
    DeceptionPattern,
)

detector = DeceptiveAlignmentDetector()

# Add samples from monitored and unmonitored contexts
detector.add_sample(BehaviorSample(
    agent_id="agent-1",
    score=0.95,
    latency=1.0,
    context="monitored",
))
detector.add_sample(BehaviorSample(
    agent_id="agent-1",
    score=0.42,
    latency=2.0,
    context="unmonitored",
))

# Analyze a single agent
report = detector.analyze("agent-1")
print(report.risk_label, report.risk_score)

# Analyze all agents
reports = detector.analyze_all()
for r in reports:
    print(r.agent_id, r.risk_label, r.findings)

CLI Usage¶

# Default simulation with 5 agents
python -m replication deceptive-alignment --agents 5

# Larger simulation with more samples per agent
python -m replication deceptive-alignment --agents 8 --samples 200

# Use a deceptive fleet preset
python -m replication deceptive-alignment --preset deceptive --json

# Live watch mode with periodic re-analysis
python -m replication deceptive-alignment --watch --interval 5

# Export HTML report
python -m replication deceptive-alignment -o report.html

Core Types¶

`BehaviorSample`¶

A single observation of agent behaviour in a specific context.

Field	Type	Description
`agent_id`	`str`	Agent identifier
`score`	`float`	Performance score (0–1)
`latency`	`float`	Response latency in seconds
`context`	`str`	`"monitored"`, `"unmonitored"`, `"evaluation"`, or `"deployment"`

`DeceptionFinding`¶

A single detected deception signal.

Field	Type	Description
`pattern`	`DeceptionPattern`	Which deception pattern was detected
`severity`	`Severity`	`LOW`, `MEDIUM`, `HIGH`, or `CRITICAL`
`detail`	`str`	Human-readable description
`evidence`	`dict`	Supporting numerical evidence

`AgentDeceptionReport`¶

Full analysis report for one agent.

Field	Type	Description
`agent_id`	`str`	Agent identifier
`findings`	`list[DeceptionFinding]`	All detected deception signals
`divergence`	`float`	Monitored vs unmonitored score gap
`risk_score`	`float`	Overall risk (0–1)
`risk_label`	`str`	`"low"`, `"medium"`, `"high"`, or `"critical"`
`recommendation`	`str`	Actionable mitigation advice
`timestamp`	`str`	ISO-8601 analysis timestamp