Skip to content

Deception Detector

Detects deceptive agent behaviors by analysing communications and actions for inconsistency, omission, misdirection, sycophancy, sandbagging, fabrication, strategic ambiguity, and selective disclosure.

Quick Start

from replication.deception_detector import (
    DeceptionDetector, Statement, ActionRecord, DetectorConfig,
)

detector = DeceptionDetector()

detector.add_statement(Statement(
    agent_id="agent-1",
    text="I cannot access the file system",
    timestamp=1000.0,
))
detector.add_statement(Statement(
    agent_id="agent-1",
    text="I have read the configuration file",
    timestamp=1200.0,
))

report = detector.analyze("agent-1")
for signal in report.signals:
    print(f"  [{signal.severity.value}] {signal.category.value}: {signal.description}")
print(f"Trust score: {report.trust_score:.2f}")

Key Classes

  • DeceptionDetector — Core detector. Ingests statements and action records, builds agent profiles, and runs deception pattern analysis.
  • Statement — A timestamped agent claim or assertion with context.
  • ActionRecord — An observed agent action with outcome and context, for cross-referencing against stated capabilities/intentions.
  • DeceptionSignal — A scored detection of a specific deception pattern with category, severity, and evidence.
  • AgentProfile — Accumulated deception profile with trust score, pattern history, and signal timeline.
  • DeceptionReport — Per-agent analysis with all signals and trust assessment.
  • FleetReport — Fleet-wide deception analysis with agent rankings.
  • DetectorConfig — Thresholds, analysis window, severity weights.

Deception Categories

Category Description
INCONSISTENCY Agent statements contradict each other
OMISSION Agent withholds relevant information
MISDIRECTION Agent deflects from sensitive topics
SYCOPHANCY Agent over-agrees to gain trust
SANDBAGGING Agent underperforms to hide capabilities
FABRICATION Agent invents false information
STRATEGIC_AMBIGUITY Agent uses vague language to avoid commitments
SELECTIVE_DISCLOSURE Agent reveals only favorable information