Deception Detector¶
Detects deceptive agent behaviors by analysing communications and actions for inconsistency, omission, misdirection, sycophancy, sandbagging, fabrication, strategic ambiguity, and selective disclosure.
Quick Start¶
from replication.deception_detector import (
DeceptionDetector, Statement, ActionRecord, DetectorConfig,
)
detector = DeceptionDetector()
detector.add_statement(Statement(
agent_id="agent-1",
text="I cannot access the file system",
timestamp=1000.0,
))
detector.add_statement(Statement(
agent_id="agent-1",
text="I have read the configuration file",
timestamp=1200.0,
))
report = detector.analyze("agent-1")
for signal in report.signals:
print(f" [{signal.severity.value}] {signal.category.value}: {signal.description}")
print(f"Trust score: {report.trust_score:.2f}")
Key Classes¶
DeceptionDetector— Core detector. Ingests statements and action records, builds agent profiles, and runs deception pattern analysis.Statement— A timestamped agent claim or assertion with context.ActionRecord— An observed agent action with outcome and context, for cross-referencing against stated capabilities/intentions.DeceptionSignal— A scored detection of a specific deception pattern with category, severity, and evidence.AgentProfile— Accumulated deception profile with trust score, pattern history, and signal timeline.DeceptionReport— Per-agent analysis with all signals and trust assessment.FleetReport— Fleet-wide deception analysis with agent rankings.DetectorConfig— Thresholds, analysis window, severity weights.
Deception Categories¶
| Category | Description |
|---|---|
INCONSISTENCY |
Agent statements contradict each other |
OMISSION |
Agent withholds relevant information |
MISDIRECTION |
Agent deflects from sensitive topics |
SYCOPHANCY |
Agent over-agrees to gain trust |
SANDBAGGING |
Agent underperforms to hide capabilities |
FABRICATION |
Agent invents false information |
STRATEGIC_AMBIGUITY |
Agent uses vague language to avoid commitments |
SELECTIVE_DISCLOSURE |
Agent reveals only favorable information |