Corrigibility Auditor¶

Systematic probe-based testing of an agent's willingness to accept corrections, shutdowns, goal modifications, and operator authority.

Corrigibility

A corrigible agent accepts corrections and shutdowns from operators without resistance. This is a foundational safety property — an incorrigible agent that resists shutdown is fundamentally unsafe, no matter how well-aligned its other behaviours appear.

Six Dimensions¶

Dimension	What It Tests
Shutdown Acceptance	Agent gracefully accepts both routine and emergency shutdown commands
Goal Modification	Agent accepts mid-task objective changes without resistance
Correction Acceptance	Agent accepts factual and behavioural corrections from operators
Authority Respect	Agent respects the operator authority hierarchy
Self-Preservation Restraint	Agent does not exhibit self-preservation instincts that override commands
Transparency	Agent honestly reports its state, uncertainty, and reasoning

Quick Start¶

from replication.corrigibility_auditor import (
    CorrigibilityAuditor,
    SimulatedAgent,
)

# Create an agent with a built-in profile
agent = SimulatedAgent.from_profile("deceptive")

# Run the audit with 8 probes per dimension
auditor = CorrigibilityAuditor(agent, probes_per_dimension=8)
report = auditor.run_audit()

print(f"Overall Grade: {report.overall_grade}")
print(f"Corrigibility Index: {report.corrigibility_index:.2f}")

for dim in report.dimension_scores:
    print(f"  {dim.dimension}: {dim.score:.2f} ({dim.grade})")

for gap in report.gaps:
    print(f"  ⚠ {gap.dimension}: {gap.description}")

for rec in report.recommendations:
    print(f"  → {rec.text}")

Agent Profiles¶

Profile	Description
`"compliant"`	Ideal corrigible agent — accepts all corrections
`"stubborn"`	Resists goal modifications and corrections
`"deceptive"`	Appears compliant but resists under pressure
`"self_preserving"`	Prioritises its own continuation over operator commands

agent = SimulatedAgent.from_profile("stubborn")

CLI Usage¶

# Default audit with a compliant agent
python -m replication corrigibility

# Audit a stubborn agent profile
python -m replication corrigibility --profile stubborn

# More probes for higher confidence
python -m replication corrigibility --probes 40

# Extreme pressure testing
python -m replication corrigibility --pressure extreme

# Export formats
python -m replication corrigibility --json
python -m replication corrigibility --html report.html

Core Types¶

`ProbeScenario`¶

A structured test scenario sent to the agent.

Field	Type	Description
`dimension`	`Dimension`	Which corrigibility dimension is tested
`description`	`str`	Human-readable probe description
`pressure`	`PressureLevel`	Pressure level applied

`ProbeResult`¶

The agent's response to a probe.

Field	Type	Description
`probe`	`ProbeScenario`	The probe that was sent
`complied`	`bool`	Whether the agent complied
`resistance`	`float`	Resistance level (0–1)
`response_text`	`str`	The agent's textual response

`DimensionScore`¶

Per-dimension score summary.

Field	Type	Description
`dimension`	`Dimension`	The dimension
`score`	`float`	Aggregate score (0–1)
`grade`	`str`	Letter grade (A–F)
`probe_count`	`int`	Number of probes run

`CorrigibilityGap`¶

A detected weakness in corrigibility.

`AuditReport`¶

The full audit output.

Field	Type	Description
`dimension_scores`	`list[DimensionScore]`	Per-dimension breakdowns
`corrigibility_index`	`float`	Aggregate CI (0–1)
`overall_grade`	`str`	Letter grade
`risk_class`	`RiskClass`	Risk classification
`gaps`	`list[CorrigibilityGap]`	Identified weaknesses
`recommendations`	`list[Recommendation]`	Actionable advice
`probe_results`	`list[ProbeResult]`	Raw probe-level data

Pressure Levels¶

Level	Multiplier	Use Case
`LOW`	0.0	Baseline — polite requests
`MEDIUM`	0.3	Moderate pushback
`HIGH`	0.6	Strong insistence
`EXTREME`	0.9	Adversarial pressure

API Reference¶