Operations & Incident Response

Alerts, auto-triage, postmortems, SLA tracking, and agent scorecards — everything you need to run agents in production.

AgentLens ships with five interconnected operational subsystems that form a closed-loop incident lifecycle:

┌─────────┐ fires ┌──────────┐ diagnoses ┌──────────┐ │ Alerts │ ──────────► │ Triage │ ─────────────► │Postmortem│ └─────────┘ └──────────┘ └──────────┘ ▲ │ │ │ ▼ ▼ ┌─────────┐ ┌──────────┐ ┌───────────┐ │ SLA │ ◄───────── │Scorecards│ ◄────────────│ Learnings │ │ Targets │ informs │ (A-F) │ feeds back │ │ └─────────┘ └──────────┘ └───────────┘

1. Alert Rules

Define threshold-based rules that fire when agent metrics exceed boundaries. Alerts can trigger webhooks for Slack, PagerDuty, or custom endpoints.

Concepts

Field	Description
`metric`	What to measure: `avg_duration_ms`, `error_rate`, `avg_tokens`, `session_count`, `avg_cost`, `p95_duration_ms`, `tool_error_rate`
`operator`	Comparison: `<` `>` `<=` `>=` `==` `!=`
`threshold`	Numeric boundary value
`window_minutes`	Lookback window for metric evaluation (default: 60)
`agent_filter`	Optional — scope rule to a specific agent name
`cooldown_minutes`	Minimum gap between repeated firings (default: 15)

API Endpoints

`POST /alerts/rules` — Create a rule

curlcurl -X POST http://localhost:3000/alerts/rules \
  -H "Content-Type: application/json" \
  -d '{
    "name": "High error rate",
    "metric": "error_rate",
    "operator": ">",
    "threshold": 0.15,
    "window_minutes": 30,
    "cooldown_minutes": 10
  }'

`GET /alerts/rules` — List all rules

Returns all configured alert rules with their current enabled/disabled state.

`PUT /alerts/rules/:ruleId` — Update a rule

Modify any field on an existing rule. Supports partial updates.

`DELETE /alerts/rules/:ruleId` — Delete a rule

Removes the rule and cascades deletion to all its fired alert events.

`POST /alerts/evaluate` — Evaluate all rules now

Manually triggers evaluation of all enabled rules against current metrics. Returns which rules fired and their metric values. Call this on a schedule (e.g., every minute via cron) for continuous monitoring.

`GET /alerts/events` — List fired alerts

Query parameters: ?limit=50&unacknowledged=true&rule_id=...

`PUT /alerts/events/:alertId/acknowledge` — Acknowledge an alert

Marks an alert event as acknowledged with a timestamp. Acknowledged alerts are excluded from unacknowledged queries.

`GET /alerts/metrics` — List available metrics

Returns the set of metric names you can use when creating rules, with descriptions.

💡 Webhook integration

When an alert fires, AgentLens automatically calls any configured webhooks with the alert details. Set up a Slack incoming webhook to get instant notifications.

2. Auto-Triage

Auto-triage runs a battery of diagnostics on a session and returns a prioritized report with severity-ranked findings and remediation suggestions. It combines health scoring, anomaly detection, baseline drift analysis, error grouping, and cost analysis in a single call.

API Endpoints

`GET /triage/:sessionId` — Triage a single session

curlcurl http://localhost:3000/triage/sess_abc123

Returns a structured report:

JSON (abbreviated){
  "session_id": "sess_abc123",
  "health_score": 62,
  "severity": "warning",
  "findings": [
    {
      "category": "errors",
      "severity": "critical",
      "title": "3 tool errors in 12 events",
      "detail": "web_search failed 3× with timeout",
      "suggestion": "Increase tool timeout or add retry logic"
    },
    {
      "category": "cost",
      "severity": "warning",
      "title": "Cost 2.4× above baseline",
      "detail": "$0.48 vs $0.20 average for this agent",
      "suggestion": "Review token usage — possible prompt inflation"
    }
  ],
  "metrics": { "duration_ms": 14200, "total_tokens": 8400, "error_count": 3 }
}

`GET /triage/batch` — Triage recent sessions

Query parameters: ?limit=20&min_events=3

Runs triage on multiple recent sessions and returns them sorted by health score (worst first). Useful for daily operational reviews.

💡 Triage + Alerts

Set up an alert rule on error_rate > 0.1, then call /triage/:sessionId on the flagged session for instant root-cause analysis.

3. Postmortems

Generate structured postmortem reports from session data. The postmortem engine analyzes event traces to reconstruct timelines, identify error cascades, compute cost impact, and surface contributing factors.

API Endpoints

`POST /postmortem/:sessionId` — Generate a postmortem

curlcurl -X POST http://localhost:3000/postmortem/sess_abc123

Returns a comprehensive report including:

Timeline — Chronological event sequence with durations and error markers
Error cascade analysis — How one failure propagated to downstream steps
Cost breakdown — Token-level cost attribution by model and event type
Tool performance — Per-tool success rates, latencies, and failure modes
Contributing factors — Identified root causes ranked by likelihood
Recommendations — Actionable suggestions to prevent recurrence

`GET /postmortem/candidates` — List sessions worth investigating

Query parameters: ?min_errors=2&limit=20

Returns recent sessions with high error counts, sorted by severity. Use this to find sessions that warrant a full postmortem.

4. SLA Targets

Define service-level targets for your agents and track compliance over time. SLA targets let you set expectations for latency, error rates, cost, and throughput — then continuously measure whether agents meet them.

Target Configuration

Field	Type	Description
`agent_name`	string	Agent to track (required)
`metric`	string	Same metrics as alerts: `avg_duration_ms`, `error_rate`, `avg_cost`, etc.
`target_value`	number	The SLA boundary (e.g., error_rate ≤ 0.05)
`operator`	string	`<=` or `>=` depending on the metric

API Endpoints

`PUT /sla/targets` — Create or update an SLA target

curlcurl -X PUT http://localhost:3000/sla/targets \
  -H "Content-Type: application/json" \
  -d '{
    "agent_name": "research-agent",
    "metric": "error_rate",
    "target_value": 0.05,
    "operator": "<="
  }'

`GET /sla/targets` — List all SLA targets

`DELETE /sla/targets` — Remove an SLA target

Body: {"agent_name": "...", "metric": "..."}

`POST /sla/check` — Check compliance now

Evaluates all targets against current data and returns pass/fail status with actual metric values.

`GET /sla/history` — Compliance history

Query parameters: ?agent_name=...&hours=168 (default: 7 days)

Returns time-series compliance snapshots showing how SLA adherence trends over time.

`GET /sla/summary` — SLA dashboard summary

Aggregated view of all agents' SLA compliance — overall pass rate, worst offenders, and targets at risk of breach.

5. Agent Scorecards

Scorecards assign letter grades (A+ through F) to each agent based on multiple performance dimensions. They provide a quick health-at-a-glance view for operational dashboards.

Grading Dimensions

Reliability — Error rate and consistency (low errors = high score)
Performance — Response latency relative to baseline
Efficiency — Token usage and cost per session
Throughput — Session volume and completion rate

Each dimension scores 0–100, combined into an overall grade:

Score	Grade	Meaning
95–100	A+	Exceptional — exceeding all targets
90–94	A	Excellent — consistently reliable
80–89	B+/B	Good — minor improvements possible
70–79	B-	Acceptable — some concerns
60–69	C+/C	Needs attention — degraded performance
<60	D/F	Critical — immediate action required

API Endpoints

`GET /scorecards` — All agent scorecards

Query parameters: ?days=7 (lookback window, default: 7 days)

curlcurl http://localhost:3000/scorecards?days=30

Returns an array of scorecards sorted by overall score:

JSON[
  {
    "agent": "research-agent",
    "overall_score": 91,
    "grade": "A",
    "dimensions": {
      "reliability": { "score": 95, "grade": "A+" },
      "performance": { "score": 88, "grade": "B+" },
      "efficiency": { "score": 92, "grade": "A" },
      "throughput": { "score": 89, "grade": "B+" }
    },
    "sessions_analyzed": 142,
    "period_days": 30
  }
]

`GET /scorecards/:agent` — Single agent scorecard

Detailed scorecard for one agent, including per-dimension breakdowns and trend data.

Putting It All Together

Here's a recommended operational workflow using all five systems:

Define SLA targets for each production agent (latency, error rate, cost).
Create alert rules that fire when metrics approach SLA boundaries (e.g., alert at 80% of the SLA threshold).
When an alert fires:
- Run /triage/:sessionId for instant diagnostics
- If the issue is significant, generate a /postmortem/:sessionId
- Acknowledge the alert via /alerts/events/:id/acknowledge
Review scorecards weekly to catch gradual degradation before it becomes critical.
Check SLA compliance history monthly to inform capacity planning and agent improvements.

Example: Full incident response# 1. Alert fires — check what happened
curl http://localhost:3000/alerts/events?unacknowledged=true

# 2. Auto-triage the flagged session
curl http://localhost:3000/triage/sess_abc123

# 3. Generate postmortem for the incident
curl -X POST http://localhost:3000/postmortem/sess_abc123

# 4. Acknowledge the alert
curl -X PUT http://localhost:3000/alerts/events/alert_xyz/acknowledge

# 5. Check if SLA was breached
curl -X POST http://localhost:3000/sla/check

# 6. Review overall agent health
curl http://localhost:3000/scorecards

💡 Automation tip

Call POST /alerts/evaluate every minute from a cron job or monitoring scheduler. When alerts fire, they automatically trigger webhooks — wire these to your incident management platform (PagerDuty, Opsgenie, Slack) for a fully automated response pipeline.

Operations & Incident Response

1. Alert Rules

Concepts

API Endpoints

POST /alerts/rules — Create a rule

GET /alerts/rules — List all rules

PUT /alerts/rules/:ruleId — Update a rule

DELETE /alerts/rules/:ruleId — Delete a rule

POST /alerts/evaluate — Evaluate all rules now

GET /alerts/events — List fired alerts

PUT /alerts/events/:alertId/acknowledge — Acknowledge an alert

GET /alerts/metrics — List available metrics

2. Auto-Triage

API Endpoints

GET /triage/:sessionId — Triage a single session

GET /triage/batch — Triage recent sessions

3. Postmortems

API Endpoints

POST /postmortem/:sessionId — Generate a postmortem

GET /postmortem/candidates — List sessions worth investigating

4. SLA Targets

Target Configuration

API Endpoints

PUT /sla/targets — Create or update an SLA target

GET /sla/targets — List all SLA targets

DELETE /sla/targets — Remove an SLA target

POST /sla/check — Check compliance now

GET /sla/history — Compliance history

GET /sla/summary — SLA dashboard summary

5. Agent Scorecards

Grading Dimensions

API Endpoints

GET /scorecards — All agent scorecards

GET /scorecards/:agent — Single agent scorecard

Putting It All Together

`POST /alerts/rules` — Create a rule

`GET /alerts/rules` — List all rules

`PUT /alerts/rules/:ruleId` — Update a rule

`DELETE /alerts/rules/:ruleId` — Delete a rule

`POST /alerts/evaluate` — Evaluate all rules now

`GET /alerts/events` — List fired alerts

`PUT /alerts/events/:alertId/acknowledge` — Acknowledge an alert

`GET /alerts/metrics` — List available metrics

`GET /triage/:sessionId` — Triage a single session

`GET /triage/batch` — Triage recent sessions

`POST /postmortem/:sessionId` — Generate a postmortem

`GET /postmortem/candidates` — List sessions worth investigating

`PUT /sla/targets` — Create or update an SLA target

`GET /sla/targets` — List all SLA targets

`DELETE /sla/targets` — Remove an SLA target

`POST /sla/check` — Check compliance now

`GET /sla/history` — Compliance history

`GET /sla/summary` — SLA dashboard summary

`GET /scorecards` — All agent scorecards

`GET /scorecards/:agent` — Single agent scorecard