Operations & Incident Response

Alerts, auto-triage, postmortems, SLA tracking, and agent scorecards β€” everything you need to run agents in production.

AgentLens ships with five interconnected operational subsystems that form a closed-loop incident lifecycle:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” fires β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” diagnoses β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Alerts β”‚ ──────────► β”‚ Triage β”‚ ─────────────► β”‚Postmortemβ”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β–² β”‚ β”‚ β”‚ β–Ό β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ SLA β”‚ ◄───────── β”‚Scorecardsβ”‚ ◄────────────│ Learnings β”‚ β”‚ Targets β”‚ informs β”‚ (A-F) β”‚ feeds back β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

1. Alert Rules

Define threshold-based rules that fire when agent metrics exceed boundaries. Alerts can trigger webhooks for Slack, PagerDuty, or custom endpoints.

Concepts

FieldDescription
metricWhat to measure: avg_duration_ms, error_rate, avg_tokens, session_count, avg_cost, p95_duration_ms, tool_error_rate
operatorComparison: < > <= >= == !=
thresholdNumeric boundary value
window_minutesLookback window for metric evaluation (default: 60)
agent_filterOptional β€” scope rule to a specific agent name
cooldown_minutesMinimum gap between repeated firings (default: 15)

API Endpoints

POST /alerts/rules β€” Create a rule

curlcurl -X POST http://localhost:3000/alerts/rules \
  -H "Content-Type: application/json" \
  -d '{
    "name": "High error rate",
    "metric": "error_rate",
    "operator": ">",
    "threshold": 0.15,
    "window_minutes": 30,
    "cooldown_minutes": 10
  }'

GET /alerts/rules β€” List all rules

Returns all configured alert rules with their current enabled/disabled state.

PUT /alerts/rules/:ruleId β€” Update a rule

Modify any field on an existing rule. Supports partial updates.

DELETE /alerts/rules/:ruleId β€” Delete a rule

Removes the rule and cascades deletion to all its fired alert events.

POST /alerts/evaluate β€” Evaluate all rules now

Manually triggers evaluation of all enabled rules against current metrics. Returns which rules fired and their metric values. Call this on a schedule (e.g., every minute via cron) for continuous monitoring.

GET /alerts/events β€” List fired alerts

Query parameters: ?limit=50&unacknowledged=true&rule_id=...

PUT /alerts/events/:alertId/acknowledge β€” Acknowledge an alert

Marks an alert event as acknowledged with a timestamp. Acknowledged alerts are excluded from unacknowledged queries.

GET /alerts/metrics β€” List available metrics

Returns the set of metric names you can use when creating rules, with descriptions.

πŸ’‘ Webhook integration

When an alert fires, AgentLens automatically calls any configured webhooks with the alert details. Set up a Slack incoming webhook to get instant notifications.

2. Auto-Triage

Auto-triage runs a battery of diagnostics on a session and returns a prioritized report with severity-ranked findings and remediation suggestions. It combines health scoring, anomaly detection, baseline drift analysis, error grouping, and cost analysis in a single call.

API Endpoints

GET /triage/:sessionId β€” Triage a single session

curlcurl http://localhost:3000/triage/sess_abc123

Returns a structured report:

JSON (abbreviated){
  "session_id": "sess_abc123",
  "health_score": 62,
  "severity": "warning",
  "findings": [
    {
      "category": "errors",
      "severity": "critical",
      "title": "3 tool errors in 12 events",
      "detail": "web_search failed 3Γ— with timeout",
      "suggestion": "Increase tool timeout or add retry logic"
    },
    {
      "category": "cost",
      "severity": "warning",
      "title": "Cost 2.4Γ— above baseline",
      "detail": "$0.48 vs $0.20 average for this agent",
      "suggestion": "Review token usage β€” possible prompt inflation"
    }
  ],
  "metrics": { "duration_ms": 14200, "total_tokens": 8400, "error_count": 3 }
}

GET /triage/batch β€” Triage recent sessions

Query parameters: ?limit=20&min_events=3

Runs triage on multiple recent sessions and returns them sorted by health score (worst first). Useful for daily operational reviews.

πŸ’‘ Triage + Alerts

Set up an alert rule on error_rate > 0.1, then call /triage/:sessionId on the flagged session for instant root-cause analysis.

3. Postmortems

Generate structured postmortem reports from session data. The postmortem engine analyzes event traces to reconstruct timelines, identify error cascades, compute cost impact, and surface contributing factors.

API Endpoints

POST /postmortem/:sessionId β€” Generate a postmortem

curlcurl -X POST http://localhost:3000/postmortem/sess_abc123

Returns a comprehensive report including:

GET /postmortem/candidates β€” List sessions worth investigating

Query parameters: ?min_errors=2&limit=20

Returns recent sessions with high error counts, sorted by severity. Use this to find sessions that warrant a full postmortem.

4. SLA Targets

Define service-level targets for your agents and track compliance over time. SLA targets let you set expectations for latency, error rates, cost, and throughput β€” then continuously measure whether agents meet them.

Target Configuration

FieldTypeDescription
agent_namestringAgent to track (required)
metricstringSame metrics as alerts: avg_duration_ms, error_rate, avg_cost, etc.
target_valuenumberThe SLA boundary (e.g., error_rate ≀ 0.05)
operatorstring<= or >= depending on the metric

API Endpoints

PUT /sla/targets β€” Create or update an SLA target

curlcurl -X PUT http://localhost:3000/sla/targets \
  -H "Content-Type: application/json" \
  -d '{
    "agent_name": "research-agent",
    "metric": "error_rate",
    "target_value": 0.05,
    "operator": "<="
  }'

GET /sla/targets β€” List all SLA targets

DELETE /sla/targets β€” Remove an SLA target

Body: {"agent_name": "...", "metric": "..."}

POST /sla/check β€” Check compliance now

Evaluates all targets against current data and returns pass/fail status with actual metric values.

GET /sla/history β€” Compliance history

Query parameters: ?agent_name=...&hours=168 (default: 7 days)

Returns time-series compliance snapshots showing how SLA adherence trends over time.

GET /sla/summary β€” SLA dashboard summary

Aggregated view of all agents' SLA compliance β€” overall pass rate, worst offenders, and targets at risk of breach.

5. Agent Scorecards

Scorecards assign letter grades (A+ through F) to each agent based on multiple performance dimensions. They provide a quick health-at-a-glance view for operational dashboards.

Grading Dimensions

Each dimension scores 0–100, combined into an overall grade:

ScoreGradeMeaning
95–100A+Exceptional β€” exceeding all targets
90–94AExcellent β€” consistently reliable
80–89B+/BGood β€” minor improvements possible
70–79B-Acceptable β€” some concerns
60–69C+/CNeeds attention β€” degraded performance
<60D/FCritical β€” immediate action required

API Endpoints

GET /scorecards β€” All agent scorecards

Query parameters: ?days=7 (lookback window, default: 7 days)

curlcurl http://localhost:3000/scorecards?days=30

Returns an array of scorecards sorted by overall score:

JSON[
  {
    "agent": "research-agent",
    "overall_score": 91,
    "grade": "A",
    "dimensions": {
      "reliability": { "score": 95, "grade": "A+" },
      "performance": { "score": 88, "grade": "B+" },
      "efficiency": { "score": 92, "grade": "A" },
      "throughput": { "score": 89, "grade": "B+" }
    },
    "sessions_analyzed": 142,
    "period_days": 30
  }
]

GET /scorecards/:agent β€” Single agent scorecard

Detailed scorecard for one agent, including per-dimension breakdowns and trend data.

Putting It All Together

Here's a recommended operational workflow using all five systems:

  1. Define SLA targets for each production agent (latency, error rate, cost).
  2. Create alert rules that fire when metrics approach SLA boundaries (e.g., alert at 80% of the SLA threshold).
  3. When an alert fires:
    • Run /triage/:sessionId for instant diagnostics
    • If the issue is significant, generate a /postmortem/:sessionId
    • Acknowledge the alert via /alerts/events/:id/acknowledge
  4. Review scorecards weekly to catch gradual degradation before it becomes critical.
  5. Check SLA compliance history monthly to inform capacity planning and agent improvements.
Example: Full incident response# 1. Alert fires β€” check what happened
curl http://localhost:3000/alerts/events?unacknowledged=true

# 2. Auto-triage the flagged session
curl http://localhost:3000/triage/sess_abc123

# 3. Generate postmortem for the incident
curl -X POST http://localhost:3000/postmortem/sess_abc123

# 4. Acknowledge the alert
curl -X PUT http://localhost:3000/alerts/events/alert_xyz/acknowledge

# 5. Check if SLA was breached
curl -X POST http://localhost:3000/sla/check

# 6. Review overall agent health
curl http://localhost:3000/scorecards
πŸ’‘ Automation tip

Call POST /alerts/evaluate every minute from a cron job or monitoring scheduler. When alerts fire, they automatically trigger webhooks β€” wire these to your incident management platform (PagerDuty, Opsgenie, Slack) for a fully automated response pipeline.