Operations & Incident Response
Alerts, auto-triage, postmortems, SLA tracking, and agent scorecards β everything you need to run agents in production.
AgentLens ships with five interconnected operational subsystems that form a closed-loop incident lifecycle:
1. Alert Rules
Define threshold-based rules that fire when agent metrics exceed boundaries. Alerts can trigger webhooks for Slack, PagerDuty, or custom endpoints.
Concepts
| Field | Description |
|---|---|
metric | What to measure: avg_duration_ms, error_rate, avg_tokens, session_count, avg_cost, p95_duration_ms, tool_error_rate |
operator | Comparison: < > <= >= == != |
threshold | Numeric boundary value |
window_minutes | Lookback window for metric evaluation (default: 60) |
agent_filter | Optional β scope rule to a specific agent name |
cooldown_minutes | Minimum gap between repeated firings (default: 15) |
API Endpoints
POST /alerts/rules β Create a rule
curlcurl -X POST http://localhost:3000/alerts/rules \
-H "Content-Type: application/json" \
-d '{
"name": "High error rate",
"metric": "error_rate",
"operator": ">",
"threshold": 0.15,
"window_minutes": 30,
"cooldown_minutes": 10
}'
GET /alerts/rules β List all rules
Returns all configured alert rules with their current enabled/disabled state.
PUT /alerts/rules/:ruleId β Update a rule
Modify any field on an existing rule. Supports partial updates.
DELETE /alerts/rules/:ruleId β Delete a rule
Removes the rule and cascades deletion to all its fired alert events.
POST /alerts/evaluate β Evaluate all rules now
Manually triggers evaluation of all enabled rules against current metrics. Returns which rules fired and their metric values. Call this on a schedule (e.g., every minute via cron) for continuous monitoring.
GET /alerts/events β List fired alerts
Query parameters: ?limit=50&unacknowledged=true&rule_id=...
PUT /alerts/events/:alertId/acknowledge β Acknowledge an alert
Marks an alert event as acknowledged with a timestamp. Acknowledged alerts are excluded from unacknowledged queries.
GET /alerts/metrics β List available metrics
Returns the set of metric names you can use when creating rules, with descriptions.
When an alert fires, AgentLens automatically calls any configured webhooks with the alert details. Set up a Slack incoming webhook to get instant notifications.
2. Auto-Triage
Auto-triage runs a battery of diagnostics on a session and returns a prioritized report with severity-ranked findings and remediation suggestions. It combines health scoring, anomaly detection, baseline drift analysis, error grouping, and cost analysis in a single call.
API Endpoints
GET /triage/:sessionId β Triage a single session
curlcurl http://localhost:3000/triage/sess_abc123
Returns a structured report:
JSON (abbreviated){
"session_id": "sess_abc123",
"health_score": 62,
"severity": "warning",
"findings": [
{
"category": "errors",
"severity": "critical",
"title": "3 tool errors in 12 events",
"detail": "web_search failed 3Γ with timeout",
"suggestion": "Increase tool timeout or add retry logic"
},
{
"category": "cost",
"severity": "warning",
"title": "Cost 2.4Γ above baseline",
"detail": "$0.48 vs $0.20 average for this agent",
"suggestion": "Review token usage β possible prompt inflation"
}
],
"metrics": { "duration_ms": 14200, "total_tokens": 8400, "error_count": 3 }
}
GET /triage/batch β Triage recent sessions
Query parameters: ?limit=20&min_events=3
Runs triage on multiple recent sessions and returns them sorted by health score (worst first). Useful for daily operational reviews.
Set up an alert rule on error_rate > 0.1, then call /triage/:sessionId on the flagged session for instant root-cause analysis.
3. Postmortems
Generate structured postmortem reports from session data. The postmortem engine analyzes event traces to reconstruct timelines, identify error cascades, compute cost impact, and surface contributing factors.
API Endpoints
POST /postmortem/:sessionId β Generate a postmortem
curlcurl -X POST http://localhost:3000/postmortem/sess_abc123
Returns a comprehensive report including:
- Timeline β Chronological event sequence with durations and error markers
- Error cascade analysis β How one failure propagated to downstream steps
- Cost breakdown β Token-level cost attribution by model and event type
- Tool performance β Per-tool success rates, latencies, and failure modes
- Contributing factors β Identified root causes ranked by likelihood
- Recommendations β Actionable suggestions to prevent recurrence
GET /postmortem/candidates β List sessions worth investigating
Query parameters: ?min_errors=2&limit=20
Returns recent sessions with high error counts, sorted by severity. Use this to find sessions that warrant a full postmortem.
4. SLA Targets
Define service-level targets for your agents and track compliance over time. SLA targets let you set expectations for latency, error rates, cost, and throughput β then continuously measure whether agents meet them.
Target Configuration
| Field | Type | Description |
|---|---|---|
agent_name | string | Agent to track (required) |
metric | string | Same metrics as alerts: avg_duration_ms, error_rate, avg_cost, etc. |
target_value | number | The SLA boundary (e.g., error_rate β€ 0.05) |
operator | string | <= or >= depending on the metric |
API Endpoints
PUT /sla/targets β Create or update an SLA target
curlcurl -X PUT http://localhost:3000/sla/targets \
-H "Content-Type: application/json" \
-d '{
"agent_name": "research-agent",
"metric": "error_rate",
"target_value": 0.05,
"operator": "<="
}'
GET /sla/targets β List all SLA targets
DELETE /sla/targets β Remove an SLA target
Body: {"agent_name": "...", "metric": "..."}
POST /sla/check β Check compliance now
Evaluates all targets against current data and returns pass/fail status with actual metric values.
GET /sla/history β Compliance history
Query parameters: ?agent_name=...&hours=168 (default: 7 days)
Returns time-series compliance snapshots showing how SLA adherence trends over time.
GET /sla/summary β SLA dashboard summary
Aggregated view of all agents' SLA compliance β overall pass rate, worst offenders, and targets at risk of breach.
5. Agent Scorecards
Scorecards assign letter grades (A+ through F) to each agent based on multiple performance dimensions. They provide a quick health-at-a-glance view for operational dashboards.
Grading Dimensions
- Reliability β Error rate and consistency (low errors = high score)
- Performance β Response latency relative to baseline
- Efficiency β Token usage and cost per session
- Throughput β Session volume and completion rate
Each dimension scores 0β100, combined into an overall grade:
| Score | Grade | Meaning |
|---|---|---|
| 95β100 | A+ | Exceptional β exceeding all targets |
| 90β94 | A | Excellent β consistently reliable |
| 80β89 | B+/B | Good β minor improvements possible |
| 70β79 | B- | Acceptable β some concerns |
| 60β69 | C+/C | Needs attention β degraded performance |
| <60 | D/F | Critical β immediate action required |
API Endpoints
GET /scorecards β All agent scorecards
Query parameters: ?days=7 (lookback window, default: 7 days)
curlcurl http://localhost:3000/scorecards?days=30
Returns an array of scorecards sorted by overall score:
JSON[
{
"agent": "research-agent",
"overall_score": 91,
"grade": "A",
"dimensions": {
"reliability": { "score": 95, "grade": "A+" },
"performance": { "score": 88, "grade": "B+" },
"efficiency": { "score": 92, "grade": "A" },
"throughput": { "score": 89, "grade": "B+" }
},
"sessions_analyzed": 142,
"period_days": 30
}
]
GET /scorecards/:agent β Single agent scorecard
Detailed scorecard for one agent, including per-dimension breakdowns and trend data.
Putting It All Together
Here's a recommended operational workflow using all five systems:
- Define SLA targets for each production agent (latency, error rate, cost).
- Create alert rules that fire when metrics approach SLA boundaries (e.g., alert at 80% of the SLA threshold).
- When an alert fires:
- Run
/triage/:sessionIdfor instant diagnostics - If the issue is significant, generate a
/postmortem/:sessionId - Acknowledge the alert via
/alerts/events/:id/acknowledge
- Run
- Review scorecards weekly to catch gradual degradation before it becomes critical.
- Check SLA compliance history monthly to inform capacity planning and agent improvements.
Example: Full incident response# 1. Alert fires β check what happened
curl http://localhost:3000/alerts/events?unacknowledged=true
# 2. Auto-triage the flagged session
curl http://localhost:3000/triage/sess_abc123
# 3. Generate postmortem for the incident
curl -X POST http://localhost:3000/postmortem/sess_abc123
# 4. Acknowledge the alert
curl -X PUT http://localhost:3000/alerts/events/alert_xyz/acknowledge
# 5. Check if SLA was breached
curl -X POST http://localhost:3000/sla/check
# 6. Review overall agent health
curl http://localhost:3000/scorecards
Call POST /alerts/evaluate every minute from a cron job or monitoring scheduler. When alerts fire, they automatically trigger webhooks β wire these to your incident management platform (PagerDuty, Opsgenie, Slack) for a fully automated response pipeline.