Monitoring & Analysis
Health scoring, anomaly detection, SLA evaluation, budgets, postmortem generation, and latency profiling for production AI agents.
Health Scoring
Score agent sessions on 6 dimensions with letter grades (A–F) and actionable recommendations. Use this for at-a-glance session quality assessment.
from agentlens import HealthScorer, HealthThresholds
# Default thresholds
scorer = HealthScorer()
# Score a list of events
report = scorer.score(events, session_id="session-123")
print(report.render())
# Score a Session object directly
report = scorer.score_session(session)
# Inspect results
print(f"Grade: {report.grade.value}") # A, B, C, D, F
print(f"Score: {report.overall_score}") # 0-100
for metric in report.metrics:
print(f" {metric.name}: {metric.score}/100 ({metric.grade.value})")
# Actionable recommendations
for rec in report.recommendations:
print(f" → {rec}")
Scored Dimensions
| Metric | What It Measures | Weight |
|---|---|---|
| Error Rate | Fraction of events that are errors | High |
| Avg Latency | Mean response time across events | Medium |
| P95 Latency | 95th percentile latency | Medium |
| Tool Success | Ratio of successful tool calls | Medium |
| Token Efficiency | Tokens used relative to event count | Low |
| Event Volume | Whether event count is in a healthy range | Low |
Custom Thresholds
# Tighter thresholds for production
thresholds = HealthThresholds(
error_rate_warn=0.05, # Warn at 5% errors
error_rate_critical=0.15, # Critical at 15%
latency_warn_ms=2000, # Warn at 2s avg latency
latency_critical_ms=8000, # Critical at 8s
)
scorer = HealthScorer(thresholds=thresholds)
Grade Scale
| Grade | Score | Interpretation |
|---|---|---|
| A | 90–100 | Excellent — agent is performing optimally |
| B | 80–89 | Good — minor areas for improvement |
| C | 70–79 | Acceptable — several metrics need attention |
| D | 60–69 | Poor — significant issues present |
| F | 0–59 | Failing — immediate action required |
Anomaly Detection
Statistical anomaly detection using z-scores against learned baselines. Detects latency spikes, token anomalies, and error bursts in real time.
from agentlens import AnomalyDetector, AnomalyDetectorConfig
config = AnomalyDetectorConfig(
z_score_threshold=2.5, # σ threshold for anomaly
min_baseline_events=10, # Min events before detection starts
latency_weight=1.0,
token_weight=1.0,
error_weight=1.5, # Weight errors more heavily
)
detector = AnomalyDetector(config=config)
# Train on historical (known-good) events
detector.train(historical_events)
# Analyze new events
report = detector.analyze(new_events)
print(f"Anomalies: {report.anomaly_count}")
print(f"Max severity: {report.max_severity}")
print(report.summary())
# Group anomalies
by_kind = report.by_kind() # dict[AnomalyKind, list]
critical = report.critical_count()
warnings = report.warning_count()
Anomaly Kinds
| Kind | Description | Typical Cause |
|---|---|---|
LATENCY_SPIKE | Response time significantly above baseline | API throttling, model overload |
TOKEN_SPIKE | Token usage significantly above baseline | Prompt injection, unexpected input |
ERROR_BURST | Error rate significantly above baseline | API outage, bad deployment |
LATENCY_DROP | Unusually fast responses | Short-circuiting, cached responses |
TOKEN_DROP | Unusually low token usage | Truncated responses, model errors |
Severity Levels
| Severity | z-score Range | Action |
|---|---|---|
LOW | 2.0–2.5σ | Monitor, log for trending |
MEDIUM | 2.5–3.0σ | Investigate within 1 hour |
HIGH | 3.0–4.0σ | Investigate immediately |
CRITICAL | > 4.0σ | Page on-call, potential incident |
SLA Evaluation
Evaluate agent sessions against Service Level Objectives (SLOs). Built-in production and development policies, or define your own.
from agentlens import SLAEvaluator, SLObjective, SLAPolicy
policy = SLAPolicy(
name="production-sla",
objectives=[
SLObjective.latency_p95(target_ms=3000, slo_percent=99.0),
SLObjective.error_rate(target_rate=0.01, slo_percent=99.5),
SLObjective.token_budget(target_per_session=10000, slo_percent=95.0),
SLObjective.tool_success_rate(target_rate=0.95),
SLObjective.throughput(min_events=5, slo_percent=95.0),
],
)
# Or use built-in policies
from agentlens import production_policy, development_policy
policy = production_policy()
evaluator = SLAEvaluator()
report = evaluator.evaluate(sessions, policy)
print(report.render())
for result in report.results:
print(f" {result.objective.name}: {result.compliance_rate:.1%} "
f"(target: {result.objective.slo_percent}%) — {result.status.value}")
Objective Types
| Kind | Factory Method | What It Measures |
|---|---|---|
LATENCY_P95 | SLObjective.latency_p95() | 95th percentile latency per session |
LATENCY_AVG | SLObjective.latency_avg() | Average latency per session |
ERROR_RATE | SLObjective.error_rate() | Fraction of error events |
TOKEN_BUDGET | SLObjective.token_budget() | Mean tokens per session |
TOOL_SUCCESS | SLObjective.tool_success_rate() | Tool call success ratio |
THROUGHPUT | SLObjective.throughput() | Minimum events per session |
Compliance Statuses
| Status | Meaning |
|---|---|
MET | Objective fully satisfied |
AT_RISK | Within 5% of threshold |
BREACHED | Below target SLO percentage |
NO_DATA | Insufficient data to evaluate |
Token Budgets
Enforce per-session or per-agent token and cost budgets with threshold callbacks. Prevents runaway costs from misbehaving agents.
from agentlens import BudgetTracker, BudgetStatus
tracker = BudgetTracker()
# Create a budget with token and cost caps
budget = tracker.create_budget(
budget_id="agent-session-1",
max_tokens=50000,
max_cost=2.50,
warn_threshold=0.8, # Fire callback at 80%
)
# Threshold notifications
def on_threshold(budget, status):
print(f"Budget '{budget.budget_id}' → {status.value}")
tracker.on_threshold(on_threshold)
# Record token usage
tracker.record(
budget_id="agent-session-1",
tokens_in=1200,
tokens_out=800,
model="gpt-4",
)
# Check status
print(f"Used: {budget.utilization:.0%}")
print(f"Remaining: {budget.remaining_tokens}")
print(f"Status: {budget.status.value}") # OK, WARNING, EXCEEDED
# Link budgets to sessions
tracker.record_for_session(
session_id="sess-abc",
tokens_in=500, tokens_out=200, model="gpt-4",
)
report = tracker.report_for_session("sess-abc")
Budget Statuses
| Status | Meaning | Behavior |
|---|---|---|
OK | Under warning threshold | Normal operation |
WARNING | Approaching limit (≥ warn_threshold) | Callback fires |
EXCEEDED | Over budget | Raises BudgetExceededError when enforce=True |
Latency Profiling
Profile agent execution with step-by-step timing, percentile stats, and slow-step detection. Identifies performance bottlenecks in multi-step agent workflows.
from agentlens import LatencyProfiler
profiler = LatencyProfiler()
# Start a profiling session
session = profiler.start_session("agent-workflow")
# Record steps (manually or from events)
session.record_step("prompt-build", duration_ms=45)
session.record_step("llm-call", duration_ms=2300)
session.record_step("tool-execution", duration_ms=150)
session.record_step("response-format", duration_ms=30)
# Get the session report
report = session.report()
print(f"Total duration: {report.total_duration_ms}ms")
print(f"Slowest step: {report.slowest_step.name} ({report.slowest_step.duration_ms}ms)")
# Percentile stats across multiple sessions
profiler.add_session(session)
stats = profiler.percentiles("llm-call")
print(f"P50: {stats.p50}ms, P95: {stats.p95}ms, P99: {stats.p99}ms")
# Slow step alerts
alerts = profiler.slow_step_alerts(threshold_ms=2000)
for alert in alerts:
print(f" ⚠ {alert.step_name}: {alert.duration_ms}ms (threshold: {alert.threshold_ms}ms)")
Postmortem Generation
Automatically generate structured incident postmortem reports from session data. Identifies root causes, assesses impact, and suggests remediations.
from agentlens import PostmortemGenerator, PostmortemConfig
config = PostmortemConfig(
error_rate_threshold=0.1, # Incident if > 10% errors
latency_threshold_ms=5000, # Incident if avg > 5s
include_timeline=True,
include_remediation=True,
)
generator = PostmortemGenerator(config=config)
# Generate from a session's events
report = generator.generate(
events=session_events,
session_id="sess-incident-42",
title="Agent latency spike at 14:30 UTC",
)
# Inspect the report
print(f"Severity: {report.severity.value}")
for cause in report.root_causes:
print(f"Root cause: {cause.description}")
print(f" Confidence: {cause.confidence}")
print(f" Evidence: {cause.evidence}")
for fix in report.remediations:
print(f"Fix: {fix.description} ({fix.category.value})")
print(f" Priority: {fix.priority}")
# Impact assessment
print(f"Affected events: {report.impact.affected_events}")
print(f"Error rate: {report.impact.error_rate:.1%}")
print(f"Duration: {report.impact.duration_ms}ms")
# Lessons learned
for lesson in report.lessons_learned:
print(f"Lesson: {lesson.description}")
# Export as Markdown
md = report.render_markdown()
# Export as dict (for storage / dashboards)
data = report.to_dict()
Remediation Categories
| Category | Examples |
|---|---|
CONFIGURATION | Adjust timeouts, rate limits, retry policies |
CODE_CHANGE | Fix bugs, add error handling, optimize prompts |
INFRASTRUCTURE | Scale resources, add caching, improve monitoring |
PROCESS | Add runbooks, update on-call procedures |
EXTERNAL | Upstream API issues, provider outages |
Prompt Version Tracking
Track prompt versions, compare diffs, and correlate prompt changes with quality outcomes. Essential for prompt engineering workflows.
from agentlens import PromptVersionTracker, Outcome
tracker = PromptVersionTracker()
# Register prompt versions
v1 = tracker.add_version(
prompt_id="summarizer",
version="1.0",
template="Summarize the following text: {text}",
metadata={"author": "alice", "model": "gpt-4"},
)
v2 = tracker.add_version(
prompt_id="summarizer",
version="1.1",
template="Provide a concise summary (max 3 sentences) of: {text}",
metadata={"author": "bob", "model": "gpt-4"},
)
# Record outcomes for each version
tracker.record_outcome("summarizer", "1.0", Outcome.SUCCESS, latency_ms=1200)
tracker.record_outcome("summarizer", "1.0", Outcome.SUCCESS, latency_ms=1100)
tracker.record_outcome("summarizer", "1.1", Outcome.SUCCESS, latency_ms=900)
tracker.record_outcome("summarizer", "1.1", Outcome.FAILURE, latency_ms=2500)
# Compare versions
diff = tracker.diff("summarizer", "1.0", "1.1")
print(f"Diff kind: {diff.kind.value}") # MODIFIED, ADDED, REMOVED
print(f"Changes: {diff.changes}")
# Generate a report
report = tracker.report("summarizer")
print(report.render())
for stat in report.version_stats:
print(f" v{stat.version}: {stat.success_rate:.0%} success, "
f"avg latency {stat.avg_latency_ms:.0f}ms")
Cost Optimization
Analyze token usage patterns and recommend cheaper models for simpler tasks. Includes complexity analysis to route requests to the right model tier.
from agentlens import CostOptimizer, ComplexityAnalyzer, ComplexityLevel
# Analyze prompt complexity
analyzer = ComplexityAnalyzer()
assessment = analyzer.assess("What is the capital of France?")
print(f"Complexity: {assessment.level.value}") # LOW, MEDIUM, HIGH, EXPERT
# Get optimization recommendations
optimizer = CostOptimizer()
report = optimizer.analyze(events)
for rec in report.recommendations:
print(f" {rec.description}")
print(f" Current model: {rec.current_model}")
print(f" Suggested: {rec.suggested_model}")
print(f" Savings: ${rec.estimated_savings:.2f}/month")
print(f" Confidence: {rec.confidence.value}")
# Migration plan
for step in report.migration_steps:
print(f"Step {step.order}: {step.description}")
print(f"Total potential savings: ${report.total_savings:.2f}/month")
Model Tiers
| Tier | Use Case | Example Models |
|---|---|---|
ECONOMY | Simple lookups, classification | GPT-3.5-turbo, Claude Haiku |
STANDARD | General tasks, summarization | GPT-4o-mini, Claude Sonnet |
PREMIUM | Complex reasoning, code gen | GPT-4, Claude Opus |
ENTERPRISE | Critical, high-stakes tasks | GPT-4-turbo, Claude 3.5 Sonnet |
Capacity Planning
Forecast resource needs and identify bottlenecks before they hit production. Analyze workload trends and generate scaling recommendations.
from agentlens import CapacityPlanner, WorkloadSample, ResourceKind
from datetime import datetime
planner = CapacityPlanner()
# Feed workload samples
planner.add_samples([
WorkloadSample(
timestamp=datetime(2025, 1, d),
requests=1000 * d,
tokens=50000 * d,
cost=2.5 * d,
latency_p95_ms=800 + d * 10,
)
for d in range(1, 31)
])
# Generate capacity report
report = planner.report()
# Workload projections (7, 30, 90 day)
for proj in report.projections:
print(f" {proj.horizon_days}d: {proj.projected_requests:.0f} req/day, "
f"${proj.projected_cost:.2f}/day")
# Bottleneck detection
for bottleneck in report.bottlenecks:
print(f" ⚠ {bottleneck.resource.value}: {bottleneck.severity.value}")
print(f" Current: {bottleneck.current_utilization:.0%}")
print(f" Projected peak: {bottleneck.projected_peak:.0%}")
# Scaling recommendations
for rec in report.scaling_recommendations:
print(f" {rec.action.value}: {rec.description}")
print(f" When: {rec.trigger_condition}")
Building a Monitoring Pipeline
Combine modules into a comprehensive monitoring pipeline. Here’s a production-ready example that runs health checks, anomaly detection, SLA evaluation, and budget tracking together:
from agentlens import (
HealthScorer, AnomalyDetector, AnomalyDetectorConfig,
SLAEvaluator, BudgetTracker, PostmortemGenerator,
production_policy,
)
# ── Initialize monitors ──────────────────────────────────────
scorer = HealthScorer()
detector = AnomalyDetector(AnomalyDetectorConfig(z_score_threshold=2.5))
evaluator = SLAEvaluator()
tracker = BudgetTracker()
postmortem_gen = PostmortemGenerator()
sla_policy = production_policy()
# ── Train anomaly detector on baseline data ──────────────────
detector.train(baseline_events)
# ── Per-session monitoring loop ──────────────────────────────
def monitor_session(session, events):
# 1. Health check
health = scorer.score(events, session_id=session["id"])
print(f"Health: {health.grade.value} ({health.overall_score}/100)")
# 2. Anomaly scan
anomalies = detector.analyze(events)
if anomalies.has_anomalies:
print(f"⚠ {anomalies.anomaly_count} anomalies detected")
if anomalies.critical_count() > 0:
# Auto-generate postmortem for critical anomalies
report = postmortem_gen.generate(
events=events,
session_id=session["id"],
title=f"Critical anomaly in {session['agent_name']}",
)
send_to_slack(report.render_markdown())
# 3. SLA evaluation
sla = evaluator.evaluate([session], sla_policy)
for result in sla.results:
if result.status.value == "BREACHED":
alert(f"SLA breached: {result.objective.name}")
# 4. Budget tracking
tracker.record_for_session(
session_id=session["id"],
tokens_in=sum(e.get("tokens_in", 0) for e in events),
tokens_out=sum(e.get("tokens_out", 0) for e in events),
model=events[0].get("model", "unknown"),
)
HealthScorer for a quick overview,
add AnomalyDetector when you have baseline data, and layer in
SLAEvaluator and BudgetTracker as you define SLOs and
cost targets. Use PostmortemGenerator to automate incident documentation.