Monitoring & Analysis

Health scoring, anomaly detection, SLA evaluation, budgets, postmortem generation, and latency profiling for production AI agents.

Overview: AgentLens provides six monitoring modules that work together to give you production-grade observability over AI agent behavior. Each module operates on the same event and session data, so you can combine them into a comprehensive monitoring pipeline.

Health Scoring

Score agent sessions on 6 dimensions with letter grades (A–F) and actionable recommendations. Use this for at-a-glance session quality assessment.

from agentlens import HealthScorer, HealthThresholds

# Default thresholds
scorer = HealthScorer()

# Score a list of events
report = scorer.score(events, session_id="session-123")
print(report.render())

# Score a Session object directly
report = scorer.score_session(session)

# Inspect results
print(f"Grade: {report.grade.value}")    # A, B, C, D, F
print(f"Score: {report.overall_score}")  # 0-100
for metric in report.metrics:
    print(f"  {metric.name}: {metric.score}/100 ({metric.grade.value})")

# Actionable recommendations
for rec in report.recommendations:
    print(f"  → {rec}")

Scored Dimensions

Metric	What It Measures	Weight
Error Rate	Fraction of events that are errors	High
Avg Latency	Mean response time across events	Medium
P95 Latency	95th percentile latency	Medium
Tool Success	Ratio of successful tool calls	Medium
Token Efficiency	Tokens used relative to event count	Low
Event Volume	Whether event count is in a healthy range	Low

Custom Thresholds

# Tighter thresholds for production
thresholds = HealthThresholds(
    error_rate_warn=0.05,        # Warn at 5% errors
    error_rate_critical=0.15,    # Critical at 15%
    latency_warn_ms=2000,        # Warn at 2s avg latency
    latency_critical_ms=8000,    # Critical at 8s
)
scorer = HealthScorer(thresholds=thresholds)

Grade Scale

Grade	Score	Interpretation
A	90–100	Excellent — agent is performing optimally
B	80–89	Good — minor areas for improvement
C	70–79	Acceptable — several metrics need attention
D	60–69	Poor — significant issues present
F	0–59	Failing — immediate action required

Anomaly Detection

Statistical anomaly detection using z-scores against learned baselines. Detects latency spikes, token anomalies, and error bursts in real time.

from agentlens import AnomalyDetector, AnomalyDetectorConfig

config = AnomalyDetectorConfig(
    z_score_threshold=2.5,         # σ threshold for anomaly
    min_baseline_events=10,        # Min events before detection starts
    latency_weight=1.0,
    token_weight=1.0,
    error_weight=1.5,              # Weight errors more heavily
)

detector = AnomalyDetector(config=config)

# Train on historical (known-good) events
detector.train(historical_events)

# Analyze new events
report = detector.analyze(new_events)

print(f"Anomalies: {report.anomaly_count}")
print(f"Max severity: {report.max_severity}")
print(report.summary())

# Group anomalies
by_kind = report.by_kind()          # dict[AnomalyKind, list]
critical = report.critical_count()
warnings = report.warning_count()

Anomaly Kinds

Kind	Description	Typical Cause
`LATENCY_SPIKE`	Response time significantly above baseline	API throttling, model overload
`TOKEN_SPIKE`	Token usage significantly above baseline	Prompt injection, unexpected input
`ERROR_BURST`	Error rate significantly above baseline	API outage, bad deployment
`LATENCY_DROP`	Unusually fast responses	Short-circuiting, cached responses
`TOKEN_DROP`	Unusually low token usage	Truncated responses, model errors

Severity Levels

Severity	z-score Range	Action
`LOW`	2.0–2.5σ	Monitor, log for trending
`MEDIUM`	2.5–3.0σ	Investigate within 1 hour
`HIGH`	3.0–4.0σ	Investigate immediately
`CRITICAL`	> 4.0σ	Page on-call, potential incident

SLA Evaluation

Evaluate agent sessions against Service Level Objectives (SLOs). Built-in production and development policies, or define your own.

from agentlens import SLAEvaluator, SLObjective, SLAPolicy

policy = SLAPolicy(
    name="production-sla",
    objectives=[
        SLObjective.latency_p95(target_ms=3000, slo_percent=99.0),
        SLObjective.error_rate(target_rate=0.01, slo_percent=99.5),
        SLObjective.token_budget(target_per_session=10000, slo_percent=95.0),
        SLObjective.tool_success_rate(target_rate=0.95),
        SLObjective.throughput(min_events=5, slo_percent=95.0),
    ],
)

# Or use built-in policies
from agentlens import production_policy, development_policy
policy = production_policy()

evaluator = SLAEvaluator()
report = evaluator.evaluate(sessions, policy)
print(report.render())

for result in report.results:
    print(f"  {result.objective.name}: {result.compliance_rate:.1%} "
          f"(target: {result.objective.slo_percent}%) — {result.status.value}")

Objective Types

Kind	Factory Method	What It Measures
`LATENCY_P95`	`SLObjective.latency_p95()`	95th percentile latency per session
`LATENCY_AVG`	`SLObjective.latency_avg()`	Average latency per session
`ERROR_RATE`	`SLObjective.error_rate()`	Fraction of error events
`TOKEN_BUDGET`	`SLObjective.token_budget()`	Mean tokens per session
`TOOL_SUCCESS`	`SLObjective.tool_success_rate()`	Tool call success ratio
`THROUGHPUT`	`SLObjective.throughput()`	Minimum events per session

Compliance Statuses

Status	Meaning
`MET`	Objective fully satisfied
`AT_RISK`	Within 5% of threshold
`BREACHED`	Below target SLO percentage
`NO_DATA`	Insufficient data to evaluate

Token Budgets

Enforce per-session or per-agent token and cost budgets with threshold callbacks. Prevents runaway costs from misbehaving agents.

from agentlens import BudgetTracker, BudgetStatus

tracker = BudgetTracker()

# Create a budget with token and cost caps
budget = tracker.create_budget(
    budget_id="agent-session-1",
    max_tokens=50000,
    max_cost=2.50,
    warn_threshold=0.8,   # Fire callback at 80%
)

# Threshold notifications
def on_threshold(budget, status):
    print(f"Budget '{budget.budget_id}' → {status.value}")

tracker.on_threshold(on_threshold)

# Record token usage
tracker.record(
    budget_id="agent-session-1",
    tokens_in=1200,
    tokens_out=800,
    model="gpt-4",
)

# Check status
print(f"Used: {budget.utilization:.0%}")
print(f"Remaining: {budget.remaining_tokens}")
print(f"Status: {budget.status.value}")  # OK, WARNING, EXCEEDED

# Link budgets to sessions
tracker.record_for_session(
    session_id="sess-abc",
    tokens_in=500, tokens_out=200, model="gpt-4",
)
report = tracker.report_for_session("sess-abc")

Budget Statuses

Status	Meaning	Behavior
`OK`	Under warning threshold	Normal operation
`WARNING`	Approaching limit (≥ warn_threshold)	Callback fires
`EXCEEDED`	Over budget	Raises `BudgetExceededError` when `enforce=True`

Latency Profiling

Profile agent execution with step-by-step timing, percentile stats, and slow-step detection. Identifies performance bottlenecks in multi-step agent workflows.

from agentlens import LatencyProfiler

profiler = LatencyProfiler()

# Start a profiling session
session = profiler.start_session("agent-workflow")

# Record steps (manually or from events)
session.record_step("prompt-build", duration_ms=45)
session.record_step("llm-call", duration_ms=2300)
session.record_step("tool-execution", duration_ms=150)
session.record_step("response-format", duration_ms=30)

# Get the session report
report = session.report()
print(f"Total duration: {report.total_duration_ms}ms")
print(f"Slowest step: {report.slowest_step.name} ({report.slowest_step.duration_ms}ms)")

# Percentile stats across multiple sessions
profiler.add_session(session)
stats = profiler.percentiles("llm-call")
print(f"P50: {stats.p50}ms, P95: {stats.p95}ms, P99: {stats.p99}ms")

# Slow step alerts
alerts = profiler.slow_step_alerts(threshold_ms=2000)
for alert in alerts:
    print(f"  ⚠ {alert.step_name}: {alert.duration_ms}ms (threshold: {alert.threshold_ms}ms)")

Postmortem Generation

Automatically generate structured incident postmortem reports from session data. Identifies root causes, assesses impact, and suggests remediations.

from agentlens import PostmortemGenerator, PostmortemConfig

config = PostmortemConfig(
    error_rate_threshold=0.1,       # Incident if > 10% errors
    latency_threshold_ms=5000,      # Incident if avg > 5s
    include_timeline=True,
    include_remediation=True,
)

generator = PostmortemGenerator(config=config)

# Generate from a session's events
report = generator.generate(
    events=session_events,
    session_id="sess-incident-42",
    title="Agent latency spike at 14:30 UTC",
)

# Inspect the report
print(f"Severity: {report.severity.value}")

for cause in report.root_causes:
    print(f"Root cause: {cause.description}")
    print(f"  Confidence: {cause.confidence}")
    print(f"  Evidence: {cause.evidence}")

for fix in report.remediations:
    print(f"Fix: {fix.description} ({fix.category.value})")
    print(f"  Priority: {fix.priority}")

# Impact assessment
print(f"Affected events: {report.impact.affected_events}")
print(f"Error rate: {report.impact.error_rate:.1%}")
print(f"Duration: {report.impact.duration_ms}ms")

# Lessons learned
for lesson in report.lessons_learned:
    print(f"Lesson: {lesson.description}")

# Export as Markdown
md = report.render_markdown()

# Export as dict (for storage / dashboards)
data = report.to_dict()

Remediation Categories

Category	Examples
`CONFIGURATION`	Adjust timeouts, rate limits, retry policies
`CODE_CHANGE`	Fix bugs, add error handling, optimize prompts
`INFRASTRUCTURE`	Scale resources, add caching, improve monitoring
`PROCESS`	Add runbooks, update on-call procedures
`EXTERNAL`	Upstream API issues, provider outages

Prompt Version Tracking

Track prompt versions, compare diffs, and correlate prompt changes with quality outcomes. Essential for prompt engineering workflows.

from agentlens import PromptVersionTracker, Outcome

tracker = PromptVersionTracker()

# Register prompt versions
v1 = tracker.add_version(
    prompt_id="summarizer",
    version="1.0",
    template="Summarize the following text: {text}",
    metadata={"author": "alice", "model": "gpt-4"},
)

v2 = tracker.add_version(
    prompt_id="summarizer",
    version="1.1",
    template="Provide a concise summary (max 3 sentences) of: {text}",
    metadata={"author": "bob", "model": "gpt-4"},
)

# Record outcomes for each version
tracker.record_outcome("summarizer", "1.0", Outcome.SUCCESS, latency_ms=1200)
tracker.record_outcome("summarizer", "1.0", Outcome.SUCCESS, latency_ms=1100)
tracker.record_outcome("summarizer", "1.1", Outcome.SUCCESS, latency_ms=900)
tracker.record_outcome("summarizer", "1.1", Outcome.FAILURE, latency_ms=2500)

# Compare versions
diff = tracker.diff("summarizer", "1.0", "1.1")
print(f"Diff kind: {diff.kind.value}")  # MODIFIED, ADDED, REMOVED
print(f"Changes: {diff.changes}")

# Generate a report
report = tracker.report("summarizer")
print(report.render())
for stat in report.version_stats:
    print(f"  v{stat.version}: {stat.success_rate:.0%} success, "
          f"avg latency {stat.avg_latency_ms:.0f}ms")

Cost Optimization

Analyze token usage patterns and recommend cheaper models for simpler tasks. Includes complexity analysis to route requests to the right model tier.

from agentlens import CostOptimizer, ComplexityAnalyzer, ComplexityLevel

# Analyze prompt complexity
analyzer = ComplexityAnalyzer()
assessment = analyzer.assess("What is the capital of France?")
print(f"Complexity: {assessment.level.value}")  # LOW, MEDIUM, HIGH, EXPERT

# Get optimization recommendations
optimizer = CostOptimizer()
report = optimizer.analyze(events)

for rec in report.recommendations:
    print(f"  {rec.description}")
    print(f"  Current model: {rec.current_model}")
    print(f"  Suggested: {rec.suggested_model}")
    print(f"  Savings: ${rec.estimated_savings:.2f}/month")
    print(f"  Confidence: {rec.confidence.value}")

# Migration plan
for step in report.migration_steps:
    print(f"Step {step.order}: {step.description}")

print(f"Total potential savings: ${report.total_savings:.2f}/month")

Model Tiers

Tier	Use Case	Example Models
`ECONOMY`	Simple lookups, classification	GPT-3.5-turbo, Claude Haiku
`STANDARD`	General tasks, summarization	GPT-4o-mini, Claude Sonnet
`PREMIUM`	Complex reasoning, code gen	GPT-4, Claude Opus
`ENTERPRISE`	Critical, high-stakes tasks	GPT-4-turbo, Claude 3.5 Sonnet

Capacity Planning

Forecast resource needs and identify bottlenecks before they hit production. Analyze workload trends and generate scaling recommendations.

from agentlens import CapacityPlanner, WorkloadSample, ResourceKind
from datetime import datetime

planner = CapacityPlanner()

# Feed workload samples
planner.add_samples([
    WorkloadSample(
        timestamp=datetime(2025, 1, d),
        requests=1000 * d,
        tokens=50000 * d,
        cost=2.5 * d,
        latency_p95_ms=800 + d * 10,
    )
    for d in range(1, 31)
])

# Generate capacity report
report = planner.report()

# Workload projections (7, 30, 90 day)
for proj in report.projections:
    print(f"  {proj.horizon_days}d: {proj.projected_requests:.0f} req/day, "
          f"${proj.projected_cost:.2f}/day")

# Bottleneck detection
for bottleneck in report.bottlenecks:
    print(f"  ⚠ {bottleneck.resource.value}: {bottleneck.severity.value}")
    print(f"    Current: {bottleneck.current_utilization:.0%}")
    print(f"    Projected peak: {bottleneck.projected_peak:.0%}")

# Scaling recommendations
for rec in report.scaling_recommendations:
    print(f"  {rec.action.value}: {rec.description}")
    print(f"  When: {rec.trigger_condition}")

Building a Monitoring Pipeline

Combine modules into a comprehensive monitoring pipeline. Here’s a production-ready example that runs health checks, anomaly detection, SLA evaluation, and budget tracking together:

from agentlens import (
    HealthScorer, AnomalyDetector, AnomalyDetectorConfig,
    SLAEvaluator, BudgetTracker, PostmortemGenerator,
    production_policy,
)

# ── Initialize monitors ──────────────────────────────────────
scorer = HealthScorer()
detector = AnomalyDetector(AnomalyDetectorConfig(z_score_threshold=2.5))
evaluator = SLAEvaluator()
tracker = BudgetTracker()
postmortem_gen = PostmortemGenerator()

sla_policy = production_policy()

# ── Train anomaly detector on baseline data ──────────────────
detector.train(baseline_events)

# ── Per-session monitoring loop ──────────────────────────────
def monitor_session(session, events):
    # 1. Health check
    health = scorer.score(events, session_id=session["id"])
    print(f"Health: {health.grade.value} ({health.overall_score}/100)")

    # 2. Anomaly scan
    anomalies = detector.analyze(events)
    if anomalies.has_anomalies:
        print(f"⚠ {anomalies.anomaly_count} anomalies detected")
        if anomalies.critical_count() > 0:
            # Auto-generate postmortem for critical anomalies
            report = postmortem_gen.generate(
                events=events,
                session_id=session["id"],
                title=f"Critical anomaly in {session['agent_name']}",
            )
            send_to_slack(report.render_markdown())

    # 3. SLA evaluation
    sla = evaluator.evaluate([session], sla_policy)
    for result in sla.results:
        if result.status.value == "BREACHED":
            alert(f"SLA breached: {result.objective.name}")

    # 4. Budget tracking
    tracker.record_for_session(
        session_id=session["id"],
        tokens_in=sum(e.get("tokens_in", 0) for e in events),
        tokens_out=sum(e.get("tokens_out", 0) for e in events),
        model=events[0].get("model", "unknown"),
    )

Tip: Start with HealthScorer for a quick overview, add AnomalyDetector when you have baseline data, and layer in SLAEvaluator and BudgetTracker as you define SLOs and cost targets. Use PostmortemGenerator to automate incident documentation.