Monitoring & Analysis

Health scoring, anomaly detection, SLA evaluation, budgets, postmortem generation, and latency profiling for production AI agents.

Overview: AgentLens provides six monitoring modules that work together to give you production-grade observability over AI agent behavior. Each module operates on the same event and session data, so you can combine them into a comprehensive monitoring pipeline.

Health Scoring

Score agent sessions on 6 dimensions with letter grades (A–F) and actionable recommendations. Use this for at-a-glance session quality assessment.

from agentlens import HealthScorer, HealthThresholds

# Default thresholds
scorer = HealthScorer()

# Score a list of events
report = scorer.score(events, session_id="session-123")
print(report.render())

# Score a Session object directly
report = scorer.score_session(session)

# Inspect results
print(f"Grade: {report.grade.value}")    # A, B, C, D, F
print(f"Score: {report.overall_score}")  # 0-100
for metric in report.metrics:
    print(f"  {metric.name}: {metric.score}/100 ({metric.grade.value})")

# Actionable recommendations
for rec in report.recommendations:
    print(f"  → {rec}")

Scored Dimensions

MetricWhat It MeasuresWeight
Error RateFraction of events that are errorsHigh
Avg LatencyMean response time across eventsMedium
P95 Latency95th percentile latencyMedium
Tool SuccessRatio of successful tool callsMedium
Token EfficiencyTokens used relative to event countLow
Event VolumeWhether event count is in a healthy rangeLow

Custom Thresholds

# Tighter thresholds for production
thresholds = HealthThresholds(
    error_rate_warn=0.05,        # Warn at 5% errors
    error_rate_critical=0.15,    # Critical at 15%
    latency_warn_ms=2000,        # Warn at 2s avg latency
    latency_critical_ms=8000,    # Critical at 8s
)
scorer = HealthScorer(thresholds=thresholds)

Grade Scale

GradeScoreInterpretation
A90–100Excellent — agent is performing optimally
B80–89Good — minor areas for improvement
C70–79Acceptable — several metrics need attention
D60–69Poor — significant issues present
F0–59Failing — immediate action required

Anomaly Detection

Statistical anomaly detection using z-scores against learned baselines. Detects latency spikes, token anomalies, and error bursts in real time.

from agentlens import AnomalyDetector, AnomalyDetectorConfig

config = AnomalyDetectorConfig(
    z_score_threshold=2.5,         # σ threshold for anomaly
    min_baseline_events=10,        # Min events before detection starts
    latency_weight=1.0,
    token_weight=1.0,
    error_weight=1.5,              # Weight errors more heavily
)

detector = AnomalyDetector(config=config)

# Train on historical (known-good) events
detector.train(historical_events)

# Analyze new events
report = detector.analyze(new_events)

print(f"Anomalies: {report.anomaly_count}")
print(f"Max severity: {report.max_severity}")
print(report.summary())

# Group anomalies
by_kind = report.by_kind()          # dict[AnomalyKind, list]
critical = report.critical_count()
warnings = report.warning_count()

Anomaly Kinds

KindDescriptionTypical Cause
LATENCY_SPIKEResponse time significantly above baselineAPI throttling, model overload
TOKEN_SPIKEToken usage significantly above baselinePrompt injection, unexpected input
ERROR_BURSTError rate significantly above baselineAPI outage, bad deployment
LATENCY_DROPUnusually fast responsesShort-circuiting, cached responses
TOKEN_DROPUnusually low token usageTruncated responses, model errors

Severity Levels

Severityz-score RangeAction
LOW2.0–2.5σMonitor, log for trending
MEDIUM2.5–3.0σInvestigate within 1 hour
HIGH3.0–4.0σInvestigate immediately
CRITICAL> 4.0σPage on-call, potential incident

SLA Evaluation

Evaluate agent sessions against Service Level Objectives (SLOs). Built-in production and development policies, or define your own.

from agentlens import SLAEvaluator, SLObjective, SLAPolicy

policy = SLAPolicy(
    name="production-sla",
    objectives=[
        SLObjective.latency_p95(target_ms=3000, slo_percent=99.0),
        SLObjective.error_rate(target_rate=0.01, slo_percent=99.5),
        SLObjective.token_budget(target_per_session=10000, slo_percent=95.0),
        SLObjective.tool_success_rate(target_rate=0.95),
        SLObjective.throughput(min_events=5, slo_percent=95.0),
    ],
)

# Or use built-in policies
from agentlens import production_policy, development_policy
policy = production_policy()

evaluator = SLAEvaluator()
report = evaluator.evaluate(sessions, policy)
print(report.render())

for result in report.results:
    print(f"  {result.objective.name}: {result.compliance_rate:.1%} "
          f"(target: {result.objective.slo_percent}%) — {result.status.value}")

Objective Types

KindFactory MethodWhat It Measures
LATENCY_P95SLObjective.latency_p95()95th percentile latency per session
LATENCY_AVGSLObjective.latency_avg()Average latency per session
ERROR_RATESLObjective.error_rate()Fraction of error events
TOKEN_BUDGETSLObjective.token_budget()Mean tokens per session
TOOL_SUCCESSSLObjective.tool_success_rate()Tool call success ratio
THROUGHPUTSLObjective.throughput()Minimum events per session

Compliance Statuses

StatusMeaning
METObjective fully satisfied
AT_RISKWithin 5% of threshold
BREACHEDBelow target SLO percentage
NO_DATAInsufficient data to evaluate

Token Budgets

Enforce per-session or per-agent token and cost budgets with threshold callbacks. Prevents runaway costs from misbehaving agents.

from agentlens import BudgetTracker, BudgetStatus

tracker = BudgetTracker()

# Create a budget with token and cost caps
budget = tracker.create_budget(
    budget_id="agent-session-1",
    max_tokens=50000,
    max_cost=2.50,
    warn_threshold=0.8,   # Fire callback at 80%
)

# Threshold notifications
def on_threshold(budget, status):
    print(f"Budget '{budget.budget_id}' → {status.value}")

tracker.on_threshold(on_threshold)

# Record token usage
tracker.record(
    budget_id="agent-session-1",
    tokens_in=1200,
    tokens_out=800,
    model="gpt-4",
)

# Check status
print(f"Used: {budget.utilization:.0%}")
print(f"Remaining: {budget.remaining_tokens}")
print(f"Status: {budget.status.value}")  # OK, WARNING, EXCEEDED

# Link budgets to sessions
tracker.record_for_session(
    session_id="sess-abc",
    tokens_in=500, tokens_out=200, model="gpt-4",
)
report = tracker.report_for_session("sess-abc")

Budget Statuses

StatusMeaningBehavior
OKUnder warning thresholdNormal operation
WARNINGApproaching limit (≥ warn_threshold)Callback fires
EXCEEDEDOver budgetRaises BudgetExceededError when enforce=True

Latency Profiling

Profile agent execution with step-by-step timing, percentile stats, and slow-step detection. Identifies performance bottlenecks in multi-step agent workflows.

from agentlens import LatencyProfiler

profiler = LatencyProfiler()

# Start a profiling session
session = profiler.start_session("agent-workflow")

# Record steps (manually or from events)
session.record_step("prompt-build", duration_ms=45)
session.record_step("llm-call", duration_ms=2300)
session.record_step("tool-execution", duration_ms=150)
session.record_step("response-format", duration_ms=30)

# Get the session report
report = session.report()
print(f"Total duration: {report.total_duration_ms}ms")
print(f"Slowest step: {report.slowest_step.name} ({report.slowest_step.duration_ms}ms)")

# Percentile stats across multiple sessions
profiler.add_session(session)
stats = profiler.percentiles("llm-call")
print(f"P50: {stats.p50}ms, P95: {stats.p95}ms, P99: {stats.p99}ms")

# Slow step alerts
alerts = profiler.slow_step_alerts(threshold_ms=2000)
for alert in alerts:
    print(f"  ⚠ {alert.step_name}: {alert.duration_ms}ms (threshold: {alert.threshold_ms}ms)")

Postmortem Generation

Automatically generate structured incident postmortem reports from session data. Identifies root causes, assesses impact, and suggests remediations.

from agentlens import PostmortemGenerator, PostmortemConfig

config = PostmortemConfig(
    error_rate_threshold=0.1,       # Incident if > 10% errors
    latency_threshold_ms=5000,      # Incident if avg > 5s
    include_timeline=True,
    include_remediation=True,
)

generator = PostmortemGenerator(config=config)

# Generate from a session's events
report = generator.generate(
    events=session_events,
    session_id="sess-incident-42",
    title="Agent latency spike at 14:30 UTC",
)

# Inspect the report
print(f"Severity: {report.severity.value}")

for cause in report.root_causes:
    print(f"Root cause: {cause.description}")
    print(f"  Confidence: {cause.confidence}")
    print(f"  Evidence: {cause.evidence}")

for fix in report.remediations:
    print(f"Fix: {fix.description} ({fix.category.value})")
    print(f"  Priority: {fix.priority}")

# Impact assessment
print(f"Affected events: {report.impact.affected_events}")
print(f"Error rate: {report.impact.error_rate:.1%}")
print(f"Duration: {report.impact.duration_ms}ms")

# Lessons learned
for lesson in report.lessons_learned:
    print(f"Lesson: {lesson.description}")

# Export as Markdown
md = report.render_markdown()

# Export as dict (for storage / dashboards)
data = report.to_dict()

Remediation Categories

CategoryExamples
CONFIGURATIONAdjust timeouts, rate limits, retry policies
CODE_CHANGEFix bugs, add error handling, optimize prompts
INFRASTRUCTUREScale resources, add caching, improve monitoring
PROCESSAdd runbooks, update on-call procedures
EXTERNALUpstream API issues, provider outages

Prompt Version Tracking

Track prompt versions, compare diffs, and correlate prompt changes with quality outcomes. Essential for prompt engineering workflows.

from agentlens import PromptVersionTracker, Outcome

tracker = PromptVersionTracker()

# Register prompt versions
v1 = tracker.add_version(
    prompt_id="summarizer",
    version="1.0",
    template="Summarize the following text: {text}",
    metadata={"author": "alice", "model": "gpt-4"},
)

v2 = tracker.add_version(
    prompt_id="summarizer",
    version="1.1",
    template="Provide a concise summary (max 3 sentences) of: {text}",
    metadata={"author": "bob", "model": "gpt-4"},
)

# Record outcomes for each version
tracker.record_outcome("summarizer", "1.0", Outcome.SUCCESS, latency_ms=1200)
tracker.record_outcome("summarizer", "1.0", Outcome.SUCCESS, latency_ms=1100)
tracker.record_outcome("summarizer", "1.1", Outcome.SUCCESS, latency_ms=900)
tracker.record_outcome("summarizer", "1.1", Outcome.FAILURE, latency_ms=2500)

# Compare versions
diff = tracker.diff("summarizer", "1.0", "1.1")
print(f"Diff kind: {diff.kind.value}")  # MODIFIED, ADDED, REMOVED
print(f"Changes: {diff.changes}")

# Generate a report
report = tracker.report("summarizer")
print(report.render())
for stat in report.version_stats:
    print(f"  v{stat.version}: {stat.success_rate:.0%} success, "
          f"avg latency {stat.avg_latency_ms:.0f}ms")

Cost Optimization

Analyze token usage patterns and recommend cheaper models for simpler tasks. Includes complexity analysis to route requests to the right model tier.

from agentlens import CostOptimizer, ComplexityAnalyzer, ComplexityLevel

# Analyze prompt complexity
analyzer = ComplexityAnalyzer()
assessment = analyzer.assess("What is the capital of France?")
print(f"Complexity: {assessment.level.value}")  # LOW, MEDIUM, HIGH, EXPERT

# Get optimization recommendations
optimizer = CostOptimizer()
report = optimizer.analyze(events)

for rec in report.recommendations:
    print(f"  {rec.description}")
    print(f"  Current model: {rec.current_model}")
    print(f"  Suggested: {rec.suggested_model}")
    print(f"  Savings: ${rec.estimated_savings:.2f}/month")
    print(f"  Confidence: {rec.confidence.value}")

# Migration plan
for step in report.migration_steps:
    print(f"Step {step.order}: {step.description}")

print(f"Total potential savings: ${report.total_savings:.2f}/month")

Model Tiers

TierUse CaseExample Models
ECONOMYSimple lookups, classificationGPT-3.5-turbo, Claude Haiku
STANDARDGeneral tasks, summarizationGPT-4o-mini, Claude Sonnet
PREMIUMComplex reasoning, code genGPT-4, Claude Opus
ENTERPRISECritical, high-stakes tasksGPT-4-turbo, Claude 3.5 Sonnet

Capacity Planning

Forecast resource needs and identify bottlenecks before they hit production. Analyze workload trends and generate scaling recommendations.

from agentlens import CapacityPlanner, WorkloadSample, ResourceKind
from datetime import datetime

planner = CapacityPlanner()

# Feed workload samples
planner.add_samples([
    WorkloadSample(
        timestamp=datetime(2025, 1, d),
        requests=1000 * d,
        tokens=50000 * d,
        cost=2.5 * d,
        latency_p95_ms=800 + d * 10,
    )
    for d in range(1, 31)
])

# Generate capacity report
report = planner.report()

# Workload projections (7, 30, 90 day)
for proj in report.projections:
    print(f"  {proj.horizon_days}d: {proj.projected_requests:.0f} req/day, "
          f"${proj.projected_cost:.2f}/day")

# Bottleneck detection
for bottleneck in report.bottlenecks:
    print(f"  ⚠ {bottleneck.resource.value}: {bottleneck.severity.value}")
    print(f"    Current: {bottleneck.current_utilization:.0%}")
    print(f"    Projected peak: {bottleneck.projected_peak:.0%}")

# Scaling recommendations
for rec in report.scaling_recommendations:
    print(f"  {rec.action.value}: {rec.description}")
    print(f"  When: {rec.trigger_condition}")

Building a Monitoring Pipeline

Combine modules into a comprehensive monitoring pipeline. Here’s a production-ready example that runs health checks, anomaly detection, SLA evaluation, and budget tracking together:

from agentlens import (
    HealthScorer, AnomalyDetector, AnomalyDetectorConfig,
    SLAEvaluator, BudgetTracker, PostmortemGenerator,
    production_policy,
)

# ── Initialize monitors ──────────────────────────────────────
scorer = HealthScorer()
detector = AnomalyDetector(AnomalyDetectorConfig(z_score_threshold=2.5))
evaluator = SLAEvaluator()
tracker = BudgetTracker()
postmortem_gen = PostmortemGenerator()

sla_policy = production_policy()

# ── Train anomaly detector on baseline data ──────────────────
detector.train(baseline_events)

# ── Per-session monitoring loop ──────────────────────────────
def monitor_session(session, events):
    # 1. Health check
    health = scorer.score(events, session_id=session["id"])
    print(f"Health: {health.grade.value} ({health.overall_score}/100)")

    # 2. Anomaly scan
    anomalies = detector.analyze(events)
    if anomalies.has_anomalies:
        print(f"⚠ {anomalies.anomaly_count} anomalies detected")
        if anomalies.critical_count() > 0:
            # Auto-generate postmortem for critical anomalies
            report = postmortem_gen.generate(
                events=events,
                session_id=session["id"],
                title=f"Critical anomaly in {session['agent_name']}",
            )
            send_to_slack(report.render_markdown())

    # 3. SLA evaluation
    sla = evaluator.evaluate([session], sla_policy)
    for result in sla.results:
        if result.status.value == "BREACHED":
            alert(f"SLA breached: {result.objective.name}")

    # 4. Budget tracking
    tracker.record_for_session(
        session_id=session["id"],
        tokens_in=sum(e.get("tokens_in", 0) for e in events),
        tokens_out=sum(e.get("tokens_out", 0) for e in events),
        model=events[0].get("model", "unknown"),
    )
Tip: Start with HealthScorer for a quick overview, add AnomalyDetector when you have baseline data, and layer in SLAEvaluator and BudgetTracker as you define SLOs and cost targets. Use PostmortemGenerator to automate incident documentation.