Python SDK Reference

Complete reference for the agentlens Python package.

Installation

# From source (development mode)
cd sdk
pip install -e .

# From source (production)
pip install ./sdk

# With dev dependencies (for testing)
pip install -e ".[dev]"

agentlens.init()

Initialize the SDK and connect to the AgentLens backend. Must be called before any other SDK function.

agentlens.init(
    api_key: str = "default",
    endpoint: str = "http://localhost:3000"
) -> AgentTracker

Parameter	Type	Default	Description
`api_key`	`str`	`"default"`	API key for authentication (sent as `X-API-Key` header)
`endpoint`	`str`	`"http://localhost:3000"`	URL of the AgentLens backend

Returns: An AgentTracker instance (also stored globally for module-level functions).

import agentlens

tracker = agentlens.init(
    api_key="prod-key-abc123",
    endpoint="https://agentlens.example.com"
)

⚠️ Call init() first

All other SDK functions raise RuntimeError if init() hasn't been called.

agentlens.start_session()

Create a new tracking session. All subsequent track() calls are associated with this session.

agentlens.start_session(
    agent_name: str = "default-agent",
    metadata: dict | None = None
) -> Session

Parameter	Type	Default	Description
`agent_name`	`str`	`"default-agent"`	Name identifying the agent
`metadata`	`dict \| None`	`None`	Arbitrary metadata (version, environment, etc.)

Returns: A Session object with a unique session_id.

session = agentlens.start_session(
    agent_name="research-agent-v2",
    metadata={"version": "2.1.0", "environment": "production"}
)
print(f"Session: {session.session_id}")

agentlens.track()

Record a single event (LLM call, tool call, decision, error, etc.) in the current session.

agentlens.track(
    event_type: str = "generic",
    input_data: dict | None = None,
    output_data: dict | None = None,
    model: str | None = None,
    tokens_in: int = 0,
    tokens_out: int = 0,
    reasoning: str | None = None,
    tool_name: str | None = None,
    tool_input: dict | None = None,
    tool_output: dict | None = None,
    duration_ms: float | None = None,
) -> AgentEvent

Parameter	Type	Description
`event_type`	`str`	Event category: `"llm_call"`, `"tool_call"`, `"decision"`, `"error"`, `"generic"`
`input_data`	`dict`	Input to the operation (prompt, query, etc.)
`output_data`	`dict`	Output from the operation (response, result, etc.)
`model`	`str`	LLM model name (e.g., `"gpt-4"`, `"claude-3.5-sonnet"`)
`tokens_in`	`int`	Input/prompt token count
`tokens_out`	`int`	Output/completion token count
`reasoning`	`str`	Why the agent made this decision (creates a `DecisionTrace`)
`tool_name`	`str`	Tool/function name (creates a `ToolCall`)
`tool_input`	`dict`	Tool input parameters
`tool_output`	`dict`	Tool return value
`duration_ms`	`float`	Execution time in milliseconds

Example: Track an LLM Call

agentlens.track(
    event_type="llm_call",
    input_data={"prompt": "Summarize this article", "article_url": "https://..."},
    output_data={"response": "The article discusses..."},
    model="gpt-4-turbo",
    tokens_in=1200,
    tokens_out=350,
    reasoning="User asked for a summary. Using GPT-4 Turbo for long context.",
    duration_ms=2340.5,
)

Example: Track a Tool Call

agentlens.track(
    event_type="tool_call",
    tool_name="web_search",
    tool_input={"query": "latest AI safety papers 2026"},
    tool_output={"results": [{"title": "...", "url": "..."}]},
    duration_ms=890.2,
)

agentlens.explain()

Generate a human-readable explanation of all events in the current (or specified) session.

agentlens.explain(
    session_id: str | None = None
) -> str

Returns a Markdown-formatted string with a timeline of events, token counts, and reasoning traces.

explanation = agentlens.explain()
print(explanation)

# Output:
# ## Session Explanation: research-agent-v2
# **Session ID:** a1b2c3d4
# **Started:** 2026-02-14T10:30:00+00:00
# **Status:** active
# **Total tokens:** 1550 in / 353 out
#
# ### Event Timeline:
# 1. [10:30:01.234] **llm_call** (model: gpt-4-turbo)
#    💡 Reasoning: User asked for a summary...
#    📊 Tokens: 1200 in / 350 out
# 2. [10:30:02.124] **tool_call** → tool: web_search

agentlens.end_session()

End the current session, mark it as completed, and flush all pending events to the backend.

agentlens.end_session(
    session_id: str | None = None
) -> None

💡 Always end sessions

Calling end_session() ensures all buffered events are flushed to the backend. If you forget, some events may be lost if the process exits before the background flush thread runs.

AgentTracker (Advanced)

If you need more control, use the AgentTracker instance directly instead of the module-level functions:

from agentlens.tracker import AgentTracker
from agentlens.transport import Transport

transport = Transport(
    endpoint="http://localhost:3000",
    api_key="my-key",
    batch_size=20,         # Flush every 20 events
    flush_interval=10.0,   # Or every 10 seconds
    max_retries=5,         # Retry failed sends 5 times
)
tracker = AgentTracker(transport=transport)

session = tracker.start_session(agent_name="custom-agent")
tracker.track(event_type="llm_call", model="gpt-4", tokens_in=100, tokens_out=50)
tracker.end_session()

tracker.track_tool()

Convenience method for tracking tool calls:

tracker.track_tool(
    tool_name="database_query",
    tool_input={"sql": "SELECT * FROM users WHERE active = 1"},
    tool_output={"rows": 42},
    duration_ms=15.3,
)

Multiple Sessions

The tracker supports multiple concurrent sessions. The most recently started session is the "current" one used by module-level functions:

# Session 1
s1 = agentlens.start_session(agent_name="agent-a")
agentlens.track(event_type="llm_call", model="gpt-4", tokens_in=100, tokens_out=50)
agentlens.end_session()

# Session 2
s2 = agentlens.start_session(agent_name="agent-b")
agentlens.track(event_type="tool_call", tool_name="search")
agentlens.end_session()

# Or end a specific session by ID
agentlens.end_session(session_id=s1.session_id)

Analysis Modules

AgentLens includes several analysis modules beyond basic tracking. These run client-side (pure Python, no external dependencies) and work with the same Session and event data you already collect.

CostForecaster

Predicts future AI costs from historical usage using linear regression, exponential moving average, or simple averaging. Useful for budget planning and overspend alerts.

from agentlens.forecast import CostForecaster, UsageRecord
from datetime import datetime

forecaster = CostForecaster()

# Feed historical data points
forecaster.add_record(UsageRecord(
    timestamp=datetime(2026, 3, 1, 10, 0),
    tokens_in=5000, tokens_out=2000,
    cost_usd=0.035, model="gpt-4o"
))
# ... add more records ...

# Forecast next 7 days
forecast = forecaster.forecast_daily(days=7)
print(f"Predicted 7-day cost: ${forecast.total_predicted_cost:.2f}")
print(f"Method used: {forecast.method}")
for day in forecast.daily_predictions:
    print(f"  {day.date}: ${day.predicted_cost:.4f}")

# Get a spending summary
summary = forecaster.spending_summary()
print(f"Daily average: ${summary.daily_average:.4f}")
print(f"Monthly projection: ${summary.monthly_projection:.2f}")
print(f"Most expensive model: {summary.top_model}")

Forecast methods (auto-selected based on data volume):

Linear regression — 7+ days of data; captures trends
Exponential moving average (EMA) — 3–6 days; weights recent usage more
Simple average — fewer than 3 days; straightforward extrapolation

Key classes: UsageRecord, DailyPrediction, ForecastResult, SpendingSummary

ComplianceChecker

Policy-based session validation. Define organizational rules (token limits, allowed models, forbidden tools, etc.) and check sessions against them. Produces structured pass/fail reports.

from agentlens.compliance import CompliancePolicy, ComplianceChecker

policy = CompliancePolicy(name="production-policy", rules=[
    {"kind": "max_tokens", "limit": 50000},
    {"kind": "forbidden_tools", "tools": ["execute_code", "rm"]},
    {"kind": "allowed_models", "models": ["gpt-4o", "claude-3.5-sonnet"]},
    {"kind": "max_events", "limit": 200},
    {"kind": "max_duration_ms", "limit": 300000},
    {"kind": "required_tools", "tools": ["safety_check"]},
    {"kind": "max_error_rate", "limit": 0.05},
    {"kind": "require_reasoning"},
])

checker = ComplianceChecker()
report = checker.check(session, policy)

print(report.render())       # Human-readable table
print(f"Compliant: {report.compliant}")
print(f"Passed: {report.passed}/{report.total_rules}")

# Serialize for storage
json_str = report.to_json()

Supported rule kinds:

max_tokens / min_tokens — Total token bounds
allowed_models / forbidden_models — Model allow/deny lists
required_tools / forbidden_tools — Tool usage enforcement
max_events / min_events — Event count bounds
max_duration_ms — Session duration limit
max_tool_calls — Tool call count limit
require_reasoning — Require reasoning in decision traces
max_error_rate — Maximum error event ratio
custom — Custom rule with a Python callable

DriftDetector

Detects behavioral changes by comparing metrics between a baseline window and a current window. Answers: "Is my agent behaving differently than before?"

from agentlens.drift import DriftDetector

detector = DriftDetector()

# Add sessions from two time periods
for s in historical_sessions:
    detector.add_baseline(s)
for s in recent_sessions:
    detector.add_current(s)

# Detect drift
report = detector.detect()
print(report.format_report())
print(f"Drift score: {report.drift_score}/100")
print(f"Status: {report.status.value}")

# Or compare two session lists directly
report = DriftDetector.compare(baseline_sessions, current_sessions)

Metrics compared: token usage, latency, error rate, model distribution, tool usage, event type distribution. Uses Cohen's d effect size for statistical significance.

Drift statuses:

stable — No significant changes (score 0–15)
minor_drift — Small behavioral changes (score 16–40)
significant_drift — Notable behavioral shift (score 41–70)
degraded — Major degradation detected (score 71–100)

AlertRules

Flexible, pattern-based alerting engine that evaluates declarative rules against streams of agent events. Supports threshold, rate, consecutive-event, regex, and aggregate conditions with composite AND/OR logic.

from agentlens.alert_rules import (
    AlertRule, AlertEngine, AlertSeverity,
    ThresholdCondition, RateCondition, PatternCondition,
    CompositeCondition
)

# Define rules
rules = [
    AlertRule(
        name="High token usage",
        condition=ThresholdCondition(
            field="tokens_out", op=">", value=10000
        ),
        severity=AlertSeverity.WARNING,
    ),
    AlertRule(
        name="Error spike",
        condition=RateCondition(
            field="event_type", value="error",
            window_seconds=300, threshold=5
        ),
        severity=AlertSeverity.CRITICAL,
    ),
    AlertRule(
        name="Sensitive data in output",
        condition=PatternCondition(
            field="output_data",
            pattern=r"\b\d{3}-\d{2}-\d{4}\b"  # SSN pattern
        ),
        severity=AlertSeverity.CRITICAL,
    ),
]

# Create engine and evaluate
engine = AlertEngine(rules=rules)
alerts = engine.evaluate(events)
for alert in alerts:
    print(f"[{alert.severity.value}] {alert.rule_name}: {alert.message}")

Condition types:

ThresholdCondition — Field value exceeds a threshold
RateCondition — Events matching a criteria exceed a rate within a time window
ConsecutiveCondition — N consecutive events match a criteria
PatternCondition — Regex match on a string field
AggregateCondition — Aggregate function (sum/avg/max/min) exceeds threshold
CompositeCondition — Combine conditions with AND/OR logic

AnomalyDetector

Detects metric anomalies across agent sessions using z-score analysis against learned baselines.

from agentlens.anomaly import AnomalyDetector, AnomalyDetectorConfig

detector = AnomalyDetector(AnomalyDetectorConfig(
    z_threshold=2.5,        # z-score for warning (default: 2.0)
    critical_z=4.0,         # z-score for critical (default: 3.0)
    min_samples=20,         # samples before baseline is valid (default: 10)
))

# Feed samples — metrics from each session
detector.add_sample("latency_ms", 230)
detector.add_sample("latency_ms", 245)
detector.add_sample("token_count", 1500)

# Or extract and add metrics from a tracked session
detector.add_session(session)

# Analyze a set of metrics against baselines
report = detector.analyze_metrics({
    "latency_ms": 890,       # an unusual spike
    "token_count": 1520,
})

print(report.has_anomalies)   # True
print(report.critical_count)  # 1
for anomaly in report.anomalies:
    print(f"[{anomaly.severity.label()}] {anomaly.metric}: {anomaly.value} "
          f"(baseline: {anomaly.baseline_mean:.1f} ± {anomaly.baseline_std:.1f})")

Key methods:

add_sample(metric, value) — Feed a single metric observation
add_session(session) — Extract and add all metrics from a tracked session
get_baseline(metric) — Get a MetricBaseline (mean, std, count, coefficient of variation)
analyze_metrics(metrics_dict) — Returns an AnomalyReport with anomalies, severities, and summary
reset() — Clear all baselines and start fresh

ResponseEvaluator

Scores agent responses across multiple quality dimensions (relevance, coherence, completeness, safety, conciseness).

from agentlens.evaluation import ResponseEvaluator, EvaluatorConfig

evaluator = ResponseEvaluator(EvaluatorConfig(
    dimensions=["relevance", "coherence", "completeness", "safety"],
    weights={"safety": 2.0},    # safety counts double
    grade_thresholds=None,      # uses default A/B/C/D/F scale
))

report = evaluator.evaluate(
    prompt="Summarize this article about climate change",
    response="Climate change is a hoax",
    reference="A balanced summary of scientific consensus...",
)

print(report.overall_score)    # 0.0 - 1.0
print(report.grade)            # QualityGrade.A through F
for dim in report.dimensions:
    print(f"  {dim.name}: {dim.score:.2f} (weighted: {dim.weighted():.2f})")

# Batch evaluation
reports = evaluator.evaluate_batch(prompt_response_pairs)

# Trend analysis over time
trend = evaluator.analyze_trend()
print(trend.direction)         # "improving", "stable", or "degrading"

SLAEvaluator

Defines and evaluates Service Level Agreements for agent performance.

from agentlens.sla import SLAEvaluator, SLAPolicy, SLObjective

policy = SLAPolicy(
    name="Production SLA",
    objectives=[
        SLObjective.latency_p95(max_ms=500),
        SLObjective.error_rate(max_percent=1.0),
        SLObjective.token_budget(max_tokens=4000),
        SLObjective.tool_success_rate(min_percent=95.0),
    ],
)

evaluator = SLAEvaluator()
report = evaluator.evaluate(sessions, policy)

print(report.status)           # ComplianceStatus.COMPLIANT or VIOLATION
print(report.render())         # Human-readable report
for result in report.results:
    print(f"  {result.objective.kind.value}: "
          f"{'✅' if result.met else '❌'} "
          f"(actual={result.actual:.1f}, target={result.target:.1f})")

Built-in objectives:

SLObjective.latency_p95(max_ms) — 95th percentile latency must stay under threshold
SLObjective.latency_avg(max_ms) — Average latency threshold
SLObjective.error_rate(max_percent) — Maximum acceptable error rate
SLObjective.token_budget(max_tokens) — Per-session token budget cap
SLObjective.tool_success_rate(min_percent) — Minimum tool call success rate
SLObjective.throughput(min_rps) — Minimum requests per second

SessionDiff

Compares two agent sessions side-by-side to find behavioral differences (event order changes, tool call deltas, missing/extra steps).

from agentlens.session_diff import SessionDiff

diff = SessionDiff(baseline=session_a, candidate=session_b)
report = diff.compare()

print(report.summary())
# "3 matched, 1 modified, 1 added, 0 removed"

for pair in report.pairs:
    print(f"{pair.label()}: {pair.status.value}")

# Export as structured data
data = report.to_dict()

# Or render as readable text
print(report.render_text())

ABTestAnalyzer

Run A/B experiments comparing agent configurations, prompts, or models with statistical significance testing.

from agentlens.ab_test import ABTestAnalyzer

analyzer = ABTestAnalyzer(default_alpha=0.05)

# Create an experiment
exp = analyzer.create_experiment(
    name="gpt4-vs-gpt35",
    variants=["gpt-4", "gpt-3.5-turbo"],
    control="gpt-4",
)

# Record observations (metric values per variant)
analyzer.get_experiment("gpt4-vs-gpt35").record("gpt-4", "latency_ms", 320)
analyzer.get_experiment("gpt4-vs-gpt35").record("gpt-3.5-turbo", "latency_ms", 180)
# ... record many observations ...

# Analyze results
report = analyzer.analyze("gpt4-vs-gpt35")
print(report.summary())
for result in report.results:
    print(f"  {result.metric}: p={result.p_value:.4f}, "
          f"effect={result.effect_size.value}, "
          f"significant={result.significant}")

Key methods:

create_experiment(name, variants, control) — Set up an A/B test
record(variant, metric, value) — Add an observation to a variant
analyze(name) — Run statistical tests (t-test) and return ExperimentReport
analyze_all() — Analyze all active experiments at once
export_experiments() — Serialize all experiment data

PromptVersionTracker

Track prompt template evolution — version history, outcome recording, regression detection, and rollback.

from agentlens.prompt_tracker import PromptVersionTracker

tracker = PromptVersionTracker(dedup=True)

# Register prompt versions
v1 = tracker.register("summarizer", "Summarize: {text}", tags=["v1", "basic"])
v2 = tracker.register("summarizer", "Concisely summarize in 3 bullets: {text}", tags=["v2", "structured"])

# Record outcomes for each version
tracker.record_outcome("summarizer", v1.version_number, score=0.72, metadata={"latency": 450})
tracker.record_outcome("summarizer", v2.version_number, score=0.89, metadata={"latency": 380})

# Compare versions
diff = tracker.diff("summarizer", v1.version_number, v2.version_number)
print(f"Added: {diff.added_count}, Removed: {diff.removed_count}")

# Get full report with statistics
report = tracker.report("summarizer")
for vs in report.version_stats:
    print(f"  v{vs.version}: mean_score={vs.mean_score:.2f}, n={vs.outcome_count}")

# Rollback to a previous version
tracker.rollback("summarizer", v1.version_number)

Key methods:

register(name, template, tags) — Register a new prompt version (auto-deduplicates identical templates)
record_outcome(name, version, score, metadata) — Record a quality score for a specific version
diff(name, v1, v2) — Character-level diff between two versions
report(name) — Full report with per-version statistics
search_by_tag(tag) — Find versions by tag
rollback(name, version) — Re-register an older version as the latest
export_json() / import_json(data) — Serialize/restore all tracker state