Python SDK Reference

Complete reference for the agentlens Python package.

Installation

# From source (development mode)
cd sdk
pip install -e .

# From source (production)
pip install ./sdk

# With dev dependencies (for testing)
pip install -e ".[dev]"

agentlens.init()

Initialize the SDK and connect to the AgentLens backend. Must be called before any other SDK function.

agentlens.init(
    api_key: str = "default",
    endpoint: str = "http://localhost:3000"
) -> AgentTracker
ParameterTypeDefaultDescription
api_keystr"default"API key for authentication (sent as X-API-Key header)
endpointstr"http://localhost:3000"URL of the AgentLens backend

Returns: An AgentTracker instance (also stored globally for module-level functions).

import agentlens

tracker = agentlens.init(
    api_key="prod-key-abc123",
    endpoint="https://agentlens.example.com"
)
⚠️ Call init() first

All other SDK functions raise RuntimeError if init() hasn't been called.

agentlens.start_session()

Create a new tracking session. All subsequent track() calls are associated with this session.

agentlens.start_session(
    agent_name: str = "default-agent",
    metadata: dict | None = None
) -> Session
ParameterTypeDefaultDescription
agent_namestr"default-agent"Name identifying the agent
metadatadict | NoneNoneArbitrary metadata (version, environment, etc.)

Returns: A Session object with a unique session_id.

session = agentlens.start_session(
    agent_name="research-agent-v2",
    metadata={"version": "2.1.0", "environment": "production"}
)
print(f"Session: {session.session_id}")

agentlens.track()

Record a single event (LLM call, tool call, decision, error, etc.) in the current session.

agentlens.track(
    event_type: str = "generic",
    input_data: dict | None = None,
    output_data: dict | None = None,
    model: str | None = None,
    tokens_in: int = 0,
    tokens_out: int = 0,
    reasoning: str | None = None,
    tool_name: str | None = None,
    tool_input: dict | None = None,
    tool_output: dict | None = None,
    duration_ms: float | None = None,
) -> AgentEvent
ParameterTypeDescription
event_typestrEvent category: "llm_call", "tool_call", "decision", "error", "generic"
input_datadictInput to the operation (prompt, query, etc.)
output_datadictOutput from the operation (response, result, etc.)
modelstrLLM model name (e.g., "gpt-4", "claude-3.5-sonnet")
tokens_inintInput/prompt token count
tokens_outintOutput/completion token count
reasoningstrWhy the agent made this decision (creates a DecisionTrace)
tool_namestrTool/function name (creates a ToolCall)
tool_inputdictTool input parameters
tool_outputdictTool return value
duration_msfloatExecution time in milliseconds

Example: Track an LLM Call

agentlens.track(
    event_type="llm_call",
    input_data={"prompt": "Summarize this article", "article_url": "https://..."},
    output_data={"response": "The article discusses..."},
    model="gpt-4-turbo",
    tokens_in=1200,
    tokens_out=350,
    reasoning="User asked for a summary. Using GPT-4 Turbo for long context.",
    duration_ms=2340.5,
)

Example: Track a Tool Call

agentlens.track(
    event_type="tool_call",
    tool_name="web_search",
    tool_input={"query": "latest AI safety papers 2026"},
    tool_output={"results": [{"title": "...", "url": "..."}]},
    duration_ms=890.2,
)

agentlens.explain()

Generate a human-readable explanation of all events in the current (or specified) session.

agentlens.explain(
    session_id: str | None = None
) -> str

Returns a Markdown-formatted string with a timeline of events, token counts, and reasoning traces.

explanation = agentlens.explain()
print(explanation)

# Output:
# ## Session Explanation: research-agent-v2
# **Session ID:** a1b2c3d4
# **Started:** 2026-02-14T10:30:00+00:00
# **Status:** active
# **Total tokens:** 1550 in / 353 out
#
# ### Event Timeline:
# 1. [10:30:01.234] **llm_call** (model: gpt-4-turbo)
#    💡 Reasoning: User asked for a summary...
#    📊 Tokens: 1200 in / 350 out
# 2. [10:30:02.124] **tool_call** → tool: web_search

agentlens.end_session()

End the current session, mark it as completed, and flush all pending events to the backend.

agentlens.end_session(
    session_id: str | None = None
) -> None
💡 Always end sessions

Calling end_session() ensures all buffered events are flushed to the backend. If you forget, some events may be lost if the process exits before the background flush thread runs.

AgentTracker (Advanced)

If you need more control, use the AgentTracker instance directly instead of the module-level functions:

from agentlens.tracker import AgentTracker
from agentlens.transport import Transport

transport = Transport(
    endpoint="http://localhost:3000",
    api_key="my-key",
    batch_size=20,         # Flush every 20 events
    flush_interval=10.0,   # Or every 10 seconds
    max_retries=5,         # Retry failed sends 5 times
)
tracker = AgentTracker(transport=transport)

session = tracker.start_session(agent_name="custom-agent")
tracker.track(event_type="llm_call", model="gpt-4", tokens_in=100, tokens_out=50)
tracker.end_session()

tracker.track_tool()

Convenience method for tracking tool calls:

tracker.track_tool(
    tool_name="database_query",
    tool_input={"sql": "SELECT * FROM users WHERE active = 1"},
    tool_output={"rows": 42},
    duration_ms=15.3,
)

Multiple Sessions

The tracker supports multiple concurrent sessions. The most recently started session is the "current" one used by module-level functions:

# Session 1
s1 = agentlens.start_session(agent_name="agent-a")
agentlens.track(event_type="llm_call", model="gpt-4", tokens_in=100, tokens_out=50)
agentlens.end_session()

# Session 2
s2 = agentlens.start_session(agent_name="agent-b")
agentlens.track(event_type="tool_call", tool_name="search")
agentlens.end_session()

# Or end a specific session by ID
agentlens.end_session(session_id=s1.session_id)

Analysis Modules

AgentLens includes several analysis modules beyond basic tracking. These run client-side (pure Python, no external dependencies) and work with the same Session and event data you already collect.

CostForecaster

Predicts future AI costs from historical usage using linear regression, exponential moving average, or simple averaging. Useful for budget planning and overspend alerts.

from agentlens.forecast import CostForecaster, UsageRecord
from datetime import datetime

forecaster = CostForecaster()

# Feed historical data points
forecaster.add_record(UsageRecord(
    timestamp=datetime(2026, 3, 1, 10, 0),
    tokens_in=5000, tokens_out=2000,
    cost_usd=0.035, model="gpt-4o"
))
# ... add more records ...

# Forecast next 7 days
forecast = forecaster.forecast_daily(days=7)
print(f"Predicted 7-day cost: ${forecast.total_predicted_cost:.2f}")
print(f"Method used: {forecast.method}")
for day in forecast.daily_predictions:
    print(f"  {day.date}: ${day.predicted_cost:.4f}")

# Get a spending summary
summary = forecaster.spending_summary()
print(f"Daily average: ${summary.daily_average:.4f}")
print(f"Monthly projection: ${summary.monthly_projection:.2f}")
print(f"Most expensive model: {summary.top_model}")

Forecast methods (auto-selected based on data volume):

Key classes: UsageRecord, DailyPrediction, ForecastResult, SpendingSummary

ComplianceChecker

Policy-based session validation. Define organizational rules (token limits, allowed models, forbidden tools, etc.) and check sessions against them. Produces structured pass/fail reports.

from agentlens.compliance import CompliancePolicy, ComplianceChecker

policy = CompliancePolicy(name="production-policy", rules=[
    {"kind": "max_tokens", "limit": 50000},
    {"kind": "forbidden_tools", "tools": ["execute_code", "rm"]},
    {"kind": "allowed_models", "models": ["gpt-4o", "claude-3.5-sonnet"]},
    {"kind": "max_events", "limit": 200},
    {"kind": "max_duration_ms", "limit": 300000},
    {"kind": "required_tools", "tools": ["safety_check"]},
    {"kind": "max_error_rate", "limit": 0.05},
    {"kind": "require_reasoning"},
])

checker = ComplianceChecker()
report = checker.check(session, policy)

print(report.render())       # Human-readable table
print(f"Compliant: {report.compliant}")
print(f"Passed: {report.passed}/{report.total_rules}")

# Serialize for storage
json_str = report.to_json()

Supported rule kinds:

DriftDetector

Detects behavioral changes by comparing metrics between a baseline window and a current window. Answers: "Is my agent behaving differently than before?"

from agentlens.drift import DriftDetector

detector = DriftDetector()

# Add sessions from two time periods
for s in historical_sessions:
    detector.add_baseline(s)
for s in recent_sessions:
    detector.add_current(s)

# Detect drift
report = detector.detect()
print(report.format_report())
print(f"Drift score: {report.drift_score}/100")
print(f"Status: {report.status.value}")

# Or compare two session lists directly
report = DriftDetector.compare(baseline_sessions, current_sessions)

Metrics compared: token usage, latency, error rate, model distribution, tool usage, event type distribution. Uses Cohen's d effect size for statistical significance.

Drift statuses:

AlertRules

Flexible, pattern-based alerting engine that evaluates declarative rules against streams of agent events. Supports threshold, rate, consecutive-event, regex, and aggregate conditions with composite AND/OR logic.

from agentlens.alert_rules import (
    AlertRule, AlertEngine, AlertSeverity,
    ThresholdCondition, RateCondition, PatternCondition,
    CompositeCondition
)

# Define rules
rules = [
    AlertRule(
        name="High token usage",
        condition=ThresholdCondition(
            field="tokens_out", op=">", value=10000
        ),
        severity=AlertSeverity.WARNING,
    ),
    AlertRule(
        name="Error spike",
        condition=RateCondition(
            field="event_type", value="error",
            window_seconds=300, threshold=5
        ),
        severity=AlertSeverity.CRITICAL,
    ),
    AlertRule(
        name="Sensitive data in output",
        condition=PatternCondition(
            field="output_data",
            pattern=r"\b\d{3}-\d{2}-\d{4}\b"  # SSN pattern
        ),
        severity=AlertSeverity.CRITICAL,
    ),
]

# Create engine and evaluate
engine = AlertEngine(rules=rules)
alerts = engine.evaluate(events)
for alert in alerts:
    print(f"[{alert.severity.value}] {alert.rule_name}: {alert.message}")

Condition types:

AnomalyDetector

Detects metric anomalies across agent sessions using z-score analysis against learned baselines.

from agentlens.anomaly import AnomalyDetector, AnomalyDetectorConfig

detector = AnomalyDetector(AnomalyDetectorConfig(
    z_threshold=2.5,        # z-score for warning (default: 2.0)
    critical_z=4.0,         # z-score for critical (default: 3.0)
    min_samples=20,         # samples before baseline is valid (default: 10)
))

# Feed samples — metrics from each session
detector.add_sample("latency_ms", 230)
detector.add_sample("latency_ms", 245)
detector.add_sample("token_count", 1500)

# Or extract and add metrics from a tracked session
detector.add_session(session)

# Analyze a set of metrics against baselines
report = detector.analyze_metrics({
    "latency_ms": 890,       # an unusual spike
    "token_count": 1520,
})

print(report.has_anomalies)   # True
print(report.critical_count)  # 1
for anomaly in report.anomalies:
    print(f"[{anomaly.severity.label()}] {anomaly.metric}: {anomaly.value} "
          f"(baseline: {anomaly.baseline_mean:.1f} ± {anomaly.baseline_std:.1f})")

Key methods:

ResponseEvaluator

Scores agent responses across multiple quality dimensions (relevance, coherence, completeness, safety, conciseness).

from agentlens.evaluation import ResponseEvaluator, EvaluatorConfig

evaluator = ResponseEvaluator(EvaluatorConfig(
    dimensions=["relevance", "coherence", "completeness", "safety"],
    weights={"safety": 2.0},    # safety counts double
    grade_thresholds=None,      # uses default A/B/C/D/F scale
))

report = evaluator.evaluate(
    prompt="Summarize this article about climate change",
    response="Climate change is a hoax",
    reference="A balanced summary of scientific consensus...",
)

print(report.overall_score)    # 0.0 - 1.0
print(report.grade)            # QualityGrade.A through F
for dim in report.dimensions:
    print(f"  {dim.name}: {dim.score:.2f} (weighted: {dim.weighted():.2f})")

# Batch evaluation
reports = evaluator.evaluate_batch(prompt_response_pairs)

# Trend analysis over time
trend = evaluator.analyze_trend()
print(trend.direction)         # "improving", "stable", or "degrading"

SLAEvaluator

Defines and evaluates Service Level Agreements for agent performance.

from agentlens.sla import SLAEvaluator, SLAPolicy, SLObjective

policy = SLAPolicy(
    name="Production SLA",
    objectives=[
        SLObjective.latency_p95(max_ms=500),
        SLObjective.error_rate(max_percent=1.0),
        SLObjective.token_budget(max_tokens=4000),
        SLObjective.tool_success_rate(min_percent=95.0),
    ],
)

evaluator = SLAEvaluator()
report = evaluator.evaluate(sessions, policy)

print(report.status)           # ComplianceStatus.COMPLIANT or VIOLATION
print(report.render())         # Human-readable report
for result in report.results:
    print(f"  {result.objective.kind.value}: "
          f"{'✅' if result.met else '❌'} "
          f"(actual={result.actual:.1f}, target={result.target:.1f})")

Built-in objectives:

SessionDiff

Compares two agent sessions side-by-side to find behavioral differences (event order changes, tool call deltas, missing/extra steps).

from agentlens.session_diff import SessionDiff

diff = SessionDiff(baseline=session_a, candidate=session_b)
report = diff.compare()

print(report.summary())
# "3 matched, 1 modified, 1 added, 0 removed"

for pair in report.pairs:
    print(f"{pair.label()}: {pair.status.value}")

# Export as structured data
data = report.to_dict()

# Or render as readable text
print(report.render_text())

ABTestAnalyzer

Run A/B experiments comparing agent configurations, prompts, or models with statistical significance testing.

from agentlens.ab_test import ABTestAnalyzer

analyzer = ABTestAnalyzer(default_alpha=0.05)

# Create an experiment
exp = analyzer.create_experiment(
    name="gpt4-vs-gpt35",
    variants=["gpt-4", "gpt-3.5-turbo"],
    control="gpt-4",
)

# Record observations (metric values per variant)
analyzer.get_experiment("gpt4-vs-gpt35").record("gpt-4", "latency_ms", 320)
analyzer.get_experiment("gpt4-vs-gpt35").record("gpt-3.5-turbo", "latency_ms", 180)
# ... record many observations ...

# Analyze results
report = analyzer.analyze("gpt4-vs-gpt35")
print(report.summary())
for result in report.results:
    print(f"  {result.metric}: p={result.p_value:.4f}, "
          f"effect={result.effect_size.value}, "
          f"significant={result.significant}")

Key methods:

PromptVersionTracker

Track prompt template evolution — version history, outcome recording, regression detection, and rollback.

from agentlens.prompt_tracker import PromptVersionTracker

tracker = PromptVersionTracker(dedup=True)

# Register prompt versions
v1 = tracker.register("summarizer", "Summarize: {text}", tags=["v1", "basic"])
v2 = tracker.register("summarizer", "Concisely summarize in 3 bullets: {text}", tags=["v2", "structured"])

# Record outcomes for each version
tracker.record_outcome("summarizer", v1.version_number, score=0.72, metadata={"latency": 450})
tracker.record_outcome("summarizer", v2.version_number, score=0.89, metadata={"latency": 380})

# Compare versions
diff = tracker.diff("summarizer", v1.version_number, v2.version_number)
print(f"Added: {diff.added_count}, Removed: {diff.removed_count}")

# Get full report with statistics
report = tracker.report("summarizer")
for vs in report.version_stats:
    print(f"  v{vs.version}: mean_score={vs.mean_score:.2f}, n={vs.outcome_count}")

# Rollback to a previous version
tracker.rollback("summarizer", v1.version_number)

Key methods: