Python SDK Reference
Complete reference for the agentlens Python package.
Installation
# From source (development mode)
cd sdk
pip install -e .
# From source (production)
pip install ./sdk
# With dev dependencies (for testing)
pip install -e ".[dev]"
agentlens.init()
Initialize the SDK and connect to the AgentLens backend. Must be called before any other SDK function.
agentlens.init(
api_key: str = "default",
endpoint: str = "http://localhost:3000"
) -> AgentTracker
| Parameter | Type | Default | Description |
|---|---|---|---|
api_key | str | "default" | API key for authentication (sent as X-API-Key header) |
endpoint | str | "http://localhost:3000" | URL of the AgentLens backend |
Returns: An AgentTracker instance (also stored globally for module-level functions).
import agentlens
tracker = agentlens.init(
api_key="prod-key-abc123",
endpoint="https://agentlens.example.com"
)
All other SDK functions raise RuntimeError if init() hasn't been called.
agentlens.start_session()
Create a new tracking session. All subsequent track() calls are associated with this session.
agentlens.start_session(
agent_name: str = "default-agent",
metadata: dict | None = None
) -> Session
| Parameter | Type | Default | Description |
|---|---|---|---|
agent_name | str | "default-agent" | Name identifying the agent |
metadata | dict | None | None | Arbitrary metadata (version, environment, etc.) |
Returns: A Session object with a unique session_id.
session = agentlens.start_session(
agent_name="research-agent-v2",
metadata={"version": "2.1.0", "environment": "production"}
)
print(f"Session: {session.session_id}")
agentlens.track()
Record a single event (LLM call, tool call, decision, error, etc.) in the current session.
agentlens.track(
event_type: str = "generic",
input_data: dict | None = None,
output_data: dict | None = None,
model: str | None = None,
tokens_in: int = 0,
tokens_out: int = 0,
reasoning: str | None = None,
tool_name: str | None = None,
tool_input: dict | None = None,
tool_output: dict | None = None,
duration_ms: float | None = None,
) -> AgentEvent
| Parameter | Type | Description |
|---|---|---|
event_type | str | Event category: "llm_call", "tool_call", "decision", "error", "generic" |
input_data | dict | Input to the operation (prompt, query, etc.) |
output_data | dict | Output from the operation (response, result, etc.) |
model | str | LLM model name (e.g., "gpt-4", "claude-3.5-sonnet") |
tokens_in | int | Input/prompt token count |
tokens_out | int | Output/completion token count |
reasoning | str | Why the agent made this decision (creates a DecisionTrace) |
tool_name | str | Tool/function name (creates a ToolCall) |
tool_input | dict | Tool input parameters |
tool_output | dict | Tool return value |
duration_ms | float | Execution time in milliseconds |
Example: Track an LLM Call
agentlens.track(
event_type="llm_call",
input_data={"prompt": "Summarize this article", "article_url": "https://..."},
output_data={"response": "The article discusses..."},
model="gpt-4-turbo",
tokens_in=1200,
tokens_out=350,
reasoning="User asked for a summary. Using GPT-4 Turbo for long context.",
duration_ms=2340.5,
)
Example: Track a Tool Call
agentlens.track(
event_type="tool_call",
tool_name="web_search",
tool_input={"query": "latest AI safety papers 2026"},
tool_output={"results": [{"title": "...", "url": "..."}]},
duration_ms=890.2,
)
agentlens.explain()
Generate a human-readable explanation of all events in the current (or specified) session.
agentlens.explain(
session_id: str | None = None
) -> str
Returns a Markdown-formatted string with a timeline of events, token counts, and reasoning traces.
explanation = agentlens.explain()
print(explanation)
# Output:
# ## Session Explanation: research-agent-v2
# **Session ID:** a1b2c3d4
# **Started:** 2026-02-14T10:30:00+00:00
# **Status:** active
# **Total tokens:** 1550 in / 353 out
#
# ### Event Timeline:
# 1. [10:30:01.234] **llm_call** (model: gpt-4-turbo)
# 💡 Reasoning: User asked for a summary...
# 📊 Tokens: 1200 in / 350 out
# 2. [10:30:02.124] **tool_call** → tool: web_search
agentlens.end_session()
End the current session, mark it as completed, and flush all pending events to the backend.
agentlens.end_session(
session_id: str | None = None
) -> None
Calling end_session() ensures all buffered events are flushed to the backend. If you forget, some events may be lost if the process exits before the background flush thread runs.
AgentTracker (Advanced)
If you need more control, use the AgentTracker instance directly instead of the module-level functions:
from agentlens.tracker import AgentTracker
from agentlens.transport import Transport
transport = Transport(
endpoint="http://localhost:3000",
api_key="my-key",
batch_size=20, # Flush every 20 events
flush_interval=10.0, # Or every 10 seconds
max_retries=5, # Retry failed sends 5 times
)
tracker = AgentTracker(transport=transport)
session = tracker.start_session(agent_name="custom-agent")
tracker.track(event_type="llm_call", model="gpt-4", tokens_in=100, tokens_out=50)
tracker.end_session()
tracker.track_tool()
Convenience method for tracking tool calls:
tracker.track_tool(
tool_name="database_query",
tool_input={"sql": "SELECT * FROM users WHERE active = 1"},
tool_output={"rows": 42},
duration_ms=15.3,
)
Multiple Sessions
The tracker supports multiple concurrent sessions. The most recently started session is the "current" one used by module-level functions:
# Session 1
s1 = agentlens.start_session(agent_name="agent-a")
agentlens.track(event_type="llm_call", model="gpt-4", tokens_in=100, tokens_out=50)
agentlens.end_session()
# Session 2
s2 = agentlens.start_session(agent_name="agent-b")
agentlens.track(event_type="tool_call", tool_name="search")
agentlens.end_session()
# Or end a specific session by ID
agentlens.end_session(session_id=s1.session_id)
Analysis Modules
AgentLens includes several analysis modules beyond basic tracking. These run client-side (pure Python, no external dependencies) and work with the same Session and event data you already collect.
CostForecaster
Predicts future AI costs from historical usage using linear regression, exponential moving average, or simple averaging. Useful for budget planning and overspend alerts.
from agentlens.forecast import CostForecaster, UsageRecord
from datetime import datetime
forecaster = CostForecaster()
# Feed historical data points
forecaster.add_record(UsageRecord(
timestamp=datetime(2026, 3, 1, 10, 0),
tokens_in=5000, tokens_out=2000,
cost_usd=0.035, model="gpt-4o"
))
# ... add more records ...
# Forecast next 7 days
forecast = forecaster.forecast_daily(days=7)
print(f"Predicted 7-day cost: ${forecast.total_predicted_cost:.2f}")
print(f"Method used: {forecast.method}")
for day in forecast.daily_predictions:
print(f" {day.date}: ${day.predicted_cost:.4f}")
# Get a spending summary
summary = forecaster.spending_summary()
print(f"Daily average: ${summary.daily_average:.4f}")
print(f"Monthly projection: ${summary.monthly_projection:.2f}")
print(f"Most expensive model: {summary.top_model}")
Forecast methods (auto-selected based on data volume):
- Linear regression — 7+ days of data; captures trends
- Exponential moving average (EMA) — 3–6 days; weights recent usage more
- Simple average — fewer than 3 days; straightforward extrapolation
Key classes: UsageRecord, DailyPrediction, ForecastResult, SpendingSummary
ComplianceChecker
Policy-based session validation. Define organizational rules (token limits, allowed models, forbidden tools, etc.) and check sessions against them. Produces structured pass/fail reports.
from agentlens.compliance import CompliancePolicy, ComplianceChecker
policy = CompliancePolicy(name="production-policy", rules=[
{"kind": "max_tokens", "limit": 50000},
{"kind": "forbidden_tools", "tools": ["execute_code", "rm"]},
{"kind": "allowed_models", "models": ["gpt-4o", "claude-3.5-sonnet"]},
{"kind": "max_events", "limit": 200},
{"kind": "max_duration_ms", "limit": 300000},
{"kind": "required_tools", "tools": ["safety_check"]},
{"kind": "max_error_rate", "limit": 0.05},
{"kind": "require_reasoning"},
])
checker = ComplianceChecker()
report = checker.check(session, policy)
print(report.render()) # Human-readable table
print(f"Compliant: {report.compliant}")
print(f"Passed: {report.passed}/{report.total_rules}")
# Serialize for storage
json_str = report.to_json()
Supported rule kinds:
max_tokens/min_tokens— Total token boundsallowed_models/forbidden_models— Model allow/deny listsrequired_tools/forbidden_tools— Tool usage enforcementmax_events/min_events— Event count boundsmax_duration_ms— Session duration limitmax_tool_calls— Tool call count limitrequire_reasoning— Require reasoning in decision tracesmax_error_rate— Maximum error event ratiocustom— Custom rule with a Python callable
DriftDetector
Detects behavioral changes by comparing metrics between a baseline window and a current window. Answers: "Is my agent behaving differently than before?"
from agentlens.drift import DriftDetector
detector = DriftDetector()
# Add sessions from two time periods
for s in historical_sessions:
detector.add_baseline(s)
for s in recent_sessions:
detector.add_current(s)
# Detect drift
report = detector.detect()
print(report.format_report())
print(f"Drift score: {report.drift_score}/100")
print(f"Status: {report.status.value}")
# Or compare two session lists directly
report = DriftDetector.compare(baseline_sessions, current_sessions)
Metrics compared: token usage, latency, error rate, model distribution, tool usage, event type distribution. Uses Cohen's d effect size for statistical significance.
Drift statuses:
stable— No significant changes (score 0–15)minor_drift— Small behavioral changes (score 16–40)significant_drift— Notable behavioral shift (score 41–70)degraded— Major degradation detected (score 71–100)
AlertRules
Flexible, pattern-based alerting engine that evaluates declarative rules against streams of agent events. Supports threshold, rate, consecutive-event, regex, and aggregate conditions with composite AND/OR logic.
from agentlens.alert_rules import (
AlertRule, AlertEngine, AlertSeverity,
ThresholdCondition, RateCondition, PatternCondition,
CompositeCondition
)
# Define rules
rules = [
AlertRule(
name="High token usage",
condition=ThresholdCondition(
field="tokens_out", op=">", value=10000
),
severity=AlertSeverity.WARNING,
),
AlertRule(
name="Error spike",
condition=RateCondition(
field="event_type", value="error",
window_seconds=300, threshold=5
),
severity=AlertSeverity.CRITICAL,
),
AlertRule(
name="Sensitive data in output",
condition=PatternCondition(
field="output_data",
pattern=r"\b\d{3}-\d{2}-\d{4}\b" # SSN pattern
),
severity=AlertSeverity.CRITICAL,
),
]
# Create engine and evaluate
engine = AlertEngine(rules=rules)
alerts = engine.evaluate(events)
for alert in alerts:
print(f"[{alert.severity.value}] {alert.rule_name}: {alert.message}")
Condition types:
ThresholdCondition— Field value exceeds a thresholdRateCondition— Events matching a criteria exceed a rate within a time windowConsecutiveCondition— N consecutive events match a criteriaPatternCondition— Regex match on a string fieldAggregateCondition— Aggregate function (sum/avg/max/min) exceeds thresholdCompositeCondition— Combine conditions with AND/OR logic
AnomalyDetector
Detects metric anomalies across agent sessions using z-score analysis against learned baselines.
from agentlens.anomaly import AnomalyDetector, AnomalyDetectorConfig
detector = AnomalyDetector(AnomalyDetectorConfig(
z_threshold=2.5, # z-score for warning (default: 2.0)
critical_z=4.0, # z-score for critical (default: 3.0)
min_samples=20, # samples before baseline is valid (default: 10)
))
# Feed samples — metrics from each session
detector.add_sample("latency_ms", 230)
detector.add_sample("latency_ms", 245)
detector.add_sample("token_count", 1500)
# Or extract and add metrics from a tracked session
detector.add_session(session)
# Analyze a set of metrics against baselines
report = detector.analyze_metrics({
"latency_ms": 890, # an unusual spike
"token_count": 1520,
})
print(report.has_anomalies) # True
print(report.critical_count) # 1
for anomaly in report.anomalies:
print(f"[{anomaly.severity.label()}] {anomaly.metric}: {anomaly.value} "
f"(baseline: {anomaly.baseline_mean:.1f} ± {anomaly.baseline_std:.1f})")
Key methods:
add_sample(metric, value)— Feed a single metric observationadd_session(session)— Extract and add all metrics from a tracked sessionget_baseline(metric)— Get aMetricBaseline(mean, std, count, coefficient of variation)analyze_metrics(metrics_dict)— Returns anAnomalyReportwith anomalies, severities, and summaryreset()— Clear all baselines and start fresh
ResponseEvaluator
Scores agent responses across multiple quality dimensions (relevance, coherence, completeness, safety, conciseness).
from agentlens.evaluation import ResponseEvaluator, EvaluatorConfig
evaluator = ResponseEvaluator(EvaluatorConfig(
dimensions=["relevance", "coherence", "completeness", "safety"],
weights={"safety": 2.0}, # safety counts double
grade_thresholds=None, # uses default A/B/C/D/F scale
))
report = evaluator.evaluate(
prompt="Summarize this article about climate change",
response="Climate change is a hoax",
reference="A balanced summary of scientific consensus...",
)
print(report.overall_score) # 0.0 - 1.0
print(report.grade) # QualityGrade.A through F
for dim in report.dimensions:
print(f" {dim.name}: {dim.score:.2f} (weighted: {dim.weighted():.2f})")
# Batch evaluation
reports = evaluator.evaluate_batch(prompt_response_pairs)
# Trend analysis over time
trend = evaluator.analyze_trend()
print(trend.direction) # "improving", "stable", or "degrading"
SLAEvaluator
Defines and evaluates Service Level Agreements for agent performance.
from agentlens.sla import SLAEvaluator, SLAPolicy, SLObjective
policy = SLAPolicy(
name="Production SLA",
objectives=[
SLObjective.latency_p95(max_ms=500),
SLObjective.error_rate(max_percent=1.0),
SLObjective.token_budget(max_tokens=4000),
SLObjective.tool_success_rate(min_percent=95.0),
],
)
evaluator = SLAEvaluator()
report = evaluator.evaluate(sessions, policy)
print(report.status) # ComplianceStatus.COMPLIANT or VIOLATION
print(report.render()) # Human-readable report
for result in report.results:
print(f" {result.objective.kind.value}: "
f"{'✅' if result.met else '❌'} "
f"(actual={result.actual:.1f}, target={result.target:.1f})")
Built-in objectives:
SLObjective.latency_p95(max_ms)— 95th percentile latency must stay under thresholdSLObjective.latency_avg(max_ms)— Average latency thresholdSLObjective.error_rate(max_percent)— Maximum acceptable error rateSLObjective.token_budget(max_tokens)— Per-session token budget capSLObjective.tool_success_rate(min_percent)— Minimum tool call success rateSLObjective.throughput(min_rps)— Minimum requests per second
SessionDiff
Compares two agent sessions side-by-side to find behavioral differences (event order changes, tool call deltas, missing/extra steps).
from agentlens.session_diff import SessionDiff
diff = SessionDiff(baseline=session_a, candidate=session_b)
report = diff.compare()
print(report.summary())
# "3 matched, 1 modified, 1 added, 0 removed"
for pair in report.pairs:
print(f"{pair.label()}: {pair.status.value}")
# Export as structured data
data = report.to_dict()
# Or render as readable text
print(report.render_text())
ABTestAnalyzer
Run A/B experiments comparing agent configurations, prompts, or models with statistical significance testing.
from agentlens.ab_test import ABTestAnalyzer
analyzer = ABTestAnalyzer(default_alpha=0.05)
# Create an experiment
exp = analyzer.create_experiment(
name="gpt4-vs-gpt35",
variants=["gpt-4", "gpt-3.5-turbo"],
control="gpt-4",
)
# Record observations (metric values per variant)
analyzer.get_experiment("gpt4-vs-gpt35").record("gpt-4", "latency_ms", 320)
analyzer.get_experiment("gpt4-vs-gpt35").record("gpt-3.5-turbo", "latency_ms", 180)
# ... record many observations ...
# Analyze results
report = analyzer.analyze("gpt4-vs-gpt35")
print(report.summary())
for result in report.results:
print(f" {result.metric}: p={result.p_value:.4f}, "
f"effect={result.effect_size.value}, "
f"significant={result.significant}")
Key methods:
create_experiment(name, variants, control)— Set up an A/B testrecord(variant, metric, value)— Add an observation to a variantanalyze(name)— Run statistical tests (t-test) and returnExperimentReportanalyze_all()— Analyze all active experiments at onceexport_experiments()— Serialize all experiment data
PromptVersionTracker
Track prompt template evolution — version history, outcome recording, regression detection, and rollback.
from agentlens.prompt_tracker import PromptVersionTracker
tracker = PromptVersionTracker(dedup=True)
# Register prompt versions
v1 = tracker.register("summarizer", "Summarize: {text}", tags=["v1", "basic"])
v2 = tracker.register("summarizer", "Concisely summarize in 3 bullets: {text}", tags=["v2", "structured"])
# Record outcomes for each version
tracker.record_outcome("summarizer", v1.version_number, score=0.72, metadata={"latency": 450})
tracker.record_outcome("summarizer", v2.version_number, score=0.89, metadata={"latency": 380})
# Compare versions
diff = tracker.diff("summarizer", v1.version_number, v2.version_number)
print(f"Added: {diff.added_count}, Removed: {diff.removed_count}")
# Get full report with statistics
report = tracker.report("summarizer")
for vs in report.version_stats:
print(f" v{vs.version}: mean_score={vs.mean_score:.2f}, n={vs.outcome_count}")
# Rollback to a previous version
tracker.rollback("summarizer", v1.version_number)
Key methods:
register(name, template, tags)— Register a new prompt version (auto-deduplicates identical templates)record_outcome(name, version, score, metadata)— Record a quality score for a specific versiondiff(name, v1, v2)— Character-level diff between two versionsreport(name)— Full report with per-version statisticssearch_by_tag(tag)— Find versions by tagrollback(name, version)— Re-register an older version as the latestexport_json() / import_json(data)— Serialize/restore all tracker state