Analytics
Know what your agents are doing and what it costs. Strathon exposes three read endpoints for operator analytics: the same APIs the dashboard, Grafana integrations, and enterprise tooling consume.
Trace list
GET /v1/traces?limit=50&agent_name=my-bot&intervention_state=blocked
Authorization: Bearer stra_…Lists traces for the caller's project, newest first. Supports:
start_after/start_before: nanosecond unix or ISO 8601agent_name: exact matchintervention_state: exact match (e.g.blocked,steered)cursor: opaque string fromnext_cursorin the previous responselimit: 1–1000, default 50
Each trace in the response includes trace_id, start/end timestamps, agent_name, workflow_name, total_cost_usd, span_count, and intervention_state.
Requires traces:read scope.
Trace tree
GET /v1/traces/{trace_id}/tree
Authorization: Bearer stra_…Reconstructs the full span hierarchy for a single trace. Returns:
trace: trace-level metadata (timestamps, agent, cost, span count)root: the root span node with nestedchildrenarraysspan_count: total spans in the trace
Each span node carries name, kind, start/end time, duration_ms, cost,
tokens, model, agent, tool, intervention_state, and a children array
of child span nodes.
If the trace has multiple root spans (no parent_span_id), root is an
array instead of a single object.
Requires traces:read scope.
Span aggregation
GET /v1/spans/aggregate?group_by=request_model&time_bucket=1d
Authorization: Bearer stra_…Groups spans by a dimension and returns aggregate metrics:
span_count: number of spans in the grouptotal_cost_usd: sum of cost_usdtotal_input_tokens: sum of input_tokenstotal_output_tokens: sum of output_tokens
Parameters:
group_by: one of:agent_name,tool_name,operation_name,request_model,provider_name,kind,status_code,intervention_statetime_bucket: optional:1h,6h,1d,7d,30d. Adds abucketfield (nanosecond timestamp of the bucket start)start_after/start_before: time range filterlimit: 1–1000, default 100
Requires traces:read scope.
Behavioral drift detection (Vigil)
Vigil watches each agent's behavior against its own learned baseline and fires an alert when the behavior shifts. It runs as a background task in the receiver (60-second tick) and requires no configuration to start; it calibrates itself from production traffic.
Four signals are tracked per agent, computed over a trailing 5-minute window:
| Signal | Definition | Alert severity |
|---|---|---|
deny_rate | Fraction of spans blocked or denied by policy | high |
error_rate | Fraction of spans with ERROR status | high |
tool_call_rate | Spans per minute | medium |
cost_rate | USD per minute | medium |
How it works
EWMA (exponentially weighted moving average) establishes the baseline; CUSUM (cumulative sum) detects sustained shifts away from it. Together they catch both a sudden spike and a slow drift that no single threshold would.
Each agent×signal baseline calibrates independently: Vigil accumulates one observation per tick in which the agent was active, and starts alerting only after 100 observations (configurable). Until then it learns silently; a new agent never alerts on its first day of normal behavior. After an alert fires, the CUSUM accumulators reset so one sustained shift produces one alert, not a storm.
Baselines are held in receiver memory and rebuild from live traffic after a restart (the calibration window starts over).
Alerts
A drift alert is dispatched through the notification system as a
behavioral_drift event, over the same channels (Slack, Discord, webhook) used
for policy and approval events. The payload carries the agent, the signal,
the current value, the baseline EWMA, a severity, and a human-readable
message:
{
"type": "behavioral_drift",
"agent_name": "support-agent",
"metric": "deny_rate",
"current_value": 0.42,
"baseline_ewma": 0.03,
"severity": "high",
"message": "Agent 'support-agent' deny_rate drifted significantly (current: 0.4200, baseline: 0.0300)"
}A rising deny_rate on a previously quiet agent is the classic signature of
prompt injection or goal hijack: the agent starts attempting calls your
policies exist to stop. Drift detection surfaces the pattern; the policies
themselves remain the enforcement.
Configuration
| Env var | Default | Meaning |
|---|---|---|
STRATHON_VIGIL_MIN_SPANS | 100 | Observations before a baseline calibrates and can alert |
STRATHON_VIGIL_EWMA_ALPHA | 0.3 | EWMA smoothing factor (higher = baseline adapts faster) |
STRATHON_VIGIL_CUSUM_THRESHOLD | 5.0 | CUSUM breach level that fires an alert |
STRATHON_VIGIL_CUSUM_DRIFT | 0.5 | CUSUM slack: drift smaller than this is absorbed as noise |
Related
- Span search: the raw data these analytics aggregate
- Budgets: turn cost visibility into enforced caps
- Metrics: Prometheus counters for the receiver itself