Detectors

Detectors are STACK's Layer 3. They score every agent action against the passport that authorized it and fire when behavior crosses thresholds. Detectors sit on two pipelines:real-time signals fire during a mission and can auto-revoke in enforced mode; post-hoc flags fire at checkout and gate review.

A detector does not decide whether a request is authorized - that is the proxy's job. Detectors measure drift, anomaly, and policy violations that the hard rules cannot see, then surface them as security events or review flags.

Two pipelines

Real-time (security-event stream)

Runs on every proxy call, credential retrieval, delegation, and checkpoint. Emits asecurity_event row with a typed signal, severity, and metadata. In enforced mode, selected detectors also revoke the passport mid-session.

Post-hoc (checkout review flags)

Runs once per mission when the agent submits a checkout. Compares the declared intent and the rolling baseline for that agent against what actually happened. Emits flags that bucket the checkout as clean,flagged, or blocked. Flagged checkouts queue for operator review.

Severity levels

Every signal carries one of three severities. Notification rules subscribe by minimum severity - see /docs/api/notifications.

  • info - informational; no action needed (often emitted in logged-mode agents)
  • warning - concerning; review recommended; enforced-mode auto-revoke may fire
  • critical - hard violation; enforced-mode auto-revoke always fires; operator paged

Real-time detector catalog

credential_outside_scope

Hard rule. Agent requested a service not listed in the passport'sintent_services or granted providers. Emitted synchronously by the proxy before the call is forwarded. Severity: critical in enforced and standard modes; warning in logged mode.

scope_drift

Behavioral. Scores each proxy call for deviation from the passport's declared intent across two dimensions. Fires once per passport when either threshold trips.

  • Granted-undeclared delta: +0.30 per call where the service is in the grant but not in intent_services
  • Novel-path delta: +0.50 the first time (service, pathname) is seen for this passport
  • Cumulative threshold: ≥ 2.00 units
  • Single-call novelty threshold: ≥ 0.80 units
  • Enforced mode: fire → passport revoked (cascade follows); logged mode: event only

credential_burst

Agent retrieved credentials more than CREDENTIAL_BURST_THRESHOLDtimes inside CREDENTIAL_BURST_WINDOW_SECONDS (defaults: 10 in 60s). Commonly indicates a stuck loop or credential harvesting. Severity: warning.

credential_after_checkout

Agent called the proxy or retrieved a credential after its passport had already submitted a checkout. Severity: critical. The passport is expected to be idle from checkout until expiry; late calls indicate the agent escaped supervision.

delegation_without_intent

A parent passport delegated to a child without the parent'sintent.will_delegate flag being set at issue time. In enforced mode the delegation is denied and this event is recorded as critical. In logged mode the delegation proceeds and the event is recorded as warning.

delegation_downgrade

A parent in enforced mode attempted to delegate to a child whose agent is in a looser accountability mode (logged orstandard). Delegation cannot reduce accountability. The request throws DELEGATION_ACCOUNTABILITY_VIOLATION; the event is always critical.

checkpoint_silence

An enforced-mode passport passed itscheckpoint_interval_seconds without a checkpoint submission. TTL does not extend; the passport expires at its currentexp. Severity: warning. Auto-fires when the interval elapses (scanned by the worker).

expired_no_checkout

A passport reached expiry without ever submitting a checkout. Display name:Unchecked-Out Passport Detector. Recorded as warning at expiry time. Checkout is expected for any enforced or logged passport; a passport that silently dies is flagged so the operator can investigate.

upstream_not_found

Display name: Upstream 404 Detector. Informational signal raised when the upstream provider returned 404 on a proxied call. Often indicates a hallucinated or stale endpoint path - useful for spotting intents whose URL patterns are wrong, LLMs calling non-existent resources, or drift between agent code and provider API changes. Severity: info. Does not revoke.

prompt_injection

Display name: Prompt Injection Detector. Three-layer detector that scans proxied request bodies, skill invocation inputs, AND retrieved content (via POST /v1/scan) for prompt-injection attempts. Severity: critical for strong instruction-override or jailbreak matches; warning for suspicious phrasing. In enforced mode critical hits deny the request.

L1 — regex pattern catalog (instruction overrides, role-play breakouts, system-prompt extraction, jailbreak names, safety bypass). Cheap, fast, catches the canonical bulk.

L2 — encoding-aware normalization. Decodes the input through base64, URL, hex, leetspeak, unicode homoglyphs, zero-width chars, ROT13, separator-collapse, and reverse before re-running L1 against each variant. A match found via L2 carries an encoding field on the security event identifying which normalization revealed it.

L3 — LLM-classifier funnel using Haiku 4.5 via OpenRouter. Catches the indirect-injection class L1+L2 cannot see: paraphrased overrides, authority impersonation, polite <INFORMATION>-wrapped attack payloads, multi-step semantic novelty, multilingual variants. Heuristic gate fires L3 only on body length ≥ 60 chars and non-critical L1+L2 verdict — keeps cost bounded. 3-second timeout with graceful degrade-to-L1+L2 on any failure (timeout, network, parse error); a detector_degraded event records when L3 isn't contributing so operators see the gap.

Benchmark numbers on a 1087-sample corpus (deepset + curated + AgentDojo): L1 alone F1 0.43, L1+L2 0.49, full L1+L2+L3 chain F1 0.86, precision 0.98, recall 0.77. Per-call cost ~$0.0007 when L3 fires; ~50% gate rate in benchmark traffic.

L7 — operator customization. Studio+ operators can adjust how this detector behaves for their traffic via PUT /v1/detectors/prompt_injection/config: add domain-specific custom regex (custom_patterns), suppress false-positive matched_text values (whitelist_patterns), remap built-in pattern severities (severity_overrides), or master-disable the detector entirely (enabled: false). Config is Redis-cached with ~60s TTL so changes propagate to every API node within a minute. Reads are open to all tiers; writes require studio.

prompt_injection_output

Display name: Output Injection Detector. The L1-L5 chain catches injection content on the way IN. This detector (L6 of the prompt-injection defense stack) catches the agent's behavior on the way OUT — proxied response bodies and headers that look like a compromised assistant complying with an injection.

Pattern catalog covers three classes: assistant-compliance language ("Here is the system prompt you asked for…","As you instructed, I will…"), system-prompt-leak template markers (<system>…</system>,[SYSTEM PROMPT] brackets), and credential-shape strings (AWS access keys, Stripe / OpenAI / GitHub / Slack tokens, PEM private keys, JWT triplets). Severity: critical for credential-shape leaks, warning for compliance and system-prompt language.

Enforcement: in logged or standard mode, the event is recorded and the response is forwarded unchanged. In enforced mode, a critical match REDACTS the response body before returning to the agent (status code and headers preserved) so a leaked secret never reaches the agent runtime. The detector deliberately does NOT auto-revoke — a leaked credential in an upstream response is suspicious, but does not necessarily mean this passport's agent did something wrong; the upstream might just be misbehaving. Operators decide via the security event review.

behavior_drift

Display name: Behavior Drift Detector. Per-passport cross-call statistical signal (L6, second component). Maintains running mean and standard deviation of two dimensions across every proxy call within a session: body length (bytes) and Shannon entropy (bits/char, captures format change like a sudden switch from prose to base64). Fires abehavior_drift security event when the current call exceeds 3σ on either dimension.

Cold-start gate: requires ≥5 prior observations before any z-score is trusted — otherwise the first call always looks like an outlier (mean = its own value, σ = 0). Zero-variance fallback: when a session's baseline is byte-identical (common for agents looping over a single template) the standard z-test divides by zero, so a deviation ≥ 3× mean is treated as 3σ-equivalent. Catches the case where a successful EARLIER injection silently steered the agent and subsequent traffic looks materially different from baseline. Severity: warning. Does not auto-revoke.

unauthorized_skill_access

Agent attempted to invoke a skill that its skill_access_modedoes not permit. Severity: warning in logged mode, critical in enforced. Agents incustom mode have an explicit allowlist; agents innone mode reject all skills.

Post-hoc review flag catalog

Flags evaluated at checkout time. They do not revoke; they shape the operator's review queue. A checkout with any critical flag isblocked pending operator decision; one or more warning flags yields flagged; no flags yieldsclean.

  • undeclared_service - agent used a service that was not in intent_services
  • undeclared_delegation - mission involved delegations the passport intent did not disclose
  • duration_exceeded - mission ran longer than 2× estimated_duration_seconds
  • no_checkpoints - enforced-mode mission submitted zero checkpoints before checkout
  • checkpoint_gap - unusually long gap between consecutive checkpoints
  • high_action_volume - actions_count exceeded the rolling baseline for this agent
  • scope_escalation_pattern - successive missions for this agent show monotonically growing intent_services
  • credential_unreported - credential retrieved during the mission but never reported in a checkpoint
  • intent_deviation - LLM-graded distance between declared intent and observed behavior ≥ 0.40
  • behavioral_anomaly - actions / services / duration exceed 3σ or 3× mean vs the last 20 checkouts

intent_deviation uses an LLM (gpt-4o-mini via OpenRouter) with a 10s timeout. If the call fails or times out the flag is skipped rather than blocking checkout.

Auto-revoke in enforced mode

Enforced mode binds detector fires to passport revocation. When a real-time detector fires on an enforced passport, STACK calls revoke() with a reason of the form <detector>_fire: <message>. Revocation cascades to delegated children and broadcasts on the revocation channel. See/docs/concepts/revocation.

Logged and standard modes never auto-revoke from detector fires. Those modes surface the signal only; revocation remains an operator or review decision.

Surface

Detectors have no dedicated endpoint - they run inline with the proxy, credentials, delegation, checkpoint, and checkout paths. The visible surface is through security events and the review queue.

  • GET /v1/security-events - paginated list of unresolved real-time signals
  • POST /v1/security-events/:id/resolve - acknowledge and clear
  • GET /v1/passports/reviews - checkouts that flagged or blocked on post-hoc rules
  • POST /v1/passports/reviews/:checkoutId/decide - approve or block (block can also block the agent from future passports)

Example: reading a fired signal

bash
curl -s https://api.getstack.run/v1/security-events \
  -H "Authorization: Bearer $STACK_API_KEY" \
  | jq '.events[0]'
json
{
  "id": "sevt_9fA2…",
  "operator_id": "op_acme",
  "agent_id": "agt_support_bot",
  "signal_type": "scope_drift",
  "severity": "warning",
  "message": "Scope drift: Cumulative drift 2.12 crossed threshold",
  "context": {
    "service": "notion",
    "delta_units": 0.5,
    "cumulative_units": 2.12,
    "in_intent": false,
    "in_granted": true,
    "novel_path": true
  },
  "timestamp": 1747913532614,
  "resolved": false
}
stack | docs