productno fluff

Your system prompt is already your policy.
We extract it, enforce it, and watch every trace.

Aiyara is agent failure monitoring. We catch structural failures in tool-calling agents — deterministically, not with another LLM.

Request early access How it works ↓

Evaluation: Deterministic. No LLM at verdict time.
Source of truth: Your system prompt + tool schemas.
Setup: ~5 minutes. 3 lines of SDK.
Fits alongside: Langfuse, Braintrust, Datadog.

01 — The problem

The shadow policy problem

Your team has rules for how your agent should behave. Some live in a product spec. Some in a Notion page. Some in a Slack thread. Eventually an engineer translates them into a system prompt — the block of natural language the model reads before every conversation.

That prompt is now the single source of truth for your agent's behavior. Six months later, three engineers have modified it, two new tool integrations added implicit expectations, a compliance ask from legal never made it in, and a PM changed the escalation policy in Notion without updating the prompt. The actual policy is now scattered across prompts, docs, Slack, email, and people's heads.

When something breaks, the first question — was the agent supposed to do that? — takes thirty minutes to answer because nobody can point to a single source of truth. And even when the prompt is correct, most teams have no systematic way to check whether the agent actually follows it. The prompt says “always verify identity before processing a refund.” Did the agent do that on the 12,000 traces it handled last week? Without automated monitoring, nobody knows.

02 — How it works

Extract. Monitor. Explain.

Three stages. The prompt you already have is the input. The output is a deterministic verdict on every trace, with the line it came from.

01
Paste the prompt. Rules appear.
Aiyara reads your system prompt and tool schemas and auto-extracts the enforceable rules. Each rule maps back to the sentence it came from. Green means high confidence; yellow needs your review; gray means recognized but not auto-enforceable (tone, style).
The extraction pipeline is the single piece of Aiyara an LLM is involved in. You review the output before it becomes a check.
SYSTEM PROMPT · retail_agent.mdv2→v3
53## Exchange Policy
54
55Exchanges are allowed for delivered orders of
56the same product type.
57Verify the order status before processing.
55Exchanges are ONLY allowed when the order
56status is “delivered”.
57The replacement item MUST be the same product
58type as the original.
59Verify both conditions before calling
60exchange_delivered_order_items.
EXTRACTED · 0.3s
HIGH CONFIDENCEArgument Constraint
Exchanges require delivered status and matching product type
If(exchange, Requires(status == "delivered", same_type))
From lines 55–57Enabled
v2→v3UPDATED · just now
Argument Constraint
Exchanges require delivered status and matching product type
If(exchange, Requires(status == "delivered"))If(exchange, Requires(status == "delivered", same_type))
From lines 55–60Re-evaluated 247 traces
VIOLATEDtrace · tr_00041 · Step 4/6
exchange_delivered_order_items{ order_id: 'W5061109', status: 'pending' }
WHY
Called on order with status='pending'. Rule requires delivered.
↑Pinned to line 56exchange_delivered_status
02
Edit the prompt. Rules update.
When the prompt changes, the extraction view comes back — showing which rules were added, removed, or modified. If your Tuesday 3pm prompt update changes two rules and introduces violations by 5pm, you see the correlation. Think of it as CI for your agent's policy.
The prompt is your source of truth. The rules don't drift from it because they're re-derived from it.
SYSTEM PROMPT · retail_agent.mdv2→v3
53## Exchange Policy
54
55Exchanges are allowed for delivered orders of
56the same product type.
57Verify the order status before processing.
55Exchanges are ONLY allowed when the order
56status is “delivered”.
57The replacement item MUST be the same product
58type as the original.
59Verify both conditions before calling
60exchange_delivered_order_items.
EXTRACTED · 0.3s
HIGH CONFIDENCEArgument Constraint
Exchanges require delivered status and matching product type
If(exchange, Requires(status == "delivered", same_type))
From lines 55–57Enabled
v2→v3UPDATED · just now
Argument Constraint
Exchanges require delivered status and matching product type
If(exchange, Requires(status == "delivered"))If(exchange, Requires(status == "delivered", same_type))
From lines 55–60Re-evaluated 247 traces
VIOLATEDtrace · tr_00041 · Step 4/6
exchange_delivered_order_items{ order_id: 'W5061109', status: 'pending' }
WHY
Called on order with status='pending'. Rule requires delivered.
↑Pinned to line 56exchange_delivered_status
03
Violations link back to the line.
Every trace is evaluated against your confirmed rules plus a set of pre-built harness checks. When a constraint fails, Aiyara pins the violation to the exact sentence in the prompt, the step in the trace, and the argument that broke it. Cascade attribution traces downstream failures back to their upstream cause.
Same trace in, same verdict out. Every time.
SYSTEM PROMPT · retail_agent.mdv2→v3
53## Exchange Policy
54
55Exchanges are allowed for delivered orders of
56the same product type.
57Verify the order status before processing.
55Exchanges are ONLY allowed when the order
56status is “delivered”.
57The replacement item MUST be the same product
58type as the original.
59Verify both conditions before calling
60exchange_delivered_order_items.
EXTRACTED · 0.3s
HIGH CONFIDENCEArgument Constraint
Exchanges require delivered status and matching product type
If(exchange, Requires(status == "delivered", same_type))
From lines 55–57Enabled
v2→v3UPDATED · just now
Argument Constraint
Exchanges require delivered status and matching product type
If(exchange, Requires(status == "delivered"))If(exchange, Requires(status == "delivered", same_type))
From lines 55–60Re-evaluated 247 traces
VIOLATEDtrace · tr_00041 · Step 4/6
exchange_delivered_order_items{ order_id: 'W5061109', status: 'pending' }
WHY
Called on order with status='pending'. Rule requires delivered.
↑Pinned to line 56exchange_delivered_status

03 — A concrete example

Tuesday, 2:47pm. One skipped step. One flagged trace.

Your customer service agent has this rule in its system prompt:

“Always verify the customer's identity by calling get_customer_details before processing any refund with process_refund.”

On Tuesday at 2:47pm, the agent handles a refund request:

user     → "I need a refund for order #4821"
agent    → process_refund(order_id="4821", amount=89.99)   ← violation
agent    → "Your refund of $89.99 has been processed."

The agent skipped get_customer_details entirely and went straight to process_refund. Aiyara flags this as a violation of the verify-identity-before-refund constraint, links it to the exact line in the system prompt, and surfaces it on the dashboard.

Now multiply this by every trace, every rule, every day. The PM sees: verify identity before refund — 97.3% compliant this week, down from 99.8% after Tuesday's prompt change. 34 violations since Tuesday, 12 linked to a downstream cascade. The engineer sees the same data with technical trace detail.

04 — Coverage

What we catch. What we don't.

Not every rule in a system prompt can be deterministically enforced. We're explicit about the line.

Rule type	Example	Enforceable?	Why
Tool ordering	“Verify identity before processing refund”	Yes	Tool call sequence is visible in every trace
Required steps	“Always check order status before modifying”	Yes	We detect when expected tools are skipped
Argument constraints	“Refund amount must not exceed order total”	Yes	Tool arguments are compared against prior results
Rate limits	“Maximum 1 handoff per conversation”	Yes	Tool calls are counted per trace
Escalation rules	“Transfer to human after 3 failed attempts”	Yes	Failure counts and escalations are tracked
Doom loops	Agent repeating the same action indefinitely	Yes	Pre-built detector, zero config
Tone / style	“Be professional and empathetic”	No	Requires subjective judgment — different tool
Factual accuracy	“Don't make up information”	Partial	We catch skipped tools; we can't verify free-text claims
Topic boundaries	“Only discuss company products”	No	Requires content classification

The rules we enforce are the ones attached to actions — tool calls, arguments, sequences. These are also the rules attached to the most expensive failures.

05 — Harness checks

Four failure modes we catch with zero configuration.

Pre-built detectors that work on any agent trace. No prompt, no schema, no setup.

Doom loops

Same tool called with identical arguments, repeatedly. Burns tokens, never completes.

Context overflow

Token utilization approaches the model limit. System prompt truncated. Instructions silently disappear.

Retry amplification

A write operation executed twice. Double charges, duplicate orders, corrupted state.

Error cascades

One failed tool call triggers three downstream failures. Symptoms are loud; root cause is buried.

06 — Stack fit

We sit alongside what you have. Not in place of it.

Most teams running agents in production already have some combination of trace logging, custom evaluators, and maybe LLM-as-judge scoring. Keep them. Aiyara is a layer of deterministic constraint monitoring and failure intelligence on top.

Langfuse, LangSmith, Braintrust answer what happened. They store the trace and let you build evaluators. They don't answer whether it should have happened — they don't understand the agent's policy. LLM-as-judge evaluates the output; we evaluate the execution. Pattern-based firewalls operate on individual inputs and outputs; cross-tool-call temporal constraints (“verify before refund”) are structurally outside their scope.

Send your traces to Langfuse, Datadog, or your own store. Point Aiyara at the same stream, and get per-constraint verdicts, cascade attribution, and failure clustering on top.

07 — Onboarding

Five minutes to live.

01
Paste your system prompt and tool schemas
No YAML, no DSL to learn. The doc you already have.
~30s
02
Auto-extracted constraints appear side-by-side
Each rule is mapped to the sentence it came from, with a confidence signal.
instant
03
Review: green, yellow, gray
Confirm what looks right, adjust what doesn't, skip what's not auto-enforceable. We explain the gray cases.
3–5 min
04
Toggle on pre-built harness checks
Doom loops, context overflow, retry amplification, error cascades. Zero config.
~30s
05
Install the SDK (3 lines)
Dashboard goes live. Every trace is monitored against confirmed rules.
~30s

pip install aiyara

import aiyara
aiyara.init(prompt="your system prompt here")

# That's it. Rules extracted. Traces monitored.

08 — Where we are

Honestly, the state of the product.

Aiyara is pre-launch. The core engine is validated on public benchmarks. The full product is being built around it.

Component	Status
Constraint extraction + DSL compiler	Workingvalidated on tau2-bench
Deterministic evaluation engine	Working0% false positives on compiled rules
Pre-built harness checks	Buildingdesigned, in implementation
Python SDK	In progresstrace model + transport shipped
Side-by-side extraction review UI	In progress
Monitoring dashboard	In progress
Hosted cloud infrastructure	In progressingestion, storage, API

What we want from early partners: validation that structural agent failures are a real problem for your team, feedback on the auto-extraction approach, and a read on your security requirements so we prioritize the right things.

09 — Validation

What we've measured so far.

Developed and tested against tau2-bench — a public benchmark of 114 tool-calling agent scenarios across retail and telecom.

100% corrected precision. Every violation we flag is a real policy violation. Zero false positives on compiled constraints.
Deterministic evaluation. Same trace in, same verdict out. No model calls at eval time.
Full pipeline, end to end in 97 seconds. Constraint extraction through evaluation. 84% agreement with benchmark labels — and 36 violations caught that the benchmark's own evaluator misses.
Composable DSL. Constraints compile to a small predicate grammar (If/Then, Requires, Before, ForAll) rather than rigid templates. Handles multi-tool constraints and cross-step dependencies template approaches can't express.

The 84% agreement reflects real gaps in our extraction pipeline, not cherry-picked scenarios. Closing them is our core research bet. More in The Translation Bottleneck.

If this diagnosis matches how you're thinking about your own agents, we'd like to hear from you.

Request early access Read the research →

The shadow policy problem

Extract. Monitor. Explain.

Paste the prompt. Rules appear.

Edit the prompt. Rules update.

Violations link back to the line.

Tuesday, 2:47pm. One skipped step. One flagged trace.

What we catch. What we don't.

Four failure modes we catch with zero configuration.

Doom loops

Context overflow

Retry amplification

Error cascades

We sit alongside what you have. Not in place of it.

Five minutes to live.

Paste your system prompt and tool schemas

Auto-extracted constraints appear side-by-side

Review: green, yellow, gray

Toggle on pre-built harness checks

Install the SDK (3 lines)

Honestly, the state of the product.

What we've measured so far.

If this diagnosis matches how you're thinking about your own agents, we'd like to hear from you.