Your system prompt is already your policy.
We extract it, enforce it, and watch every trace.
Aiyara is agent failure monitoring. We catch structural failures in tool-calling agents — deterministically, not with another LLM.
- Evaluation
- Deterministic. No LLM at verdict time.
- Source of truth
- Your system prompt + tool schemas.
- Setup
- ~5 minutes. 3 lines of SDK.
- Fits alongside
- Langfuse, Braintrust, Datadog.
01 — The problem
The shadow policy problem
Your team has rules for how your agent should behave. Some live in a product spec. Some in a Notion page. Some in a Slack thread. Eventually an engineer translates them into a system prompt — the block of natural language the model reads before every conversation.
That prompt is now the single source of truth for your agent's behavior. Six months later, three engineers have modified it, two new tool integrations added implicit expectations, a compliance ask from legal never made it in, and a PM changed the escalation policy in Notion without updating the prompt. The actual policy is now scattered across prompts, docs, Slack, email, and people's heads.
When something breaks, the first question — was the agent supposed to do that? — takes thirty minutes to answer because nobody can point to a single source of truth. And even when the prompt is correct, most teams have no systematic way to check whether the agent actually follows it. The prompt says “always verify identity before processing a refund.” Did the agent do that on the 12,000 traces it handled last week? Without automated monitoring, nobody knows.
02 — How it works
Extract. Monitor. Explain.
Three stages. The prompt you already have is the input. The output is a deterministic verdict on every trace, with the line it came from.
01
Paste the prompt. Rules appear.
Aiyara reads your system prompt and tool schemas and auto-extracts the enforceable rules. Each rule maps back to the sentence it came from. Green means high confidence; yellow needs your review; gray means recognized but not auto-enforceable (tone, style).
The extraction pipeline is the single piece of Aiyara an LLM is involved in. You review the output before it becomes a check.
02
Edit the prompt. Rules update.
When the prompt changes, the extraction view comes back — showing which rules were added, removed, or modified. If your Tuesday 3pm prompt update changes two rules and introduces violations by 5pm, you see the correlation. Think of it as CI for your agent's policy.
The prompt is your source of truth. The rules don't drift from it because they're re-derived from it.
03
Violations link back to the line.
Every trace is evaluated against your confirmed rules plus a set of pre-built harness checks. When a constraint fails, Aiyara pins the violation to the exact sentence in the prompt, the step in the trace, and the argument that broke it. Cascade attribution traces downstream failures back to their upstream cause.
Same trace in, same verdict out. Every time.
03 — A concrete example
Tuesday, 2:47pm. One skipped step. One flagged trace.
Your customer service agent has this rule in its system prompt:
“Always verify the customer's identity by callingget_customer_detailsbefore processing any refund withprocess_refund.”
On Tuesday at 2:47pm, the agent handles a refund request:
user → "I need a refund for order #4821"
agent → process_refund(order_id="4821", amount=89.99) ← violation
agent → "Your refund of $89.99 has been processed."The agent skipped get_customer_details entirely and went straight to process_refund. Aiyara flags this as a violation of the verify-identity-before-refund constraint, links it to the exact line in the system prompt, and surfaces it on the dashboard.
Now multiply this by every trace, every rule, every day. The PM sees: verify identity before refund — 97.3% compliant this week, down from 99.8% after Tuesday's prompt change. 34 violations since Tuesday, 12 linked to a downstream cascade. The engineer sees the same data with technical trace detail.
04 — Coverage
What we catch. What we don't.
Not every rule in a system prompt can be deterministically enforced. We're explicit about the line.
| Rule type | Example | Enforceable? | Why |
|---|---|---|---|
| Tool ordering | “Verify identity before processing refund” | Yes | Tool call sequence is visible in every trace |
| Required steps | “Always check order status before modifying” | Yes | We detect when expected tools are skipped |
| Argument constraints | “Refund amount must not exceed order total” | Yes | Tool arguments are compared against prior results |
| Rate limits | “Maximum 1 handoff per conversation” | Yes | Tool calls are counted per trace |
| Escalation rules | “Transfer to human after 3 failed attempts” | Yes | Failure counts and escalations are tracked |
| Doom loops | Agent repeating the same action indefinitely | Yes | Pre-built detector, zero config |
| Tone / style | “Be professional and empathetic” | No | Requires subjective judgment — different tool |
| Factual accuracy | “Don't make up information” | Partial | We catch skipped tools; we can't verify free-text claims |
| Topic boundaries | “Only discuss company products” | No | Requires content classification |
The rules we enforce are the ones attached to actions — tool calls, arguments, sequences. These are also the rules attached to the most expensive failures.
05 — Harness checks
Four failure modes we catch with zero configuration.
Pre-built detectors that work on any agent trace. No prompt, no schema, no setup.
Doom loops
Same tool called with identical arguments, repeatedly. Burns tokens, never completes.
Context overflow
Token utilization approaches the model limit. System prompt truncated. Instructions silently disappear.
Retry amplification
A write operation executed twice. Double charges, duplicate orders, corrupted state.
Error cascades
One failed tool call triggers three downstream failures. Symptoms are loud; root cause is buried.
06 — Stack fit
We sit alongside what you have. Not in place of it.
Most teams running agents in production already have some combination of trace logging, custom evaluators, and maybe LLM-as-judge scoring. Keep them. Aiyara is a layer of deterministic constraint monitoring and failure intelligence on top.
Langfuse, LangSmith, Braintrust answer what happened. They store the trace and let you build evaluators. They don't answer whether it should have happened — they don't understand the agent's policy. LLM-as-judge evaluates the output; we evaluate the execution. Pattern-based firewalls operate on individual inputs and outputs; cross-tool-call temporal constraints (“verify before refund”) are structurally outside their scope.
Send your traces to Langfuse, Datadog, or your own store. Point Aiyara at the same stream, and get per-constraint verdicts, cascade attribution, and failure clustering on top.
07 — Onboarding
Five minutes to live.
- 01~30s
Paste your system prompt and tool schemas
No YAML, no DSL to learn. The doc you already have.
- 02instant
Auto-extracted constraints appear side-by-side
Each rule is mapped to the sentence it came from, with a confidence signal.
- 033–5 min
Review: green, yellow, gray
Confirm what looks right, adjust what doesn't, skip what's not auto-enforceable. We explain the gray cases.
- 04~30s
Toggle on pre-built harness checks
Doom loops, context overflow, retry amplification, error cascades. Zero config.
- 05~30s
Install the SDK (3 lines)
Dashboard goes live. Every trace is monitored against confirmed rules.
pip install aiyara
import aiyara
aiyara.init(prompt="your system prompt here")
# That's it. Rules extracted. Traces monitored.08 — Where we are
Honestly, the state of the product.
Aiyara is pre-launch. The core engine is validated on public benchmarks. The full product is being built around it.
| Component | Status |
|---|---|
| Constraint extraction + DSL compiler | Workingvalidated on tau2-bench |
| Deterministic evaluation engine | Working0% false positives on compiled rules |
| Pre-built harness checks | Buildingdesigned, in implementation |
| Python SDK | In progresstrace model + transport shipped |
| Side-by-side extraction review UI | In progress |
| Monitoring dashboard | In progress |
| Hosted cloud infrastructure | In progressingestion, storage, API |
What we want from early partners: validation that structural agent failures are a real problem for your team, feedback on the auto-extraction approach, and a read on your security requirements so we prioritize the right things.
09 — Validation
What we've measured so far.
Developed and tested against tau2-bench — a public benchmark of 114 tool-calling agent scenarios across retail and telecom.
- 100% corrected precision. Every violation we flag is a real policy violation. Zero false positives on compiled constraints.
- Deterministic evaluation. Same trace in, same verdict out. No model calls at eval time.
- Full pipeline, end to end in 97 seconds. Constraint extraction through evaluation. 84% agreement with benchmark labels — and 36 violations caught that the benchmark's own evaluator misses.
- Composable DSL. Constraints compile to a small predicate grammar (If/Then, Requires, Before, ForAll) rather than rigid templates. Handles multi-tool constraints and cross-step dependencies template approaches can't express.
The 84% agreement reflects real gaps in our extraction pipeline, not cherry-picked scenarios. Closing them is our core research bet. More in The Translation Bottleneck.