productno fluff

Your system prompt is already your policy.
We extract it, enforce it, and watch every trace.

Aiyara is agent failure monitoring. We catch structural failures in tool-calling agents — deterministically, not with another LLM.

Evaluation
Deterministic. No LLM at verdict time.
Source of truth
Your system prompt + tool schemas.
Setup
~5 minutes. 3 lines of SDK.
Fits alongside
Langfuse, Braintrust, Datadog.

01 — The problem

The shadow policy problem

Your team has rules for how your agent should behave. Some live in a product spec. Some in a Notion page. Some in a Slack thread. Eventually an engineer translates them into a system prompt — the block of natural language the model reads before every conversation.

That prompt is now the single source of truth for your agent's behavior. Six months later, three engineers have modified it, two new tool integrations added implicit expectations, a compliance ask from legal never made it in, and a PM changed the escalation policy in Notion without updating the prompt. The actual policy is now scattered across prompts, docs, Slack, email, and people's heads.

When something breaks, the first question — was the agent supposed to do that? — takes thirty minutes to answer because nobody can point to a single source of truth. And even when the prompt is correct, most teams have no systematic way to check whether the agent actually follows it. The prompt says “always verify identity before processing a refund.” Did the agent do that on the 12,000 traces it handled last week? Without automated monitoring, nobody knows.

02 — How it works

Extract. Monitor. Explain.

Three stages. The prompt you already have is the input. The output is a deterministic verdict on every trace, with the line it came from.

  1. 01

    Paste the prompt. Rules appear.

    Aiyara reads your system prompt and tool schemas and auto-extracts the enforceable rules. Each rule maps back to the sentence it came from. Green means high confidence; yellow needs your review; gray means recognized but not auto-enforceable (tone, style).

    The extraction pipeline is the single piece of Aiyara an LLM is involved in. You review the output before it becomes a check.

  2. 02

    Edit the prompt. Rules update.

    When the prompt changes, the extraction view comes back — showing which rules were added, removed, or modified. If your Tuesday 3pm prompt update changes two rules and introduces violations by 5pm, you see the correlation. Think of it as CI for your agent's policy.

    The prompt is your source of truth. The rules don't drift from it because they're re-derived from it.

  3. 03

    Violations link back to the line.

    Every trace is evaluated against your confirmed rules plus a set of pre-built harness checks. When a constraint fails, Aiyara pins the violation to the exact sentence in the prompt, the step in the trace, and the argument that broke it. Cascade attribution traces downstream failures back to their upstream cause.

    Same trace in, same verdict out. Every time.

03 — A concrete example

Tuesday, 2:47pm. One skipped step. One flagged trace.

Your customer service agent has this rule in its system prompt:

“Always verify the customer's identity by calling get_customer_details before processing any refund with process_refund.”

On Tuesday at 2:47pm, the agent handles a refund request:

user     → "I need a refund for order #4821"
agent    → process_refund(order_id="4821", amount=89.99)   ← violation
agent    → "Your refund of $89.99 has been processed."

The agent skipped get_customer_details entirely and went straight to process_refund. Aiyara flags this as a violation of the verify-identity-before-refund constraint, links it to the exact line in the system prompt, and surfaces it on the dashboard.

Now multiply this by every trace, every rule, every day. The PM sees: verify identity before refund — 97.3% compliant this week, down from 99.8% after Tuesday's prompt change. 34 violations since Tuesday, 12 linked to a downstream cascade. The engineer sees the same data with technical trace detail.

04 — Coverage

What we catch. What we don't.

Not every rule in a system prompt can be deterministically enforced. We're explicit about the line.

Rule typeExampleEnforceable?Why
Tool ordering“Verify identity before processing refund”YesTool call sequence is visible in every trace
Required steps“Always check order status before modifying”YesWe detect when expected tools are skipped
Argument constraints“Refund amount must not exceed order total”YesTool arguments are compared against prior results
Rate limits“Maximum 1 handoff per conversation”YesTool calls are counted per trace
Escalation rules“Transfer to human after 3 failed attempts”YesFailure counts and escalations are tracked
Doom loopsAgent repeating the same action indefinitelyYesPre-built detector, zero config
Tone / style“Be professional and empathetic”NoRequires subjective judgment — different tool
Factual accuracy“Don't make up information”PartialWe catch skipped tools; we can't verify free-text claims
Topic boundaries“Only discuss company products”NoRequires content classification

The rules we enforce are the ones attached to actions — tool calls, arguments, sequences. These are also the rules attached to the most expensive failures.

05 — Harness checks

Four failure modes we catch with zero configuration.

Pre-built detectors that work on any agent trace. No prompt, no schema, no setup.

Doom loops

Same tool called with identical arguments, repeatedly. Burns tokens, never completes.

Context overflow

Token utilization approaches the model limit. System prompt truncated. Instructions silently disappear.

Retry amplification

A write operation executed twice. Double charges, duplicate orders, corrupted state.

Error cascades

One failed tool call triggers three downstream failures. Symptoms are loud; root cause is buried.

06 — Stack fit

We sit alongside what you have. Not in place of it.

Most teams running agents in production already have some combination of trace logging, custom evaluators, and maybe LLM-as-judge scoring. Keep them. Aiyara is a layer of deterministic constraint monitoring and failure intelligence on top.

Langfuse, LangSmith, Braintrust answer what happened. They store the trace and let you build evaluators. They don't answer whether it should have happened — they don't understand the agent's policy. LLM-as-judge evaluates the output; we evaluate the execution. Pattern-based firewalls operate on individual inputs and outputs; cross-tool-call temporal constraints (“verify before refund”) are structurally outside their scope.

Send your traces to Langfuse, Datadog, or your own store. Point Aiyara at the same stream, and get per-constraint verdicts, cascade attribution, and failure clustering on top.

07 — Onboarding

Five minutes to live.

  1. 01

    Paste your system prompt and tool schemas

    No YAML, no DSL to learn. The doc you already have.

    ~30s
  2. 02

    Auto-extracted constraints appear side-by-side

    Each rule is mapped to the sentence it came from, with a confidence signal.

    instant
  3. 03

    Review: green, yellow, gray

    Confirm what looks right, adjust what doesn't, skip what's not auto-enforceable. We explain the gray cases.

    3–5 min
  4. 04

    Toggle on pre-built harness checks

    Doom loops, context overflow, retry amplification, error cascades. Zero config.

    ~30s
  5. 05

    Install the SDK (3 lines)

    Dashboard goes live. Every trace is monitored against confirmed rules.

    ~30s
pip install aiyara

import aiyara
aiyara.init(prompt="your system prompt here")

# That's it. Rules extracted. Traces monitored.

08 — Where we are

Honestly, the state of the product.

Aiyara is pre-launch. The core engine is validated on public benchmarks. The full product is being built around it.

ComponentStatus
Constraint extraction + DSL compilerWorkingvalidated on tau2-bench
Deterministic evaluation engineWorking0% false positives on compiled rules
Pre-built harness checksBuildingdesigned, in implementation
Python SDKIn progresstrace model + transport shipped
Side-by-side extraction review UIIn progress
Monitoring dashboardIn progress
Hosted cloud infrastructureIn progressingestion, storage, API

What we want from early partners: validation that structural agent failures are a real problem for your team, feedback on the auto-extraction approach, and a read on your security requirements so we prioritize the right things.

09 — Validation

What we've measured so far.

Developed and tested against tau2-bench — a public benchmark of 114 tool-calling agent scenarios across retail and telecom.

The 84% agreement reflects real gaps in our extraction pipeline, not cherry-picked scenarios. Closing them is our core research bet. More in The Translation Bottleneck.

If this diagnosis matches how you're thinking about your own agents, we'd like to hear from you.