Models violate their own instructions up to 80% of the time.We catch every violation — deterministically.
The Problem
42–80%
of the time, models violate their own instructions under conversational pressure
IBM CRAFT, EMNLP 2025
44%
of agent failures are architectural — not capability. Better models won't fix them.
MAST, NeurIPS 2025
11%
detection accuracy of frontier LLMs reviewing agent traces. The industry standard is a coin flip with a broken coin.
TRAIL, Patronus AI 2025
Policy Intelligence
Paste a prompt, get enforceable rules. Edit the prompt, rules update. When a trace breaks a rule, we show you the exact line to fix.
Structural Detection
Schema validators catch shape.
LLM-as-judge attempts quality.
Neither catches sequence.
We do, on every trace.
Monitoring
Same request passes at 2pm and fails at 3pm. We evaluate every trace against your rules and point to the instruction producing the drift.
Outcome-Linked Severity
Every violation scored by the failures it caused — escalations, reverts, refunds, whatever your team tracks. The noisiest rule rarely hurts you most.
Get Started
No scorers. No criteria files. Paste your system prompt and Aiyara extracts the rules automatically.
pip install aiyara
import aiyara
aiyara.init(
api_key="ak_…",
prompt="…your system prompt…",
auto_instrument=True,
)
# Rules extracted. Every trace evaluated. Prompt diffs flagged.