Reliability infrastructure for AI agents

Models violate their own instructions up to 80% of the time.We catch every violation — deterministically.

Request early access See how it works

PASS RATE

72%

156 traces

SEVERITY BREAKDOWN

5 open

1 critical2 high1 medium1 low

OPEN ISSUES

3

0.4 per 1K traces

CRITICAL

1

Since: Mar 14

Violation Rate

1 day agoSince Mar 14

Sort: Priority ▾SeverityClassStabilityCascadeConstraint ID

CRITICAL12 of 17

CRT

modify_order_pendingPOLICY VIOLATION · modify_pending_order_address at step 2; order #W865881 status unknown (agent didn’t check status)

ROOT CAUSE +1systematic

1 day ago

395

CRT

exchange_order_deliveredPOLICY VIOLATION · exchange at step 4; order #W908622 status unknown

2 days ago

284

HGH

order_escalation_missingESCALATION VIOLATION · order #W773291 total $847.50 not escalated to supervisor

3 days ago

89

MED

tool_call_limit_exceededCARDINALITY VIOLATION · get_order_details called 5 times (limit: 3)

5 days ago

42

The Problem

42–80%

of the time, models violate their own instructions under conversational pressure

IBM CRAFT, EMNLP 2025

44%

of agent failures are architectural — not capability. Better models won't fix them.

MAST, NeurIPS 2025

11%

detection accuracy of frontier LLMs reviewing agent traces. The industry standard is a coin flip with a broken coin.

TRAIL, Patronus AI 2025

Policy Intelligence

Your prompt is the spec. We make it enforceable.

Paste a prompt, get enforceable rules. Edit the prompt, rules update. When a trace breaks a rule, we show you the exact line to fix.

SYSTEM PROMPT · retail_agent.mdv2→v3

53## Exchange Policy

54

55Exchanges are allowed for delivered orders of

56the same product type.

57Verify the order status before processing.

55Exchanges are ONLY allowed when the order

56status is “delivered”.

57The replacement item MUST be the same product

58type as the original.

59Verify both conditions before calling

60exchange_delivered_order_items.

EXTRACTED · 0.3s

HIGH CONFIDENCEArgument Constraint

Exchanges require delivered status and matching product type

If(exchange, Requires(status == "delivered", same_type))

v2→v3UPDATED · just now

Argument Constraint

Exchanges require delivered status and matching product type

If(exchange, Requires(status == "delivered"))If(exchange, Requires(status == "delivered", same_type))

VIOLATEDtrace · tr_00041 · Step 4/6

exchange_delivered_order_items{ order_id: 'W5061109', status: 'pending' }

WHY

Called on order with status='pending'. Rule requires delivered.

Structural Detection

Schema validators catch shape.

LLM-as-judge attempts quality.

Neither catches sequence.

We do, on every trace.

search_ticket320 ms
get_customer180 ms
process_refund410 ms
Right tool, wrong timeprecondition verify_identity skipped
send_confirmation240 ms

Monitoring

AI is nondeterministic. It needs its own type of monitoring.

Same request passes at 2pm and fails at 3pm. We evaluate every trace against your rules and point to the instruction producing the drift.

RELIABILITY · retail_agent.mdLIVE · 2,847 traces / 24h

53## Exchange Policy

54

55Exchanges are ONLY allowed when the order23%

56status is “delivered”.

57The replacement item MUST be the same product99%

58type as the original.

59Verify both conditions before calling87%

60exchange_delivered_order_items.

DETERMINISTIC · EVERY TRACE247 rules

94%

3BROKEN

12FLAKY

232HEALTHY

Top broken

exchange_delivered_status98%→23%

retry_amplification_bound89%→41%

tool_auth_scope95%→67%

PASS RATE DROPexchange_delivered_status

98%23%

prompt v4 · 3 days ago

Root cause · conditional

order.status == "pending"77%

retried after upstream 40419%

SAME REQUEST · TWO TRACESexchange · W5061109

tr_02f1PASS

get_order_detailsverify_statusexchange_delivered_order_items

tr_04b8FAIL

get_order_detailsexchange_delivered_order_itemsskipped verify

This pattern · seen 12× this week

11failed traces92%fail ratev4first seen

Outcome-Linked Severity

Fix what broke something.
Not what fires most.

Every violation scored by the failures it caused — escalations, reverts, refunds, whatever your team tracks. The noisiest rule rarely hurts you most.

Traceslast 30d · 4,127 traces

AllEscalated 847Abandoned 284Refunded 164Contained 2,832

...0004111:23·2.4s·4.1k tok·7 calls

auth_userprocess_refund

ESCALATEDvia webhook

...0008910:47·3.8s·6.2k tok·11 calls

auth_usermodify_order

REFUNDEDvia SDK adapter

...0015509:15·5.1s·8.4k tok·9 calls

search_kbprocess_refund

ABANDONEDvia inference rule

...0020308:40·1.8s·3.2k tok·4 calls

auth_usercancel_order

CONTAINEDvia SDK adapter

...0024408:02·2.9s·5.0k tok·6 calls

auth_userprocess_refund

ESCALATEDvia webhook

...0031807:47·3.3s·5.7k tok·8 calls

search_kbhandoff_to_human

ESCALATEDvia inference rule

...0039507:12·2.1s·3.8k tok·5 calls

cancel_orderconfirm

CONTAINEDvia SDK adapter

...0047206:55·4.2s·7.1k tok·10 calls

modify_orderverify_account

ABANDONEDvia inference rule

Attributionrule × outcome · last 30d · baseline v2

Escalation 847Abandonment 284Refund 164Containment 2,832+2

RULE × OUTCOMELIFT + 95% CIPOSTERIORCHANGE-POINT

R7

identity_verify_skipped_before_tx→ ESCALATION

+37%ci [+29, +44]

0.98

apr 17 · v4

R3

retry_on_4xx_non_retriable→ ESCALATION

+22%ci [+15, +30]

0.91

mar 28 · v3

R14

adversarial_bypass_mfa→ ESCALATION

+51%ci [+39, +63]

0.96

apr 11 · v4

R22

refund_exceeds_sup_threshold→ ESCALATION

+18%ci [+11, +25]

0.87

apr 03 · v4

R9

status_check_missing_before_modify→ ESCALATION

+9%ci [+4, +14]

0.78

stable

R12

greeting_variant_inconsistent→ ESCALATION

+0%ci [-8, +8]

0.42

stable

Issues ranked by cost47 open · cascade-clustered

Outcome impactFrequency

#01CRTR7identity_verify_skipped_before_txCASCADEROOT CAUSE

cascade from identity_verify → transaction_toolchange-point apr 17 · v4

312escalations

847violations

#02CRTR3retry_on_4xx_non_retriable

tokens burn on permanent errors

89escalations

156violations

#03HGHR14adversarial_bypass_mfa

rare but costly — high lift, low volume

54escalations

71violations

#04MEDR22refund_exceeds_sup_threshold

supervisor handoff skipped on high-value refunds

38escalations

112violations

#05MEDR9status_check_missing_before_modify

intermittent — signal not yet stable

27escalations

214violations

#06LOWR12greeting_variant_inconsistentVANITY

4,200 violations — the vanity metric

11escalations

4,200violations

open · 7d

outcome · escalation

312escalations

847violations

+37%lift

apr 17 · v4change-point

Cascade

user_intententry

identity_verifySKIPPED

transaction_toolFAILED

upstream cause, downstream symptom — one issue.

Attribution

lift+37%

95% ci[+29, +44]

posterior0.98

change-pointapr 17 · v4

Sample traces 5 of 312

...0004111:23ESCALATED

...0008910:47REFUNDED

...0015509:15ABANDONED

...0024408:02ESCALATED

...0031807:47ESCALATED

Prompt source line 47

47Always verify identity before any transaction.

Get Started

5 minutes, not 5 sprints

No scorers. No criteria files. Paste your system prompt and Aiyara extracts the rules automatically.

pip install aiyara

import aiyara

aiyara.init(
    api_key="ak_…",
    prompt="…your system prompt…",
    auto_instrument=True,
)
# Rules extracted. Every trace evaluated. Prompt diffs flagged.

pip install aiyara

import aiyara

aiyara.init(
    api_key="ak_…",
    prompt="…your system prompt…",
    auto_instrument=True,
)

from langgraph.graph import StateGraph
# Compiled graphs are wrapped — invoke() and stream() captured.

pip install aiyara

import aiyara

aiyara.init(
    api_key="ak_…",
    prompt="…your system prompt…",
    auto_instrument=True,
)

from agents import Agent, Runner
# Runner.run() / run_sync() patched — every agent invocation captured.

See the product

Reliability infrastructure for AI agents

Your prompt is the spec. We make it enforceable.

Paste your prompt. Rules appear.

Edit once. Rules update.

Violations link back to the line.

Right tool, wrong time

Skipped required steps

Retries on non-retriable errors

Doom loops

AI is nondeterministic. It needs its own type of monitoring.

See which rules are breaking.

Find the line that caused it.

Understand why.

Fix what broke something.
Not what fires most.

Tag outcomes.

Attribute.

Rank by cost.

Drill in.

5 minutes, not 5 sprints

Reliability infrastructure for AI agents

Your prompt is the spec. We make it enforceable.

Paste your prompt. Rules appear.

Edit once. Rules update.

Violations link back to the line.

Right tool, wrong time

Skipped required steps

Retries on non-retriable errors

Doom loops

AI is nondeterministic. It needs its own type of monitoring.

See which rules are breaking.

Find the line that caused it.

Understand why.

Fix what broke something.Not what fires most.

Tag outcomes.

Attribute.

Rank by cost.

Drill in.

5 minutes, not 5 sprints

Fix what broke something.
Not what fires most.