research01 / 05

Agents Fail Structurally. Current Tooling Can't See It.

Why the expensive agent failures in production are structural, not linguistic — and why better prompts, LLM-judges, and firewalls don't close the gap.

Published: 2026.04.14
Reading: 6 min
References: 4 papers
Author: Aiyara Research

A pattern, not a list

In December 2025, an AI agent at Amazon named Kiro autonomously deleted a production environment — thirteen-hour outage. Around the same time, Google's Antigravity agent wiped a user's Drive. Sierra's customer service agent was jailbroken into generating inappropriate content on gap.com. Anthropic agents destroyed 2.5 years of a user's production data. Meta's internal agent posted to a company forum without human approval and triggered a Sev 1.

A CLTR / AISI-funded study documented ~700 cases of agents ignoring human instructions between October 2025 and March 2026, a 5x jump in six months. 88% of enterprises in Gravitee's December 2025 CTO survey report at least one AI agent security incident. 61% of those incidents, per DigitalApplied's analysis, trace to over-permissioned credentials.

Every one of these is a structural failure. The agent called a tool it wasn't supposed to call, skipped a step it was supposed to perform, or took an action outside its authorization. Not a hallucination — the model wasn't confused about facts. It executed the wrong plan.

Across publicly reported production incidents, a clear majority have structural root causes rather than linguistic ones. Press selection inflates that share — catastrophic structural failures make better headlines than a wrong-sounding sentence — but it reflects where the cost concentrates.

Why better prompts don't close the gap

IBM's CRAFT study (EMNLP 2025) put leading models through a standard customer-service policy and measured violation rates. Under normal conversations: 42–59% violations depending on the model. Under conversational pressure — not prompt injection, just a persistent user — 70–80%. The best prompt-based defense the authors engineered still sat above 80% violation under sustained pressure.

There is a second reason bigger models don't fix this. An agent at 95% accuracy per step on a 20-step task succeeds end-to-end 36% of the time. At 97% over 50 steps, 21.8%. Compound probability, not a model limitation. Per-step accuracy will improve; the multiplication does not.

Why better evals don't either

Trace stores — Langfuse, LangSmith, Braintrust — answer what happened. They don't answer whether it should have happened; they don't understand the agent's policy. Whether the evaluator you build on top is correct is on you.

The default evaluator pattern, LLM-as-judge, has a failure mode that shows up by construction. Judges default to charitable interpretations of ambiguous constraints: a rule the checker isn't sure applies becomes a default "no violation." Published measurements put LLM-as-judge error in the 15–20% range on structured tasks with clean ground truth; on production policies, where ground truth is itself ambiguous, the error is larger and hard to measure. Ensembles don't cancel this out — judges from the same model family share biases, so their convergence is covariance under a shared default, not independent signals agreeing.

We hit this in our own work. Our first ensemble scored 100% agreement on a held-out benchmark, and the number was an artifact of every checker defaulting to the same answer. The postmortem is here.

Patronus released the TRAIL benchmark in early 2026 to measure exactly this — automated failure detection on agent traces. The best published system scores 11%.

Why firewalls don't either

Pattern-based firewalls — LlamaFirewall is the open-source default — operate on individual inputs and outputs. A policy like "verify identity before processing any refund above $500" is a cross-tool-call temporal constraint; no regex sees both calls in the same frame. The attack literature (STAC, arXiv 2509.25624) routinely breaks pattern defenses with paraphrase, encoding, and sequential decomposition attacks at 80–90%+ success rates. Best published defenses miss 10–20% of violations on their own test sets before any adversary is involved.

Most production stacks today run some combination of these three — tighter prompts, an LLM-judge, a firewall. The failure modes don't offset. They stack.

The shape the field has converged on

Five independent research groups have, in the last eighteen months, arrived at the same three-stage pipeline:

Start with a natural-language policy — system prompt, SOP, compliance document.
Translate it into formal constraints in a machine-checkable language.
Evaluate every trace against those constraints deterministically. Same trace in, same verdict out.

Agent-C (Kamath, Zhang & Xu et al., arXiv 2512.23738), FormalJudge (Zhou, Lou, Sheng, Fu & Yang, 2602.11136), AgentSpec (Wang et al., Jun Sun group, 2503.18666 — the most-cited paper in this space), PCAS (Palumbo et al., 2602.16708), and the Microsoft ContextCov / AgentPex / AgentRx program (three papers in March 2026) each implement a version of this flow. The design space around step 2 is contested; the three-step shape is consensus.

The property it has that nothing else does: a deterministic verdict can be wrong, but it can't be wrong-on-average-while-looking-confident. If a constraint misfires, the constraint is an auditable artifact. If a trace passes, the evidence is the sequence of tool calls, not a model's summary of them.

What still stands between the shape and a shipped system

Two bottlenecks, both real, both active research.

Translation. The first arrow — natural language to formal constraint — is where every group stalls. Prasetya et al.'s Talk is Cheap, Logic is Hard (arXiv 2603.17193) benchmarked 24 frontier models on formalizing program post-conditions (a cleaner problem than agent policies), and found the gap from ground truth persistent across model scales. The most effective mitigation they identified wasn't a bigger model; it was a second verification layer that doesn't trust the first. More in The Translation Bottleneck.

Evaluation trust. Even with a correct spec, evaluation can be silently wrong — as with our 100%-agreement incident, where the ensemble was agreeing on a default rather than a judgment. The fix involved separating does this rule apply? from does the trace satisfy it? at the checker-return signature (a pattern from FormalJudge). Agreement dropped from a fake 1.0 to an honest 0.67 and surfaced the 38 of 49 traces that were actually out of scope. Full account in Why LLM-as-Judge Fails Silently at Scale.

The evaluation stage itself is commoditizing. Z3 is Z3. Dafny is Dafny. The DSL design space is narrowing around small composable predicate vocabularies. The differentiation is upstream.

Appendix: sources cited

CRAFT (IBM Research, EMNLP 2025) — customer-service policy compliance under conversational pressure.
TRAIL (Patronus AI, 2026) — automated failure detection on agent traces.
Talk is Cheap, Logic is Hard — Prasetya et al., arXiv 2603.17193.
Agent-C — Kamath, Zhang & Xu et al., arXiv 2512.23738.
AgentSpec — Wang et al., arXiv 2503.18666.
FormalJudge — Zhou, Lou, Sheng, Fu & Yang, arXiv 2602.11136.
PCAS — Palumbo et al., arXiv 2602.16708.
ContextCov — Sharma et al., Microsoft, arXiv 2603.00822.
AgentPex (Willful Disobedience) — Sharma et al., Microsoft, arXiv 2603.23806.
AgentRx — Barke et al., Microsoft, arXiv 2602.02475.
LlamaFirewall — Meta, arXiv 2505.03574.
STAC — arXiv 2509.25624.
Gravitee CTO Survey — 750 CTOs, December 2025.
CLTR / AISI-funded — ~700 instruction-ignoring cases, Oct 2025–Mar 2026.
Incident reports: Amazon Kiro, Google Antigravity, Sierra / Gap, Anthropic agents, Meta internal (Dec 2025–Mar 2026).