research02 / 05

Why LLM-as-Judge Fails Silently at Scale

Our first LLM-judge ensemble scored 100% agreement on a held-out benchmark. That was the bug. A structural failure mode of LLM-as-judge.

Published: 2026.02.28
Reading: 6 min
References: 4 papers
Author: Aiyara Research

LLM-as-judge is the default pattern for evaluating agent behavior at scale. Every major eval platform ships it; most internal eval stacks are built on it. We started there too. Our first ensemble of LLM-generated checkers scored 100% agreement on a held-out benchmark.

That was the bug.

What follows is what we found when we looked at why, and why we think it's a property of the paradigm rather than our specific implementation.

What we tried

Some agent policies are simple enough to evaluate with pattern-matched rules. Others are not. A retail policy might say "do not process a refund if the customer's address has changed within the last 30 days and the order contains a high-risk category item." No generic pattern matches that. A common move in the field — and the one we started with — is to generate evaluation code for such rules with an LLM, run it against traces, and trust the result.

Generated evaluation code is cheap to produce and expensive to trust. The standard mitigation, in the direction Prasetya et al.'s Talk is Cheap, Logic is Hard (arXiv 2603.17193) points toward, is an ensemble: generate several independent checker variants per constraint, run them all against the same traces, and flag constraints where variants disagree for human review. The premise is that if multiple LLM-generated checkers converge, the answer is probably right.

The first real test, against a domain we hadn't iterated on, returned 100% agreement.

How the 100% happened

We noticed because the number was too clean. A zero-disagreement score on a held-out benchmark is suspicious in the same way a model that hits training loss of exactly zero is suspicious — it usually means the thing being measured is not the thing you wanted.

The failure was structural, not semantic. Each checker returned a single boolean: does this trace satisfy this constraint? The implicit convention — the one LLM-generated code adopts by default — is that a constraint which does not apply to a given trace returns True. Nothing to violate; trivially satisfying. Defensible in isolation. Disastrous in an ensemble.

What the 100% agreement actually measured was agreement on the default, not on the judgment. For the unseen benchmark, most constraints applied to almost none of the traces. Most checkers returned True on almost every trace, because most of the time the rule was not in scope. Every checker was rubber-stamping the same default.

Consider a single trace against a spec with roughly 100 constraints. Five are actually in scope; the other ninety-five are silent — not applicable, default True. Three checker variants return True for the ninety-five. The five that fire may or may not agree. Either way, ninety-five trivial Trues plus five real judgments averages to near-perfect agreement. The rubber stamps outnumbered the judgments twenty to one, and the average drowned them.

The gauge was right. The thing the gauge was calibrated against was wrong.

Why this is structural, not incidental

This isn't a bug in our ensemble. It's the structural failure mode of LLM-as-judge, and it reproduces whenever an LLM-judged evaluation is applied across a distribution the judges haven't been narrowly calibrated for.

LLMs generating evaluation code default to charitable interpretations. A constraint the checker isn't sure about — a scope condition it cannot confidently evaluate, a rule that isn't clearly triggered — becomes a default True. This is the same phenomenon human evaluators exhibit when asked to judge ambiguous cases without a scoring rubric: absent clear criteria, the default is "no objection." For humans, the failure mode is well-documented. For LLM-judges, it is structurally identical and quantitatively worse, because LLM-judges produce confident answers by construction. They are not calibrated to say I don't know at the right rate.

An ensemble amplifies this rather than canceling it. The premise of ensemble methods is that independent errors cancel. That holds only if the errors are independent. LLM-judges generated by the same model family from similar prompts make correlated errors; their shared failure mode is the default. What looks like convergence under the ensemble hypothesis is covariance under a shared bias. The ensemble doesn't reveal the ambiguity in the policy; it hides it behind a unified default.

Published measurements put LLM-as-judge error in the 15–20% range on structured evaluation tasks with clean ground truth. When ground truth is itself ambiguous, as it is in most production policies, the error is larger and unmeasurable — concentrated at exactly the cases where evaluation needs to work: ambiguous scope, underspecified rules, novel distributions. The LLM-judge is most wrong where it is most consulted, and most confident where it is most wrong.

Silent failures look like confidence

The habit we've landed on: for every metric an evaluation pipeline reports, write down the trivial way it could be right for the wrong reason. Most of the time the exercise produces nothing. Once in a while, as it did for us, it produces a change in framing that refactors the whole stack.

The habit is worth the cost because silent failures in LLM evaluation pipelines look like confidence, not like error. Loud failures — the model throws, the checker crashes, the verifier returns UNKNOWN — are the easy case. Every engineer who has shipped an eval pipeline has instrumented for them. Silent failures are the ones that pass the dashboard, move the number in the right direction, and produce a story you want to tell. Our 100% agreement was a silent failure. A benchmark that secretly contains no in-scope traces is a silent failure. A decomposer that generates constraints the source document never supported is a silent failure. A spec the verifier accepts and the policy author would not recognize is a silent failure.^c

The common thread across these modes: the evaluation system can't tell, from what the LLM returned, whether the judgment was even being made. Confidence on an undefined judgment is the specific failure shape LLM-as-judge produces by construction, and any stack built on it inherits the shape.

Where we went instead

Our bet is that LLM-as-judge is the wrong primitive at the base of a reliable evaluation stack. It's useful for small-scale, human-supervised work — extracting candidate constraints from a prompt, proposing initial classifications, summarizing incidents. It's not the right tool for high-volume, deterministic compliance evaluation, and no amount of better prompting or ensemble engineering closes the gap at the base.

The fix to our specific incident came from separating does this rule apply? from does the trace satisfy it? at the checker-return signature — a decomposition pattern from FormalJudge (Zhou, Lou, Sheng, Fu & Yang, arXiv 2602.11136). Under the split signature, ensemble agreement dropped from a fake 1.0 to an honest 0.67, and 38 of 49 traces were correctly marked not-applicable.

The larger move is to shift the base from probabilistic judgment to deterministic verification — formal constraints evaluated by an SMT solver over structured traces. That's the subject of The Translation Bottleneck. The narrower claim here: whatever the right shape of a production evaluation stack is, it isn't built on an ensemble of LLM-judges, and vacuous agreement is one concrete reason why.

What we still don't know

We don't know how widespread this failure mode is across the rest of the field. We fixed the specific pathway in our own system. We have a general sense that it's common in LLM-judged work — the MAST taxonomy (Cemri et al., arXiv 2503.13657) catalogues a related class of multi-agent failures where verification is the bottleneck — but a systematic audit would be useful.

If you've hit a similar pattern in your own evaluation pipeline, or you have an ensemble metric that looks suspiciously clean, we'd like to compare notes.

Aiyara is our attempt at deterministic verification over formal constraints — built against this failure mode, among others. Early access.

Appendix: papers referenced

FormalJudge — Zhou, Lou, Sheng, Fu & Yang, arXiv 2602.11136.
Talk is Cheap, Logic is Hard — Prasetya et al., arXiv 2603.17193.
VeriAct — arXiv 2604.00280.
MAST — Cemri, Pan, Yang, Agrawal, Chopra, Tiwari, Keutzer, Parameswaran, Klein, Ramchandran, Zaharia, Gonzalez & Stoica, arXiv 2503.13657.