research03 / 05

The Translation Bottleneck

Five research groups have converged on the same pipeline for agent compliance. Every one of them stalls at the first arrow — natural language to formal constraint.

Published: 2026.03.25
Reading: 9 min
References: 17 papers
Author: Aiyara Research

Five independent research groups have converged on the same three-stage pipeline for agent compliance: natural-language policy → formal constraints → deterministic evaluation. (Why this shape, if it's new.) The third stage is commoditizing — Z3 has been production-quality for fifteen years, and the DSL design space is narrowing. Most published work is on the second stage. The first arrow — translating a natural-language policy into a set of machine-checkable constraints — is where every group stalls.

Every group admits it, in print or in practice. This post is about what "bottleneck" means concretely, what's been tried, and what the field hasn't yet learned to measure.

Why the first arrow is hard

If the constraint is correct, the verdict is trustworthy. "If the constraint is correct" is doing enormous work in that sentence.

Consider a single policy sentence lifted from a retail SOP: "Always verify the customer's identity by calling get_customer_details before processing a refund."

A decomposer has to make at least five decisions before it can emit a formal constraint:

Scope. Does "always" mean every trace, every refund event, or every conversation?
Applicability. Does this rule apply if the refund is below some dollar threshold? What if the customer was already verified earlier in the session?
Operator binding. Is get_customer_details the only acceptable identity-verification tool, or one of several? (In most real tool catalogs, it's one of several.)
Temporal semantics. "Before" means the call must happen earlier in the same trace — but must it be immediately before, or with any arbitrary intervening calls?
Negation structure. What does a violation look like — no verification call anywhere, a verification call after the refund, or a verification call on a different customer?

Each is an interpretation choice a human policy author made implicitly and a decomposer has to make explicitly. The source document — the one humans wrote for other humans — almost never pins them down. A refund SOP contains dozens of sentences like this, each with its own hidden parameters.

This isn't new in its most general form. NL-to-formal translation has been studied in hardware verification (nl2spec, Cosler et al. 2023, 110 citations), smart-city requirements (CitySpec, 2022), software contracts (nl2postcond, 2024), and policy compilation (ContextCov, 2026). What's new is that agent policies are operational — they control tool calls with real-world side effects, on indefinite time horizons, in enough volume that manual spec authoring is infeasible. The automation has to work.

What the cleanest measurement shows

Prasetya et al.'s Talk is Cheap, Logic is Hard (arXiv 2603.17193) benchmarked 24 large language models on formalizing program post-conditions — arguably the easy end of NL-to-formal, because the target language (predicate logic over program variables) is well-scoped and the source (a function signature plus a docstring) is tight. Their headline:

No LLM achieves perfect accuracy, and the gap between generated formalizations and ground-truth formalizations is persistent across model scales.

The most effective mitigation they identify isn't a bigger model. It's auto-generated test suites that catch incorrect formalizations after the fact. The fix for "the LLM produced an incorrect spec" is a second verification layer that doesn't trust the spec.

VeriAct (arXiv 2604.00280) reports a complementary finding: verifier-accepted specifications are frequently incorrect. The verifier says "yes, this trace satisfies this constraint," but the constraint was never a faithful translation of the source policy. The system is self-consistent and wrong.

Lahiri's Intent Formalization (arXiv 2603.17150, single-authored position paper from Microsoft Research) explicitly names "semi-automated metrics for specification quality assessment" as the grand open challenge. The verifier works. The translation into the verifier's input language doesn't, reliably, and we don't yet know how to tell when it has failed.

Two pieces of our own data fit into this picture. When we ran our best decomposer against an unseen SOP-Bench domain, template coverage on customer service fell to ~42% while on patient intake it held 93% — per-domain variance larger than within-domain variance, consistent with the "source doc quality varies wildly" reading. Separately, our first ensemble of generated checkers scored 100% agreement on a held-out benchmark; the number was an artifact of every checker defaulting to the same answer, not real agreement. (Full postmortem.)

The failure modes of NL-to-formal pipelines tend to be silent and structural, not loud and semantic.

The three classes of mitigation the field has tried

1. Constrain the output space

If you fix the target DSL tightly enough, a lot of generation errors become syntactically impossible. Agent-C's contribution is instructive: five composable predicates (Before, After, Seq, Forall, Exists) inside a constrained-decoding pipeline (SynCode / CRANE, arXiv 2403.01632) so every emitted token is guaranteed to fit the grammar. On tau2-bench the system achieves 100% conformance at generation time — a striking number, offset by the fact that Agent-C operates at generation time (it needs token-level model access) and not at evaluation time (post-hoc trace checking is a different modality).

ToolGate (Liu et al., arXiv 2601.04688) makes a related move — per-tool Hoare contracts {P} tool {Q} with a typed symbolic state space. Small formalization surface, but per-step contracts can't express cross-step temporal properties.

The lesson: constraining the output language is a good idea, probably mandatory in production, but it doesn't by itself tell you whether the constraint you emitted is a faithful translation of the source policy. It tells you the constraint is well-formed.

2. Verify the translation, not just the trace

This is the direction Talk is Cheap points toward: after the decomposer emits a candidate constraint, run an additional process that tries to falsify it before admitting it into the verifier.

Program-proof co-evolution (Assured Automatic Programming, arXiv 2410.18494; Draft-and-Prune, arXiv 2603.17233) is one instance: for each policy sentence, generate a (DSL rule, Python check, positive and negative example traces) tuple, iterate until mutual consistency, and back-translate the final rule to English for human review. The back-translation step is the practical hinge — it turns an opaque DSL artifact into something a non-specialist can correct.

The AWS Neurosymbolic NL-Formalization paper (arXiv 2511.09008) reports >99% soundness via redundant formalization cross-checking — running the decomposer multiple times with different prompts and taking the intersection. Same move we ended up making. It trades yield for soundness in a predictable way: the more decomposers you require to agree, the fewer constraints you compile, but the ones you do compile are more trustworthy.

Worth flagging: all "verify the translation" techniques assume the source document is unambiguous. When it's not — and operational policies very frequently aren't — cross-checking only narrows the space of internally consistent readings, not the space of correct ones.

3. Put a human in the specific right place

There's a long line of work on human-in-the-loop NL-to-formal translation: nl2spec for hardware LTL, CitySpec for smart-city requirements, Policy-as-Prompt (Kholkar & Ahuja, arXiv 2509.23994) for policy trees in the agent setting. The design choice that separates these systems is where they insert the human.

Passive correction — the human reviews everything the system produced, catches errors manually — scales poorly. Guided review — the system tells the human which parts it's unsure about and lets them skip the rest — scales better, but only if the "unsure about" signal is trustworthy. This is precisely where Lahiri's position paper locates the open problem: we don't have deterministic metrics for formalization quality that a guided-review interface could key off.

A partial inventory of the deterministic signals we compute during decomposition: inter-decomposer agreement, tool-schema grounding (every referenced tool name and argument must exist), structural text analysis (ambiguity markers, implicit negations, scope anchors), SMT contradiction detection (pairwise consistency over emitted constraints), tool surface coverage (which parts of the agent's action space have any constraint at all), and example-backed validation (every rule must admit at least one positive and one negative trace). None is novel in isolation. What's under-explored — and what Intent Formalization calls out explicitly — is bundling them into a quality signal that behaves like a compiler warning: specific, localized, and deterministic.

What isn't yet measured

Reading the survey back, one thing stands out: nobody has published a benchmark for formalization accuracy itself.

Existing benchmarks measure downstream outcomes. tau2-bench measures whether an agent respects an authoritative policy. Agent-C measures generation-time conformance. FormalJudge measures agreement with human labels on trajectory-level safety. AgentPex measures per-metric scores on behavioral compliance. All of these conflate "did the decomposer produce a correct spec?" with "does the evaluator find the right answer using that spec?" When the final verdict is right, we can't tell which stage deserves credit; when it's wrong, we can't tell which stage to fix.

A formalization-accuracy benchmark would need to be structured differently. Minimally:

A corpus of real operational policy documents — SOPs, system prompts, compliance specs — with known ambiguities. Not sanitized tau-bench-style policies.
Expert-authored gold-standard formal constraints for each document, with disagreements between experts preserved as metadata rather than averaged away.
An evaluation protocol that scores a candidate decomposition on precision (did every emitted constraint correspond to a real rule?), recall (did every real rule surface as at least one constraint?), and grounding (do constraints reference real tools with real arguments?) — separately from any downstream trace-evaluation metric.
Measurement of silent failure modes: vacuous constraints (trivially satisfied by every trace), unreachable constraints (no realistic trace could trigger them), self-conflicting constraints.

We don't have this benchmark. Neither does anyone we've found. It's the piece of shared infrastructure the field needs most and is least incentivized to build — publishing a benchmark that makes your own pipeline look worse than the last paper is a hard sell.

Where this leaves us

The three-stage pipeline is going to be the dominant shape of agent oversight tooling for the next several years. The evaluation stage is commoditizing. The differentiation is moving upstream, into how much trust you can place in the emitted formal spec.

Two claims that survive the current literature:

The bottleneck is upstream, and it's specifically the absence of a formalization-quality signal a human can act on locally. Everything else — constrained decoding, multi-decomposer ensembles, back-translation — is in service of producing that signal.
Aggregate accuracy numbers on the whole pipeline aren't informative about the upstream stage. A 98.6% detection rate on a benchmark is compatible with a decomposer that works perfectly on that benchmark and fails silently elsewhere, and with a decomposer that fails in ways that happen not to affect the benchmark's particular trace distribution. Without per-stage evaluation, we can't distinguish these cases.

The research agenda we'd most like to see the field converge on, in priority order: (a) a policy-formalization accuracy benchmark, structured as above; (b) a set of standard deterministic signals for formalization quality, measured on that benchmark; (c) empirical studies of which signals, in which combinations, actually change the quality of the emitted spec under a guided-review workflow.

A formalization-quality layer that works like compiler warnings for your policy is what we're building toward. Early access.

Appendix: papers cited

Agent-C — Kamath, Zhang & Xu et al., arXiv 2512.23738.
FormalJudge — Zhou, Lou, Sheng, Fu & Yang, arXiv 2602.11136.
AgentSpec — Wang et al., arXiv 2503.18666.
PCAS — Palumbo et al., arXiv 2602.16708.
ContextCov — Sharma et al., arXiv 2603.00822.
AgentPex (Willful Disobedience) — Sharma et al., arXiv 2603.23806.
Intent Formalization — Lahiri, arXiv 2603.17150 (single-authored).
Talk is Cheap, Logic is Hard — Prasetya et al., arXiv 2603.17193.
VeriAct — arXiv 2604.00280.
nl2spec — Cosler et al., 2023.
nl2postcond — Endres et al., 2024.
ToolGate — Liu et al., arXiv 2601.04688.
Neurosymbolic NL Formalization — AWS, arXiv 2511.09008.
Policy-as-Prompt — Kholkar & Ahuja, arXiv 2509.23994.
Assured Automatic Programming — arXiv 2410.18494.
Draft-and-Prune — arXiv 2603.17233.
SynCode — arXiv 2403.01632.