research05 / 05

A Field Map of Agent Policy Compliance

A reference map of the agent policy compliance landscape — seven families of approach, five research programs worth following, six gaps still open.

Published: 2026.03.10
Reading: 13 min
References: 22 papers
Author: Aiyara Research

This is a reference post: a companion to the more opinionated pieces in the series. If you want the existential framing, start with Agents Are Failing Structurally in Production. If you want the technical depth, The Translation Bottleneck and Why LLM-as-Judge Fails Silently at Scale. What follows is the map: who's working on what, where the seven families of approach sit, and which gaps in the field aren't yet filled.

Why this post exists

Twelve months ago, "does my agent follow its policy?" was an exotic research question answered mostly by the team that built the agent. Today it has thirty-plus named systems, at least five well-funded research programs, and publication velocity in the high tens per quarter.

This isn't a competitive analysis. We have our own system in this space and name it where relevant, but the point is the shape of the field, not our position in it. We've avoided grading or ranking — a system we describe as "probabilistic" isn't worse than one we describe as "deterministic"; they're solving different problems.

Baseline observations

A few things worth stating at the top because they frame everything below.

The field is pre-paradigmatic. There is no shared baseline, no standard benchmark everyone targets, and no canonical definition of what "agent policy compliance" even means. (tau2-bench and AgentDojo are the closest to shared reference points; neither is universal.) New systems regularly introduce their own evaluation metrics, benchmarks, and threat models. Comparing across systems requires re-deriving a common frame from the methods sections.

Attack research is outpacing defense research. Across the papers surveyed, red-team methods routinely achieve 80–90%+ success rates. The best published defenses miss between 10% and 20% of violations on their own test sets. This gap is not news inside the field — it's discussed in STAC (arXiv 2509.25624), ShieldAgent (2503.22738), and AgentDojo — but it does not yet show up in product marketing.

Industry has entered. The taxonomy below treats industry and academic efforts in the same frame because they're now in the same frame. Meta shipped LlamaFirewall. Google published VeriGuard. Microsoft Research has a deliberate three-year program running. Anthropic, Amazon AGI, and Accenture all have relevant papers. This wouldn't have been true twelve months ago.

The pipeline shape has converged; the pieces have not. Multiple independent groups (Agent-C, FormalJudge, AgentSpec, PCAS, ContextCov / AgentPex) have landed on the same three-stage architecture: natural language policy → formal constraints → deterministic trace evaluation. Within that shape, the choices of DSL, solver, and evaluation strategy are actively contested. We've written separately about why the first arrow is the bottleneck.

A taxonomy by approach

Seven families, differentiated by when they intervene and what guarantees they can make. Systems frequently straddle two families; we place them by primary mode of operation.

1. Decode-time constrained generation

Representative systems: Agent-C (Kamath, Zhang & Xu et al., arXiv 2512.23738), SynCode (2403.01632), CRANE, Outlines (production library).

Mechanism: Intervene during token generation. A formal grammar or FOL spec is compiled into a constraint, and the model's next-token distribution is masked to permit only tokens that can lead to valid completions. Equivalent to structured decoding with a semantic filter.

Guarantees: Deterministic (0% false positive by construction, within the grammar). Agent-C reports 100% conformance on tau2-bench.

Caveats: Requires token-level access to the model — not usable with black-box API models unless the provider exposes structured-output primitives. Prevents violations at generation time; doesn't produce a post-hoc trace that a compliance team could audit separately. Agent-C's published repository has been empty (no commits) for four-plus months at time of writing, which is a live question about whether the research program is continuing or pivoting.

Best read for an initial mental model: Agent-C. The 5-predicate DSL (Before / After / Seq / Forall / Exists) is a small enough surface to internalize in an hour.

2. Runtime enforcement DSLs

Representative systems: AgentSpec (Wang et al., Jun Sun group, arXiv 2503.18666; 28 citations, the most-cited single paper in this space), Guardrails-as-Infrastructure, VeriSafe Agent, Agentproof.

Mechanism: Express policy as DSL rules with trigger / condition / action semantics. A runtime evaluates every agent action against active rules; violating actions are blocked or remediated before they reach the tool call.

Guarantees: Deterministic within the DSL's expressiveness. Typical reported prevention rates are 90%+ on code agents and 100% on autonomous-vehicle compliance scenarios.

Caveats: Policies are hand-authored or LLM-generated without validation. No NL-to-formal translation step is part of the system. The DSL's expressiveness is the ceiling — complex temporal properties often fall outside it. Runtime enforcement also creates a hard choice when a rule fires: block (which may strand the agent) or log (which may let the violation through).

Why it's load-bearing: the first published DSL for agent policy was in this family (AgentSpec), and the design choices in later DSLs are downstream of the decisions made here. Jun Sun's group at SMU/NUS is the center of gravity.

3. Probabilistic runtime enforcement

Representative systems: Pro2Guard (Jun Sun, arXiv 2508.00500), ShieldAgent (Bo Li, UIUC, 2503.22738; 52 citations, ICML 2025), ABC (behavioral contracts).

Mechanism: Model the agent's behavior as a DTMC or probabilistic automaton; use model-checking techniques to estimate the probability of violation and gate actions when that probability crosses a threshold.

Guarantees: Statistical rather than deterministic. Typical reported detection rates around 90% with a non-zero false-positive rate.

Caveats: The probability model requires enough trace data to estimate transition weights, which is a bootstrapping problem for new domains. The statistical framing makes it hard to explain a specific violation to a regulator — "this trace had a 23% predicted violation probability" is not an answer that satisfies a compliance audit.

Why it coexists with the deterministic branch: probabilistic methods gracefully handle partial specifications. If you can't write the rule precisely, you can still put a number on it. The deterministic-vs-probabilistic split is the live philosophical argument in this part of the field.

4. LLM-based guardrails

Representative systems: ToolSafe, AIR (Jun Sun, arXiv 2602.11749), DynaGuard, Policy-as-Prompt (Kholkar & Ahuja, 2509.23994).

Mechanism: A second LLM (the "judge" or "guard") reviews proposed agent actions or completed traces and flags policy violations. The guard may itself be prompted with policy text, fine-tuned on labeled traces, or trained on synthetic data.

Guarantees: None formal. Reported accuracy varies widely; typical numbers are in the 80–90% range on the guard's training distribution.

Caveats: LLM-as-judge has a well-documented literature of failure modes — position bias, verbosity bias, self-preference bias. A judge that is on average 90% accurate may be systematically wrong on a specific violation class without anyone noticing. FormalJudge (Zhou, Lou, Sheng, Fu & Yang, arXiv 2602.11136) reports 16.6% improvement over LLM-as-judge baselines specifically by replacing the judge with a formal verifier. We've written separately about why ensembles of LLM-judges don't fix this.

Why it's dominant despite the caveats: it's the cheapest approach to build. No DSL, no solver, no formal semantics. If you have an LLM and a policy document, you have a system in an afternoon. It's also the approach most easily adapted to new domains.

5. Pattern-based firewalls

Representative systems: LlamaFirewall (Meta, arXiv 2505.03574, 46 citations, open source), AEGIS, PromptGuard2, CodeShield.

Mechanism: Regex, classifier, or embedding-distance-based filters over agent inputs and outputs. Typically bundled as middleware between the agent and the outside world.

Guarantees: Pattern-completeness — if your pattern set covers the threat, you block it. No formal reasoning over sequences of actions.

Caveats: Attack research routinely shows pattern-based defenses are brittle to paraphrase attacks, encoding tricks, and sequential decomposition (STAC). Meta's open-sourcing of LlamaFirewall is probably the most significant industry event of the past year in this space, but the architecture itself is what existed before — just shipped and distributed at scale.

Why it matters anyway: LlamaFirewall is the fastest-growing de-facto standard in production. Research-grade techniques that can't operate alongside or inside it will have an adoption ceiling.

6. Testing, fuzzing, and red-teaming frameworks

Representative systems: ToolEmu, TAI3, ASTRAL, VeriGrey (arXiv 2603.17639), RGASD-family methods, AgentDojo (benchmark).

Mechanism: Generate adversarial test cases for agents. Some use AFL-style coverage-guided fuzzing (VeriGrey). Some use LLM-driven scenario synthesis. Some use combinatorial test design over tool-call sequences.

Guarantees: N/A — these are testing tools, not enforcement tools. Their output is a set of failing traces, not a runtime guarantee.

Caveats: Testing infrastructure is complementary to enforcement infrastructure, not a substitute for it. A testing framework shows you that a violation can happen; it doesn't show you that one did happen in production. The live open question in this family is what a "coverage" metric should mean over a policy space — a separate post on that.

Why we pay attention: the quality of a compliance pipeline is bounded by the quality of its test corpus, and the best test corpora for this problem come out of papers in this family.

7. Post-hoc trace evaluation and failure diagnosis

Representative systems: FormalJudge (Zhou, Lou, Sheng, Fu & Yang, arXiv 2602.11136), AgentPex / Willful Disobedience (Sharma et al., Microsoft, 2603.23806), ContextCov (2603.00822), AgentRx (Microsoft, 2602.02475), Near-Miss (Amazon AGI, 2603.29665), AgentTrace (2603.14688), our own system.

Mechanism: Evaluate a completed agent trace against a set of formal constraints after the fact. Diagnose which constraints fired, where, and why. Some systems (AgentRx, AgentTrace) add causal analysis over the sequence of failures; some (FormalJudge, our own) use Dafny/Z3 for the core verdict; some (Microsoft's ContextCov / AgentPex pipeline) use LLM-mediated evaluation end-to-end.

Guarantees: Deterministic in the deterministic-verdict subfamily (FormalJudge, ours); LLM-dependent in the rest.

Caveats: Post-hoc means you cannot block a violation that already happened. You can only detect, classify, and cascade-diagnose. That trade is the explicit reason this family exists separately from runtime enforcement: diagnosis quality is higher when the full trace is in hand than when a decision must be made one action at a time.

Why this family is active right now: Microsoft's three-paper sequence (ContextCov, AgentPex, AgentRx) published across a single month in early 2026 is the clearest statement of a full-stack compliance pipeline any industry lab has made. The single program worth monitoring for the state of the art is the Sharma / Barke / Zorn research program at Microsoft.

A summary table

Family	Timing	Guarantees	Representative	Best for
Decode-time constrained generation	Pre-action	Deterministic	Agent-C	Grammar-bounded outputs, open models
Runtime enforcement DSL	Pre-action	Deterministic	AgentSpec	Explicit allow/block decisions
Probabilistic runtime enforcement	Pre-action	Statistical	Pro2Guard, ShieldAgent	Partial specs, ambiguous rules
LLM-based guardrails	Pre/post	None formal	AIR, DynaGuard	Fast start, ambiguous domains
Pattern-based firewalls	Pre/post	Pattern coverage	LlamaFirewall	Production middleware
Testing / fuzzing	Pre-deploy	N/A (detective)	VeriGrey, AgentDojo	Test corpus generation
Post-hoc trace evaluation	Post-action	Deterministic or LLM-based	FormalJudge, AgentPex	Audit, diagnosis, cascade analysis

Groups worth following

Five research programs account for a disproportionate share of the weight in this field, in rough order of how much attention we pay:

Microsoft Research (Sharma, Barke, Zorn). The most coherent end-to-end program in the field. SPML (2024) established a DSL for prompt defense; PromptPex (2025) extracted input/output specs from prompts; ContextCov (March 2026) scaled to repository-level instruction files with 46,000 enforcement checks across 723 repos; AgentPex / Willful Disobedience (March 2026) applied the extraction to trace-level compliance; AgentRx (February 2026) added failure diagnosis with a 9-category taxonomy. Three papers in a single month is an unusual cadence. The program covers the full three-stage pipeline and is the closest thing to a product vision from an industry lab.

UIUC FOCAL Lab (Singh, Banerjee, Suresh). The strongest formal-methods group working on agents. Eight papers in twelve months. Clear arc from constrained-generation infrastructure (SynCode, CRANE, DINGO) to agent safety (Agent-C, BEAVER) to red-teaming (TRAP, ToolCert) to self-evolving agents (SEVerA). Agent-C's 5-predicate DSL has been quietly influential. The Agent-C GitHub repository has been empty (no commits) for four-plus months, and there is no visible product engagement, so the program's commercial trajectory is unclear.

Jun Sun group (SMU/NUS). The most cited researcher in this space. AgentSpec (28 citations, ICSE 2026) is the foundational DSL paper; Pro2Guard, AIR, ClawWorm, SafeClaw-R extend runtime enforcement and incident response. Jun Sun has a deep formal-methods background (neural network verification) and publishes at high velocity. The trajectory looks purely academic — no commercialization activity is visible — but it's a strong signal for what techniques will be published next.

Amazon AGI (Rabinovich, Anaby-Tavor) and AWS (Neurosymbolic NL Formalization team). Near-Miss (arXiv 2603.29665) is a striking recent failure-detection paper: agents that pass the task outcome but skip a required compliance check, 8–17% of trajectories across tau2-bench. Separately, the AWS Neurosymbolic NL-Formalization paper (2511.09008, 27 authors) reports >99% soundness via redundant cross-checking — independent validation of the multi-decomposer intersection approach. If these two teams converge inside Amazon, it would produce a direct competitor to the Microsoft program.

Bo Li group (UIUC) — ShieldAgent. ICML 2025 publication, 52 citations — the highest-visibility single paper in the field. Probabilistic rule circuits, web-agent-specific. Architecturally different from the formal branch; we follow the program for how the probabilistic family evolves rather than as a direct methodological influence.

Honorable mentions. Somesh Jha (UW-Madison) — PCAS (arXiv 2602.16708), Datalog-derived policy language. Shuvendu Lahiri (Microsoft) — Intent Formalization (2603.17150), a position paper that explicitly names the specification-quality bottleneck as the open problem. Ion Stoica and Matei Zaharia (Berkeley) — Specifications: The Missing Link (2412.05299), institutional validation of the spec-first paradigm from prominent systems researchers.

Where the gaps still are

Six things we cannot find published work on, after the review. These are the open spots on the map, in the sense that they are either not addressed or addressed only partially.

1. Specification quality signals as a product surface. Lahiri's Intent Formalization position paper explicitly calls this out. Multiple groups compute related signals internally (grounding validation, inter-decomposer agreement) but nobody bundles them as a deliberate onboarding artifact for engineers writing specs. Nothing in the literature resembles "compiler warnings for your policy document."

2. Deterministic causal failure diagnosis at the verdict level. AgentRx (Microsoft) does LLM-based failure localization. AgentTrace does causal graph reconstruction from raw logs. Near-Miss detects that a latent failure happened. Nobody, in work we've seen, uses a formal dependency graph over constraints to produce a deterministic root-cause verdict tied to specific trace evidence. We do this in our own system, and we're aware we may have missed a paper.

3. A shared benchmark for formalization accuracy. Every benchmark we know of in this field measures end-to-end pipeline behavior — did the agent + guard combination produce the right outcome? No published benchmark isolates the NL-to-formal stage and asks whether the emitted spec is a faithful translation of the source policy, independent of downstream trace evaluation. (Why this matters.)

4. Coverage over constraint spaces. Nothing standard. Specification coverage (Bartocci), property-pattern coverage (Cabrera Castillos), and combinatorial coverage (Jin, Yang) all exist in neighboring literatures that don't cite each other. The unification is the research opportunity.

5. Handling of policy ambiguity and conflict at the source-document level. ConRAP (arXiv 2403.08053) does ambiguity detection for contracts (F2 = 0.87) using LLM-based methods, in a different domain. No agent-focused system we know of surfaces source-document ambiguities deterministically, produces structured clarification questions, or detects implicit-negation structure as an onboarding step.

6. What happens in the presence of conflicting policies. Real compliance documents contradict themselves regularly. Several DSL papers mention this in passing but none, to our reading, publishes a systematic approach to conflict detection across the full extracted policy, let alone resolution semantics. Z3 pairwise consistency checks handle some of this; the full problem remains open.

What a reasonable reader should take away

Three summary claims:

The three-stage pipeline (NL → formal → deterministic eval) is the direction the field is going. Within it, the first arrow is the bottleneck and the third stage is commoditizing.
The probabilistic and deterministic branches are solving different problems, not competing for the same niche. The right comparison is not "which is better" but "which is appropriate for this policy and this risk profile."
The industry side of the field (Microsoft, Meta, Google, Amazon) entered in a serious way in the past twelve months. The academic field as it existed in early 2025 is now one part of a larger ecosystem with different incentives and different time horizons. Expect more movement from the industry side than from the academic side over the next twelve months.

If you're building in this space and your work fits cleanly into one of the seven families above, you're probably in a contested area. If it fits in one of the six gaps, you're probably in an uncontested one. Both are reasonable places to be.

Appendix: papers cited, with arXiv IDs where available

Agent-C — Kamath, Zhang & Xu et al., arXiv 2512.23738.
AgentSpec — Wang et al., arXiv 2503.18666.
Pro2Guard — Jun Sun et al., arXiv 2508.00500.
ShieldAgent — Bo Li et al., arXiv 2503.22738.
AIR — Jun Sun et al., arXiv 2602.11749.
LlamaFirewall — Meta, arXiv 2505.03574.
VeriGuard — Google, arXiv 2510.05156.
ContextCov — Sharma et al., Microsoft, arXiv 2603.00822.
AgentPex (Willful Disobedience) — Sharma et al., Microsoft, arXiv 2603.23806.
AgentRx — Barke et al., Microsoft, arXiv 2602.02475.
Near-Miss — Rabinovich et al., Amazon, arXiv 2603.29665.
AgentTrace — arXiv 2603.14688.
FormalJudge — Zhou, Lou, Sheng, Fu & Yang, arXiv 2602.11136.
Policy-as-Prompt — Kholkar & Ahuja, arXiv 2509.23994.
PCAS — Palumbo et al., arXiv 2602.16708.
Intent Formalization — Lahiri, arXiv 2603.17150 (single-authored).
Specifications: The Missing Link — Stoica & Zaharia et al., arXiv 2412.05299.
Neurosymbolic NL Formalization — Amazon/AWS, arXiv 2511.09008.
SynCode — arXiv 2403.01632.
STAC — arXiv 2509.25624.
VeriGrey — Zhang et al., arXiv 2603.17639.
ConRAP — arXiv 2403.08053.