A Proposal for Coverage Over Constraint Spaces
No agent-policy coverage framework is currently published. A first sketch — four dimensions from adjacent fields, plus one that isn't named anywhere.
This is a proposal, not a description. No agent-policy coverage framework is currently published; no shared baseline exists. What follows is a first sketch — five dimensions drawn from adjacent fields, plus one piece we can't find named in the prior literature. (Why this matters, if new: the translation bottleneck.)
The question that doesn't have an answer yet
If you run ten thousand traces through a policy evaluator and every constraint reports "no violation," two things could be true:
- Your agent genuinely complies with the policy, or
- Your test suite isn't exercising the parts of the policy that would have shown violations.
Distinguishing those two is what coverage is for. In software engineering, we have forty years of vocabulary for it — line coverage, branch coverage, MC/DC, mutation score. In formal verification, specification coverage, property-pattern coverage, LTL automaton traversal. In combinatorial testing, t-wise coverage of parameter interactions.
For the specific question of did my test suite exercise the policy constraints of an AI agent well enough? — none of those existing frameworks transfers cleanly. The rest of this post walks through what the adjacent fields know, why their contributions don't compose, and what a coverage framework for agent policy spaces would need to look like. The pieces exist, they've been studied carefully in fields that don't cite each other, and nobody in the agent compliance space has yet put them together.
Why code coverage doesn't carry over
Line coverage asks: which source lines did any test execute? Branch coverage: which conditional branches? MC/DC, more sharply: does each condition in each boolean expression independently affect the outcome?
For testing the implementation of a policy evaluator, those are still useful. But the thing we're testing isn't the evaluator — it's the agent, against the policy, evaluated by the evaluator. The evaluator's code is fixed and correct by construction (that's what "deterministic verdict" buys you). What varies is the traces the agent produces, and what we want to measure is whether our test traces exercised the policy thoroughly.
The policy itself isn't code in a conventional sense. It's a set of declarative constraints, usually with interacting applicability conditions, some with temporal ordering semantics, some with numeric boundaries. The analogue of a "line" or a "branch" in that representation isn't obvious. Specification testing has studied this for two decades; the answers are just distributed across literatures that don't converse.
Piece 1: specification automaton coverage (from CPS verification)
The most directly applicable prior work comes from cyber-physical systems. Bartocci et al. (2020) translate a signal temporal logic (STL) specification into a symbolic automaton and define coverage as the fraction of automaton states and transitions visited by a test suite. Cabrera Castillos et al. (2014) generalize the idea to coverage over automata derived from temporal property patterns — always, eventually, never, precedence, response.
What transfers. Agent policies include many sentences with the same shape as STL properties: "always verify identity before a refund," "eventually resolve the ticket," "never process a refund for a shipped order." Compiling each into a small automaton and measuring which states and transitions are exercised across a trace corpus gives a concrete, principled coverage number.
What doesn't. Agent policies are discrete, not continuous-signal. The boundary between "compliant" and "non-compliant" in a CPS setting is often a continuous surface; in an agent setting it's usually a discrete event. Most of the CPS coverage literature's machinery for signal robustness and boundary proximity has analogues in the agent case but hasn't been worked out.
Piece 2: constrained combinatorial testing
The combinatorial-testing literature asks: if your system has many input parameters, each with a domain of possible values, which combinations need testing? The answer, from Kuhn and collaborators over roughly twenty years: most real-world bugs are triggered by interactions of at most six parameters, and often by pairwise or three-way interactions.
The practical metric is t-wise coverage: the fraction of all possible t-value combinations any test in the suite activates. Pairwise (t=2) is the default. Constrained combinatorial testing (Jin et al., 2021; Yang et al., 2026) extends this to handle real-world parameter dependencies — some value combinations are impossible and should be excluded from the denominator.
What transfers. Agent policies are full of constraint interactions. "Refunds above $500 require manager approval and same-day verification and a documented reason." A suite that exercises each condition individually but never all three together may pass every rule in isolation and fail the conjunction. t-wise coverage over the constraint-activation space is the metric we want.
What doesn't. Combinatorial testing assumes parameters are unordered and independent. Agent policies have temporal dependencies — constraint A's activation depends on what the agent did earlier. The combinatorial-testing literature hasn't worked out what "pairwise coverage" means over a dimension that is itself a sequence. The right adaptation is probably t-wise coverage over tuples of (constraint, activation-order), but we haven't seen it written down.
Piece 3: mutation testing
Mutation testing asks a sharper question: given your test suite, if I deliberately introduce a small bug into the system, will any test catch it? The fraction of introduced bugs ("mutants") caught is the mutation score. One of the strongest adequacy criteria software engineering has.
Fellner et al. (2019) extended mutation testing to hyperproperties — properties over sets of traces rather than individual traces — using HyperLTL. The technical move handles non-determinism: a mutant may only change behavior probabilistically, and HyperLTL lets you formalize "mutant changes behavior with positive probability" as a first-class property.
What transfers. Mutation operators over policies are well-defined. Take refund <= $500 and produce refund <= $510. Take verify before refund and produce verify after refund. Flip an allow to a deny. The set of plausible policy mutations is finite, enumerable, and human-interpretable — a suite's mutation score over policy mutants is a concrete, meaningful number.
What doesn't. The canonical mutation-testing literature assumes a deterministic system under test. LLM agents are non-deterministic; the same input may produce different traces on different runs. Fellner's hyperproperty framing handles this in principle; we aren't aware of an implementation that has scaled to the size of agent policy corpora.
Piece 4: boundary coverage
Marculescu and Feldt (2018) define a coverage metric over boundaries: for each constraint, have we tested inputs near the boundary from both the valid and invalid sides? The intuition is that the interesting failures happen at boundaries, not in the interior.
What transfers. Nearly verbatim. Every numeric constraint in an agent policy has boundaries (refund threshold, time window, argument range). Every categorical constraint has a boundary between activation and non-activation. Boundary coverage — the fraction of constraints with at least one trace on each side — is an immediately computable, useful metric.
What doesn't. Marculescu and Feldt's formulation assumes a scalar output whose distance to the boundary is measurable. Agent policies produce discrete compliance verdicts, so the distance-to-boundary has to be defined at the input level. For temporal constraints, the analogous "distance" is something like how close in the trace ordering did the constraint come to being violated, which is well-defined but less explored.
Piece 5: the piece we can't find named elsewhere
The four above each contribute a dimension. The one we don't find published work on — and which is specific to agent policy compliance — is applicability coverage.
A policy constraint has two layers: does this constraint apply to this trace? and does the trace satisfy it? The first layer is usually given short shrift in the adjacent literatures, because applicability is typically decided in advance (you know a refund rule applies because this trace contains a refund). In agent systems, applicability is itself a function of the agent's behavior and is frequently the most error-prone part of the spec. A constraint that accidentally applies to every trace is vacuous. A constraint that never applies to any trace is dead.
We hit this directly in our own pipeline — an ensemble cross-check reported 100% agreement, and the number was vacuous because every checker was returning the same default for every trace without ever evaluating whether the rule applied. The incident writeup is separate; the fix came from separating applicability from satisfaction at the checker return signature. Applicability coverage — the fraction of constraints that activated at least once across a trace corpus — is a first-class coverage dimension, and it isn't reducible to any of the four pieces above.
A minimal applicability-coverage metric:
- Activation coverage. Fraction of constraints that fired (applicability = true) on at least one trace.
- Activation diversity. For each constraint, the number of distinct traces it fired on, relative to corpus size.
- Outcome diversity. For each constraint, has it been observed with both compliant and non-compliant outcomes?
A suite with 0% activation coverage on 20% of your constraints has a blind spot — regardless of how good boundary coverage, pairwise coverage, or mutation score is on the other 80%. We haven't seen this quantity named in the literature, and we think it should be.
Putting the pieces together
A coverage framework for agent policy compliance pulls the five dimensions into a prioritization. The cheap metrics run continuously; the expensive ones run at release gates.
Cheap to compute every evaluation run:
- Activation coverage. Did each constraint fire at least once?
- Outcome diversity. Did each constraint produce both compliant and non-compliant verdicts?
These fall out of regular evaluation at zero additional cost and are the first-line defense against silent-spec failures.
Worth computing periodically (release gates, weekly rollups):
- Boundary coverage. For each constraint with a boundary, at least one trace on each side.
- Pairwise interaction coverage. For each pair of constraints, at least one trace where both fire.
These catch most interaction bugs the first tier misses. Pairwise coverage over even a moderately sized constraint set (50 constraints → 1,225 pairs) is tractable with adversarial test generation.
Worth computing before high-stakes decisions (major releases, compliance audits):
- MC/DC-style constraint coverage. Each condition in each compound constraint independently affects the outcome.
- t-wise coverage for t ≥ 3. Three-way and higher constraint interactions.
- Temporal-sequence coverage. For constraints with ordering semantics, the set of tested orderings.
Worth computing on demand:
- Constraint mutation score. Fraction of policy mutations detected by the suite.
- Probabilistic coverage. For non-deterministic agents, coverage expressed as expectations over runs.
We're deliberately not proposing numeric thresholds. The right levels depend on the test-generation efficiency of the specific system and the risk profile of the policy being tested — 70% activation coverage on a consumer-refund agent and 70% activation coverage on a medical-records agent are not the same thing.
What's actually open
Three research questions that are genuinely open, in the sense that we've looked for prior work and not found it:
Non-determinism. Every coverage metric above assumes a given trace, evaluated against a given constraint, produces a deterministic verdict. That holds. The problem is that the agent is non-deterministic — the same scenario may produce different traces. How should coverage be measured when the input-to-trace function is itself probabilistic? Fellner's hyperproperty approach gives a theoretical answer; the empirical work hasn't been done.
Policy-size scaling. Real operational policies have hundreds of constraints. Pairwise coverage over 500 constraints is 124,750 pairs; three-way is 20.7 million triples. At some point exhaustive enumeration stops being feasible and you need sampling-based estimators. The combinatorial-testing literature has answers here (IPOG, locating arrays) that haven't been transferred to the policy setting.
Coverage-to-risk mapping. All five dimensions above are surrogate metrics. They measure test-suite quality as a proxy for the thing we actually want, which is confidence that production traces don't contain violations the suite would have missed. The mapping from coverage numbers to residual-risk estimates is well-studied in software testing (Musa's reliability theory, empirical mutation-score-to-bug-rate correlations) and mostly unexplored in the agent setting. This is the direction we'd most like to see someone take seriously. It's also the one most likely to take years.
Where we are
We've implemented activation and outcome coverage in our own system and partial boundary / pairwise coverage. The numbers are useful internally — they've caught at least one spec-regression that would otherwise have shipped — and we don't yet have the data to say whether they're useful externally.
What we feel solid on: the five dimensions above are necessary. A framework missing any one of them has a known blind spot we can demonstrate with concrete cases. Whether these five are sufficient, and whether the prioritization above is the right way to compose them, is work the field is going to have to keep doing.
If you're working on testing tools for AI agents, we'd like to compare notes. The combinatorial-testing and mutation-testing communities have decades of answers the agent compliance community hasn't yet asked its questions of; the fastest path forward is probably moving those answers across the gap.
Appendix: sources
- Bartocci, E., Deshmukh, J., Donze, A., Fainekos, G., Maler, O., Nickovic, D., & Sankaranarayanan, S. (2020). Specification-based Monitoring of Cyber-Physical Systems. arXiv 2010.06674.
- Cabrera Castillos, K., Dadeau, F., Julliand, J., Kanso, B., & Taha, S. (2014). A Compositional Automata-based Semantics and Preserving Transformation Rules for Testing Property Patterns. Formal Aspects of Computing.
- Fellner, A., Krenn, W., Schlick, R., Tarrach, T., & Weissenbacher, G. (2019). Model-based, mutation-driven test case generation via heuristic-guided branching search. TOSEM.
- Jin, H., Kuhn, R., Kacker, R., Lei, Y., & Hu, D. (2021). IPOG-D: Efficient test generation for high-strength combinatorial testing. Software Testing, Verification and Reliability.
- Marculescu, B., & Feldt, R. (2018). Finding and Correcting Corner Cases via Boundary-based Search. arXiv 1810.06720.
- Kuhn, D. R., Wallace, D. R., & Gallo, A. (2004). Software fault interactions and implications for software testing. IEEE TSE.
- Yang, B., Zhang, Z., Nie, C., & Wu, H. (2026). Constraint-aware locating arrays for fault identification.