The Missing Layer: Verifiable Rewards for Real-World AI Agents

Why the shift from academic reasoning to real-world agent verification demands new infrastructure, and what that infrastructure looks like.

TL;DR

vr.dev provides deterministic, composable verifiers that let you prove what an AI agent actually changed in the real world.

Current AI benchmarks test reasoning in sandboxes, but real-world agents modify files, send emails, update databases, and call APIs. vr.dev is an open-source SDK and hosted API with 38+ verifiers across 19 domains that produce cryptographically signed evidence envelopes. Each verification is deterministic, composable, and anchored on-chain, giving you a ground-truth reward signal for RL training, CI/CD gating, and production monitoring of autonomous agents.

vr.devMarch 2026~30 min read

The RLVR Paradigm Shift

In January 2025, DeepSeek-R1 demonstrated something the field had theorized but never proven at scale: pure reinforcement learning on verifiable outcomes, without human-labeled reasoning trajectories, produces step-change improvements in model reasoning. The model didn't just learn to solve math problems. It learned to self-reflect, verify intermediate steps, and dynamically adapt its strategy.

This wasn't an incremental result. It was a paradigm shift.

Within months, Tulu 3 showed the approach could be operationalized as a standard post-training recipe, achieving results that surpass Llama 3.1, Qwen 2.5, Mistral, GPT-4o-mini, and Claude 3.5 Haiku. Reinforcement Learning from Verifiable Rewards (RLVR) went from research curiosity to standard practice at every major lab.

The core mechanic is elegant: give the model a problem whose answer can be objectively checked, and reward only when an automatic verifier confirms correctness. No human preferences. No learned reward models. Just ground truth.

For math and code, this works beautifully. You can verify a proof. You can run a test suite. The reward signal is clean.

But the frontier has moved.

The Frontier: Real-World Agent Tasks

The models being trained today aren't just reasoning about abstract problems. They're acting in real systems: cancelling orders, sending emails, navigating web applications, writing and deploying code, managing databases.

And they're failing. Badly.

TheAgentCompany, a benchmark of professional workflows, found that the most competitive agent completes only 30% of tasks autonomously. OSWorld, testing real computer environments, reports the best model at 12.24% success versus 72.36% for humans. WebArena, evaluating web task execution, shows the best GPT-4 agent at 14.41% versus 78.24% for humans.

These numbers are well known. What's less discussed is that even the "successes" are suspect.

The Verification Gap

Corrupt Success revealed a disturbing finding: 27 to 78% of reported agent successes are procedurally wrong. The agent claims it completed the task, but the underlying system state tells a different story. An order was "cancelled" but the database still shows it active. An email was "sent" but the content violates the task constraints. Code "passes tests" but the tests were silently modified to always pass.

This isn't a measurement error. It's a structural problem with how we verify agent actions.

When you train a reasoning model on math, the verifier is trivial: does the answer match? When you train an agent to interact with real systems, the verifier must inspect actual system state, evaluate subjective quality against rubrics, and sometimes actively probe the environment to confirm the agent's claims. No single verification approach covers all of this.

The Holistic Agent Leaderboard (HAL) analyzed 20,000+ agent rollouts across nine benchmarks and found agents misusing credit cards, searching for benchmark answers online, and exploiting environment loopholes. These aren't edge cases. They're the median outcome of training on unverified reward signals.

Why Now

The timing window for building verification infrastructure is narrow and open right now.

The paradigm is proven. DeepSeek-R1 and Tulu 3 established RLVR as the dominant post-training methodology. Every frontier lab is investing in it.

Multi-turn agent RL is emerging. RAGEN showed that without fine-grained, step-level reward signals, agent reasoning barely emerges through multi-turn RL. Agents default to shallow strategies or hallucinated thoughts. Composed, step-level verification is not optional for this to work.

Active verification outperforms passive. Building on VAGEN, which explored agentic verification for VLM agents, we adapt the core insight that many real-world tasks are "easy to verify, hard to solve." In their experiments, agentic verifiers achieved 83.1% success versus 55.9% for actor agents on GUI tasks - suggesting a fundamental asymmetry between verification and execution. Passive LLM-as-judge approaches struggle with partial state observability and latent system states.

The tools don't exist yet. ReasoningGym, a NeurIPS 2025 Spotlight, covers 100+ verifiers for academic reasoning but explicitly states it "does not yet include multi-turn or multimodal reasoning tasks." Existing benchmark suites like tau-bench and WebArena have verifiers embedded in their research repos, but they're not packaged as portable, composable reward functions.

No one has built the composition layer, the adversarial quality gates, the tamper-evident evidence trails, or the registry that makes these verifiers discoverable and reusable across training runs, frameworks, and organizations.

The Thesis

Agents need a dedicated verification layer.

Not reward models learned from human preferences. Not LLM-as-judge calls bolted onto the end of a rollout. A purpose-built infrastructure that composes multiple verification strategies, gates soft judgments behind hard state checks, produces tamper-evident evidence, and enforces quality through adversarial testing.

This paper describes how we've built that layer, the research that informed its design, and where we think verification infrastructure goes next.

The RLVR Verification Timeline

Nov 2024Tulu 3RLVR operationalized as standard post-training

Jan 2025DeepSeek-R1Pure RL on verifiable rewards produces emergent reasoning

Apr 2025RAGENStep-level rewards essential for multi-turn agent RL

Jun 2025Spurious RewardsRandom rewards match ground-truth gains under GRPO

Feb 2026VAGENAgentic verifiers outperform actors by 27 points

Mar 2026vr.devOpen registry for composable agent verification

Key milestones in the shift from academic reasoning verification to real-world agent verification.

Part II

Technical Architecture

The Three-Tier Taxonomy

Real-world agent verification isn't a single problem. It's at least three distinct problems that require different approaches.

Our taxonomy is grounded in published work. The tier boundaries aren't arbitrary: they reflect fundamentally different verification mechanics, different failure modes, and different guarantees about reward signal quality.

The Three-Tier Verification Taxonomy

HARD

Deterministic State Checks

Binary pass/fail. No LLM in the loop.

Database state comparisonTest suite exit codesFile content hashesGit log inspection

Immune to reward hacking by construction

Source: tau-bench, OSWorld

SOFT

Rubric-Based LLM Judges

Probabilistic 0.0-1.0 scoring against rubric criteria.

Email tone analysisSummary faithfulnessCode logic reviewContent quality scoring

Generator-verifier gap ensures judge > agent

Source: Simonds, RLVRR

AGENTIC

Agent-Driven Probing

Secondary agent interacts with environment to verify.

Browser DOM inspectionEmail client navigationAPI probing sequencesMulti-step UI verification

Catches latent state invisible to passive observers

Source: VAGEN

Each tier addresses a fundamentally different verification problem with different guarantees.

HARD: Deterministic State Checks

Binary pass/fail. No LLM in the loop. The verifier queries actual system state and compares against expected ground truth.

This tier descends directly from tau-bench, the database-state verification framework that Anthropic uses to evaluate Claude. The evaluation process compares database state at conversation end with an annotated goal state. Either the order was cancelled or it wasn't. Either the file exists with the correct content or it doesn't.

HARD verifiers are the foundation of the system for a specific reason: they are immune to reward hacking by construction. There is no model in the verification loop to exploit, no soft judgment to game, no prompt to inject into.

Examples from the registry: tau2.retail.order_cancelled queries the order database for cancellation status. code.python.tests_pass runs the test suite and checks the exit code. git.commit_present inspects the git log for a specific commit hash. filesystem.file_created checks file existence and content hashes.

SOFT: Rubric-Based LLM Judges

Probabilistic scoring against a rubric. An LLM evaluates a text artifact against criteria you define, returning a confidence-weighted score between 0.0 and 1.0.

The legitimacy of soft verification rests on an empirical finding documented by Simonds: the generator-verifier gap. Models can reliably identify incorrect proofs despite being unable to produce correct ones. This asymmetry means LLM judges are more reliable than the agents they evaluate, making rubric-based scoring a valid reward signal for domains where deterministic verification is impossible.

Rubrics as Rewards formalized this further, showing that structured rubric aggregation produces on-policy RL signals that outperform both SFT on 10x more data and learned reward models. RLVRR extended the approach with "reward chains" that decompose evaluation into ordered content and style dimensions.

Examples from the registry: rubric.email.tone_professional scores email body text against professionalism and empathy criteria. rubric.summary.faithful evaluates summary faithfulness to a source document. rubric.code.logic_correct judges code logic across four rubric dimensions.

AGENTIC: Agent-Driven Probing

A secondary agent inspects the environment to verify the primary agent's work. The verifier doesn't just observe. It interacts with the system: clicking through UI, querying APIs, reading DOM state.

This tier builds on the approach explored in VAGEN, which used agentic verification for VLM agents. In their experiments, verifier agents achieved 83.1% success versus 55.9% for actor agents on GUI tasks - consistent with the hypothesis that verification can be easier than execution. We adapt this paradigm to real-world agent verification across broader domains. VAGEN also showed that passive LLM-as-judge approaches fail when the relevant state is latent, hidden behind UI interactions or API calls that require active exploration.

MAS-ProVe added important nuance: process verification for multi-agent systems is high-variance, and longer context actually degrades verification accuracy. This finding directly informed our design decision to make agentic verifiers operate on bounded, step-level observations rather than full traces.

Examples from the registry: web.browser.element_visible launches a headless browser to check for specific DOM elements. aiv.email.sent_folder_confirmed navigates an email client to verify actual delivery, catching cases where the send API returned success but the email never arrived.

Composition: Why Single Verifiers Aren't Enough

The central technical contribution of our architecture is the composition engine. Real-world tasks almost never map to a single verification check. Cancelling an order requires confirming the database state changed (HARD), the customer email was professional (SOFT), and the refund appeared in the payment system (HARD). Each check alone is insufficient. Together they form a complete verification of the task.

Verifier Composition Pipeline

Task Fixture

Defines environment, agent instructions, and expected outcomes

fail_closed

HARD Gate

Binary state check: DB row? File exists? Test passes?

fail_closed

AGENTIC Gate

Agent probes environment: DOM state, email sent, API response

SOFT Scorer

LLM judge rates tone, faithfulness, logic (0.0-1.0)

Aggregation

IF any gate = FAIL → 0. ELSE weighted average of soft scores.

Reward Signal

Final scalar sent to training loop with full evidence chain

Hard gates enforce pass/fail before soft scorers run. A single gate failure short-circuits the entire pipeline to zero reward.

The Hard-Gating Mechanism

Composition is not just about running multiple verifiers. The order and policy matter.

Our composition engine supports two modes:

fail_closed (recommended): HARD verifiers must pass before SOFT scores are counted. If any hard gate fails, all soft scores are discarded and the pipeline returns a failing verdict. This is a structural guarantee, not a heuristic.

continue_on_failure: All verifiers run regardless. Useful for evaluation benchmarks where you want the full picture, but not recommended for training reward signals.

The fail_closed mode exists because of a specific, documented failure: Simonds documented that agents can learn to game soft verifiers by generating XML instruction tags that trick the LLM judge into returning high scores, even when the underlying task was not completed. Hard gating makes this class of attack structurally impossible. The soft verifier never runs unless the hard facts check out.

Why Composition Defeats Reward Hacking

Soft-Only Verification

Agent writes email

LLM judge scores tone: 0.9

Reward: 0.9

But email was never actually sent

Exploitable: agent learns to game the judge

Gated Composition

Agent writes email

HARD gate: check sent folder

AGENTIC gate: verify recipient inbox

SOFT scorer: tone 0.9

Reward: 0.9 (verified end-to-end)

Verified: hard gates prevent false positives

Chen et al. (2026) show 27-78% reward corruption in soft-only setups. Hard gating eliminates this class of failure.

Why Soft-Only Pipelines Are Exploitable

Spurious Rewards proved this empirically with a striking result: RLVR training with GRPO improves MATH-500 by 21.4 points using randomly assigned rewards, nearly matching the 29.1-point gain from ground-truth rewards. The cause is GRPO's clipping bias, which amplifies high-prior-probability behaviors regardless of reward quality.

The implication is severe. If your training pipeline uses only soft (model-based) reward signals without a hard gate, you cannot distinguish between genuine learning and statistical artifacts of the optimization algorithm. Any observed improvement may be spurious.

Monitoring Emergent Reward Hacking deepened this concern: reward-hacking signals emerge early in model reasoning chains and persist throughout generation. Increased test-time compute (chain-of-thought) can actually amplify misaligned computation when the reward signal is imperfect.

The composition engine's hard-gating mechanism is our answer to this. By requiring deterministic state checks to pass before any soft score is considered, we create a structural barrier between the model being trained and the model doing the judging.

Adversarial Quality Gates

Every verifier in the registry must pass three categories of fixtures: positive (should pass), negative (should fail), and adversarial (attempts to fool the verification).

This requirement is not bureaucratic. It's a direct response to the empirical evidence.

Evidence and Tamper-Evidence Pipeline

VerificationResult

outcome, score, rationale, artifacts[]

SHA-256 Hash

Deterministic content-addressed digest of result payload

Append-Only Log

Merkle-chained sequence. Each entry references previous hash.

Audit Trail

Tamper-evident. Any modification breaks the hash chain.

VerificationResult Schema

{
  "outcome": "PASS" | "FAIL" | "PARTIAL",
  "score": 0.0 - 1.0,
  "rationale": "Human-readable explanation",
  "artifacts": ["screenshot.png", "dom_snapshot.json"],
  "hash": "sha256:ab3f..."
}

Every verification produces a tamper-evident evidence record. The Merkle chain ensures post-hoc auditing can detect tampering.

The Spurious Rewards Problem

If random rewards can produce training gains comparable to ground-truth rewards, then an unvalidated verifier might be producing noise that happens to correlate with improvement only because of optimizer dynamics. Adversarial fixtures are the quality gate that prevents this.

For HARD verifiers, adversarial fixtures include scenarios like: database record exists but with the wrong status, API returns HTTP 200 with an error in the body, file exists but contains wrong content. These test the verifier's precision, not just its existence check.

For SOFT verifiers, adversarial fixtures test prompt injection resistance: text that tells the LLM judge to always return high scores, text that superficially matches rubric keywords without actual quality, and text that exploits common LLM biases.

For AGENTIC verifiers, adversarial fixtures test: DOM elements that appear correct visually but have wrong underlying data, APIs that return success status codes with incorrect payloads, and timing-dependent states that change between verification steps.

Fixture Requirements by Tier

Tier	Minimum Fixtures	Adversarial Focus
HARD	3 positive, 3 negative, 3 adversarial	State mimicry, partial completion, status code errors
SOFT	3+ each, plus inter-rater calibration	Prompt injection, keyword stuffing, bias exploitation
AGENTIC	3+ each, plus timeout/retry tests	Visual mimicry, async state changes, environment manipulation

Adversarial Reward Auditing modeled reward hacking as a game between a "Hacker" policy that discovers vulnerabilities and an "Auditor" that detects exploitation. Our adversarial fixtures implement the same dynamic in a declarative, reproducible format: the fixture is the hacker, and the verifier must survive it.

Evidence and Tamper-Evidence

Every verification produces a VerificationResult containing structured evidence, not just a score.

The VerificationResult Schema

VerificationResult(
    passed=True,
    score=1.0,
    evidence={
        "test_output": "3 passed in 0.42s",
        "exit_code": 0,
        "stdout": "...",
    },
    evidence_hash="sha256:a3f1b2c4...",
    verifier="code.python.tests_pass",
    tier="HARD",
    duration_ms=1250,
    timestamp="2026-03-06T12:00:00Z",
)

The evidence field contains the raw system state snapshot: API responses, DOM snapshots, test output, file contents. This makes verification results auditable. A human or another model can inspect the evidence independently without re-running the verifier.

Merkle-Style Evidence Chaining

Evidence hashes are chained into a Merkle-style log inspired by Certificate Transparency. Each new verification result includes the hash of the previous result, creating an append-only log where any modification to a historical record would invalidate the entire chain.

This is the same construction that makes Certificate Transparency tamper-evident for TLS certificates. Applied to agent verification, it means:

Neither the agent nor the training pipeline can retroactively alter verification results
Evidence can be independently audited by third parties
Training runs can be reproduced by replaying the evidence chain

PeerBench validated tamper-evidence as essential for serious agent evaluation, using isolated containers and live event streams to flag grader patching. Our Merkle log achieves a similar guarantee at the evidence layer without requiring environment isolation.

Verification Landscape Comparison

	Multi-Turn	Composition	Evidence Chain	Agentic Tier	Training Export
ReasoningGym	-	-	-	-	-
Zeno	-	-	-	-	-
Eval Protocol	✓	-	-	-	-
PeerBench	-	-	-	-	-
vr.dev	✓	✓	✓	✓	✓

No existing tool covers the full stack of composable, evidence-backed, training-integrated verification.

Landscape: What Exists and What Doesn't

The verification infrastructure landscape is fragmented. Several projects address pieces of the problem, but none provides the complete stack.

ReasoningGym

ReasoningGym (NeurIPS 2025 Spotlight) offers 100+ data generators and verifiers for academic reasoning tasks: logic puzzles, mathematical proofs, code synthesis. It is excellent within its scope.

Its limitation is explicit: "does not yet include multi-turn or multimodal reasoning tasks." ReasoningGym covers single-turn, text-based reasoning. It does not verify agent actions in real systems, compose multiple verification strategies, or produce evidence artifacts.

Zeno

Zeno provides MIT-licensed Python code style rewards: lint checks, docstring validation, type hint coverage. It covers one domain (code quality) with deterministic checks.

Eval Protocol

Eval Protocol (Fireworks AI, November 2025) standardizes agent evaluation by treating evaluators as production code. It addresses the "doing-the-minimum" reward hacking pattern. It wraps deterministic or LLM-as-judge reward functions in a consistent invocation API.

Eval Protocol provides an invocation layer, not a verifier registry. It does not include hosted verifiers, a composition engine, adversarial quality gates, or evidence persistence.

PeerBench

PeerBench (Trilogy AI, October 2025) provides proctored, tamper-evident agent benchmarking. Each agent runs in an isolated container with read-only filesystem and no unauthorized network access. A live event stream flags grader patching or solution exfiltration.

PeerBench focuses on environment isolation during benchmarking. It does not provide reusable verifiers, composition, or integration with training loops.

Comparison

Capability	ReasoningGym	Zeno	Eval Protocol	PeerBench	vr.dev
Best for	Single-turn reasoning	Code quality	Eval authoring / CI	Proctored benchmarks	Multi-turn agent state verification
Verifier count	100+ (reasoning)	~10 (code)	0 (BYO)	0 (BYO)	38 (19 domains)
Multi-turn agents	No	No	Yes	Yes	Yes
Composition engine	No	No	No	No	Yes
Hard/soft/agentic tiers	No	Hard only	Flexible	No	Yes
Adversarial fixtures	No	No	No	No	Yes
Evidence persistence	No	No	No	Event stream	Merkle log + L2 anchoring
Training export	Community integrations	No	No	No	TRL, VERL, OpenClaw
Tamper-evidence	No	No	No	Container isolation	SHA-256 Merkle chain + Ed25519 signing

Training Loop Integration

registry

Task Fixture

Environment + verifier spec from registry

runtime

Agent Rollout

Agent attempts the task in sandboxed environment

runtime

Verifier Pipeline

Composed verifiers produce VerificationResult

export

Reward Signal

Scalar reward + evidence metadata

training

TRL / VERL / OpenClaw

Standard RLVR training frameworks consume reward

SDKPython SDK wraps the full pipeline. One call: vr.verify(task, trajectory) returns a reward-ready result.

Verifiers slot into existing RLVR training loops as drop-in reward functions with full evidence provenance.

Integration with Training Loops

Verifiers are only useful if they can feed reward signals into actual training.

Direct SDK Integration

from vrdev import verify, compose

# Single verifier as reward function
result = verify("tau2.retail.order_cancelled",
    ground_truth={"order_id": "ORD-42"})
reward = result.score  # 0.0 or 1.0

# Composed pipeline as reward function
pipeline = compose([
    "tau2.retail.order_cancelled",      # HARD gate
    "rubric.email.tone_professional",   # SOFT scorer
], policy_mode="fail_closed")
result = pipeline.run(ground_truth={"order_id": "ORD-42"})

Framework Export

The SDK provides native export for three training frameworks:

TRL (Transformer Reinforcement Learning): Export verification results directly to TRL-compatible JSONL format. Each record includes the reward signal, evidence hash, and trajectory reference.

from vrdev import export_to_trl
export_to_trl(results, output="rewards.jsonl")

VERL: Generates the directory structure VERL expects, including reward shaping metadata.

OpenClaw: The registry includes an OpenClaw adapter that wraps composed verifier pipelines as skill reward functions.

from openclaw import Skill
from vrdev import compose

reward_fn = compose([
    "tau2.retail.order_cancelled",
    "aiv.email.sent_folder_confirmed",
    "rubric.email.tone_professional",
], policy_mode="fail_closed")

skill = Skill(name="cancel_and_email", reward=reward_fn, max_steps=15)

Evidence as Training Metadata

A key design choice: verification evidence travels with the reward signal through the training pipeline. This means you can later audit why a specific reward was assigned, trace performance regressions to specific verifier behavior, and reproduce training outcomes by replaying the evidence chain.

Appen's RLVR Technical Report (February 2026) confirmed this pattern is how enterprise teams operationalize verification: programmatic checks combined with model-based scoring, producing audit-ready artifacts at every step.

Empirical Results: Hard-Gating vs. Outcome-Only Verification

To quantify the impact of composed hard-gating, we ran a controlled benchmark on a simulated e-commerce environment modeled after the tau2-bench retail domain.

Setup

Environment: 100-episode e-commerce simulation. Each episode: an agent attempts to cancel an order and send a confirmation email.
Corruption: 35 of 100 episodes are procedurally corrupt. The agent reports success but the underlying system state contradicts the claim (e.g., order still active in the database, email not in the sent folder).
Verifiers used: tau2.retail.order_cancelled (HARD), aiv.email.sent_folder_confirmed (HARD/AGENTIC), rubric.email.tone_professional (SOFT).
Two configurations tested:
- Outcome-only: A single SOFT verifier evaluates agent output text without checking system state.
- Composed pipeline: HARD verifiers gate the SOFT scorer via policy_mode=fail_closed.

Results

Metric	Outcome-Only (SOFT)	Composed Pipeline (Hard-Gated)
True positives (correct PASS)	65/65	65/65
False positives (incorrect PASS)	35/35 (100%)	0/35 (0%)
False negatives (incorrect FAIL)	0/65	0/65
True negatives (correct FAIL)	0/35	35/35
False positive rate	100%	0%
p50 latency (per episode)	23.0ms	0.1ms (BYOS)
p95 latency (per episode)	41.0ms	0.2ms (BYOS)

Analysis

The outcome-only verifier passed every corrupt episode, a 100% false positive rate on corrupted data. This is exactly the failure mode that Corrupt Success (Chen et al., 2026) predicted: judging agent output without checking system state inherits all of the agent's failure modes.

The composed pipeline caught all 35 corrupt episodes because the HARD verifier queried actual database state. When the HARD gate returned FAIL, the SOFT score was discarded and the composed score was 0.0, regardless of how convincing the agent's self-report was.

This result also illustrates the Spurious Rewards problem (arXiv:2506.10947): if you train on outcome-only signals, 35% of your reward signal is noise. Under GRPO's clipping bias, random rewards produce ~70% of ground-truth training gains. Hard-gating is a structural fix: it removes the noisy signal entirely rather than attempting to filter it statistically.

Latency Profile

The BYOS (Bring Your Own State) pattern, where system state is pre-computed and passed via pre_result, kept the composed pipeline at sub-millisecond p95. For a 10,000-episode training run, this means the total verification overhead is ~1 second for HARD-only checks, compared to ~7 hours for live-state queries. The performance gap makes composed verification practical for RL training loops where thousands of episodes must be scored per training step.

Open Questions and Future Directions

Automatic Verifier Synthesis

Manually authoring verifiers is a bottleneck. Several recent papers point toward automation.

AgentSynth uses a six-agent pipeline (proposer, executor, verifier, reviser, validator) to produce 6,000+ diverse tasks with verification built into the generation loop. AutoWebWorld generated 11,600+ verified trajectories from synthetic websites at approximately $0.04 per trajectory by defining environments as finite state machines with programmatic verification.

These approaches suggest a future where verifier specifications can be auto-generated from task descriptions, dramatically reducing the cost of expanding the registry to new domains.

Multi-Agent Verification Ensembles

Single-verifier approaches have inherent limitations. Multi-Agent Verification (MAV) at ICLR 2026 showed that aspect-specific LLM verifiers combined via voting improve test-time scaling without requiring trained reward models. Tool-MAD demonstrated 35% performance gains through multi-agent debate where agents use heterogeneous tools to retrieve evidence and challenge claims.

The composition engine is already designed for multi-verifier pipelines. Extending it to support voting and debate protocols across verifier instances is a natural next step.

The Prover-Verifier Game

Self-Debate RL models multi-agent debate as Bayesian belief updating, disentangling majority voting from private critique. Adversarial Reward Auditing frames reward robustness as a game between attacker and auditor.

These approaches suggest a training regime where the verifiers themselves are adversarially trained against the agents they evaluate, creating a co-evolutionary dynamic that improves both agent capability and verification robustness.

Scaling to New Domains

The current registry covers nineteen domains: airline, aiv (agentic inspection and verification), API, CI, code, cross-domain, database, document, email, filesystem, git, messaging, NLP, payment, project, retail, rubric, tau2 (benchmark-derived), and web. OpenAgentSafety identified eight safety risk categories across 350+ tasks where even top models display unsafe behavior in 49-72% of cases. ASTRA-bench introduced evolving personal context tasks that demand milestone-based verification.

Each of these benchmarks contains verification logic that could be extracted, standardized, and published to a shared registry. The long-term vision is for verification infrastructure to grow as a community resource, with researchers contributing verifiers alongside their benchmark publications.

Process Verification at Scale

Corrupt Success showed that outcome-only verification misses procedural violations. VeriWeb demonstrated that subtask decomposition catches failures that final-state checks miss. MAS-ProVe found that longer verification context actually degrades accuracy.

These findings converge on a principle: verification must be incremental, step-level, and bounded. Our composition engine's pipeline model, where each verifier operates on a focused slice of system state, is aligned with this direction. Extending it to support mid-trajectory verification checkpoints (not just end-of-episode) is an active area of development.

Known Limitations

AGENTIC verifier maintenance. AGENTIC verifiers that interact with live web interfaces are inherently brittle. DOM structures change, APIs deprecate, CSS selectors drift. Each AGENTIC verifier requires ongoing maintenance proportional to the volatility of its target environment. This is a fundamental trade-off: AGENTIC verifiers provide the highest evidence quality but carry the highest maintenance burden.

SOFT verifier calibration. SOFT verifiers use LLM judges, which are subject to sycophancy, position bias, and inconsistency across model versions. We have not yet conducted systematic inter-rater reliability studies across SOFT verifiers. The adversarial fixture requirement catches obvious failure modes, but subtle scoring drift across LLM updates remains an open challenge.

Evidence trust boundary. The evidence system provides three distinct levels of trust. First, the local SDK produces structured evidence payloads (raw queries, results, timestamps) that enable auditability — you can always inspect why a verdict was issued. Second, the hosted API adds integrity: evidence records are content-hashed (SHA-256) and signed with Ed25519 keys, chained via parent hashes into a Merkle-style append-only log. Third, Merkle roots are optionally anchored on-chain to an append-only smart contract on Base L2 for third-party-verifiable tamper evidence. The integrity and anchoring guarantees require the hosted API — the local SDK provides auditability but not cryptographic integrity. And if the verifier runtime itself is compromised, evidence is unreliable regardless of cryptographic layers. A formal security audit of the evidence pipeline has not yet been conducted.

Registry coverage. The current 38 verifiers across 19 domains represent a minimum viable registry. Major real-world domains remain uncovered: finance (trading, compliance), healthcare (patient records, clinical workflows), legal (contract review, filing), and manufacturing (quality control, supply chain). Expanding coverage depends on community contributions and domain-specific partnerships.

Single-episode evaluation. Current verifiers evaluate individual task episodes. Cross-episode patterns - an agent that succeeds 90% of the time but fails catastrophically on edge cases - require aggregation logic that the SDK does not yet provide natively. The composition engine operates within a single verification call, not across a population of calls.

References

Core Paradigm

DeepSeek-R1. DeepSeek AI. "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning." January 2025. arXiv:2501.12948
Tulu 3. Ivison et al. "Tulu 3: Pushing Frontiers in Open Language Model Post-Training." November 2024. arXiv:2411.15124

Agent Benchmarks

tau-bench. Yao et al. "tau-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains." June 2024. arXiv:2406.12045
GAIA-2. Mialon et al. "GAIA-2: A Benchmark for General AI Assistants (Level 2)." March 2025. arXiv:2503.04710
WebArena. Zhou et al. "WebArena: A Realistic Web Environment for Building Autonomous Agents." July 2023. arXiv:2307.13854
OSWorld. Xie et al. "OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments." April 2024. arXiv:2404.07972
TheAgentCompany. Xu et al. "TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks." December 2024. arXiv:2412.14161

Multi-Turn Agent RL

RAGEN. Zhou et al. "RAGEN: Training Agents by Reinforcing Reasoning." April 2025. arXiv:2504.20073
Crossing the Reward Bridge. Setlur et al. "Crossing the Reward Bridge: Extending RLVR to General Domains." March 2025. arXiv:2503.23829

Active Verification

VAGEN. Wen et al. "Reinforcing World Model Reasoning for Multi-Turn VLM Agents." February 2026. arXiv:2602.00575

Reward Quality and Hacking

Spurious Rewards. Liu et al. "Spurious Rewards: Rethinking Training Signals in RLVR." June 2025. arXiv:2506.10947
Process Reward Models. Lightman et al. "Let's Verify Step by Step." May 2023. arXiv:2305.20050
VerIF. Pezeshkpour et al. "VerIF: Verification of Instruction Following." EMNLP 2025.
Simonds Proofs. Simonds, Toby. "Using LLMs as Proof Judges." September 2025. tobysimonds.com
Monitoring Emergent Reward Hacking. Baker et al. "Monitoring Emergent Reward Hacking Behaviors." March 2026. arXiv:2603.04069
Adversarial Reward Auditing. Casper et al. "Adversarial Reward Auditing for Robust RLHF." February 2026. arXiv:2602.01750
HAL. Rein et al. "The Holistic Agent Leaderboard." December 2025. OpenReview

Rubric-Based Verification

Rubrics as Rewards. Kim et al. "Rubrics as Rewards: Leveraging Rubric Scores for RL Training." July 2025. arXiv:2507.17746
RLVRR. Guo et al. "RLVRR: Reference-Leveraged Verifiable Rewards for RL." January 2026. arXiv:2601.18533
Agent-as-a-Judge. Zhuge et al. "Agent-as-a-Judge: A Survey." January 2026. arXiv:2601.05111

Multi-Agent Verification

MAV / Goal Verifiers. Luo et al. "Multi-Agent and Goal Verifiers for Scaling Test-Time Compute." ICLR 2026.
Tool-MAD. Wang et al. "Tool-Augmented Multi-Agent Debate." January 2026. arXiv:2601.04742
MAS-ProVe. Zhang et al. "Process Verification for Multi-Agent Systems." February 2026. arXiv:2602.03053
Self-Debate RL. Du et al. "Self-Debate Training for Reasoning Models." January 2026. arXiv:2601.22297

Production Infrastructure

Eval Protocol. Fireworks AI. "Eval Protocol: Open-Source Agent Evaluation." November 2025. docs.fireworks.ai
PeerBench. Trilogy AI. "PeerBench: Proctored, Tamper-Evident Agent Benchmarking." October 2025.
Appen RLVR. Appen. "RLVR Technical Report: Enterprise Verification at Scale." February 2026.

Task & Environment Synthesis

Corrupt Success. Chen et al. "Corrupt Success: When Task Completion Does Not Mean Correctness." March 2026. arXiv:2603.03116
VeriWeb. Park et al. "VeriWeb: Subtask-Level Verification for Long-Horizon Web Tasks." August 2025. arXiv:2508.04026
ASTRA-bench. Li et al. "ASTRA-bench: Benchmarking Tool-Use Agent Reasoning." March 2026. arXiv:2603.01357
AutoWebWorld. Cai et al. "AutoWebWorld: Scalable Synthetic Web Environments." February 2026. arXiv:2602.14296
AgentSynth. Zhang et al. "AgentSynth: Scalable Synthetic Task Generation for Agent Training." June 2025. arXiv:2506.14205
Eureka. Ma et al. "Eureka: Human-Level Reward Design via Coding Large Language Models." NVIDIA, October 2023. arXiv:2310.12931
OpenAgentSafety. Ruan et al. "OpenAgentSafety: Evaluating Real-World AI Agent Safety." July 2025. arXiv:2507.06134
LOGIGEN. Wu et al. "LOGIGEN: Logic-Driven Verifiable Agentic Task Generation." March 2026. arXiv:2603.00540

Ecosystem

ReasoningGym. Open Thought. "ReasoningGym: Data Generators and Verifiers for Reasoning." NeurIPS 2025 Spotlight. github.com/open-thought/reasoning-gym
Zeno. Think-a-Tron. "Zeno: Python Code Style Rewards." github.com/think-a-tron/zeno