Concepts
The Problem
Research shows that 27-78% of agent "successes" are procedurally wrong. The agent claims it completed a task, but the underlying state tells a different story. An order was "cancelled" but the database still shows it active. An email was "sent" but the content is gibberish. Code "passes tests" but the tests were modified to always pass.
RL training on unverified rewards propagates these false successes into the model weights, creating agents that learn to appear correct rather than be correct.
Three Tiers of Verification
HARD: Deterministic State Checks
Binary pass/fail. No LLM in the loop. The verifier queries actual system state (database records, API responses, file contents) and compares against the expected ground truth.
Examples: tau2.retail.order_cancelled, code.python.tests_pass, git.commit_present
SOFT: Rubric-Based LLM Judges
Probabilistic scoring against a rubric. An LLM evaluates a text artifact (email body, summary, code review) against criteria you define. Returns a confidence-weighted score.
Examples: rubric.email.tone_professional, rubric.summary.faithful, rubric.code.logic_correct
AGENTIC: Agent-Driven Probing
A secondary agent inspects the environment to verify the primary agent's work. The verifier interacts with the system: clicking through UI, querying APIs, reading DOM state.
Examples: web.browser.element_visible, web.browser.screenshot_match, aiv.email.sent_folder_confirmed
Composition Engine
Verifiers are composed into reward pipelines using the compose() function. Four policy modes:
fail_closed(recommended): HARD verifiers must pass before SOFT scores are counted. This prevents reward hacking, since an agent cannot game a soft metric while violating hard constraints.fail_open: ERROR and UNVERIFIABLE states are excluded from scoring; only explicit FAIL blocks the pipeline.escalation: Run tiers in order (HARD → SOFT → AGENTIC), stop when a tier passes.ensemble: Run all verifiers and aggregate scores.
Evidence & Trust Model
Every verification produces an evidence record containing:
- Raw system state snapshot (API response, DOM, test output)
- Verdict (pass/fail) and score (0.0-1.0)
- Timestamp and verifier version
- SHA-256 content hash
Three levels of trust are available:
- Auditability (local SDK + hosted API): Every result carries raw evidence, so you can always inspect why a verdict was issued.
- Integrity (hosted API): Evidence records are Ed25519-signed and chained into a Merkle-style log, making post-hoc tampering detectable.
- On-chain anchoring (optional): Merkle roots are periodically published to Base L2 for third-party-verifiable integrity.
Integration
Export evidence to training frameworks:
from vrdev import export_to_trl, export_to_verl
# Generate TRL-compatible reward dataset
export_to_trl(pipeline_results, output="rewards.jsonl")
# Or VERL format
export_to_verl(pipeline_results, output="verl_rewards/")