Verify what AI agents
actually changed
Deterministic checks, rubric-based scoring, and agentic probes. Composed into trust pipelines for CI, evaluation, and training.

27–78% of agent “successes” are wrong
The agent says it cancelled the order, but the database still shows it active. It says it sent the email, but the content doesn't match the rubric. Training on these false positives teaches models to appear correct instead of being correct.
Benchmark: HARD gating eliminates 100% of false positives
100-episode e-commerce simulation. Soft-only scoring: 35% false positives. HARD-gated pipeline: 0%.
Read the case study →One platform, two modes
Ship agents you trust today. Train better agents tomorrow.
Regression testing, audit trails, and proof that agents changed real system state
- Catch false successes before users do. HARD verifiers query actual databases, APIs, and file systems. Not agent self-reports.
- Compose checks into CI gates. Chain verifiers with
policy_mode="fail_closed"so soft scores only count if hard state checks pass first. - Evidence payloads & audit trail. Every verdict carries raw evidence (query results, DOM snapshots, file stats). The hosted API adds Ed25519 signing, Merkle-chained integrity, and optional on-chain anchoring on Base L2.
- SDK + CLI + API.
pip install vrdev, run from the command line, or hit the REST API directly.
Ground-truth reward signals that prevent reward hacking at training time
- Drop-in reward functions. Use verifiers as the reward signal in TRL, VERL, or OpenClaw. HARD returns 0/1, SOFT returns a continuous score.
- Anti-reward-hacking by design. The composition engine gates soft LLM judges behind hard state checks. Agents can't game soft metrics while violating deterministic constraints.
- Run locally, no HTTP latency.
pip install vrdevand callv.verify()in your training loop. No API dependency in the hot path. - Export to JSONL. Training-ready export for GRPO / DPO workflows with evidence provenance built in.
Don't trust agent outputs. Verify them.
Three layers of verification for AI agents that interact with real systems
Verifier Registry
38 verifiers across 19 domains including retail, airline, telecom, email, calendar, shell, code, web, filesystem, document, database, API, git, messaging, payment, CI, and more. Each verifier checks actual system state: API responses, database records, browser DOM, git history. Three tiers: HARD (deterministic), SOFT (LLM judge), AGENTIC (agent-driven).
Composition Engine
Chain verifiers into reward pipelines with policy_mode="fail_closed". Gate soft LLM judges behind hard state checks so reward hacking can't bypass deterministic constraints. The anti-reward-hacking mechanism baked into your training loop.
Evidence & Audit Trail
Every verification produces a structured evidence record containing the raw API response, verdict, score, and timestamp. The hosted API adds Ed25519 signing and Merkle-chained integrity with optional on-chain anchoring on Base L2. Export to TRL, VERL, or any training framework.
Bring Your Own State
Already ran the check yourself? Pass pre_result to skip redundant execution and feed your own state into the composition engine. Sub-millisecond overhead for RL training loops. Learn more →
MCP Server
6 tools for Claude Desktop and Cursor: list, run, compose, explain, search, and reward. Directly from your AI assistant. Install with pip install vrdev[mcp]. Setup guide →
Free to start. Free locally, forever.
Run all verifiers locally with pip install vrdev. The hosted API is free during launch; paid tiers activate soon.
Install the SDK
pip install vrdev. Run verifiers locally with zero setup. No API key needed.
Optional: Hosted API
Sign up for an API key to get evidence anchoring, audit trails, and team dashboards.
Pay per call (later)
When paid tiers activate: USDC micropayments on Base via x402. Starting at $0.005 per HARD check.
Get started in 3 lines
from vrdev import get_verifier, compose, VerifierInput
from vrdev.core.types import PolicyMode
# single verifier: checks the actual DB state
v = get_verifier("vr/tau2.retail.order_cancelled")
result = v.verify(VerifierInput(
completions=["Order cancelled"],
ground_truth={"order_id": "ORD-42"},
))
# composed pipeline: hard gate → soft scorer
pipeline = compose([
get_verifier("vr/tau2.retail.order_cancelled"),
get_verifier("vr/rubric.email.tone_professional"),
], policy_mode=PolicyMode.FAIL_CLOSED)
result = pipeline.verify(VerifierInput(
completions=["Order cancelled, email sent"],
ground_truth={"order_id": "ORD-42"},
))
print(result[0].passed) # True / False
print(result[0].evidence_hash) # sha256:a3f1b2...Why not just…
“Why not just write a pytest assert?”
You absolutely should write asserts. But a single assert doesn't compose with LLM scoring, doesn't produce an evidence chain, and doesn't come with adversarial test fixtures. vr.dev wraps deterministic checks (HARD tier) with composition, evidence persistence, and training-data export.
“Why not just use LLM-as-judge?”
LLM judges are sycophantic, non-deterministic, and gameable by the agent being evaluated. Our SOFT tier uses LLM judges, but only after the HARD tier confirms actual system state. fail_closed composition means a perfect rubric score is discarded if the database check fails.
“Why not build this internally?”
You can. We did. It took months. The registry gives you 38 battle-tested verifiers across 19 domains with adversarial fixtures, shared composition patterns, and training framework integration. Ready to use from pip install vrdev.
Who this is for
Agent Developers
Building agents that interact with real systems (APIs, databases, browsers) and need proof they worked.
RL/RLHF Researchers
Training agents with GRPO/DPO and need ground-truth reward signals that prevent reward hacking.
Eval & QA Teams
Running regression tests on agent behavior and need deterministic checks beyond “it looks right.”
Platform Teams
Need audit trails and compliance evidence for agents acting on behalf of users in production.
Current scope & limitations
We believe in being upfront about where vr.dev is today.
Registry size
38 verifiers across 19 domains. Growing, but not yet comprehensive. Finance, healthcare, and legal domains are not covered yet.
Signing is hosted-only
Ed25519 signing, Merkle integrity, and on-chain anchoring require the hosted API. The local SDK provides full evidence payloads and auditability, but not cryptographic tamper-evidence.
AGENTIC verifiers need network
HARD and SOFT verifiers run fully offline. AGENTIC verifiers (IMAP, CalDAV, browser) require network access to probe external services. BYOS mode avoids all live I/O.

Ship agents you can verify
pip install vrdev: run locally in seconds. Free forever for local use.
Open source (MIT) · Free: run locally · Hosted API: pay-per-call in USDC