Verify what AI agents
actually changed

Deterministic checks, rubric-based scoring, and agentic probes. Composed into trust pipelines for CI, evaluation, and training.

Get Started Browse 38 Verifiers

🛒

Agent Action

API: POST /orders/ORD-42/cancel

📡

System State

HTTP 200 + DB row captured

🔒

tau2.retail.order_cancelled

Assert order.status == cancelled

HARD

✅

Verdict

✓ PASS 1.0

evidence: api-state

27–78% of agent “successes” are wrong

The agent says it cancelled the order, but the database still shows it active. It says it sent the email, but the content doesn't match the rubric. Training on these false positives teaches models to appear correct instead of being correct.

Benchmark: HARD gating eliminates 100% of false positives

100-episode e-commerce simulation. Soft-only scoring: 35% false positives. HARD-gated pipeline: 0%.

Read the case study →

One platform, two modes

Ship agents you trust today. Train better agents tomorrow.

For Engineers Shipping Agents

Regression testing, audit trails, and proof that agents changed real system state

Catch false successes before users do. HARD verifiers query actual databases, APIs, and file systems. Not agent self-reports.
Compose checks into CI gates. Chain verifiers with policy_mode="fail_closed" so soft scores only count if hard state checks pass first.
Evidence payloads & audit trail. Every verdict carries raw evidence (query results, DOM snapshots, file stats). The hosted API adds Ed25519 signing, Merkle-chained integrity, and optional on-chain anchoring on Base L2.
SDK + CLI + API. pip install vrdev, run from the command line, or hit the REST API directly.

Get Started in 5 Minutes

For RL/Training Pipelines

Ground-truth reward signals that prevent reward hacking at training time

Drop-in reward functions. Use verifiers as the reward signal in TRL, VERL, or OpenClaw. HARD returns 0/1, SOFT returns a continuous score.
Anti-reward-hacking by design. The composition engine gates soft LLM judges behind hard state checks. Agents can't game soft metrics while violating deterministic constraints.
Run locally, no HTTP latency. pip install vrdev and call v.verify() in your training loop. No API dependency in the hot path.
Export to JSONL. Training-ready export for GRPO / DPO workflows with evidence provenance built in.

Read the Integration Guide

Don't trust agent outputs. Verify them.

Three layers of verification for AI agents that interact with real systems

Verifier Registry

38 verifiers across 19 domains including retail, airline, telecom, email, calendar, shell, code, web, filesystem, document, database, API, git, messaging, payment, CI, and more. Each verifier checks actual system state: API responses, database records, browser DOM, git history. Three tiers: HARD (deterministic), SOFT (LLM judge), AGENTIC (agent-driven).

Composition Engine

Chain verifiers into reward pipelines with policy_mode="fail_closed". Gate soft LLM judges behind hard state checks so reward hacking can't bypass deterministic constraints. The anti-reward-hacking mechanism baked into your training loop.

Evidence & Audit Trail

Every verification produces a structured evidence record containing the raw API response, verdict, score, and timestamp. The hosted API adds Ed25519 signing and Merkle-chained integrity with optional on-chain anchoring on Base L2. Export to TRL, VERL, or any training framework.

Bring Your Own State

Already ran the check yourself? Pass pre_result to skip redundant execution and feed your own state into the composition engine. Sub-millisecond overhead for RL training loops. Learn more →

MCP Server

6 tools for Claude Desktop and Cursor: list, run, compose, explain, search, and reward. Directly from your AI assistant. Install with pip install vrdev[mcp]. Setup guide →

Free to start. Free locally, forever.

Run all verifiers locally with pip install vrdev. The hosted API is free during launch; paid tiers activate soon.

⚡

Install the SDK

pip install vrdev. Run verifiers locally with zero setup. No API key needed.

🔑

Optional: Hosted API

💰

Pay per call (later)

When paid tiers activate: USDC micropayments on Base via x402. Starting at $0.005 per HARD check.

See full pricing →

Get started in 3 lines

pipeline.py

from vrdev import get_verifier, compose, VerifierInput
from vrdev.core.types import PolicyMode

# single verifier: checks the actual DB state
v = get_verifier("vr/tau2.retail.order_cancelled")
result = v.verify(VerifierInput(
    completions=["Order cancelled"],
    ground_truth={"order_id": "ORD-42"},
))

# composed pipeline: hard gate → soft scorer
pipeline = compose([
    get_verifier("vr/tau2.retail.order_cancelled"),
    get_verifier("vr/rubric.email.tone_professional"),
], policy_mode=PolicyMode.FAIL_CLOSED)

result = pipeline.verify(VerifierInput(
    completions=["Order cancelled, email sent"],
    ground_truth={"order_id": "ORD-42"},
))
print(result[0].passed)         # True / False
print(result[0].evidence_hash)  # sha256:a3f1b2...

Why not just…

“Why not just write a pytest assert?”

You absolutely should write asserts. But a single assert doesn't compose with LLM scoring, doesn't produce an evidence chain, and doesn't come with adversarial test fixtures. vr.dev wraps deterministic checks (HARD tier) with composition, evidence persistence, and training-data export.

“Why not just use LLM-as-judge?”

LLM judges are sycophantic, non-deterministic, and gameable by the agent being evaluated. Our SOFT tier uses LLM judges, but only after the HARD tier confirms actual system state. fail_closed composition means a perfect rubric score is discarded if the database check fails.

“Why not build this internally?”

You can. We did. It took months. The registry gives you 38 battle-tested verifiers across 19 domains with adversarial fixtures, shared composition patterns, and training framework integration. Ready to use from pip install vrdev.

Who this is for

Agent Developers

Building agents that interact with real systems (APIs, databases, browsers) and need proof they worked.

RL/RLHF Researchers

Training agents with GRPO/DPO and need ground-truth reward signals that prevent reward hacking.

Eval & QA Teams

Running regression tests on agent behavior and need deterministic checks beyond “it looks right.”

Platform Teams

Need audit trails and compliance evidence for agents acting on behalf of users in production.

Current scope & limitations

We believe in being upfront about where vr.dev is today.

Registry size

38 verifiers across 19 domains. Growing, but not yet comprehensive. Finance, healthcare, and legal domains are not covered yet.

Signing is hosted-only

Ed25519 signing, Merkle integrity, and on-chain anchoring require the hosted API. The local SDK provides full evidence payloads and auditability, but not cryptographic tamper-evidence.

AGENTIC verifiers need network

HARD and SOFT verifiers run fully offline. AGENTIC verifiers (IMAP, CalDAV, browser) require network access to probe external services. BYOS mode avoids all live I/O.

Ship agents you can verify

pip install vrdev: run locally in seconds. Free forever for local use.