Demos
Self-contained examples showing vrdev's verifiable-rewards pipeline. Each demo embeds its own mock server - no external services required.
pip install vrdevRetail Support-Ops
cancel → refund → inventoryThree HARD verifiers composed with fail-closed policy verify that an AI agent actually cancelled the order, processed the refund, and updated inventory.
Verdict: PASS Score: 1.00 Breakdown: order_cancelled/status_match: 1.0 order_cancelled/reason_match: 1.0 refund_processed/status_match: 1.0 refund_processed/amount_match: 1.0 inventory_updated/quantity_match: 1.0 inventory_updated/warehouse_match: 1.0
Verdict: FAIL Score: 0.00 ⚠ Hard gate triggered order still active, refund pending, inventory wrong
Code Agent
lint → test → commitVerifies an AI agent that writes Python: lint with ruff (zero violations), tests with pytest, and a git commit. All three are HARD verifiers - one failure gates the episode.
Verdict: PASS Score: 1.00 Breakdown: lint_ruff/violation_count: 0 tests_pass/pass_ratio: 1.0 commit_present/message_match: 1.0
Verdict: FAIL Score: 0.00 ⚠ Hard gate triggered - 3 unused imports Tests passed, but lint failure gates the whole episode
E-commerce Browser Agent
order placed → refund processedCross-domain composition: a WebArena-style order verifier plus a τ²-bench refund verifier. Catches agents that claim "order placed" when it was actually cancelled.
Verdict: PASS Score: 1.00 Breakdown: order_placed/order_found: 1.0 order_placed/items_match: 1.0 order_placed/total_match: 1.0 refund_processed/status_match: 1.0 refund_processed/amount_match: 1.0
Verdict: FAIL Score: 0.00 ⚠ Hard gate triggered - order cancelled + refund denied
Benchmark: Soft-Only vs Hard-Gated
100 episodes. 35% corrupt agent outputs. The soft-only LLM judge gave perfect scores to every corrupt episode. Hard-gated composition caught all 40 with zero false negatives.
Reproducible: python benchmark_gating.py