Demos

Self-contained examples showing vrdev's verifiable-rewards pipeline. Each demo embeds its own mock server - no external services required.

pip install vrdev

Retail Support-Ops

cancel → refund → inventory

Three HARD verifiers composed with fail-closed policy verify that an AI agent actually cancelled the order, processed the refund, and updated inventory.

tau2.retail.order_cancelledtau2.retail.refund_processedtau2.retail.inventory_updated

✓ Pass

Verdict:  PASS
Score:    1.00
Breakdown:
  order_cancelled/status_match:   1.0
  order_cancelled/reason_match:   1.0
  refund_processed/status_match:  1.0
  refund_processed/amount_match:  1.0
  inventory_updated/quantity_match: 1.0
  inventory_updated/warehouse_match: 1.0

✗ Fail

Verdict:  FAIL
Score:    0.00
  ⚠ Hard gate triggered
  order still active, refund pending, inventory wrong

Code Agent

lint → test → commit

Verifies an AI agent that writes Python: lint with ruff (zero violations), tests with pytest, and a git commit. All three are HARD verifiers - one failure gates the episode.

code.python.lint_ruffcode.python.tests_passgit.commit_present

✓ Pass

Verdict:  PASS
Score:    1.00
Breakdown:
  lint_ruff/violation_count: 0
  tests_pass/pass_ratio: 1.0
  commit_present/message_match: 1.0

✗ Fail

Verdict:  FAIL
Score:    0.00
  ⚠ Hard gate triggered - 3 unused imports
  Tests passed, but lint failure gates the whole episode

E-commerce Browser Agent

order placed → refund processed

Cross-domain composition: a WebArena-style order verifier plus a τ²-bench refund verifier. Catches agents that claim "order placed" when it was actually cancelled.

web.ecommerce.order_placedtau2.retail.refund_processed

✓ Pass

Verdict:  PASS
Score:    1.00
Breakdown:
  order_placed/order_found:  1.0
  order_placed/items_match:  1.0
  order_placed/total_match:  1.0
  refund_processed/status_match: 1.0
  refund_processed/amount_match: 1.0

✗ Fail

Verdict:  FAIL
Score:    0.00
  ⚠ Hard gate triggered - order cancelled + refund denied

Benchmark: Soft-Only vs Hard-Gated

100 episodes. 35% corrupt agent outputs. The soft-only LLM judge gave perfect scores to every corrupt episode. Hard-gated composition caught all 40 with zero false negatives.

100%Soft-only false positive rate

0%Hard-gated false positive rate

1.000Score divergence on corrupt episodes

Reproducible: python benchmark_gating.py

Full walkthrough in docs Getting started guide