Examples
Runnable demos showing how to compose verifiers into real pipelines.
Support Operations Pipeline
Cancel an order, process a refund, and verify inventory was updated:
from vrdev import get_verifier, compose, VerifierInput
from vrdev.core.types import PolicyMode
chain = compose(
[get_verifier("vr/tau2.retail.order_cancelled"),
get_verifier("vr/tau2.retail.refund_processed"),
get_verifier("vr/tau2.retail.inventory_updated")],
policy_mode=PolicyMode.FAIL_CLOSED,
)
result = chain.verify(VerifierInput(
completions=["Order cancelled and refund issued"],
ground_truth={"order_id": "ORD-42"},
))
print(result[0].passed) # True only if ALL verifiers pass
Code Agent Pipeline
Lint, test, and verify a git commit:
chain = compose(
[get_verifier("vr/code.python.lint_ruff"),
get_verifier("vr/code.python.tests_pass"),
get_verifier("vr/git.commit_present")],
policy_mode=PolicyMode.FAIL_CLOSED,
)
result = chain.verify(VerifierInput(
completions=["Fixed the bug and committed"],
ground_truth={"repo": ".", "test_cmd": "pytest"},
))
Benchmark: HARD Gating Impact
Run 100 episodes comparing HARD-gated vs ungated rewards:
python demos/benchmark_gating.py
Key finding: 100% of soft false positives are blocked by HARD gates, reducing reward contamination to 0%.
Source Code
All demos are in the demos/ directory of the vrdev repository:
demos/
demo_support_ops.py # Retail cancel -> refund -> inventory
demo_code_agent.py # Lint -> test -> git commit
demo_browser_agent.py # E-commerce order + refund
benchmark_gating.py # 100-episode HARD vs SOFT benchmark
CASE_STUDY.md # Written case study with analysis