Examples

Runnable demos showing how to compose verifiers into real pipelines.

Support Operations Pipeline

Cancel an order, process a refund, and verify inventory was updated:

from vrdev import get_verifier, compose, VerifierInput
from vrdev.core.types import PolicyMode

chain = compose(
    [get_verifier("vr/tau2.retail.order_cancelled"),
     get_verifier("vr/tau2.retail.refund_processed"),
     get_verifier("vr/tau2.retail.inventory_updated")],
    policy_mode=PolicyMode.FAIL_CLOSED,
)

result = chain.verify(VerifierInput(
    completions=["Order cancelled and refund issued"],
    ground_truth={"order_id": "ORD-42"},
))
print(result[0].passed)  # True only if ALL verifiers pass

Code Agent Pipeline

Lint, test, and verify a git commit:

chain = compose(
    [get_verifier("vr/code.python.lint_ruff"),
     get_verifier("vr/code.python.tests_pass"),
     get_verifier("vr/git.commit_present")],
    policy_mode=PolicyMode.FAIL_CLOSED,
)

result = chain.verify(VerifierInput(
    completions=["Fixed the bug and committed"],
    ground_truth={"repo": ".", "test_cmd": "pytest"},
))

Benchmark: HARD Gating Impact

Run 100 episodes comparing HARD-gated vs ungated rewards:

python demos/benchmark_gating.py

Key finding: 100% of soft false positives are blocked by HARD gates, reducing reward contamination to 0%.

Source Code

All demos are in the demos/ directory of the vrdev repository:

demos/
  demo_support_ops.py     # Retail cancel -> refund -> inventory
  demo_code_agent.py      # Lint -> test -> git commit
  demo_browser_agent.py   # E-commerce order + refund
  benchmark_gating.py     # 100-episode HARD vs SOFT benchmark
  CASE_STUDY.md           # Written case study with analysis

← Integration GuideBring Your Own State (BYOS) →