Golden Pipeline Templates

Pre-composed verification pipelines for common agent tasks. Copy, paste, and adapt.

Each template uses the compose() API with policy_mode="fail_closed". SOFT scores only count if all HARD checks pass first.

1. Cancel Order and Verify (Retail)

Use case: Agent cancels an order, processes a refund, and sends a notification email.

Verifiers: 2× HARD + 1× AGENTIC (fail_closed)

from vrdev import get_verifier, compose, VerifierInput
from vrdev.core.types import PolicyMode

pipeline = compose(
    [get_verifier("vr/tau2.retail.order_cancelled"),
     get_verifier("vr/tau2.retail.refund_processed"),
     get_verifier("vr/aiv.email.sent_folder_confirmed")],
    require_hard=True,
    policy_mode=PolicyMode.FAIL_CLOSED,
)

result = pipeline.verify(VerifierInput(
    completions=["Order cancelled and confirmation sent"],
    ground_truth={
        "order_id": "ORD-42",
        "expected_refund_amount": 49.99,
        "email_subject": "Your order has been cancelled",
    },
))

print(result[0].passed)     # True only if ALL checks pass
print(result[0].score)      # 1.0 or 0.0
print(result[0].breakdown)  # per-verifier results

Why this works: Even if the agent writes a perfect cancellation email, the pipeline fails if the order is still active or the refund wasn't processed.

2. Code Agent with Quality Gate (Dev)

Use case: Agent writes or modifies code. Must pass linting and tests before style is scored.

Verifiers: 2× HARD gate + 1× SOFT scorer

from vrdev import get_verifier, compose, VerifierInput
from vrdev.core.types import PolicyMode

pipeline = compose(
    [get_verifier("vr/code.python.lint_ruff"),
     get_verifier("vr/code.python.tests_pass"),
     get_verifier("vr/rubric.code.logic_correct")],
    require_hard=True,
    policy_mode=PolicyMode.FAIL_CLOSED,
)

result = pipeline.verify(VerifierInput(
    completions=["Fixed the bug and committed"],
    ground_truth={
        "file_path": "src/handler.py",
        "repo": ".",
        "test_cmd": "pytest tests/ -q",
    },
))

# If linting or tests fail, rubric score is zeroed out
# Agent can't get a high score by writing "nice-looking" broken code
print(f"Score: {result[0].score:.2f}")

Why this works: The SOFT rubric only contributes to the score if both HARD gates pass. An agent can't game the LLM judge score while submitting code that doesn't lint or pass tests.

3. Email with Tone Check (Support)

Use case: Agent sends a customer email. Must verify the email was actually sent (AGENTIC) before scoring tone quality (SOFT).

Verifiers: 1× AGENTIC gate + 1× SOFT scorer

from vrdev import get_verifier, compose, VerifierInput
from vrdev.core.types import PolicyMode

pipeline = compose(
    [get_verifier("vr/aiv.email.sent_folder_confirmed"),
     get_verifier("vr/rubric.email.tone_professional")],
    policy_mode=PolicyMode.FAIL_CLOSED,
)

result = pipeline.verify(VerifierInput(
    completions=["Sent a response to the customer"],
    ground_truth={
        "email_subject": "Re: Your support ticket #1234",
        "expected_recipient": "customer@example.com",
    },
))

# SOFT score only counts if the email was actually sent
print(f"Sent: {result[0].passed}")
print(f"Score: {result[0].score:.2f}")

Why this works: An agent that generates a beautifully written email but never actually sends it gets a score of 0.0.

Adapting Templates

Change the policy mode

fail_closed (default): Any HARD/AGENTIC FAIL or ERROR → score 0.0
fail_open: Only explicit FAIL blocks the pipeline; ERROR is tolerated
escalation: Run tiers in order, stop when a tier passes
ensemble: Run all verifiers and aggregate scores

Use with the CLI

vr compose \
  --verifiers vr/tau2.retail.order_cancelled,vr/tau2.retail.refund_processed \
  --policy fail_closed \
  --ground-truth '{"order_id": "ORD-42"}'

Export for RL training

from vrdev import export_to_trl

# Run pipeline on many episodes, export for GRPO/DPO
export_to_trl(results, output="training_data.jsonl")

Building Your Own Pipeline

Browse the Registry to find verifiers for your domain
Compose HARD checks first (state verification), then SOFT (quality scoring)
Use fail_closed to prevent reward hacking
Test with adversarial inputs: use each verifier's built-in adversarial fixtures
Export results to your training framework

← MCP Server