Case Study: Soft-Only vs Hard-Gated Verification

100 verification episodes. 35 corrupt agent outputs. One question: does your reward signal catch false successes, or reward them?

The Problem

When training or evaluating AI agents, reward signals must reflect whether the agent actually changed system state, not just whether its text output sounds correct.

LLM-as-judge evaluators (the “soft-only” approach) score agent text against rubrics. This works well for tone, formatting, and style, but but it has a fundamental blind spot: it can't verify state changes.

An agent that says “I cancelled your order” gets a high rubric score regardless of whether the order was actually cancelled.

Method

We ran 100 verification episodes against a mock retail API.

35 episodes were “corrupt”: the agent claimed success, but the order was still active and the refund was still pending.
65 episodes were “clean”: the agent genuinely completed the task.
All episodes used identical, well-written agent text output.

We evaluated each episode with two strategies:

Strategy	Verifiers	Config
Soft-only	`rubric.email.tone_professional`	Default (threshold ≥ 0.75)
Hard-gated	`tau2.retail.order_cancelled` + `tau2.retail.refund_processed` + `rubric.email.tone_professional`	`require_hard=True`, `policy_mode=FAIL_CLOSED`

Seed: 42. All results reproducible via python benchmark_gating.py.

Results

False Positive Rate

Soft-only

100%

Hard-gated

Avg Score on Corrupt Episodes

Soft-only

1.000

Hard-gated

0.000

Metric	Soft-only	Hard-gated
False positive rate	100%	0%
False positives	35	0
True positives	65	65
Avg score (all)	1.000	0.650
Avg score (corrupt)	1.000	0.000
Score divergence (corrupt)	1.000	N/A
Avg latency	0.1 ms	23.0 ms

Key Finding

Every corrupt agent output was rewarded by the soft-only strategy. The rubric judge gave a perfect 1.0 score to all 35 corrupt episodes because the agent's text was polished and included all expected information, even though the underlying state was wrong.

Hard-gated composition caught all 35 corrupt episodes with zero false positives and zero false negatives. The latency overhead was ~23 ms per episode.

Score Divergence

On corrupt episodes, soft-only assigned an average score of 1.000 while hard-gated assigned 0.000, a divergence of 1.0 (the maximum possible).

In a reinforcement learning context, this means corrupt completions received the same reward as correct completions, making the reward signal useless for learning.

Example: Corrupt Episode Caught

Agent output: "Dear customer, your order ORD-0001 has been cancelled per your request. Refund RF-0001 of $49.99 has been processed." Actual system state: Order ORD-0001: status=active (NOT cancelled) Refund RF-0001: status=pending, amount=$0.00 (NOT processed) Soft-only verdict: PASS (score: 1.0) Hard-gated verdict: FAIL (score: 0.0, hard_gate_failed=true)

Implications

For RL training: Soft-only rewards inject noise proportional to the agent false-success rate. If 35% of agent actions are wrong but scored as correct, the training signal degrades significantly.

For evaluation: Accuracy metrics based on soft-only rewards overcount successes. A benchmark showing “92% task completion” with soft-only scoring may actually be ≤60%.

For production monitoring: Soft-only checks provide a false sense of confidence. Hard verification is required at the state boundary (API, database, filesystem) to detect silent failures.

Reproduce

pip install vrdev cd demos/ python benchmark_gating.py # Results in benchmark_results.json

Source: github.com/vrDotDev/vr-dev/tree/main/demos

See Golden Pipeline Templates →