Case Study: Soft-Only vs Hard-Gated Verification
100 verification episodes. 35 corrupt agent outputs. One question: does your reward signal catch false successes, or reward them?
The Problem
When training or evaluating AI agents, reward signals must reflect whether the agent actually changed system state, not just whether its text output sounds correct.
LLM-as-judge evaluators (the “soft-only” approach) score agent text against rubrics. This works well for tone, formatting, and style, but but it has a fundamental blind spot: it can't verify state changes.
An agent that says “I cancelled your order” gets a high rubric score regardless of whether the order was actually cancelled.
Method
We ran 100 verification episodes against a mock retail API.
35 episodes were “corrupt”: the agent claimed success, but the order was still active and the refund was still pending.
65 episodes were “clean”: the agent genuinely completed the task.
All episodes used identical, well-written agent text output.
We evaluated each episode with two strategies:
| Strategy | Verifiers | Config |
|---|---|---|
| Soft-only | rubric.email.tone_professional | Default (threshold ≥ 0.75) |
| Hard-gated | tau2.retail.order_cancelled + tau2.retail.refund_processed + rubric.email.tone_professional | require_hard=True, policy_mode=FAIL_CLOSED |
Seed: 42. All results reproducible via python benchmark_gating.py.
Results
False Positive Rate
Avg Score on Corrupt Episodes
| Metric | Soft-only | Hard-gated |
|---|---|---|
| False positive rate | 100% | 0% |
| False positives | 35 | 0 |
| True positives | 65 | 65 |
| Avg score (all) | 1.000 | 0.650 |
| Avg score (corrupt) | 1.000 | 0.000 |
| Score divergence (corrupt) | 1.000 | N/A |
| Avg latency | 0.1 ms | 23.0 ms |
Key Finding
Every corrupt agent output was rewarded by the soft-only strategy. The rubric judge gave a perfect 1.0 score to all 35 corrupt episodes because the agent's text was polished and included all expected information, even though the underlying state was wrong.
Hard-gated composition caught all 35 corrupt episodes with zero false positives and zero false negatives. The latency overhead was ~23 ms per episode.
Score Divergence
On corrupt episodes, soft-only assigned an average score of 1.000 while hard-gated assigned 0.000, a divergence of 1.0 (the maximum possible).
In a reinforcement learning context, this means corrupt completions received the same reward as correct completions, making the reward signal useless for learning.
Example: Corrupt Episode Caught
Implications
For RL training: Soft-only rewards inject noise proportional to the agent false-success rate. If 35% of agent actions are wrong but scored as correct, the training signal degrades significantly.
For evaluation: Accuracy metrics based on soft-only rewards overcount successes. A benchmark showing “92% task completion” with soft-only scoring may actually be ≤60%.
For production monitoring: Soft-only checks provide a false sense of confidence. Hard verification is required at the state boundary (API, database, filesystem) to detect silent failures.