Cost & Latency Profile
Verification speed matters for RL training loops. Here is the performance profile by tier.
Per-Tier Latency
| Tier | Typical Latency | Cost | LLM Required | |------|----------------|------|--------------| | HARD | 1-50ms | Free | No | | SOFT | 500-2000ms | LLM API cost | Yes | | AGENTIC | 2-10s | LLM + browser | Yes |
Percentile Breakdown
| Tier | p50 | p95 | p99 | Notes | |------|-----|-----|-----|-------| | HARD (live DB) | 12ms | 45ms | 95ms | Network-bound SELECT | | HARD (BYOS/pre_result) | 0.05ms | 0.1ms | 0.3ms | Pure computation | | HARD (file parse) | 3ms | 8ms | 15ms | JSON/CSV/YAML | | SOFT (GPT-4o) | 780ms | 1400ms | 2200ms | LLM API latency dominates | | SOFT (Claude Sonnet) | 1100ms | 1800ms | 2800ms | Slightly higher base | | SOFT (local LLM) | 200ms | 800ms | 1500ms | Depends on hardware | | AGENTIC (browser) | 3.2s | 7.5s | 12s | DOM load + interaction |
For RL training loops, use BYOS (pre_result) mode to keep HARD verifiers at sub-millisecond p99. SOFT tier latency is dominated by LLM inference; use local models to eliminate network variance.
HARD Tier Details
HARD verifiers are the fastest because they perform simple state checks:
| Verifier Type | Latency | Notes | |--------------|---------|-------| | File read + parse | 1-10ms | JSON, CSV, YAML, text | | Database SELECT | 10-50ms | Single-row lookups | | HTTP GET | 50-200ms | Network-bound | | Shell command | 100-5000ms | Depends on command (pytest, ruff) |
With BYOS (pre_result)
When using the pre_result pattern, all HARD verifiers drop to sub-millisecond latency since they skip external I/O entirely:
| Verifier Type | Latency with pre_result | |--------------|------------------------| | Database checks | < 0.1ms | | API checks | < 0.1ms | | File checks | < 0.1ms (if content pre-loaded) |
SOFT Tier Details
SOFT verifiers call an LLM to evaluate text against a rubric:
- GPT-4o: ~800ms, ~$0.003/verification
- Claude Sonnet: ~1200ms, ~$0.003/verification
- Local models: Variable, no API cost
At Scale: RL Training Loops
For a typical RL training episode with 3 HARD + 1 SOFT verifier:
| Scenario | Total Latency | Cost per Episode | |----------|--------------|-----------------| | Live state + LLM | ~2.5s | ~$0.003 | | BYOS + LLM | ~1.2s | ~$0.003 | | BYOS only (HARD) | < 1ms | Free |
10,000 Episodes
| Mode | Total Time | Total Cost | |------|-----------|------------| | Live | ~7 hours | ~$30 | | BYOS + LLM | ~3.3 hours | ~$30 | | BYOS HARD-only | ~10 seconds | Free |
Optimization Tips
- Use BYOS for training: Pre-compute state once, verify thousands of times
- Run HARD first: Gate expensive SOFT calls behind cheap HARD checks
- Batch API calls: Use
/v1/batchendpoint for bulk verification - Cache LLM results: SOFT verifier results are deterministic for same input + rubric