Trust Model

vr.dev is built on a simple premise: verification results must be trustworthy enough to train on. This page explains the mechanisms that make that possible.

The Trust Problem

When you use an LLM-as-judge to evaluate agent work, you inherit the LLM's failure modes: hallucination, sycophancy, inconsistency. Training on these weak signals propagates errors into model weights.

vr.dev addresses this with a layered trust architecture.

Four Layers of Trust

Layer 1: Deterministic Verification (HARD tier)

HARD verifiers produce binary pass/fail verdicts by querying actual system state. No LLM in the loop. No ambiguity.

  • Database verifier checks the row exists
  • File verifier reads bytes from disk
  • API verifier makes a real HTTP request

Trust guarantee: If the system state matches, the verdict is correct. Period.

Layer 2: Gated Composition

The composition engine enforces a critical invariant: SOFT scores are only counted when HARD checks pass first.

Episode reward = HARD_gate(order_cancelled) AND HARD_gate(refund_processed)
                ? SOFT_score(email_tone) * weight
                : 0.0

This prevents reward hacking. An agent cannot earn reward for a well-written email about an order it failed to actually cancel.

Layer 3: Evidence & Integrity

Every verification produces a structured evidence record:

{
  "verifier_id": "vr/tau2.retail.order_cancelled",
  "verdict": "PASS",
  "score": 1.0,
  "evidence": {
    "query": "SELECT status FROM orders WHERE id = 42",
    "result": {"status": "cancelled"},
    "snapshot_at": "2026-03-15T10:30:00Z"
  },
  "content_hash": "sha256:a1b2c3...",
  "parent_hash": "sha256:d4e5f6...",
  "signature": "ed25519:base64...",
  "signing_key_id": "088e71d4...",
  "verifier_version": "0.1.0"
}

This layer provides two distinct guarantees:

Auditability (local SDK + hosted API): Every result carries the raw evidence (the actual query, its result, the snapshot timestamp). You can always inspect why a verdict was issued. This travels with the reward signal through training pipelines. No separate audit system needed; the evidence is the audit trail.

Integrity (hosted API): Evidence records are content-hashed (SHA-256) and signed with Ed25519 keys. Records are chained via parent_hash, creating a Merkle-style append-only log. This makes it computationally infeasible to alter past results without detection. Note: integrity signing requires the hosted API. The local SDK produces evidence payloads but does not sign them.

Layer 4: On-Chain Anchoring (optional, hosted API only)

The hosted API periodically publishes Merkle roots to an append-only smart contract on Base (Ethereum L2). This provides an external, immutable timestamp and integrity proof for batches of evidence records. Inclusion proofs can verify that any individual evidence record was part of an anchored batch via GET /v1/evidence/{hash}/proof.

This layer is entirely optional. If you don't need third-party-verifiable evidence integrity (e.g., for compliance or dispute resolution), skip it entirely. The local SDK and hosted API work without on-chain anchoring. It exists for use cases where evidence must be independently auditable by parties who don't trust the vr.dev execution environment.

Threat Model

| Threat | Mitigation | |--------|-----------| | Agent claims success but state disagrees | HARD verifiers check real state | | Agent games soft metrics while violating constraints | Gated composition blocks soft rewards | | Can't explain why a verdict was issued | Evidence payloads carry raw data (query, result, timestamp) everywhere | | Verification results altered after the fact | Content-hashed + Ed25519-signed evidence chain (hosted API) | | Need third-party-verifiable integrity | Optional on-chain Merkle root anchoring on Base L2 | | LLM judge is inconsistent | HARD tier avoids LLM entirely; SOFT tier uses rubrics | | Verifier code has bugs | Per-verifier fixture tests (positive, negative, adversarial) | | Prompt injection via agent output | Verifiers never execute agent text; they query state independently |

Verification Integrity Score

Each verifier's registry entry includes an operational scorecard:

  • Determinism: Is the result reproducible? (deterministic / probabilistic)
  • Evidence quality: What kind of proof? (hard-state / rubric-llm / browser-observed)
  • Permissions: What access does the verifier need? (fs:read, db:read, net:http)
  • Gating: Should this verifier gate downstream rewards?
Trust Model | vr.dev Docs