Decision Integrity

A decision record is only valuable if you can trust it. Haft has multiple mechanisms to ensure decisions are honest, evidence is real, claims are falsifiable, and the knowledge base stays clean.

Adversarial verification gate

Before recording any decision, the agent runs a verification check. The principle: the agent that generated the options cannot be the sole validator of those options (FPF A.12 — External Transformer Principle).

For tactical decisions (quick, reversible):

  • One-line counter-argument: "The strongest argument against this decision is..."
  • If the counter-argument kills the decision — back to exploring

For standard/deep decisions:

  1. Deductive consequences — "If this is correct, what 3 things must be true?"
  2. Strongest counter-argument — genuine, not a strawman
  3. Self-evidence check — "Is the only evidence from this same conversation?"
  4. Tail failure scenarios — low-probability, high-impact failure modes
  5. WLNK challenge — "Is the stated weakest link actually the weakest?"

Inductive measurement gate

When recording a measurement (verdict: accepted/partial/failed), haft checks whether the decision has a baseline (file hashes were snapshotted). If not:

  • Warning appears in the response: "No baseline found — implementation may not be verified"
  • Measurement records at CL1 (0.4 penalty) instead of CL3 (no penalty)
  • R_eff for unverified measurement: max(0, 1.0 - 0.4) = 0.6 — still healthy, but visibly lower than 1.0

This prevents the agent from calling measure from memory without actually verifying the implementation — a real problem we discovered and fixed during development.

Evidence supersession

When a new measurement is recorded on a decision that already has a measurement, the old measurement is marked verdict='superseded' and excluded from R_eff computation. This prevents old partial measurements from permanently dragging R_eff down.

Superseded evidence stays in the database for audit — it's not deleted, just excluded from the active chain.

Note-decision deduplication

Notes and decisions serve different purposes. Notes are observations ("we use Redis here"). Decisions are contracts ("we chose Redis because X, with invariants Y and rollback Z"). When someone tries to record a note that duplicates an existing decision, haft catches it.

The check uses containment (not Jaccard similarity):

  • >70% of note's words in a decision title — rejected with explanation
  • 50-70% — warning, note still recorded
  • <50% — pass silently

Same check runs note-vs-note to prevent duplicate notes accumulating.

Batch cleanup: /h-verify action="reconcile" scans all active notes against all active decisions in one pass and reports overlaps.

Claims with verify_after

Decisions often contain predictions: "this cache will reduce p99 latency by 40%" or "migration will complete within 2 sprints." In v6, these are structured as claims — falsifiable predictions attached to a decision.

A claim has three components:

  • observable — what to measure ("p99 latency of /api/search")
  • threshold — what counts as success ("< 200ms")
  • verify_after — when to check ("2026-05-15")
Claims:
  - observable: "p99 latency of /api/search"
    threshold: "< 200ms"
    verify_after: 2026-05-15

  - observable: "cache hit rate"
    threshold: "> 85%"
    verify_after: 2026-05-01

When a verify_after date passes and the claim remains unverified, /h-verify scan surfaces it as "pending verification." The agent prompts you to collect evidence: did the prediction hold? The result attaches as CL3 evidence if verified against production data, pulling R_eff up or down based on reality.

Claims are falsifiable by design. A claim without a concrete observable and threshold is not a claim — it's a hope. Haft rejects claims that lack these fields.

Decisions as test specs

Full-cycle decisions contain enough structure to serve as test specifications: invariants, post-conditions, admissibility constraints, and affected file paths. Any coding agent can translate these into property-based tests.

Language Property-based testing library
Gorapid, gopter
Pythonhypothesis
Rustproptest
TypeScriptfast-check

Test results attach as CL3 evidence — the highest confidence level in the R_eff model. This creates a closed loop: decisions define what must hold, tests verify that it holds, evidence records the verification, and R_eff reflects the trust level.

When code drifts (files change after baseline), R_eff drops, and the tests need to re-run. The integrity system doesn't just check that decisions were made honestly — it checks that reality still matches the decision.