Key Concepts

R_eff (Effective Reliability)

Every decision has a computed trust score: R_eff. It's calculated from evidence attached to the decision — test results, measurements, benchmarks, user feedback.

Formula: R_eff = min(effective_score) across all evidence items.

This is the weakest link principle — the decision is only as trustworthy as its weakest piece of evidence. No averaging, no optimistic roll-ups.

  • R_eff ≥ 0.5 — healthy, decision is trustworthy
  • R_eff < 0.5 — degraded, surfaces in stale scan
  • R_eff < 0.3 — AT RISK, needs immediate attention
  • No evidence — decision is fresh, not degraded (treated as healthy)

Congruence Level (CL)

Not all evidence is equally relevant. A benchmark from the same project (CL3) is more trustworthy than a blog post about a similar stack (CL1). CL penalties reduce the effective score of cross-context evidence:

CL Context Penalty Example
CL3Same context0.0Internal test result
CL2Similar context0.1Decision from a related project (same language)
CL1Different context0.4External documentation, blog post
CL0Opposed context0.9Evidence from a conflicting methodology

CL matters for cross-project recall too. When a decision from another project surfaces during /q-frame, it gets tagged CL2 (same language) or CL1 (different language).

Weakest Link (WLNK)

Every variant in /q-explore must identify its weakest link — the single thing that bounds its quality. This is not a generic "cons" list. It's the specific mechanism that will fail first under stress.

WLNK applies everywhere in Quint: R_eff is min (not average), gate decisions use worst-wins (not voting), evidence chains break at the weakest item.

Evidence Decay

Evidence has an optional valid_until date. When evidence expires, its score drops to 0.1 regardless of its original verdict. This pulls R_eff down, making the decision surface as stale.

The intuition: a benchmark from 6 months ago is not as trustworthy as one from last week. Evidence doesn't become false — it becomes weak. 0.1, not 0.0.

Indicator Roles

When characterizing a problem (/q-char), each comparison dimension gets a role:

  • constraint — hard limit, must satisfy. Variants that violate it are eliminated.
  • target — what you're optimizing. 1-3 targets max.
  • observation — monitor but do NOT optimize. This is Anti-Goodhart: when a metric becomes a target, it ceases to be a good metric. Mark things as observation to prevent reward hacking.

Pareto Front

After comparison (/q-compare), variants are plotted on a Pareto front — the set of options where no option is strictly worse than another on all dimensions.

If variant A is better on latency but worse on cost, and variant B is the reverse — both are on the Pareto front. Neither dominates the other. Your job is to make the trade-off explicitly, not pretend one is objectively "best."

Parity

A comparison is junk if the options weren't evaluated fairly. Parity means: same inputs, same scope, same budget, same measurement procedure for all variants. If you benchmarked Redis on production hardware and Memcached on a laptop — that's not a fair comparison.

Transformer Mandate

From FPF: a system cannot transform itself. The agent that generates options cannot be the sole validator of those options. In practice:

  • The agent generates variants — the human decides
  • The verification gate challenges decisions before recording
  • Measurements without independent verification get CL1 (self-evidence), not CL3

The two cycles

Decisions don't end at recording. The observation cycle (drift, evidence decay, stale scan) feeds signals back into the decision cycle (frame, explore, compare, decide). Failed measurements create new problems. Stale decisions trigger re-evaluation. See Decision Lifecycle for how quint-code supports each stage.

Next