Changelog

What shipped.

All notable changes to the ReasonRank application. Dates are release dates;

@boydco/* platform package bumps are omitted unless they change app behavior.

Unreleased — Money-math integrity + launch hardening

Statistical and security hardening pass on the numbers customers act on. The

verified-savings pipeline now compares production to production, refuses to

report savings it can't prove, and uses the statistically correct test for

paired eval data.

Realized savings baseline was apples-to-oranges. Realized savings compared an EVAL-SUITE cost-per-call baseline against PRODUCTION traces. Applying a recommendation now snapshots the agent's pre-switch trailing-30d production cost-per-call (model_switches.baseline_trace_cost_per_call, migration 0014) and realized savings use it — production vs production. The eval baseline remains only as a legacy fallback.

`MIN_REALIZED_CALLS` was declared but never enforced — "verified savings" could be computed from a single trace. Realized cost/savings now stay null until ≥20 post-switch traces ran the new model.

Verification never checked the switch actually happened. Realized cost summed ALL post-switch traces regardless of model; a customer who never shipped the change could still "verify" savings. Post-switch traces are now matched to the new model (exact name or dated-snapshot suffix), a switch_detected flag records whether the majority of traffic moved, savings are withheld when it didn't, and the Opportunities feed surfaces a "switch not detected in production" warning with the on-hold dollar amount.

The confidence interval ignored pairing and clustering. Baseline and candidate run the SAME test cases (paired) with correlated repetitions (clustered), but the CI bootstrapped individual results independently — under-powered on the pairing and anti-conservative under repetitions. The recommendation engine and shadow evals now use a paired cluster bootstrap over shared test cases (pairedClusterBootstrapCI in @reasonrank/core); confidence requires ≥5 shared cases (repetitions can no longer manufacture sample size), and the evidence payload records the method used.

Stability score now uses the sample standard deviation (Bessel's correction); the population formula understated variance at 2–10 reps.

computeModelMetrics exposes scoresByCase so every consumer can run paired inference; verified switches keep refreshing realized cost after verification so the savings run-rate tracks live traffic instead of freezing.

Recommendations are suppressed (with a log) when either model's cost was computed from the generic fallback price — a savings projection built on a made-up price is never shown.

Gateway is self-hosted, single-tenant only. Removed the shared gateway.reasonrank.ai pilot option from the deployment docs: a shared gateway would mix every tenant's traces into one workspace (single org-scoped PAT) and put customers' raw provider keys in transit through ReasonRank-operated hosts. Docs now require the gateway container to run in the customer's own infrastructure.

Provider-credential ciphertexts are key-versioned (v1. prefix; legacy format still decrypts) so CREDENTIALS_ENCRYPTION_KEY can be rotated via CREDENTIALS_ENCRYPTION_KEY_V1 pinning instead of being frozen forever.

Middleware public-route matching is segment-bounded: /pricing no longer makes /pricing-anything public.

README plan table now matches src/lib/plans.ts (Free / Team $299 / Scale $999 / Enterprise, spend-under-management bands) — it previously described a retired Starter/Pro pricing model.

README leads with Cost per Success as the headline metric and documents the verified-savings guarantees.

The beyond-MVP release that turns ReasonRank from an eval harness that delivers

*insight* into a cost-optimization platform that delivers an *actionable lever*.

Spend guardrails & transparency. Pre-flight run estimator (exact call count + tokenizer-based token/$ range) with confirm-above-threshold and an optional single-case calibration dry run. Per-org org_eval_settings (monthly budget, per-run cap, max output-token ceiling) enforced in runs.start. Live running cost on the run detail page, an org-wide "pause all runs" kill switch, a ReasonRank-attributed monthly spend meter, and 50/80/100% budget email alerts.

Agents, trace ingestion, workflows. "Task" reframed as "Agent" with a production model + real monthly volume. POST /api/ingest (ingestion-key auth, metadata-only by default, opt-in sampled payloads) feeds real traffic; Workflows roll up ordered agents into combined cost/quality.

Savings recommendation engine. Finds the cheapest model that holds quality and projects monthly dollar savings against real volume. Per-agent recommendation cards (Apply/Dismiss) and a global Savings dashboard.

Google Gemini provider end-to-end; refreshed price tables; judge cost now included in aggregate metrics.

Private beta gating. Invite-only signup via email allowlist or invite codes (managed in Admin → Private beta) with an optional comp plan for invited orgs.

Compliance. Self-serve workspace deletion (cascades to all tenant data) and a trace retention/redaction policy swept by the cron job.

Run-processing kick targets the specific run so freshly-started runs aren't stuck behind older queued work.

New [docs/ingestion-quickstart.md](docs/ingestion-quickstart.md) and a private-beta + retention section in [docs/launch-runbook.md](docs/launch-runbook.md). Refreshed SETUP.md (removed saas-starter/Clerk leftovers).