Changelog

What shipped.

All notable changes to the ReasonRank application. Dates are release dates;

@boydco/* platform package bumps are omitted unless they change app behavior.

Unreleased — Money-math integrity + launch hardening

Statistical and security hardening pass on the numbers customers act on. The

verified-savings pipeline now compares production to production, refuses to

report savings it can't prove, and uses the statistically correct test for

paired eval data.

Fixed — statistics (the four money-math flaws)

  • Realized savings baseline was apples-to-oranges. Realized savings compared an EVAL-SUITE cost-per-call baseline against PRODUCTION traces. Applying a recommendation now snapshots the agent's pre-switch trailing-30d production cost-per-call (model_switches.baseline_trace_cost_per_call, migration 0014) and realized savings use it — production vs production. The eval baseline remains only as a legacy fallback.
  • `MIN_REALIZED_CALLS` was declared but never enforced — "verified savings" could be computed from a single trace. Realized cost/savings now stay null until ≥20 post-switch traces ran the new model.
  • Verification never checked the switch actually happened. Realized cost summed ALL post-switch traces regardless of model; a customer who never shipped the change could still "verify" savings. Post-switch traces are now matched to the new model (exact name or dated-snapshot suffix), a switch_detected flag records whether the majority of traffic moved, savings are withheld when it didn't, and the Opportunities feed surfaces a "switch not detected in production" warning with the on-hold dollar amount.
  • The confidence interval ignored pairing and clustering. Baseline and candidate run the SAME test cases (paired) with correlated repetitions (clustered), but the CI bootstrapped individual results independently — under-powered on the pairing and anti-conservative under repetitions. The recommendation engine and shadow evals now use a paired cluster bootstrap over shared test cases (pairedClusterBootstrapCI in @reasonrank/core); confidence requires ≥5 shared cases (repetitions can no longer manufacture sample size), and the evidence payload records the method used.

Fixed — statistics (supporting)

  • Stability score now uses the sample standard deviation (Bessel's correction); the population formula understated variance at 2–10 reps.
  • computeModelMetrics exposes scoresByCase so every consumer can run paired inference; verified switches keep refreshing realized cost after verification so the savings run-rate tracks live traffic instead of freezing.
  • Recommendations are suppressed (with a log) when either model's cost was computed from the generic fallback price — a savings projection built on a made-up price is never shown.

Security

  • Gateway is self-hosted, single-tenant only. Removed the shared gateway.reasonrank.ai pilot option from the deployment docs: a shared gateway would mix every tenant's traces into one workspace (single org-scoped PAT) and put customers' raw provider keys in transit through ReasonRank-operated hosts. Docs now require the gateway container to run in the customer's own infrastructure.
  • Provider-credential ciphertexts are key-versioned (v1. prefix; legacy format still decrypts) so CREDENTIALS_ENCRYPTION_KEY can be rotated via CREDENTIALS_ENCRYPTION_KEY_V1 pinning instead of being frozen forever.
  • Middleware public-route matching is segment-bounded: /pricing no longer makes /pricing-anything public.

Docs

  • README plan table now matches src/lib/plans.ts (Free / Team $299 / Scale $999 / Enterprise, spend-under-management bands) — it previously described a retired Starter/Pro pricing model.
  • README leads with Cost per Success as the headline metric and documents the verified-savings guarantees.

Previous — Private beta

The beyond-MVP release that turns ReasonRank from an eval harness that delivers

*insight* into a cost-optimization platform that delivers an *actionable lever*.

Added

  • Spend guardrails & transparency. Pre-flight run estimator (exact call count + tokenizer-based token/$ range) with confirm-above-threshold and an optional single-case calibration dry run. Per-org org_eval_settings (monthly budget, per-run cap, max output-token ceiling) enforced in runs.start. Live running cost on the run detail page, an org-wide "pause all runs" kill switch, a ReasonRank-attributed monthly spend meter, and 50/80/100% budget email alerts.
  • Agents, trace ingestion, workflows. "Task" reframed as "Agent" with a production model + real monthly volume. POST /api/ingest (ingestion-key auth, metadata-only by default, opt-in sampled payloads) feeds real traffic; Workflows roll up ordered agents into combined cost/quality.
  • Savings recommendation engine. Finds the cheapest model that holds quality and projects monthly dollar savings against real volume. Per-agent recommendation cards (Apply/Dismiss) and a global Savings dashboard.
  • Google Gemini provider end-to-end; refreshed price tables; judge cost now included in aggregate metrics.
  • Private beta gating. Invite-only signup via email allowlist or invite codes (managed in Admin → Private beta) with an optional comp plan for invited orgs.
  • Compliance. Self-serve workspace deletion (cascades to all tenant data) and a trace retention/redaction policy swept by the cron job.

Changed

  • Aggregate metrics sum both call cost and judge cost for true spend.
  • Run-processing kick targets the specific run so freshly-started runs aren't stuck behind older queued work.

Fixed

  • kickRunProcessing(runId) now forwards the run id to the cron route.
  • Added a no-double-bill regression test for the executor's resume-skip filter.

Docs

  • New [docs/ingestion-quickstart.md](docs/ingestion-quickstart.md) and a private-beta + retention section in [docs/launch-runbook.md](docs/launch-runbook.md). Refreshed SETUP.md (removed saas-starter/Clerk leftovers).