Why we use a paired cluster bootstrap to recommend model switches

ReasonRank Engineering · 2026-07-02

Recommending a model switch is a statistical claim: the cheaper model is not meaningfully worse than the one you run today. Get the statistics wrong and the recommendation engine becomes a machine for confidently breaking production. This post is the reasoning behind the test we actually run.

The naive test, and the two ways it lies

Suppose you score two models on 20 test cases with 3 repetitions each. You now have 60 scores per model. The tempting move is a t-test (or a flat bootstrap) over those 60-vs-60 samples.

That test lies to you twice.

Lie #1: repetitions are not independent samples. Three runs of the same prompt against the same model are highly correlated — they share the case's difficulty, its ambiguity, its rubric quirks. Treating 20 cases × 3 reps as n=60 instead of n≈20 shrinks your confidence interval by roughly √3 for free. The interval looks tight because you counted the same evidence three times.

Lie #2: unpaired comparisons throw away your design. Both models scored the same cases. Case-to-case difficulty variance — usually the biggest variance component — cancels out if you compare within cases and is pure noise if you don't.

What we run instead

For each test case both models scored, compute the mean score per model within that case, and take the per-case difference. Then bootstrap over cases — resampling whole cases with replacement, keeping each case's repetitions together — and read the confidence interval off the distribution of the mean difference:

// per-case paired deltas; repetitions stay inside their case (cluster)
const deltas = sharedCases.map(
  (c) => mean(candidate.scoresByCase[c]) - mean(baseline.scoresByCase[c])
);
// resample CASES with replacement, 2000 iterations → CI on mean(deltas)

The cluster is the test case. Resampling clusters — not individual scores — means correlated repetitions can never masquerade as independent evidence.

The decision rule is non-inferiority, not "who won"

We don't ask "is the cheap model better?" We ask: can we rule out that it's more than 2% worse? A switch is only recommended when the entire 95% confidence interval on the quality delta clears the −2% tolerance line. If the interval straddles it, the recommendation is flagged unproven, excluded from every headline dollar figure, and the UI tells you exactly how many more test cases you need for a confident call.

Why not just run more repetitions?

Because repetitions buy you precision about the cases you already have, not about your workload. Ten reps of 5 cases tells you a lot about those 5 prompts and almost nothing about the sixth one your users will send tomorrow. Cases are the unit of generalization — which is exactly why they're the unit of resampling. When the data is thin, the honest output is "add 3 more test cases," not a narrower interval.

If you want to see the test run against your own workload, the free cost calculator will tell you what a switch is worth at list prices — and a trial workspace will tell you whether quality actually holds.