Full technical report

Quantifying LLM Cost Savings from Cache-Aware Inference Arbitrage

A Statistical Report

In this benchmark, the same LLM session could cost up to 4x as much depending on the provider used. The gap comes from token pricing, provider prompt-caching behavior, and users’ request patterns. When those factors diverge, a cache-aware cost-arbitrage layer can reduce spend. The question is how much it saves and how reliably.

We benchmarked Auriko's cache-aware cost-arbitrage engine against five inference providers and an inference routing peer. The benchmark sent over 80,000 API requests to 37 models across 3 workload types. These requests form over 22,000 sessions. Auriko reduced dollar-weighted spend against six tested baselines: 32.8% against the routing peer (95% CI [30.6%, 34.9%]) and 7.7–38.3% across five single-provider baselines.

This report presents the benchmark results, explains the cost drivers behind the savings, and documents the methodology, statistical tests, and robustness checks.


The Opportunity: Cost Dispersion Across Providers

Inference cost for the same requested model varies across providers. For some models we tested, the most expensive path costs 4x the cheapest for an identical request.

We compared 11 models across providers. For each model, we divided each provider's average cost by the lowest-cost provider's average cost and multiplied by 100. A score of 100 means cheapest. A score of 300 means three times the cost for the same request.

Figure 1: The same requested model costs up to 4× more on different providers

The variation appears across models in the study. The pattern holds across workload types. Coding agent workloads show the widest gap; single-turn and conversation workloads show comparable spreads (details in Robustness and Sensitivity).

Figure 2: Cost dispersion persists across all workload types

This cost dispersion means that within the tested provider set, a cheapest provider choice exists for each measured workload. The rest of this report presents a statistical study of the cost reduction observed for Auriko's cache-aware routing approach. The next section describes how we conducted the study.


Experimental Design

This study adopts a matched-pairs design. For each model and workload combination, every call to Auriko and each comparator uses identical prompts. Request parameters — temperature, top-p, frequency penalty, presence penalty, and maximum output length — are held constant across all calls. Both sides are dispatched concurrently to control for time-of-day effects. Each combination is repeated multiple times. If either side of a pair errors, both sides are excluded before analysis.

Coverage

DimensionCoverage
Comparators1 routing peer + 5 inference providers
Models37 models
WorkloadsSingle-turn, multi-turn conversation, and multi-step coding agent sessions
Requests made80,634 API requests
LLM sessions22,416 sessions
Requested paired sessions10,752
Clean paired target comparisons9,594
Benchmark window2026-05-28 to 2026-06-07 UTC

Requests made counts individual API calls. LLM sessions group related calls into workload sessions, including multi-turn conversations and multi-step agent runs. Requested paired sessions are target-level session pairings before error exclusion; clean paired target comparisons are the same pairings after symmetric exclusion when either side errored.

Metric

Cost is the dollar amount returned by each API response. For session-level analysis, request costs are summed within each workload session. For the routing peer, cost is adjusted for their platform fee, matching what a user of that service effectively pays:

adjusted cost = API-returned cost × (1 + platform fee %)

Since Auriko charges no platform fees, no cost adjustment is needed.

Statistical Methods

MethodPurpose
Stratified bootstrap (95% CI, 5,000 replicates, model + scenario strata)Confidence intervals
Sign test (one-sided)Directional evidence at the session level
Wilcoxon signed-rank test (one-sided)Paired magnitude test
Holm–Bonferroni correctionMultiple-comparison control on model-level tests
Effect size (Hedges' g)Standardized magnitude measure
Win rate inference (binomial + Wilson CI)Session-level directional evidence

Validity Controls

ControlWhat it checks
Symmetric exclusionSessions with errors on either side are removed from both sides, preventing one-sided inflation
Output-volume normalizationOutput costs scaled to match between sides
Output-volume filteringRestricted to sessions where both sides produced similar output (±10%)
Error-inclusion testRe-including error sessions to test whether exclusion materially changes the result

The Evidence: Cost Reduction

The dispersion section showed that the same requested model can cost materially different amounts on different providers. This section presents Auriko's realized cost savings against each comparator target.

Primary Findings

Auriko reduced dollar-weighted spend against six comparator targets, with target-level savings ranging from 7.7% to 38.3%.

The chart below shows the distribution of session-level cost differences for each comparator target, together with the percentage of wins, ties, and losses — where a win is a session where Auriko costs less than the comparator. Auriko costs less in 60–90% of non-tie sessions across six targets.

Figure 3: Session savings distribution and win/tie/loss

The chart below shows session win rates with 95% confidence intervals, ties excluded. Auriko costs less in the majority of non-tie sessions against each comparator target, with confidence intervals above the 50% even-odds line.

Figure 4: Auriko cost less in 61–90% of non-tie sessions

For each comparator target, the chart below shows dollar-weighted savings with 95% confidence intervals. Savings range from 7.7% (Model provider E) to 38.3% (Model provider A). Confidence intervals are computed via stratified bootstrap.

Figure 5: Savings were positive against all tested baselines

These panels show how cost accumulates over a multi-turn conversation. Each line is the average cumulative cost across models tested against that comparator. The gap between the lines is the savings, and it widens with each turn, a pattern consistent with routing choices and reusable prompt context becoming more valuable over successive turns. This is conversation-scenario data only; single-turn workloads have no turn progression to show.

Figure 6: Cumulative cost gaps widened over conversation turns

The per-model breakdown below shows the cumulative-cost gap is also visible at the individual-model level.

Figure 7: Most models ended with lower Auriko cost

Cost Reduction Across Workloads

Each cell represents one comparator target paired with one workload scenario, annotated with the dollar-weighted savings percentage for that pair. Color encodes magnitude. Cost reduction is positive across workload types and comparator targets.

For most targets, multi-turn workloads show larger cost reduction than single-turn requests, consistent with reusable prompt context becoming more valuable over successive turns.

Figure 8: Savings were positive in all 18 workload-target cells

Cost Reduction Across Models

A common objection to average savings is that one model with a large cost difference could drive the entire result. This subsection tests whether savings are broad across the tested model set.

TargetModels testedPositive savingsHolm-significant
Routing peer151512
Model provider A121111
Model provider B322522
Model provider C755
Model provider D211
Model provider E32216
Model provider cohort373225

Holm-significant models survived Holm–Bonferroni multiple-comparison correction, a conservative family-wise correction across all model-level tests within the comparator group.

Figure 9: Most tested models had positive dollar-weighted savings

Each dot is one model's dollar-weighted savings with a 95% bootstrap confidence interval. Blue markers indicate models that survived Holm–Bonferroni correction. Models near or below zero have wide confidence intervals reflecting small sample sizes.

Statistical Tests

TargetSign test pWilcoxon p
Routing peer< 0.001< 0.001
Model provider A< 0.001< 0.001
Model provider B< 0.001< 0.001
Model provider C< 0.001< 0.001
Model provider D< 0.001< 0.001
Model provider E< 0.001< 0.001

For each comparator target, we computed the session-level cost difference d = comparator cost − Auriko cost, where positive values indicate Auriko was cheaper. We tested the null hypothesis that Auriko has no lower session-level cost (H₀: the distribution of d is centered at zero) against the one-sided alternative (H₁: d is shifted positive) using two nonparametric tests.

The sign test counts non-tie sessions where Auriko costs less than the comparator versus more, and tests whether the lower-cost proportion exceeds 50%. The Wilcoxon signed-rank test incorporates magnitude by ranking the absolute differences, then testing whether positive ranks systematically outweigh negative ranks. Zero differences are included in the ranking (Wilcox method). We report both tests because they are complementary: the sign test is robust to outliers (a single extreme session cannot drive the result), while the Wilcoxon is more powerful because it uses rank information.

Both tests reject H₀ at p < 0.001 for six comparator targets, with p-values ranging from 10⁻¹² to 10⁻²⁵⁴.

Summary Table

TargetDollar-weighted savings95% CIWin / Tie / Loss (%)
Routing peer32.8%[30.6%, 34.9%]85.4 / 0.1 / 14.6
Model provider A38.3%[36.9%, 39.7%]87.4 / 2.6 / 10.0
Model provider B25.3%[24.7%, 26.0%]71.7 / 3.9 / 24.4
Model provider C19.8%[18.1%, 21.4%]71.4 / 1.7 / 26.9
Model provider D11.8%[10.4%, 13.3%]69.0 / 1.7 / 29.3
Model provider E7.7%[7.1%, 8.4%]42.5 / 30.2 / 27.3

Dollar-weighted savings = (total comparator spend − total Auriko spend) / total comparator spend. 95% CI is a stratified bootstrap interval (5,000 replicates, model + scenario strata) around dollar-weighted savings.

Model provider D has 416 clean paired sessions and 2 models. It is useful as a measured target result, but its narrow coverage means it should not be treated as broad evidence.

Model provider E has the narrowest target-level savings and the highest tie rate. Its raw lower-cost rate is 42.5% because 30.2% of sessions are ties; excluding ties, Auriko costs less in 60.9% of non-tie sessions.

Dropping the highest-savings direct baseline from the cohort, model provider cohort savings remain at approximately 16.8%. The cohort result is not carried by a single anonymized baseline.

Cost Reduction Consistency

The preceding sections reported statistically significant cost differences across six comparators. Statistical significance provides evidence against the null hypothesis, but does not quantify magnitude — with sufficient sample size, even a negligible difference produces a significant p-value. Effect size measures how large the savings are relative to session-to-session cost variability, independent of sample size.

We report Hedges' g: the mean per-session dollar difference (comparator minus Auriko) divided by its standard deviation, with a small-sample bias correction applied. Higher values indicate savings that are consistent across sessions; lower values indicate savings that are positive on average but more variable per session.

Figure 10: Effect sizes were positive across all comparators

Hedges' g ranges from 0.21 (Model provider E, N = 3,077) to 0.65 (Model provider A, N = 1,647). Six comparators produce positive effect sizes.

The target-level results answer the magnitude question. The next question is which measured cost components are associated with these savings.


The Mechanics: Cache-Aware Arbitrage

The actual cost of an LLM request depends on more than headline token price. On the provider side, five variables shape the realized cost: minimum token thresholds for cache activation, block granularity, cache expiration timing, write and storage costs, and context-length pricing tiers. On the user side, the workload pattern (prefix lengths, token reuse rates, request timing, and conversation depth) determines how effectively each provider's caching mechanics activate. The interaction between these two sides means the lowest-cost provider varies per request.

This section examines two measured patterns in the test data: provider cost arbitrage, where Auriko's realized route is compared with the tested provider cost range, and higher prompt-cache hit measures for the tested workloads. For a full treatment of the cost model, see How Auriko Optimizes LLM Inference Cost.

Provider Cost Arbitrage

The cost dispersion section showed that the same requested model can cost up to 4× more on different providers. This chart shows where Auriko's realized cost appears within that dispersion. For each model, the gray bar spans the cost range across the five tested single-provider baselines. Auriko's realized cost is near the low end of the range. The routing peer's cost is higher in the range.

Figure 11: Auriko routes near the cheapest provider for most models

Prompt-Cache Hit Measures

Providers charge a reduced rate for prompt tokens served from cache. Auriko had a higher token-weighted cache hit rate (cache-read tokens as a share of total input tokens) against six comparator targets, with deltas ranging from +2.9 to +11.6 percentage points.

Figure 12: Auriko had higher token-weighted cache hit rates

The previous chart measured the share of input tokens served from cache. This chart measures a different dimension: the share of requests with any cache hit. Auriko has a higher request-level hit rate against five of six targets. Model provider C is the exception, where the comparator leads by 2.5 percentage points.

Figure 13: Request-level cache hit rates were higher for five targets

Session-level cache hit rate delta correlates positively with cost reduction across six comparator targets (Spearman ρ = 0.26 to 0.61).

Figure 14: Cache-hit deltas correlated with session savings

Auriko's higher cache hit rate is consistent with selecting provider routes whose caching mechanics fit the measured usage pattern. Over successive conversation turns, the reusable prompt context grows and the measured cache-hit gap widens.

Figure 15: Auriko's cache-aware algorithm improved cache-hit rates


Robustness and Sensitivity

We tested the all-comparison aggregate result against 242 perturbations of the analysis sample: removing each scenario, comparator target, or model one at a time; re-including error sessions; normalizing completion-token costs; and restricting the analysis to sessions with similar output volumes. The result remained positive across every reported robustness-check family. The lowest aggregate estimate was 16.4% in the completion-matched subset.

Stress-Test Summary

The baseline dollar-weighted cost reduction across six comparator targets is 24.6% (95% CI [23.6%, 25.7%]). The table below reports the lowest estimate from each robustness-check family — the single variant in that family that produces the lowest cost reduction — along with its bootstrap confidence interval and the more conservative p-value from the sign test and Wilcoxon signed-rank test. It summarizes the all-comparison aggregate floor estimates, not all 242 robustness rows.

The full robustness package contains 242 rows across target-level, direct-provider-cohort, and all-comparison scopes: 24 scenario leave-one-out rows, 11 baseline leave-one-out rows, 175 model leave-one-out rows, 8 error-inclusion rows, and 24 completion-matched rows.

TestControls forVariantsFloor estimate95% CIp-value
Leave-one-scenario-outWorkload-mix dependence323.6% (drop agent)[22.8%, 24.4%]< 10⁻¹⁸⁴
Leave-one-baseline-outSingle-comparator dependence620.1% (drop Model provider A)[19.3%, 20.9%]< 10⁻³¹⁰
Leave-one-model-outSingle-model dependence3719.3% (drop deepseek-v4-pro)[18.7%, 20.0%]< 10⁻³⁰⁰
Error-inclusionClean-success filter bias118.1%[16.9%, 19.2%]< 10⁻³⁰⁰
Completion-matched (±10%)Output-length variation1 (5,716 sessions)16.4%[15.7%, 17.1%]< 10⁻²⁹³
Completion-normalizedOutput-volume cost scaling122.6%Per-target CIs above zeroAll < 0.001

Five panels show per-target results under each stress test; the sixth shows the maximum shift from baseline for each test family. The table above reports the floor from each panel.

The error-inclusion row reintroduces 1,158 platform-complete sessions with at least one side marked as error; aggregate savings remain 18.1%.

Figure 16: Savings remained positive across robustness stress tests

Leave-One-Out Stability

The three leave-one-out checks remove each scenario, comparator target, or model one at a time and recompute dollar-weighted cost reduction on the remaining sample. Scenario leave-one-out produces the tightest range (23.6%–26.4%), indicating that the result is not dependent on any one workload type in the leave-one-out check. Model leave-one-out shows the largest single-element influence: excluding deepseek-v4-pro yields 19.3% savings (95% CI [18.7%, 20.0%]), reflecting that model's unusually large measured savings in the benchmark.

Baseline leave-one-out ranges from 20.1% to 30.5%, depending on which comparator is excluded. This spread reflects the natural variation in per-target cost reduction, which ranges from 7.7% to 38.3%. Across scenario, baseline, and model leave-one-out checks, the aggregate CI lower bound remains above 18%.

Per-target detail for output-volume controls follows.

Completion Normalization

We test output-volume sensitivity by scaling output costs symmetrically: whichever side produced fewer completion tokens is scaled up to match the other side. Prompt and cache costs are preserved as observed. After normalization, savings remain positive at 9.9–38.4% across targets, indicating that output-volume differences do not fully explain the result. Every per-target bootstrap interval remains above zero; the lowest lower bound is 9.5%.

Figure 17: Savings remained positive after completion normalization

Completion-Matched Sensitivity

As a second output-volume check, we restrict the analysis to sessions where Auriko and the comparator produced similar output volumes. Even in the tightest band reported in the robustness table (±10%), savings remain positive for most targets. A stricter ±5% output-matched diagnostic, tracked separately from the 242-row robustness table, retains 4,547 sessions (47.4% of paired sessions) and still produces 16.1% aggregate savings. This is evidence that output-volume mismatch does not explain the full measured savings.

Figure 18: Savings remained positive in completion-matched subsets

Workload-Controlled Dispersion

One objection to provider cost dispersion is that different providers might serve different workload mixes. We therefore compare uncontrolled provider spread against spread measured only within common workload cells: identical prompts sent to all providers for each model. The dispersion persists after this workload control, consistent with the cost-dispersion pattern shown earlier.

The controlled audit covers 36 models across 1,556 common workload cells. The median controlled spend multiple is 1.57x; 32 of 36 models remain above 1.1x, 20 remain above 1.5x, and 7 remain above 2x.

Figure 19: Cost dispersion remained after workload control

Across the reported all-comparison stress-test families, aggregate savings ranged from 16.4% to 30.5%. Where aggregate confidence intervals are reported, their lower bounds remained above zero; completion-normalized per-target intervals also remained above zero. The next section defines the scope and boundaries of these results.


Scope of Results

This benchmark compared Auriko's realized costs against six comparator targets (one routing peer and five model providers) across 37 models and three workload types. The benchmark covers over 80,000 API requests forming over 22,000 sessions, collected over the benchmark window described in Experimental Design. Cost reduction was positive for each comparator target, each workload type, and 18 target-scenario cells, with dollar-weighted cost reduction ranging from 7.7% to 38.3% by target.

Comparator coverage. The benchmark tests six targets. The inference provider market includes providers not tested here.

Model coverage. 37 models are represented. Production workloads may include models outside this set. The model-level analysis shows cost reduction is distributed across models, not concentrated in a few.

Workload specificity. Cost reduction magnitudes are specific to this benchmark's workload mix. Different prompt lengths, conversation depths, or model distributions would produce different magnitudes.

Cost only. This benchmark measures cost. It does not measure latency, quality, or reliability differences across routes.

Validity analysis. Known limitations, confound controls, and threats to validity are examined in the Robustness and Sensitivity section.


Methodology

Metric definitions, experimental design, and statistical methods.

Glossary of Metrics

MetricDefinitionFormulaUsed in
Cache hit rate (request-level)Share of requests with any cache-read tokensrequests with cache_read > 0 / successful requestsThe Mechanics: Cache-Aware Arbitrage
Cache hit rate (token-weighted)Cache hit rate weighted by total input tokensΣ(cache_read_tokens) / Σ(input_tokens)The Mechanics: Cache-Aware Arbitrage
Clean paired sessionA session where both sides completed without errorBoth sides non-null, non-error, symmetric exclusionAll sections
Completion-matched bandSubset where output-token ratios fall within a rangeauriko_output / comp_output ∈ [lower, upper]Robustness
Dollar-weighted savings (%)Aggregate savings as share of comparator spend(Σ comp_cost − Σ auriko_cost) / Σ comp_cost × 100The Evidence: Cost Reduction
Effect size (Hedges' g)Bias-corrected standardized mean differencemean(diffs) / sd(diffs), corrected for small NThe Evidence: Cost Reduction
Indexed spend (cheapest = 100)Provider cost relative to cheapest for the same requested modelprovider_cost / min(provider_costs) × 100Cost Dispersion Across Providers
Output-token ratioRatio of Auriko to comparator output tokensauriko_completion_tokens / comp_completion_tokensRobustness
Session win / tie / lossPer-session outcome based on cost differenceWin: comp > auriko; Loss: auriko > comp; Tie: equalThe Evidence: Cost Reduction
Symmetric completion normalizationScale shorter side's output tokens to matchAdjusts output cost component only; prompt/cache preservedRobustness

Statistical Methods

Confidence intervals. Percentile bootstrap on dollar-weighted savings. 5,000 replicates, stratified by model and scenario. 2.5th–97.5th percentile (95% CI). Fixed seed for reproducibility.

Aggregate hypothesis tests. For each comparator target, session-level cost differences (d = comparator cost − Auriko cost) are tested with two one-sided nonparametric tests. The sign test counts non-tie sessions where Auriko costs less than the comparator versus more, and tests whether the lower-cost proportion exceeds 50%. The Wilcoxon signed-rank test ranks absolute differences and tests whether positive ranks systematically outweigh negative ranks.

Per-model hypothesis tests. Within each target, per-model cost differences are tested for normality (Shapiro-Wilk). Normal distributions use a paired t-test; non-normal distributions use a Wilcoxon signed-rank test. Both are two-sided.

Multiple comparisons. Holm–Bonferroni step-down procedure at α = 0.05 on per-model p-values within each target.

Effect size. Hedges' g: mean of paired cost differences divided by their standard deviation, with small-sample bias correction.

Win rate inference. Binomial test against H₀: win rate = 50% (ties excluded). Wilson score interval on session win rate.

Heterogeneity. Cochran's Q tests whether model-level savings are homogeneous. I² estimates the proportion of variance due to true heterogeneity. τ² (DerSimonian-Laird) estimates between-model variance.

Limitations

LimitationScope
Point-in-time snapshotResults reflect the measurement window (2026-05-28 to 2026-06-07 UTC). Provider pricing and model availability change.
Workload representativenessProduction workloads may differ in prompt length, complexity, or domain.
Model selection37 models tested. Results apply to tested models only.
Geographic constraintsAll sessions generated from a single geographic region. Provider-selection behavior may differ in other regions.
Routing algorithm evolutionResults reflect the algorithm version active during the measurement window.

See cost optimization live

See how Auriko reduces inference cost