The variables that affect your inference cost

How provider caching mechanics interact with your workload to determine inference cost, and how to optimize for it.


You compare input rates, output rates, cache reads, and cache writes across providers. Then you pick the cheapest one on paper and move on. That's usually where everyone starts, and it's a reasonable place to begin.

The problem is that those rates only tell you what a cached token costs if it gets cached. They don't tell you which tokens actually get cached, under what conditions, or how your traffic interacts with each provider's caching mechanics. Two providers quoting identical rates can still produce very different bills.


The variables between the rate and the bill

The cache rate on a pricing page shows the discount you could get. Whether you actually get that discount depends on how each provider's caching mechanics interact with your specific workload:

Minimum token thresholds. Some providers won’t cache anything below a certain token count. If your system prompt falls short of the threshold, every token pays full price on every request. In that case, the cache rate on the pricing page doesn’t apply.

Block granularity. Providers cache in fixed-size blocks. Tokens that don't fill a complete block pay full price. On shorter prompts, the tokens falling outside block boundaries add up.

Expiration behavior. Cache duration varies by provider. A cached entry is only useful if another matching request arrives before it expires. Frequent requests are more likely to reuse the cache because less time passes between them. With bursty traffic, entries can expire between bursts, so expected reuse may not translate into actual cache hits.

Cache write and storage costs. Some providers charge a premium the first time a prompt enters the cache, and some charge for cache storage over time. Those costs only pay off if subsequent requests reuse the cached prompt. On a short conversation, the write and storage costs can exceed the read savings.

Tiered pricing. Provider rates can shift based on context length. A request at 200,000 tokens may hit a different pricing tier with different cache read and write rates than one at 50,000. The rate on the pricing page may not be the rate that applies to your request.

These factors interact. For example, a provider with a 90% cache discount and a 1,024-token minimum gives you a 0% effective discount on an 800-token prompt. A provider with 75% off and no minimum gives you 75%. In that case, the deeper published discount does not lead to a lower bill, because the caching mechanism never activates.


The cheapest provider depends on the request

Provider mechanics are only one side of it.

Your usage pattern (prompt structure, prefix length, how often prompts repeat, request timing, output volume, and conversation depth) interacts with each provider's mechanics differently. The provider that's cheapest for a chatbot with a long, stable system prompt can be the most expensive for short, varied requests. The one that wins on high-reuse workloads can lose when traffic is bursty, because cached entries are more likely to expire between requests.

Expected conversation length also matters. A provider that charges a write premium to populate the cache may cost more on the first request. If the conversation continues for several turns, that upfront cost can be amortized across discounted cache reads. The cheapest provider for the entire conversation isn’t necessarily the one that’s cheapest for the first message.

When the workload changes, the cost ranking can change too. There often isn’t one universally cheapest provider. There’s a cheapest provider for this request, on this workload, right now.


How Auriko handles this

Auriko models both sides of the equation.

On the provider side, we run a data engine that tracks, tests, and measures caching behavior across every provider we route to: discount depths, minimum thresholds, block granularity, write costs, expiration behavior, and pricing tiers that shift with context length. Providers change their infrastructure over time, and we keep this data updated.

On your side, we estimate request-level variables from usage patterns: token counts, prefix lengths, reuse rates, request timing, output volumes, and conversation depth. This lets us resolve the correct pricing tier and estimate cache effectiveness for each provider per request. By the way, we never see, log, or store your prompts, responses, or content. Check our Data policy for details.

For each request, we compute the expected cost at every available provider and route to the cheapest one.

Auriko also handles the provider-specific steps needed to activate caching. Every provider's caching works differently: some require explicit directives in the request, some use session affinity to route repeat requests to the same cache, some need cache keys for storage-local lookup, and some have per-model token thresholds that gate when caching fires at all. Auriko handles each automatically.

You can route purely on cost. Or you can balance cost against time to first token, throughput, and other performance dimensions. You control the priority.

Every request is backed by automatic failover. Routing is capacity-aware: requests route around providers approaching their limits before they trigger, not after. Cost optimization doesn't come at the cost of reliability.

When caching produces savings, the response includes how much it saved, both as a percentage and a dollar amount. The savings are visible per request.

Get started

Auriko is in beta.