Skip to content
AI

How Cache Hit Rate Changes AI API Cost

AI

AI Cost Calculator

Updated:

3 min read

Cache Hit Rate Controls Repeated Input Cost

Prompt caching is not simply a switch that makes API calls cheaper. It lowers cost when repeated input can be billed as cached input instead of normal input. The higher the hit rate, the more predictable your input cost becomes.

For models that support caching, input tokens usually fall into two groups:

  • Cache miss: input processed for the first time and billed at normal input price.
  • Cache hit: input reused from cache and billed at a lower price.

You can enter both values separately in the text model calculator to see how different hit rates affect total cost. If you need the pricing mechanics first, read how much prompt caching can save.

How to Estimate Hit Rate

Cache hit rate is the share of total input tokens that can be reused.

cache hit rate = cached input tokens / total input tokens

For example, a request may include 20K input tokens: 12K from a stable system prompt, tool instructions, or knowledge context, and 8K from the current user message. If the stable part can be reused, the theoretical hit rate is about 60%.

What Should Be Cached

Good cache candidates usually have three traits:

  1. They repeat across many requests.
  2. They are long enough to matter.
  3. They do not need to change every time.

Common examples include:

  • system prompts
  • long policy instructions
  • tool descriptions
  • stable knowledge summaries
  • repeated document prefixes

User input, real-time state, timestamps, and temporary context should usually stay outside your cache assumption.

Use 0%, 50%, and 80% Budget Scenarios

Do not budget with only one ideal hit rate. Compare at least three scenarios:

Hit RateMeaningUse
0%No cachingConservative budget
50%Half of input is reusablePractical estimate
80%Most context is stableOptimistic estimate

If your budget works at 0%, cost risk is low. If it only works at 80%, validate the caching strategy before relying on it.

Caching Does Not Reduce Output Cost

Prompt caching affects input-related cost. For writing, coding, report generation, and agent workflows, output tokens can still dominate the bill.

Even with a high cache hit rate, use the practical AI API cost reduction guide and control:

  • output length
  • retry frequency
  • whether explanations are required
  • whether structured short answers can replace long prose

When Cache Engineering Is Worth It

If request volume is low and inputs are short, cache engineering may not be worth the effort. It matters more when:

  • many requests repeat the same context
  • each request has a long system prompt
  • agent tool instructions are large
  • document Q&A reuses stable context
  • enterprise workflows share background knowledge

Estimate savings with prompt caching analysis before committing engineering time.

Summary

Cache hit rate is a key variable in AI API cost planning. It can reduce repeated input cost, but it does not reduce output cost, and it should not be assumed too optimistically. Compare multiple hit-rate scenarios before launch so you know whether caching truly supports your product economics.

Recommended