Cache Hit Rate Controls Repeated Input Cost
Prompt caching is not simply a switch that makes API calls cheaper. It lowers cost when repeated input can be billed as cached input instead of normal input. The higher the hit rate, the more predictable your input cost becomes.
For models that support caching, input tokens usually fall into two groups:
- Cache miss: input processed for the first time and billed at normal input price.
- Cache hit: input reused from cache and billed at a lower price.
You can enter both values separately in the text model calculator to see how different hit rates affect total cost. If you need the pricing mechanics first, read how much prompt caching can save.
How to Estimate Hit Rate
Cache hit rate is the share of total input tokens that can be reused.
cache hit rate = cached input tokens / total input tokens
For example, a request may include 20K input tokens: 12K from a stable system prompt, tool instructions, or knowledge context, and 8K from the current user message. If the stable part can be reused, the theoretical hit rate is about 60%.
What Should Be Cached
Good cache candidates usually have three traits:
- They repeat across many requests.
- They are long enough to matter.
- They do not need to change every time.
Common examples include:
- system prompts
- long policy instructions
- tool descriptions
- stable knowledge summaries
- repeated document prefixes
User input, real-time state, timestamps, and temporary context should usually stay outside your cache assumption.
Use 0%, 50%, and 80% Budget Scenarios
Do not budget with only one ideal hit rate. Compare at least three scenarios:
| Hit Rate | Meaning | Use |
|---|---|---|
| 0% | No caching | Conservative budget |
| 50% | Half of input is reusable | Practical estimate |
| 80% | Most context is stable | Optimistic estimate |
If your budget works at 0%, cost risk is low. If it only works at 80%, validate the caching strategy before relying on it.
Caching Does Not Reduce Output Cost
Prompt caching affects input-related cost. For writing, coding, report generation, and agent workflows, output tokens can still dominate the bill.
Even with a high cache hit rate, use the practical AI API cost reduction guide and control:
- output length
- retry frequency
- whether explanations are required
- whether structured short answers can replace long prose
When Cache Engineering Is Worth It
If request volume is low and inputs are short, cache engineering may not be worth the effort. It matters more when:
- many requests repeat the same context
- each request has a long system prompt
- agent tool instructions are large
- document Q&A reuses stable context
- enterprise workflows share background knowledge
Estimate savings with prompt caching analysis before committing engineering time.
Summary
Cache hit rate is a key variable in AI API cost planning. It can reduce repeated input cost, but it does not reduce output cost, and it should not be assumed too optimistically. Compare multiple hit-rate scenarios before launch so you know whether caching truly supports your product economics.