Do Not Put Cache Savings Into the Budget Until You Can Explain Them
Prompt caching can reduce AI API cost, but only when the request structure is stable enough to reuse cached input. If the cacheable prefix changes on every request, the budget will look cheaper than the real bill. For the broader savings model, start with the prompt caching savings guide and use this checklist to validate the budget assumptions.
Use this checklist before relying on prompt caching in a production cost estimate.
1. Identify the Cacheable Prefix
Start by separating fixed content from dynamic content.
Usually cacheable:
- system instructions
- stable role definitions
- tool schemas
- safety policies
- fixed examples
- long reference documents reused across requests
Usually dynamic:
- user message
- current timestamp
- session id
- retrieved snippets that change per request
- recent chat history
- personalized account data
If dynamic values appear before the fixed prefix is complete, cache hit rate may drop sharply.
2. Check Tool Schema Stability
AI apps that use tools often send large tool definitions with each request. These definitions may be cacheable if they are stable.
Check whether:
- tool order stays the same
- tool names stay the same
- descriptions are not regenerated per request
- optional tools are not inserted randomly
- feature flags do not change the schema frequently
Tool schemas can be a large part of input tokens, so unstable schemas can erase expected savings.
3. Estimate Cache Hit Rate Conservatively
Do not assume an 80% or 90% hit rate without evidence. Start with a conservative scenario table:
| Scenario | Cache Hit Rate |
|---|---|
| new feature, unknown usage | 20% to 40% |
| repeated workflow, stable prompt | 50% to 70% |
| high-volume fixed assistant | 70% to 90% |
Then replace assumptions with measured logs after launch.
4. Compare Cached and Uncached Input
A prompt caching budget should show both versions:
| Budget Field | Why It Matters |
|---|---|
| total input tokens | shows full request size |
| cacheable input tokens | shows possible discount area |
| uncached input tokens | remains full-price |
| output tokens | usually not reduced by input caching |
| cache write cost | may apply when creating cache entries |
| cache read cost | applies when cache is hit |
Use the text model calculator for baseline input and output estimates, then compare with cache hit rate cost planning.
5. Watch for Cache Breakers
Common cache breakers include:
- timestamps in the system prompt
- request ids inside fixed instructions
- changing example order
- dynamic retrieval content before fixed tool schemas
- different locale text mixed into the same prompt layout
- A/B test variants inserted into the prefix
Move dynamic fields later in the request where possible.
6. Include Retries
Retries can reduce the value of caching if failed requests repeatedly create new prefixes or if long outputs are generated again.
Record:
- retry count
- timeout rate
- whether retries reuse the same cacheable prefix
- whether duplicate user submissions are deduplicated
For launch planning, combine this checklist with the AI cost launch checklist.
7. Log Cache Metrics
At minimum, log:
| Field | Purpose |
|---|---|
| model | confirms pricing tier |
| input tokens | total input cost basis |
| cached input tokens | confirms cache read volume |
| cache creation tokens | tracks cache write behavior |
| output tokens | explains remaining cost |
| feature name | separates workflows |
| cache hit or miss | makes hit rate measurable |
Without these fields, you cannot prove whether prompt caching is working.
Summary
Prompt caching helps most when fixed instructions, tool schemas, or reusable documents make up a large part of the input. Before including savings in a budget, identify the cacheable prefix, estimate hit rate conservatively, compare cached and uncached costs, avoid cache breakers, and log real cache metrics after launch.