Output tokens dominate AI API cost covered an underweighted fact — output token unit price is 3-5× input, and in production prompts, output share of cost commonly hits 60%+.
This piece is the execution follow-up: output is the cost driver, now what do we do about it. I’ve collected 8 methods that have measurable effect in real workloads, ordered by modification cost, low to high.
| # | Method | Mod cost | Measured savings |
|---|---|---|---|
| 1 | Force structured output format | very low | 20-40% |
| 2 | Hard length limit in system prompt | very low | 10-25% |
| 3 | Remove explanatory pre/postfix | low | 10-20% |
| 4 | Use enum/code over natural language labels | low | 15-30% |
| 5 | Stream + early termination | medium | 5-30% |
| 6 | Flatten nested schema | medium | 15-25% |
| 7 | Post-process truncation + important-first ordering | medium | 10-20% |
| 8 | Few-shot example for format | medium | 20-40% |
Let’s go through each.
Method 1: force structured output format
What: change “respond in a paragraph” to “respond in JSON / YAML / table”.
Why it works: natural language carries lots of “transition words, politeness, explanatory framing” — Sure, I can help with that. Here's the breakdown of... 30 tokens of fluff disappear in JSON output. The model skips fluff when it sees a schema.
Example:
❌ Natural language:
Based on the description, this ticket appears to be a billing issue.
The user is asking about why they were charged twice. The priority
is high because billing problems affect customer trust. I would
suggest assigning this to the billing team for review.
✅ JSON:
{
"category": "billing",
"priority": "high",
"assignee": "billing",
"reason": "duplicate charge"
}
Output drops from ~70 tokens to ~25, a 64% reduction.
Note: the JSON schema lives in the system prompt (cache-friendly); only schema entities ship in output. Input cost barely moves.
Method 2: hard length limit in system prompt
What: explicitly tell the model the output length cap.
Why it works: models respond to specific numbers much better than abstract instructions. Be concise barely lands, Output 50 words or less actually compresses.
Effective wording:
❌ “Be concise” / “Don’t be verbose”
❌ “Keep it short”
✅ “Output ≤ 50 words”
✅ “Limit response to 3 sentences max”
✅ “Output exactly 1 paragraph, no more than 80 tokens”
You can branch by scenario:
Output rules:
- For simple questions: 1-2 sentences
- For lists: max 5 items
- For code: only the requested function, no surrounding text
- Never explain your reasoning unless explicitly asked
In one customer support workload, adding this single block alone dropped average output tokens from 180 to 110.
Method 3: remove explanatory pre/postfix
What: explicitly forbid Sure, here's... / I hope this helps! / Let me know if... filler.
Why it works: these sentences appear in every output, costing 15-30 tokens per call. At 100K calls/day that’s 1.5-3M tokens wasted daily.
Add to system prompt:
- Do not start responses with "Sure", "Of course", "I'd be happy to", "Here's", "Based on"
- Do not end with "I hope this helps", "Let me know if you need more", "Feel free to ask"
- Start with the actual answer
- End when the answer ends
The model will actually comply.
Method 4: use enum / code over natural language labels
What: replace status, category, level enum values with short codes.
Why it works: "priority": "high" is 1-2 tokens, "priority": "very high importance" is 4-5. In bulk classification workloads this multiplier is huge.
Example:
❌ Verbose:
{
"status": "successfully completed",
"severity": "moderate to high impact",
"next_action": "human review required"
}
✅ Compact:
{
"status": "ok",
"severity": "M",
"next": "review"
}
Trade-off: your parsing layer needs to understand the codes — but that’s a one-time write. Output token savings are continuous.
Method 5: stream + early termination
What: use streaming responses (SSE), and have the client cut the connection when “enough output” arrived.
Why it works: in many scenarios the model continues generating beyond marginal value — but the API bills for the full generation. Early termination clips that tail.
Typical applications:
- Summarization: user reads first 100 words and is done, model wants to write 500
- List generation: user needs top 3, model defaults to 10
- Code generation: model finishes the function but wants to “explain” too
Pseudocode:
async for chunk in stream_response(prompt):
yield chunk
if condition_met(accumulated_output):
await close_stream()
break
Caveat: not every provider fully refunds early-cancelled streams — some still bill already-generated tokens. Check provider docs.
Method 6: flatten nested schema
What: change deeply nested JSON to flat structure.
Why it works: nested JSON like { "user": { "profile": { "name": ... has tokens for every layer of brackets, commas, indentation. Same information flat saves 30%+ structural tokens.
Example:
❌ Deeply nested (45 tokens):
{
"user": {
"profile": {
"name": "Alice",
"tier": "pro"
},
"stats": {
"calls_today": 42,
"tokens_today": 8400
}
}
}
✅ Flat (28 tokens):
{
"user_name": "Alice",
"user_tier": "pro",
"calls_today": 42,
"tokens_today": 8400
}
Trade-off: flat structure is less semantically organized, but in output context your consumer typically parses to objects immediately — flatness doesn’t hurt downstream.
Method 7: post-process truncation + important-first ordering
What: have the model put the most important content first, truncate later.
Why it works: sometimes long output is justified, but you really only need the first 80%. If important content lives up front, truncation costs less.
System prompt pattern:
Output structure:
1. Direct answer (1-2 sentences) — REQUIRED
2. Key reasoning (2-3 bullets) — OPTIONAL
3. Detailed explanation — OMIT unless asked
4. Examples — OMIT unless asked
Order matters: write 1, then 2 only if helpful, then 3, then 4.
In post-processing you can truncate at section 1 or 2 by scenario. The principle: make “verbosity” something you can trim.
Method 8: few-shot example for format
What: include one or two ideal output examples directly in the prompt.
Why it works: models follow example-based instructions more reliably than rule-based ones. Give an 80-token example, the model will replicate at ~80 tokens. Give “be concise,” it improvises.
Best for:
- Workloads needing stable output length (review generation, tagging, summaries)
- Workloads with fixed templates
Cache-friendly: examples in system prompt → enjoys prompt caching → input cost unaffected.
Implementation priority
If you can only do 3, ROI order:
- Method 1 (structured output) — one-time setup, perpetual benefit, applies almost everywhere
- Method 3 (kill fluff prefix/postfix) — add 5 lines of system prompt, immediate 10-20% drop
- Method 2 (hard length limit) — reinforces 1; prevents long outputs from running away
Do these three, measure savings for a week, then decide whether 4-8 are worth pursuing.
Don’t do these “fake compressions”
Final list of moves that look like compression but aren’t useful (or backfire):
❌ Asking the model to summarize its own output — extra call, extra cost, quality may be worse
❌ Forcing output too short — model can’t answer, user re-asks, total cost goes up
❌ Strip all explanation, only conclusions — in scenarios needing accountability (medical, financial, legal) this hurts trust
❌ Vague language like “use your judgment” — model improvises, output variance widens, long tail gets longer
The goal of output compression is to preserve information density, cut redundant form — not “shorter is better.”
Further reading:
- Why compressing output matters first: output tokens dominate AI API cost
- Pair with input-side optimization: prompt caching budget checklist
- Full cost-reduction lever set: 7 actionable ways to reduce AI API cost
- Don’t know where your spend is going: 7 signals your AI API cost is running away