Skip to content
AI

2026 AI API Pricing Guide: 30+ Models Compared

AI

AI Cost Calculator

14 min read

If you only read one line: as of June 2026, ranked by combined input + output + cache value, the five most economical APIs are DeepSeek V4 Pro, Gemini 2.5 Flash Lite, Claude Haiku 4.5, GPT-5.4 mini, and Mistral Small 4. The three most expensive but most capable are GPT-5.5, Claude Opus 4.7, and qwen3.7-max.

This is a pillar page designed to let you finish a model selection decision in 10 minutes: start with the full 33-model price table, then four billing models, the long-context cost curve, six real-world business scenarios with monthly bill estimates, a hidden-cost checklist, five optimization plays, and five high-frequency FAQs.

Pricing data updated 2026-05-28 (sourced from each provider’s official pricing page — see sourceUrl per model). Next refresh: 2026-07-15.

AI API Price Table (30+ Models · June 2026)

Unit: USD / 1M tokens (Chinese-region models priced in CNY where noted). Cache Read is the discounted input rate when a cached prefix matches. The “Category” column splits reasoning models from plain text models — they bill differently, see the next section.

OpenAI (5 models)

ModelInputOutputCache ReadCategoryNotes
GPT-5.5$5.00$30.00$0.50reasoningStandard rate < 270K context; batch -50%
GPT-5.4$2.50$15.00$0.25reasoningThe price/perf sweet spot
GPT-5.4 mini$0.75$4.50$0.075reasoningReasoning-capable small model
GPT-4.1$2.00$8.00$0.50textMature workhorse
GPT-4.1 mini$0.40$1.60$0.10textBest for long-tail tasks

Quick read: above 270K context, GPT-5.5 doubles. Batch mode cuts the bill in half but adds 24h latency.

Anthropic (3 models)

ModelInputOutputCache WriteCache ReadCategory
Claude Opus 4.7$5.00$25.00$6.25$0.50reasoning
Claude Sonnet 4.6$3.00$15.00$3.75$0.30reasoning
Claude Haiku 4.5$1.00$5.00$1.25$0.10text

Quick read: Anthropic is the only family that bills cache writes separately — first write costs 1.25× uncached input, but every subsequent hit is just 0.10×. Break-even is at 4 reuses.

Google (4 models)

ModelInputOutputCache ReadCategory
Gemini 2.5 Pro$1.25$10.00$0.125reasoning
Gemini 2.5 Flash$0.30$2.50$0.03reasoning
Gemini 2.5 Flash Lite$0.10$0.40$0.01reasoning
Gemini 3.1 Flash Lite$0.25$1.50$0.025reasoning

Quick read: Flash Lite is the 2026 price floor for Western APIs — input at $0.10/M, 4-10× cheaper than peers. Trade-off: capability is one tier below Pro.

DeepSeek / MiniMax / Zhipu / Qwen / Moonshot (China — 14 models)

ProviderModelInputOutputCache ReadCurrency
DeepSeekV4 Pro$0.14$0.28$0.0028USD
MiniMaxM2.7$2.10$8.40$0.42USD
MiniMaxM2.7-highspeed$4.20$16.80$0.42USD
MiniMaxM2.5$2.10$8.40$0.21USD
ZhipuGLM-5.1$6.00$24.00$1.30USD
ZhipuGLM-5-Turbo$5.00$22.00$1.20USD
ZhipuGLM-5$4.00$18.00$1.00USD
Qwenqwen3.7-max$12.00$36.00USD
Qwenqwen3-max$2.50$10.00USD
Qwenqwen-max$2.40$9.60USD
Qwenqwen-plus$0.80$2.00USD
Qwenqwen-turbo$0.30$0.60USD
Moonshotkimi-k2.6$6.50$27.00$1.10USD
Moonshotkimi-k2.5$4.00$21.00$0.70USD
Moonshotmoonshot-v1-32k$5.00$20.00USD

Quick read: DeepSeek V4 Pro is the absolute price floor in this whole table ($0.14 in / $0.28 out), especially friendly to Chinese-language workloads. But qwen3.7-max and GLM-5.1 are actually pricier than Claude Sonnet — don’t assume “Chinese == cheaper”, check the specific tier.

Mistral (10 models)

ModelInputOutputCategory
Mistral Large 3$0.50$1.50reasoning
Mistral Medium 3.5$1.50$7.50reasoning
Mistral Small 4$0.10$0.30text
Magistral Medium$2.00$5.00reasoning
Magistral Small$0.50$1.50reasoning
Devstral 2$0.40$2.00reasoning
Devstral Small 2$0.10$0.30text
Codestral$0.30$0.90text
Ministral 3 3B$0.10$0.10text
Ministral 3 8B$0.15$0.15text

Quick read: Ministral 3B/8B is the cheapest chat option under EU data-residency requirements; Codestral is the European pick for codegen. Mistral has no cache-read tier, so long-context tasks favor Claude/Gemini.

Full fields (sourceUrl, updatedAt, pricingFormula) live in the on-site Pricing Hub under “Model Library”.


The Four Billing Models You Actually Need to Know

Different providers compute the bill in different ways. Memorize these four and every model on the planet plugs into the same formulas.

Mode A: Pure tokens (input + output)

Used by: GPT-4.1, qwen-turbo, all Mistral models.

total = input_tokens × P_in / 1M  +  output_tokens × P_out / 1M
```text
Example — qwen-plus on 1M input + 0.3M output = $0.80 + $0.60 = **$1.40**.

### Mode B: Tokens + cache read (input-side discount)

**Used by**: GPT-5 family, all Gemini models.

```text
total = input_miss × P_in  +  input_hit × P_cache  +  output × P_out
```text
Example — GPT-5.4 on 1M input (40% cache hit) + 0.3M output =
0.6M × $2.50 + 0.4M × $0.25 + 0.3M × $15.00 = **$6.10**.

vs. uncached: $7.00. **40% cache hit saves 13%**.

### Mode C: Tokens + cache write + cache read (Anthropic)

**Used by**: Claude Haiku / Sonnet / Opus.

```text
total = input_miss × P_in
      + cache_write_count × tokens × P_cache_write
      + cache_read_count  × tokens × P_cache_read
      + output × P_out
```text
The Anthropic break-even rule:

> One cache write = 1.25× uncached input; every later hit = 0.10× uncached. You break even at **≥ 4 reuses**.

Concrete on Sonnet 4.6: writing a 1M-token system prompt once costs $3.75; each cache hit costs $0.30. If you call this prompt ≥ 4 times a day (almost every production app does), caching is a free win. Detailed ROI math: [prompt-caching-roi-breakeven](/en/posts/prompt-caching-roi-breakeven/).

### Mode D: Reasoning tokens billed separately

**Used by**: DeepSeek R-series, Magistral, qwen3.7-max.

Reasoning models generate a **chain-of-thought** before producing the visible answer. Those internal tokens **bill at output rate but are invisible in the response** — a common cause of "my bill came back 6× what I expected".

```text
real_output = visible_output + reasoning_tokens (hidden)
bill = input × P_in + real_output × P_out
```text
Measured: a single hard problem on Magistral Medium produced 800 visible output tokens + 4500 reasoning tokens. Bill = 5300 × output rate, **6.6× the visible word count**.

---

## The Long-Context Cost Curve

A 2026 shift to watch: **multiple providers now surcharge ultra-long contexts**.

| Model | < 270K context | ≥ 270K context |
|---|---|---|
| GPT-5.5 | $5 / $30 | $10 / $60 (×2) |
| GPT-5.4 | $2.50 / $15 | $5 / $30 (×2) |
| Claude Opus 4.7 | $5 / $25 | $7.5 / $37.5 (×1.5) |
| Gemini 2.5 Pro | $1.25 / $10 | $2.50 / $15 (×1.5-2) |

**Trigger metric**: total tokens in a single request (history + current prompt + output) crossing 270K.

**Field-tested moves**:
1. **Rolling summarization** — every 200K cumulative, summarize the conversation back down to 50K.
2. **Switch to Flash Lite** for retrieval-only / extraction-only tasks: Gemini 2.5 Flash Lite at $0.10 input handles long context cheaply, at the cost of one capability tier.

Full curves and switch thresholds: [rag-long-context-api-cost](/en/posts/rag-long-context-api-cost/). For a developer-side view of the same price landscape, PromptNet has a parallel writeup: [2026 AI API pricing comparison](https://www.promptnet.cn/2026/05/23/ai-api-pricing-comparison-2026/).

---

## Multi-Modal Costs

The price table is text-only. Other modalities are billed separately:

- **Image input**: GPT-5 series $0.005-0.015 / image (resolution-tiered); Claude $0.024 / image (flat); Gemini $0.0025 / image (cheapest).
- **Image generation**: DALL-E 3 $0.04-0.08 / image; Imagen 3 $0.03; Stable Diffusion 3 $0.02.
- **Audio transcription**: Whisper $0.006 / minute; Deepgram Nova-3 $0.0043 / minute.
- **TTS**: OpenAI tts-1 $0.015 / 1K chars; ElevenLabs $0.30 / 1K chars (premium quality but 20× the price).
- **Video understanding**: Gemini 2.5 Pro $1.40 / minute of video.
- **Video generation**: Sora $0.50-1.50 / second; Veo 2 $0.35 / second.

Full per-modality calculators: [image](/en/image/), [audio](/en/audio/), [video](/en/video/).

---

## Six Real-World Scenarios with Monthly Bill Estimates

All baselines below use **Claude Sonnet 4.6 ($3 in / $15 out)**. To swap models, just plug new unit prices into each formula.

### Scenario 1: Customer-support chatbot

**Profile**: 10K DAU, 3 turns / user / day, 200 in + 100 out tokens per turn.

```text
Daily tokens   = 10000 × 3 × (200 + 100) = 9M
Monthly tokens = 270M  (input 180M / output 90M)
Monthly cost   = 180 × $3 + 90 × $15 = $1890
```text
With 30% prompt-cache hit on the system prompt: **$1890 → $1620 (-14%)**. Switch to Haiku 4.5 ($1/$5): **$630**. Switch to DeepSeek V4 Pro: **$50**.

### Scenario 2: RAG knowledge-base Q&A

**Profile**: each query retrieves 8K context + 500-token question and produces 600-token answer; 5K queries/day.

```text
Daily input  = 5000 × 8500 = 42.5M  (40% cache hit)
Daily output = 5000 × 600  = 3M
Daily cost   = 25.5 × $3 + 17 × $0.30 + 3 × $15 = $126
Monthly cost ≈ $3780
```text
Switch to Gemini 2.5 Flash ($0.30/$2.50) + 60% cache: **$3780 → $480 (-87%)**. Detailed math: [estimate-rag-chatbot-cost](/en/posts/estimate-rag-chatbot-cost/).

### Scenario 3: AI Agent (multi-turn + tool use)

**Profile**: each task = 8 LLM calls, avg 5K input (incl. history) + 800 output (incl. reasoning); 1000 tasks/day.

```text
Daily input  = 1000 × 8 × 5K  = 40M
Daily output = 1000 × 8 × 800 = 6.4M
Daily cost   = 40 × $3 + 6.4 × $15 = $216
Monthly cost ≈ $6480
```text
Output dominates (30-50% of total) in agent workloads, so **output compression is the highest-leverage optimization** — see [ai-output-token-compression-methods](/en/posts/ai-output-token-compression-methods/). If you're running agents through Claude Code, prompt-cache discipline and rolling-context tricks are the other lever — PromptNet has a focused [Claude API cost-control & budget guide](https://www.promptnet.cn/2026/06/02/claude-api-cost-control-budget-guide/) on that.

### Scenario 4: Codegen (IDE / Copilot-style)

**Profile**: each completion = 3K context + 200 output tokens; 500 active devs × 80 completions/day.

```text
Daily tokens = 500 × 80 × 3200 = 128M  (50% cache hit)
Daily cost   = 64 × $3 + 64 × $0.30 + (500 × 80 × 0.2K) × $15 = $331
Monthly cost ≈ $9930
```text
Codegen cost is dominated by **context** (95%). Switch to Codestral ($0.30/$0.90): **$720/month** — but capability drops a tier; product team must weigh.

### Scenario 5: Content production (long output)

**Profile**: 200 articles × 1500 words/day (~2.2K output tokens) + 1K input prompt.

```text
Daily input  = 200 × 1K   = 0.2M
Daily output = 200 × 2.2K = 0.44M
Daily cost   = 0.2 × $3 + 0.44 × $15 = $7.2
Monthly cost ≈ $216
```text
Output drives 90% of cost in long-form generation. Drop to Gemini 2.5 Flash output ($2.50): **$116/month**.

### Scenario 6: Information extraction (structured output + batch)

**Profile**: 500K user comments/day, 500 input + 100 output tokens each (structured JSON).

```text
Daily input  = 500K × 500 = 250M
Daily output = 500K × 100 = 50M
Standard     = 250 × $3 + 50 × $15 = $1500/day
Batch (-50%) = $750/day = $22500/month
```text
Run extraction on **batch API + Mistral Small 4 ($0.10/$0.30)** and the bill drops to $1100/month — **95% cheaper than Sonnet at standard rate**.

---

## Hidden Costs (Why Bills Routinely Land 30%+ Over Forecast)

The price table is just the entrée. Here are the six items that quietly eat your budget:

| Hidden item | Typical surcharge | When triggered |
|---|---|---|
| Rate-limit retries | +5-15% | Tier 1/2 accounts, traffic spikes |
| Data residency (EU / China-mainland) | +10% | Compliance |
| Observability (Helicone, Langfuse, etc.) | $20-200 / month | Production-grade |
| Failed-request retries | +3-8% | Network jitter, timeouts |
| Invisible reasoning tokens | ×2-7 (workload-dependent) | Reasoning models |
| Long-context surcharge | ×1.5-2 | > 270K context |

How to actually trace these to specific line items: [check-ai-api-bill-against-pricing](/en/posts/check-ai-api-bill-against-pricing/) and [ai-api-cost-runaway-7-signals](/en/posts/ai-api-cost-runaway-7-signals/).

---

## Five Optimization Plays

### 1. Model routing — switch tiers by context length

```text
if context_tokens < 4K:    Claude Haiku 4.5
elif context_tokens < 32K: Claude Sonnet 4.6
elif context_tokens < 200K: Claude Sonnet 4.6 + cache
else:                       Gemini 2.5 Pro  # avoids Anthropic >270K surcharge
```text
Measured savings: 35-60%. Full routing playbook: [model-selection-cost-balancing-guide](/en/posts/model-selection-cost-balancing-guide/).

### 2. Prompt cache — only enable if reuse ≥ 4

Anthropic's break-even is **4 reuses**. If your system prompt fires ≥ 4 times/day (almost every production app), caching is a free win. OpenAI/Gemini have no cache-write fee, so always-on is fine. ROI math: [prompt-caching-roi-breakeven](/en/posts/prompt-caching-roi-breakeven/).

### 3. Output compression — output is 5-7× more expensive than input

Highest-leverage optimization. Three plays:
- Pin "answer ≤ 200 words" in the system prompt.
- Use `response_format: json_schema` to force structure (saves 30-50% output).
- Set `reasoning_effort: low` on reasoning models — cuts hidden reasoning tokens by ~70%.

Before/after benchmarks: [ai-output-token-compression-methods](/en/posts/ai-output-token-compression-methods/).

### 4. Batch API — half-price if you can wait 24h

OpenAI, Anthropic, and Mistral all offer 50%-off batch endpoints. Fits: content production, information extraction, data labeling. Doesn't fit: live chat, agents.

### 5. Chinese-region downgrade route

For Chinese-language tasks and latency-tolerant non-core paths, **DeepSeek V4 Pro / Qwen Plus / Doubao Seed** are 1/10-the-price substitutes — provided you accept the API stability and compliance trade-offs. Deeper comparison: [deepseek-api-cost-coding-chat-batch](/en/posts/deepseek-api-cost-coding-chat-batch/).

---

## Five High-Frequency FAQs

### Q1: Which is cheaper, ChatGPT or Claude API?

At equivalent tier, **Claude is 17% cheaper than GPT** in 2026:
- High end: Claude Opus 4.7 ($5/$25) vs. GPT-5.5 ($5/$30) — output is 17% cheaper on Claude.
- Mid: Claude Sonnet 4.6 ($3/$15) vs. GPT-5.4 ($2.50/$15) — roughly even.
- Low: Claude Haiku 4.5 ($1/$5) vs. GPT-4.1 mini ($0.40/$1.60) — GPT is 60-70% cheaper.

But add the **long-context surcharge** (GPT-5.5 ×2 vs. Opus ×1.5) and Claude becomes the cheaper choice for long-running sessions. Detailed comparison: [compare-claude-gpt-gemini-api-cost](/en/posts/compare-claude-gpt-gemini-api-cost/).

### Q2: How much more expensive is GPT-5.5 vs. Claude Opus 4.7?

Inputs match ($5 each), but output costs 20% more on GPT-5.5 ($30 vs. $25). **Crucially**:
- Long context (>270K): GPT-5.5 ×2, Opus ×1.5 — Opus comes out 30% cheaper.
- Cached: Opus cache read $0.50, GPT-5.5 cache read $0.50 — flat tie.

Bottom line: **GPT-5.5 is 17-20% pricier on short workloads; Opus is 30% cheaper on long-context workloads.**

### Q3: How much can caching actually save?

Depends on hit rate:

| Model | 30% hit | 60% hit | 90% hit |
|---|---|---|---|
| Claude Sonnet 4.6 | -27% | -54% | -81% |
| GPT-5.4 | -27% | -54% | -81% |
| Gemini 2.5 Pro | -27% | -54% | -81% |

For Anthropic, subtract the **cache-write cost** (1.25× uncached for the first write) — at low reuse counts (< 4) caching is actually more expensive than not using it.

### Q4: How do Chinese-region models (DeepSeek/Qwen/Doubao) compare to Western ones?

On price: DeepSeek V4 Pro is the absolute floor in this 30+ model comparison ($0.14 / $0.28) — 7-18× cheaper than Claude Haiku. Trade-offs:
- ✅ Chinese-language quality is comparable or better (DeepSeek and Qwen approach Claude Sonnet on Chinese benchmarks).
- ⚠️ API uptime: 99.5% (China-region) vs. 99.9% (Western).
- ⚠️ Long-context ceiling: 128K-256K (China-region) vs. 1M-2M (Claude/Gemini).
- ❌ Function-calling stability: weaker than Claude/GPT.

**Rule of thumb**: route non-core paths (preprocessing, content gen, extraction) to Chinese-region models; keep core agent / tool-call paths on Claude / GPT.

### Q5: How do I forecast my monthly AI API budget?

Three steps:
1. Pick the closest of the six scenarios above; plug your daily token volumes into the formula to get a baseline monthly cost.
2. Add a 30% buffer for hidden costs (retries, observability, long-context surcharges).
3. Roll forward with the template at [token-cost-calculator-api-budget](/en/posts/token-cost-calculator-api-budget/).

Start small (< $500/month) on new projects, run for 2 weeks, then re-forecast — the second pass is 3-5× more accurate than the initial estimate. For solo founders building content/affiliate sites where AI cost mixes with hosting/SEO budget, OppMint has a useful breakdown in [AI tools for solo founders building content sites](https://www.oppmint.com/ai-tools-solo-founder-content-website/).

Recommended