Skip to content
AI

How to Estimate API Costs for a RAG Chatbot

AI

AI Cost Calculator

2 min read

A RAG chatbot may look like a simple question-and-answer flow, but its real cost comes from retrieved documents, system prompts, conversation history, and generated answers. Estimating only the user’s question length will usually understate the monthly bill.

Break Down One Request

A typical RAG request includes system instructions, the user question, retrieved document chunks, recent conversation history, and the generated answer. The user question is often the shortest part. Retrieved context and history usually dominate input tokens.

Use a Monthly Formula

Start with a simple estimate:

monthly cost = cost per request × daily requests × 30

Then break one request into:

input cost = uncached input tokens × input price + cached tokens × cache read price
output cost = output tokens × output price

If each support question sends 6,000 retrieved tokens and receives an 800-token answer, the cost will be much higher than a short chat request.

Measure Retrieval Size

The easiest place to lose control is top-k retrieval. Returning 3 chunks versus 8 chunks may only slightly improve answer quality while doubling input tokens.

Track average retrieved chunks, average chunk length, and no-answer rate before launch. These numbers are more useful than page views when forecasting API bills. You can put them into the token budget template as a separate RAG scenario row.

Where Caching Helps

System prompts, tool instructions, and fixed templates are good candidates for caching. User questions and retrieved content change more often, so they may not reliably hit cache.

When prompt caching is available, estimate fixed and dynamic context separately. Fixed parts can use cache-read pricing, while retrieved content usually remains normal input cost; use the prompt caching savings guide to calibrate the cache assumption.

Watch Output Length

RAG answers are often longer than classification or extraction outputs. A model with low input pricing can still become expensive if it writes long responses.

Run 50 to 100 real questions before launch and measure average output length. Use that number in the calculator instead of a best-case guess. During model selection, compare text models and the model pricing table so the estimate is not tied to one provider.

Launch Checklist

  • Limit maximum retrieved chunks
  • Set a maximum answer length
  • Use different context budgets for free and paid users
  • Track no-answer requests
  • Estimate cache savings only for stable prompt sections

RAG cost control is not just about choosing a cheaper model. It is about reducing unnecessary context per request.

Recommended