LLM Cost Optimization: Where Your Token Budget Actually Goes
You open your cloud billing dashboard and the number doesn’t look right. Your document analysis feature just finished its first full month in production. The API cost is way beyond the pilot budget. Same feature that cost almost nothing with 200 test documents. Finance sends a pointed email. Then a meeting invite.
Every prototype decision scaled directly into the bill. A 4,000-token system prompt on every request. A frontier model for every document, no matter how simple. No caching. Full conversation history on every turn. Nobody panicked during the pilot because 200 documents looked cheap. Multiply by 35,000 documents a month and the meter is still running.
The taxi drove the same route to your house before every ride. Nobody questioned it until the monthly statement arrived.
- System prompt + conversation history = 70% of your token spend. These are the highest-impact targets, not the user query.
- Model routing delivers the largest single cost reduction by sending simple queries to smaller models. A few days of engineering.
- Prompt caching and semantic caching stack. Prompt caching cuts cost on every request. Semantic caching eliminates requests entirely. Run both.
- Conversation history grows linearly and nobody notices until turn 10 when you’re sending 8,000+ tokens of stale context per request.
- Cost attribution by feature and team turns politics into accountability. Without unit economics, optimization stays abstract and nobody owns the spend.
Token Economics: The Math You Should Do First
Typical RAG request: system prompt (2,000 tokens) + retrieved context (3,000) + conversation history (5,000) + user query (100). Ten thousand input tokens before the model generates a single word. At 100,000 requests per day, fractions of a cent pile up into real money. The system prompt is identical across all requests. You’re paying the taxi to drive the same route to your house 100,000 times a day before it goes anywhere useful.
System prompt and conversation history together account for 70% of input tokens in most apps. The user query, the part that actually varies, is typically under 2%. You’re paying for the prologue, not the plot.
Model Routing: The Biggest Single Win
Not every request needs a frontier model. A “what are your business hours?” query and a “analyze this 50-page contract for liability exposure” query cost the same if both hit the same endpoint. Hiring a limo to pick up groceries.
A lightweight classifier routes each request to the cheapest model that can handle it. Simple questions go to a small model at a fraction of the cost. Complex reasoning goes to the frontier model. UberX for the commute, Uber Black for the client dinner. For apps with diverse request complexity, this one change delivers the largest single cost reduction available.
- Request volume exceeds 10,000 per day (below this, engineering cost exceeds savings)
- Request complexity varies meaningfully across the user base
- Quality benchmarks exist for each task type to detect routing errors
- Latency budget allows the 5-15ms overhead of a classification step
- Fallback path routes to frontier model when classifier confidence is low
| Task Complexity | Examples | Routed To | Cost per 1K requests | Why |
|---|---|---|---|---|
| Simple | Classification, extraction, yes/no, short lookup | Small model (Haiku, GPT-4o-mini) | ~$0.02 | These tasks don’t need reasoning. Small models handle them at 95%+ accuracy |
| Medium | Summarization, structured output, moderate analysis | Mid-tier model (Sonnet, GPT-4o) | ~$0.30 | Needs better instruction following but not deep reasoning |
| Complex | Multi-step reasoning, creative drafting, ambiguous edge cases | Large model (Opus, GPT-4) | ~$2.00 | Only 10-20% of requests actually need this. Route only what qualifies |
A classifier at the front saves 5-20x on total inference cost.
The classifier itself is cheap. A few hundred labeled examples, a small model fine-tuned on your request distribution. The hard part is defining “simple” versus “complex” for your specific domain. Get this wrong and complex queries hit a small model, producing garbage that users retry on. Doubling your cost and halving their trust.
Don’t: Blanket-downgrade all requests to a cheaper model. Switching from a frontier model to a budget model across the board saves money and tanks quality on the requests that actually needed reasoning. Nobody notices until satisfaction scores drop.
Do: Route by measured complexity. Simple queries to small models, complex queries to frontier. Preserve quality where it matters, save aggressively where it doesn’t.
Caching: Two Layers That Stack
Model routing cuts cost per request. Caching eliminates requests entirely. The two compound.
Prompt caching (Anthropic , OpenAI): reuses computed work for repeated prefixes. Big savings on the cached portion. Strict requirement: the prefix must be exactly the same, character for character. The driver already knows the route to your house. Doesn’t need to look it up every ride. One team’s five-minute fix (moving a dynamic timestamp out of the system prompt) cut their bill nearly in half. Five minutes. The most profitable five minutes that engineer ever worked.
Semantic caching: responses stored and indexed by meaning via embeddings. When a new query is similar enough to one already answered, return the cached response. “Same place as yesterday? I know a shortcut.” Set the cosine similarity threshold between 0.92 and 0.97. Too loose returns wrong answers. Too strict rarely triggers. Customer service apps with repetitive question patterns eliminate most of the redundant calls. 5-15ms lookup versus 500-2,000ms inference.
| Strategy | Effort | Ongoing Cost | Typical Savings | Best For |
|---|---|---|---|---|
| Prompt caching | Low (hours) | None | High, on cached prefix | Long system prompts, RAG apps |
| Model routing | Medium (days) | Classifier upkeep | High, on average spend | High-volume, mixed complexity |
| Semantic caching | Medium (days) | Cache infra | Moderate, via call elimination | Repetitive queries (support, FAQ) |
| History summarization | Low (hours) | Extra API call per N turns | High, after turn 5 | Chat apps |
| Output truncation | Low (hours) | None | Low-to-moderate | Verbose generation tasks |
The two caching strategies look similar but work at different layers. Understanding when each fires (and when they stack) matters for setting expectations:
| Dimension | Prompt Caching | Semantic Caching |
|---|---|---|
| How it works | Caches the prefix (system prompt + context). Model reuses cached tokens on subsequent calls | Embeds the query, searches for similar past queries, returns cached response if similarity > threshold |
| Cache key | Exact prefix match | Embedding similarity (threshold 0.92-0.97) |
| Hit rate | High for repeated system prompts | Varies. Higher for FAQ-style, lower for novel queries |
| Cost saving | Reduces input token cost on cached portion | Eliminates entire inference call on hit |
| Latency | Same as normal call (prefix skipped in billing, not latency) | Sub-100ms on cache hit (no model call) |
| Risk | None. Exact match only | Semantic false positives return wrong answer |
| Best for | Long system prompts reused across requests | High-volume, repetitive query patterns |
| Stack together? | Yes. Prompt cache reduces cost on misses, semantic cache eliminates calls on hits |
Conversation History: The Silent Cost Multiplier
Caching handles repeated patterns. It can’t help with the cost that grows every turn of every conversation.
By turn 10 of a chat conversation, you’re sending 8,000+ tokens of mostly irrelevant history with every request. By turn 20, double that. The taxi driver narrating every turn from every ride you’ve ever taken before starting today’s trip. Stanford HAI documents AI processing costs growing faster than efficiency gains across the industry. Conversation history is a big reason why.
Three strategies, each with different trade-offs:
Sliding window: keep only the last N turns. Predictable ceiling, simple to build. The user loses context from earlier turns, which matters in long troubleshooting conversations. But most conversations don’t go past 5 turns. (Nobody reads Terms of Service either.)
Summarization: compress older turns into a summary after N turns. Steep token reduction, preserves key context. Costs one extra API call per summarization point. Worth it for any chat app with average sessions over 5 turns.
Hybrid retrieval: store all turns in a vector database, retrieve only the 2-3 most relevant per query. Keeps the token count stable regardless of conversation length. More complex to build, but the most efficient for sessions that run long. Searching your email for the relevant thread instead of reading every email you’ve ever sent.
Building Cost Visibility Before You Optimize
Provider dashboards show total spend. Useless for anything except panic. You need cost per feature, per team, per user group. Reading the receipt instead of just paying the bill.
# Wrap every LLM call with cost attribution
response = llm.complete(
prompt=prompt,
metadata={
"feature": "document-analysis",
"team": "product-search",
"prompt_template": "v3.2",
"user_cohort": "power-users",
}
)
# Log token usage for aggregation
log_cost_event(
input_tokens=response.usage.input_tokens,
output_tokens=response.usage.output_tokens,
model=response.model,
**response.metadata
)
Dashboard targets: cost per task completion, cost per DAU, cost trend by feature and prompt template version. When template v3 costs noticeably more than v2 with the same quality, roll back. FinOps turns visibility into accountability. MLOps makes it part of the pipeline.
When Optimization Is Premature
| Optimize now | Optimize later |
|---|---|
| 10,000+ daily requests | Under 1,000 daily requests |
| Cost growing faster than revenue | Still figuring out if anyone wants the product |
| Multiple features sharing LLM budget | Single-feature, predictable volume |
| Users complaining about latency | Latency within acceptable range |
Under 1,000 requests per day? Engineering time exceeds token savings. Ship, validate, then optimize. Build the visibility instrumentation first regardless. It costs almost nothing and gives you data when you need to optimize later.
What the Industry Gets Wrong About LLM Costs
“Use a cheaper model.” Model downgrade is the bluntest tool in the box. Switching every request to a budget model saves money and wrecks the responses that needed real reasoning. Model routing preserves quality by sending each request to the right tier. A scalpel, not a sledgehammer.
“Token costs are the main expense.” For many production deployments, engineering time spent debugging, rewriting prompts, and managing infrastructure exceeds the API bill. Two engineers burning a sprint to shave pennies off a prompt nobody calls often? The savings never catch up to the salaries. Know which costs are actually big before attacking them.
“Caching doesn’t work for personalized responses.” Prompt caching works for any app with a repeated system prompt, regardless of how personalized the output is. The system prompt is the static portion. Personalizing the user query and retrieved context does not break the cached prefix. Teams skip prompt caching because they assume “personalized” means “uncacheable.” It doesn’t.
That document analysis feature? System prompt cached. Simple documents routed to smaller models. Repeated queries never hit the API. You open the billing dashboard next month and the number looks right. Finance doesn’t send an email. The meter still runs. It just runs a lot slower, and you know exactly where every token goes.