LLM Cost Optimization: Where Your Token Budget Actually Goes

Sep 5, 2025 Metasphere Engineering 13 min read

You open your cloud billing dashboard and the number doesn’t look right. Your document analysis feature just finished its first full month in production. The API cost is way beyond the pilot budget. Same feature that cost almost nothing with 200 test documents. Finance sends a pointed email. Then a meeting invite.

Every prototype decision scaled directly into the bill. A 4,000-token system prompt on every request. A frontier model for every document, no matter how simple. No caching. Full conversation history on every turn. Nobody panicked during the pilot because 200 documents looked cheap. Multiply by 35,000 documents a month and the meter is still running.

The taxi drove the same route to your house before every ride. Nobody questioned it until the monthly statement arrived.

Key takeaways

System prompt + conversation history = 70% of your token spend. These are the highest-impact targets, not the user query.
Model routing delivers the largest single cost reduction by sending simple queries to smaller models. A few days of engineering.
Prompt caching and semantic caching stack. Prompt caching cuts cost on every request. Semantic caching eliminates requests entirely. Run both.
Conversation history grows linearly and nobody notices until turn 10 when you’re sending 8,000+ tokens of stale context per request.
Cost attribution by feature and team turns politics into accountability. Without unit economics, optimization stays abstract and nobody owns the spend.

Token Economics: The Math You Should Do First

Typical RAG request: system prompt (2,000 tokens) + retrieved context (3,000) + conversation history (5,000) + user query (100). Ten thousand input tokens before the model generates a single word. At 100,000 requests per day, fractions of a cent pile up into real money. The system prompt is identical across all requests. You’re paying the taxi to drive the same route to your house 100,000 times a day before it goes anywhere useful.

System prompt and conversation history together account for 70% of input tokens in most apps. The user query, the part that actually varies, is typically under 2%. You’re paying for the prologue, not the plot.

Model Routing: The Biggest Single Win

Not every request needs a frontier model. A “what are your business hours?” query and a “analyze this 50-page contract for liability exposure” query cost the same if both hit the same endpoint. Hiring a limo to pick up groceries.

A lightweight classifier routes each request to the cheapest model that can handle it. Simple questions go to a small model at a fraction of the cost. Complex reasoning goes to the frontier model. UberX for the commute, Uber Black for the client dinner. For apps with diverse request complexity, this one change delivers the largest single cost reduction available.

Prerequisites

Request volume exceeds 10,000 per day (below this, engineering cost exceeds savings)
Request complexity varies meaningfully across the user base
Quality benchmarks exist for each task type to detect routing errors
Latency budget allows the 5-15ms overhead of a classification step
Fallback path routes to frontier model when classifier confidence is low

Task Complexity	Examples	Routed To	Cost per 1K requests	Why
Simple	Classification, extraction, yes/no, short lookup	Small model (Haiku, GPT-4o-mini)	~$0.02	These tasks don’t need reasoning. Small models handle them at 95%+ accuracy
Medium	Summarization, structured output, moderate analysis	Mid-tier model (Sonnet, GPT-4o)	~$0.30	Needs better instruction following but not deep reasoning
Complex	Multi-step reasoning, creative drafting, ambiguous edge cases	Large model (Opus, GPT-4)	~$2.00	Only 10-20% of requests actually need this. Route only what qualifies

A classifier at the front saves 5-20x on total inference cost.

The classifier itself is cheap. A few hundred labeled examples, a small model fine-tuned on your request distribution. The hard part is defining “simple” versus “complex” for your specific domain. Get this wrong and complex queries hit a small model, producing garbage that users retry on. Doubling your cost and halving their trust.

Anti-pattern

Don’t: Blanket-downgrade all requests to a cheaper model. Switching from a frontier model to a budget model across the board saves money and tanks quality on the requests that actually needed reasoning. Nobody notices until satisfaction scores drop.

Do: Route by measured complexity. Simple queries to small models, complex queries to frontier. Preserve quality where it matters, save aggressively where it doesn’t.

Caching: Two Layers That Stack

Model routing cuts cost per request. Caching eliminates requests entirely. The two compound.

The Caching Stack Two layers that compound, not compete. Prompt caching reduces cost on every request by reusing computed attention states for repeated prefixes. Semantic caching eliminates entire requests by returning stored responses for queries similar in meaning. Run both for the deepest cut.

Prompt caching (Anthropic , OpenAI): reuses computed work for repeated prefixes. Big savings on the cached portion. Strict requirement: the prefix must be exactly the same, character for character. The driver already knows the route to your house. Doesn’t need to look it up every ride. One team’s five-minute fix (moving a dynamic timestamp out of the system prompt) cut their bill nearly in half. Five minutes. The most profitable five minutes that engineer ever worked.

Semantic caching: responses stored and indexed by meaning via embeddings. When a new query is similar enough to one already answered, return the cached response. “Same place as yesterday? I know a shortcut.” Set the cosine similarity threshold between 0.92 and 0.97. Too loose returns wrong answers. Too strict rarely triggers. Customer service apps with repetitive question patterns eliminate most of the redundant calls. 5-15ms lookup versus 500-2,000ms inference.

Strategy	Effort	Ongoing Cost	Typical Savings	Best For
Prompt caching	Low (hours)	None	High, on cached prefix	Long system prompts, RAG apps
Model routing	Medium (days)	Classifier upkeep	High, on average spend	High-volume, mixed complexity
Semantic caching	Medium (days)	Cache infra	Moderate, via call elimination	Repetitive queries (support, FAQ)
History summarization	Low (hours)	Extra API call per N turns	High, after turn 5	Chat apps
Output truncation	Low (hours)	None	Low-to-moderate	Verbose generation tasks

The two caching strategies look similar but work at different layers. Understanding when each fires (and when they stack) matters for setting expectations:

Dimension	Prompt Caching	Semantic Caching
How it works	Caches the prefix (system prompt + context). Model reuses cached tokens on subsequent calls	Embeds the query, searches for similar past queries, returns cached response if similarity > threshold
Cache key	Exact prefix match	Embedding similarity (threshold 0.92-0.97)
Hit rate	High for repeated system prompts	Varies. Higher for FAQ-style, lower for novel queries
Cost saving	Reduces input token cost on cached portion	Eliminates entire inference call on hit
Latency	Same as normal call (prefix skipped in billing, not latency)	Sub-100ms on cache hit (no model call)
Risk	None. Exact match only	Semantic false positives return wrong answer
Best for	Long system prompts reused across requests	High-volume, repetitive query patterns
Stack together?	Yes. Prompt cache reduces cost on misses, semantic cache eliminates calls on hits

Conversation History: The Silent Cost Multiplier

Caching handles repeated patterns. It can’t help with the cost that grows every turn of every conversation.

By turn 10 of a chat conversation, you’re sending 8,000+ tokens of mostly irrelevant history with every request. By turn 20, double that. The taxi driver narrating every turn from every ride you’ve ever taken before starting today’s trip. Stanford HAI documents AI processing costs growing faster than efficiency gains across the industry. Conversation history is a big reason why.

Three strategies, each with different trade-offs:

Sliding window: keep only the last N turns. Predictable ceiling, simple to build. The user loses context from earlier turns, which matters in long troubleshooting conversations. But most conversations don’t go past 5 turns. (Nobody reads Terms of Service either.)

Summarization: compress older turns into a summary after N turns. Steep token reduction, preserves key context. Costs one extra API call per summarization point. Worth it for any chat app with average sessions over 5 turns.

Hybrid retrieval: store all turns in a vector database, retrieve only the 2-3 most relevant per query. Keeps the token count stable regardless of conversation length. More complex to build, but the most efficient for sessions that run long. Searching your email for the relevant thread instead of reading every email you’ve ever sent.

Building Cost Visibility Before You Optimize

Provider dashboards show total spend. Useless for anything except panic. You need cost per feature, per team, per user group. Reading the receipt instead of just paying the bill.

# Wrap every LLM call with cost attribution
response = llm.complete(
    prompt=prompt,
    metadata={
        "feature": "document-analysis",
        "team": "product-search",
        "prompt_template": "v3.2",
        "user_cohort": "power-users",
    }
)
# Log token usage for aggregation
log_cost_event(
    input_tokens=response.usage.input_tokens,
    output_tokens=response.usage.output_tokens,
    model=response.model,
    **response.metadata
)

Dashboard targets: cost per task completion, cost per DAU, cost trend by feature and prompt template version. When template v3 costs noticeably more than v2 with the same quality, roll back. FinOps turns visibility into accountability. MLOps makes it part of the pipeline.

When Optimization Is Premature

Optimize now	Optimize later
10,000+ daily requests	Under 1,000 daily requests
Cost growing faster than revenue	Still figuring out if anyone wants the product
Multiple features sharing LLM budget	Single-feature, predictable volume
Users complaining about latency	Latency within acceptable range

Under 1,000 requests per day? Engineering time exceeds token savings. Ship, validate, then optimize. Build the visibility instrumentation first regardless. It costs almost nothing and gives you data when you need to optimize later.

The Optimization Paradox The engineering cost of optimizing can exceed the token cost you’re trying to save. A team spending two weeks optimizing a prompt that costs pocket change per day has negative ROI. Build cost visibility first. Let the data show you where the waste lives. Optimize from evidence, not intuition.

What the Industry Gets Wrong About LLM Costs

“Use a cheaper model.” Model downgrade is the bluntest tool in the box. Switching every request to a budget model saves money and wrecks the responses that needed real reasoning. Model routing preserves quality by sending each request to the right tier. A scalpel, not a sledgehammer.

“Token costs are the main expense.” For many production deployments, engineering time spent debugging, rewriting prompts, and managing infrastructure exceeds the API bill. Two engineers burning a sprint to shave pennies off a prompt nobody calls often? The savings never catch up to the salaries. Know which costs are actually big before attacking them.

“Caching doesn’t work for personalized responses.” Prompt caching works for any app with a repeated system prompt, regardless of how personalized the output is. The system prompt is the static portion. Personalizing the user query and retrieved context does not break the cached prefix. Teams skip prompt caching because they assume “personalized” means “uncacheable.” It doesn’t.

Our take Cost visibility matters more than cost optimization. Teams with attribution dashboards showing cost per feature per day optimize voluntarily. Teams with a monthly total shrug and keep building. If you can only do one thing, instrument first. The optimization follows naturally when engineers can see what they’re spending. You don’t need a diet plan. You need a scale.

That document analysis feature? System prompt cached. Simple documents routed to smaller models. Repeated queries never hit the API. You open the billing dashboard next month and the number looks right. Finance doesn’t send an email. The meter still runs. It just runs a lot slower, and you know exactly where every token goes.

Frequently Asked Questions

What is prompt caching and how much can it save?

Prompt caching stores computed attention states for repeated prompt prefixes, typically your system prompt, and reuses them across requests. For apps with long system prompts and retrieved context, caching saves heavily on the cached portion. Both Anthropic and OpenAI offer prompt caching. The requirement: a byte-for-byte identical prefix. Dynamic content like timestamps or request IDs in the system prompt defeats caching entirely.

What is model routing and when is it worth implementing?

Model routing uses a small classifier to send each request to the cheapest model that can handle it. Simple questions and short classification tasks go to a smaller model at a fraction of the cost. Complex reasoning goes to the frontier model. For apps where request complexity varies a lot, routing delivers a big cost cut with barely any quality loss on correctly routed tasks.

What is semantic caching and what are its limitations?

Semantic caching stores LLM responses indexed by meaning via embeddings. Similar requests, not just identical ones, can be served from cache. Limitations: cached responses go stale, similarity thresholds need tuning (too strict means low hit rate, too loose means wrong answers), and embedding lookup adds 5-15ms latency. Most valuable for high-volume apps with repeated question patterns.

How do you attribute LLM costs to teams or features?

Track token usage per request with metadata linking to team, feature, and user segment. Add tags or per-key tracking to every API call. Build a cost pipeline that reports unit economics: cost per task completion, cost per active user per day. Without this visibility, optimization is just talk and nobody owns the spend.

When does self-hosting an open-source model become cost-effective?

Self-hosting becomes cost-competitive when call volume reaches tens of millions of tokens per day, your use case tolerates the quality gap versus frontier models, and your team can run GPU infrastructure. Most teams underestimate the break-even point once they factor in GPU management, model updates, and scaling infrastructure.