LLM Cost Optimization: Cut Inference Spend 40-90%

Sep 5, 2025 Metasphere Engineering 9 min read

You open your cloud billing dashboard on a Monday morning and the number does not look right. Your LLM-powered document analysis feature just finished its first full month in production, and the API cost is deep into five figures. The same feature cost almost nothing during a two-week pilot with 200 test documents. Your finance team sends a pointed email.

The architectural decisions made during prototyping scaled directly into the production bill: a 4,000-token system prompt sent on every request, GPT-4 Turbo for every document regardless of complexity, no caching layer, and full conversation history on every turn. Nobody panicked during prototyping because 200 documents at a couple of dollars each looked fine. In production, 35,000 documents per month at roughly the same per-document cost is a very different conversation. The unit cost barely changed. The volume did. Every architectural shortcut from the prototype was now multiplied by 35,000.

This pattern repeats across every LLM deployment we see. Costs are invisible during development because volumes are small. They scale non-linearly with specific design choices. And they are billed by a third party with no natural feedback loop until the invoice arrives and someone starts asking hard questions.

Token Economics: The Math You Should Do First

Every design decision in an LLM application has a token cost, and the token cost is the cost. Understanding where tokens go is the first step to controlling spend. Most teams skip this step entirely. Do not be most teams.

A typical RAG application sends the following on every request: a system prompt (1,500-4,000 tokens with detailed instructions), retrieved context chunks (2,000-6,000 tokens depending on your top-K setting), conversation history in chat applications (grows with each turn, 500-10,000+ tokens), and the user’s actual question (50-200 tokens). Add those up. A single request easily hits 8,000-15,000 input tokens before the model generates a single output token.

Do the arithmetic explicitly. Take a system prompt of 2,000 tokens, retrieved context of 3,000 tokens, and conversation history of 5,000 tokens. That is 10,000 input tokens per request. At a few dollars per million input tokens, each request costs fractions of a cent. Sounds small. At 100,000 requests per day, that fraction compounds into thousands per day and tens of thousands per month. Just input tokens. Output tokens add another 30-50% on top.

Now look at where those tokens actually go. The system prompt is identical across all requests. Identical. The conversation history grows linearly with conversation length. The retrieved context varies but often contains redundant chunks. Each of these is a distinct optimization target, and ignoring any of them is leaving money on the table.

The following breakdown shows where those tokens go in a typical RAG request and the optimization strategy for each component. The system prompt and conversation history together account for 70% of input tokens in most applications, making them the highest-leverage optimization targets.

Model Routing: The 40-70% Win

Not every request needs your most expensive model. This sounds obvious. Most teams ignore it anyway. A question like “What are your business hours?” does not require GPT-4 or Claude Opus. A 7B-parameter model handles it correctly at 5-20x lower cost per token. Complex reasoning tasks, multi-step analysis, and nuanced generation still need the frontier model. The trick is routing each request to the cheapest model that handles it adequately.

The implementation is a lightweight classifier, either rule-based or a small fine-tuned model, that evaluates incoming requests on estimated complexity. Simple classification: short queries, FAQ-style questions, straightforward extraction tasks go to the small model. Medium complexity: summarization, structured analysis, moderate QA goes to a mid-tier model. Complex: multi-step reasoning, code generation, creative writing, long-context analysis goes to frontier.

The classifier itself costs nearly nothing. A rule-based router using query length, keyword presence, and task type adds zero API cost. An embedding-based classifier adds a negligible cost per request for the embedding call. Against the savings from routing simple queries to cheaper models, the ROI is overwhelming.

Consider an AI-powered customer support platform processing 80,000 queries per day. Before routing, every query hits the frontier model at full price. After routing, 62% of queries go to a smaller model at a fraction of the cost, 28% to the mid-tier, and 10% to the frontier model. Average per-query cost drops by roughly 70% with no measurable quality degradation on appropriately routed queries.

Caching: Two Layers That Stack

Model routing slashes cost per request. Caching eliminates requests entirely. These two layers stack, and the combination is where the real savings live.

Prompt caching operates at the provider level. It stores the computed attention states for your system prompt prefix and reuses them across requests. When the cached prefix represents a large fraction of total input tokens, which is common in RAG applications with long system prompts, savings hit 50-90% on the cached portion. Anthropic and OpenAI both offer this.

The implementation requirement is strict: the cached prefix must be byte-for-byte identical across requests. If your system prompt includes a timestamp (Current date: March 14, 2026), the cache misses on every request because the prefix changes. Move dynamic content to the end of the prompt, after the cacheable prefix. This is a five-minute architectural change with a massive cost impact. Do it today.

Semantic caching operates at your application level. You cache full LLM responses indexed by the semantic meaning of the request using vector embeddings. When a new request arrives, compute its embedding, search your cache for semantically similar past queries, and return the cached response if similarity exceeds your threshold (typically 0.92-0.97 cosine similarity).

For customer service applications where many users ask variations of “How do I reset my password?” or “What is your return policy?”, semantic caching can eliminate 30-50% of model invocations entirely. The cache hit eliminates the API call, returning a response in 5-15ms instead of 500-2,000ms.

The two approaches are complementary. Prompt caching saves on every request by reusing prefix computation. Semantic caching eliminates inference entirely for repeated patterns. Running both gives you the deepest cost reduction.

Conversation History Management

Chat applications have a specific cost trap that sneaks up on every team: conversation history that grows unboundedly. Each turn includes the full conversation history in the prompt. By turn 10, your input tokens include 8,000+ tokens of conversation history, most of which is no longer relevant to the current question. You are paying for the model to re-read a conversation it already had.

Three approaches, and the right one depends on your use case:

Sliding window: Keep only the last N turns (typically 5-8). Simple, predictable cost ceiling. The trade-off is losing context from early turns. Works well for customer support where each question tends to be self-contained.

Summarization: After N turns, summarize the conversation into a compressed representation (300-500 tokens) and replace the raw history. The summary captures key decisions and context without the verbatim back-and-forth. Costs one additional API call every N turns but reduces ongoing per-turn costs by 60-80%.

Hybrid retrieval: Store all conversation turns in a vector database. On each new turn, retrieve only the 2-3 most relevant past exchanges based on semantic similarity to the current query. This gives the model targeted context without the full history. Most effective for long-running conversations (20+ turns) in advisory or analysis applications.

Building Cost Visibility Before You Need It

Cost optimization is impossible without visibility at the right granularity. Provider dashboards show total spend. That is useless. They do not show which features, which teams, or which prompt designs drive the most cost.

Building cost attribution requires instrumenting every LLM call with metadata: the feature name, the team, the prompt template version, the user cohort, and the token counts returned by the API. Aggregate this into a dashboard that surfaces unit economics. Cost per task completion. Cost per active user per day. Cost trend by feature and by prompt template version. Build this before you need it, not after the bill arrives.

Here is where this gets powerful: when you can see that prompt template v3 costs 50% more per completion than v2 with equivalent quality scores, you roll back immediately. When you can see that Feature X consumes a disproportionate share of your monthly bill for a small user cohort, you have a concrete conversation about whether that feature justifies its cost. Without numbers, that conversation is just politics.

The cost optimization and FinOps discipline provides the organizational framework for turning visibility into accountable spending targets. The MLOps and model lifecycle practice ensures cost tracking is integrated into your model deployment pipeline, not bolted on as an afterthought.

Without this instrumentation pipeline, optimization conversations stay abstract and nothing changes. With it, every team owns their LLM spend the same way they own their cloud compute budget, with clear metrics, attribution, and thresholds that trigger action before the monthly invoice forces it. The teams that control their AI costs are the teams that can see their AI costs. Everyone else is flying blind into a billing cliff.

Frequently Asked Questions

What is prompt caching and how much can it save?

Prompt caching stores computed attention states for repeated prompt prefixes, typically your system prompt, and reuses them across requests. For applications with long system prompts and retrieved context, caching reduces cost by 50-90% on the cached portion. Both Anthropic and OpenAI offer prompt caching. The requirement: a byte-for-byte consistent prefix. Dynamic content like timestamps or request IDs in the system prompt defeats caching entirely.

What is model routing and when is it worth implementing?

Model routing uses a small classifier to direct each request to the cheapest adequate model. Simple questions and short classification tasks go to a 7B-parameter model at 5-20x lower cost. Complex reasoning goes to the frontier model. For applications with diverse request complexity, routing typically reduces average inference cost by 40-70% with minimal quality impact on appropriately routed tasks.

What is semantic caching and what are its limitations?

Semantic caching stores LLM responses indexed by query meaning via embeddings. Similar requests, not just identical ones, can be served from cache. Limitations: cached responses become outdated, similarity thresholds require tuning (too strict means low hit rate, too loose means wrong answers), and embedding lookup adds 5-15ms latency. Most valuable for high-volume applications with repeated question patterns.

How do you attribute LLM costs to teams or features?

Track token usage per request with metadata linking to team, feature, and user segment. Include tags or per-key tracking in every API call. Build a cost pipeline aggregating by team that reports unit economics: cost per task completion, cost per active user per day. Without this visibility, optimization stays abstract and nobody is accountable for specific spending.

When does self-hosting an open-source model become cost-effective?

Self-hosting Llama, Mistral, or Qwen becomes cost-competitive when call volume reaches tens of millions of tokens per day, your use case tolerates the quality gap versus frontier models, and your team can operate GPU infrastructure. Most teams underestimate the break-even point when total operational cost including GPU management, model updates, and scaling infrastructure is included.