AI Agent Orchestration in Production

Dec 2, 2024 Metasphere Engineering 17 min read

You wire up a prototype over lunch. The agent calls a search tool, summarizes the result, feeds it into an API, and returns a neat structured answer. Works great in the notebook. You demo it to the team. Everyone nods. The PM starts drafting the launch announcement.

Then production happens.

The agent hits a 429 on the search tool. Retries. Gets a different result set this time. Makes up a parameter name for the downstream API. Gets a 400 back. Then, helpfully and confidently, tries to call a tool that doesn’t exist. Full confidence. Main character energy. Token meter spinning the whole time. By the time someone notices, the run has burned through more tokens than your entire pilot budget.

You hired a brilliant new employee, gave them access to every system in the company, and left for the weekend. The model is fine. What’s missing is supervision. Spending limits. Restricted access. A manager who checks their work before it goes out the door.

Key takeaways

Agent workflows are distributed systems with an unpredictable decision-maker at the center. Same input, different decisions, every time.
Three orchestration patterns hold up under load. Sequential chains, parallel fan-out, and human-in-the-loop gates. Everything else is a variation or a mistake.
Durable state is required the moment a workflow needs to pause for approval, wait for a webhook, or survive a restart.
Tool calling is the most fragile layer. Schema validation, retry with context feedback, and hard timeouts are non-negotiable.
Set cost caps before the first production run. A per-run token budget and a step limit catch runaway loops before your finance team does.

Orchestration Patterns That Actually Work

Sequential chains are the simplest and the most commonly botched. The mistake: passing the full output of each step forward. A 10-step chain dumping 5,000 tokens of piled-up context before the final step even starts thinking. The new hire’s desk is already buried in printouts from the first nine steps. Can’t find the original assignment under the pile. Drowning in their own paperwork. Compress between steps. Pass only what the next step needs.

Parallel fan-out runs independent subtasks at the same time and merges results. The trap: assuming tasks are independent when they’re not. If subtask B’s quality depends on A’s result, you have a hidden sequential dependency. A whiteboard catches these in five minutes. Production catches them at 3am.

Human-in-the-loop gates pause for approval before anything that touches production state, triggers a transaction, or goes external. Not something to add later. The new hire can draft the email, but someone else clicks send. The challenge: keeping workflow state alive while it waits for a human. A Lambda can’t sit for two hours. You need durable state. And that’s the hardest engineering problem in agent orchestration.

State Management for Long-Running Agents

If your workflow finishes in seconds, a Lambda is fine. The moment it needs to pause for approval, wait for a webhook, or retry after a delay, you need durable orchestration. State that lives through restarts, timeouts, and deployments. The new hire goes home at 5pm and picks up exactly where they left off the next morning. No lost context. No starting over.

AWS Step Functions saves state transitions natively. A workflow picks up from where it failed, not from scratch. Express Workflows cap at five minutes. A single approval gate blows right past that. Standard Workflows run up to a year.

Temporal replays from event history. A workflow that failed at step 47 picks up from step 47 with full context. For expensive LLM calls at every step, that replay efficiency saves real money. Steep learning curve. Worth it for complex agent workflows.

Inngest targets serverless event-driven workflows with a simpler developer experience. You trade maturity and advanced features for that simplicity.

Pick based on where your team can debug under pressure, not feature comparison charts.

	Step Functions	Temporal	Inngest
Best for	AWS-native teams	Cross-cloud, long-running	Greenfield, fast shipping
Max duration	1 year (Standard)	Unlimited	Depends on plan
Failure recovery	Pick up from failed state	Replay from event history	Retry with backoff
Learning curve	Low (if in AWS)	High (worth it for complex)	Low
Self-host option	No	Yes	Yes
Agent fit	Short workflows with gates	Long sessions with expensive steps	Event-driven, simpler chains

Durable state keeps your workflow alive. But the layer that breaks most often isn’t state management. It’s the tool calls themselves.

Tool Calling Reliability

The model makes up arguments that look like valid JSON but mean nothing. The target function could be down. Or the model confidently calls a tool that doesn’t exist. Addressing a package to an office nobody’s heard of. All three happen in production. Often in the same run.

Three layers of defense. You need all three.

Strict JSON Schema validation. Every tool gets a schema. Check inputs before running anything. Reject anything that doesn’t match.

{
  "name": "calculate_invoice",
  "parameters": {
    "type": "object",
    "properties": {
      "amount": { "type": "number", "minimum": 0 },
      "currency": { "type": "string", "enum": ["USD", "EUR", "GBP"] },
      "invoice_id": { "type": "string", "pattern": "^INV-[0-9]{6}$" }
    },
    "required": ["amount", "currency", "invoice_id"]
  }
}

Retry with context feedback. The naive approach retries the same call with the same arguments. That almost never works. Use exponential backoff, and after 2-3 failures, fall back to another tool or return a structured error. The detail that makes retries actually work: pass the error message back to the LLM as context. Models correct themselves well when they can see what went wrong. Terribly when they can’t. Tell the new hire why the form was rejected, and they fix it. Hand it back without a word, and they submit the same thing again.

Timeout enforcement. Hard timeouts per tool call. 30-60 seconds for API calls, 5-10 minutes for data processing. Without these, a single hung integration blocks the entire workflow while your token meter runs. In cloud-native architectures , circuit breakers serve the same purpose for service-to-service calls.

Guardrails Architecture

An agent without guardrails is the new hire with the corporate credit card and admin access on day one. No spending limit. No approval chain. Full permissions to every system. What could go wrong? (Everything. Everything could go wrong.)

Prerequisites

Every tool has a JSON Schema definition with strict validation
Per-run token budget is set (50,000-200,000 tokens depending on workflow complexity)
Step limit of 15-25 tool calls per run is enforced at the orchestration layer
Human-in-the-loop gates exist for any action that changes production state
All tool calls are logged with full arguments, results, and token cost

Input validation sits between the user’s request and the agent’s first step. Reject prompt injection attempts, confirm scope, clean up prompts built from user input. Off-the-shelf frameworks handle toxicity and PII detection. The checks that actually matter are specific to your business: a financial agent rejecting unauthorized accounts, a healthcare agent refusing dosing guidance. Those you write yourself.

Output filtering inspects every tool call before it runs. A read-only agent that suddenly tries to call a DELETE endpoint has either been prompt-injected or made up its permissions. Block it, log it, page someone.

Cost caps set a per-run token budget and time limit. Without these, a single infinite loop shreds your monthly budget in an afternoon. The post-mortem always includes “we were going to add cost caps next sprint.” (Famous last words, cloud edition.)

Good AI system guardrails treat the agent like an untrusted new hire. Minimum permissions. Every action logged. Damage limited by hard caps.

Agent Memory Patterns

You’ve secured the tools and capped the cost. The subtler problem: what does the agent remember between steps?

Forget everything, and the agent is useless. Remember everything, and the context window fills up until the agent ignores its own instructions. The desk is so buried in printouts from the last forty steps that the original assignment is somewhere at the bottom. Goldfish with a PhD. Memory comes down to what to keep, what to compress, and what to look up on demand.

Conversation memory uses a sliding window with summaries. Compress old exchanges into summaries, keep recent ones word-for-word. The summary loses detail, but the alternative is worse: the model forgetting its own instructions under all that piled-up context.

Episodic memory stores records of past runs, pulled up as examples when similar tasks come along. An agent that has processed 50 invoice reconciliations carries 50 real examples of decisions and edge cases. The new hire’s notebook. Without it, every invoice is the first invoice.

Semantic memory uses RAG-backed vector retrieval. Simple idea. In practice, retrieval quality makes or breaks it. When retrieval returns 10 chunks and only 2 matter, the other 8 actively hurt the reasoning. Noise drowning out the signal. The NIST AI Risk Management Framework covers what can go wrong. The fix: strict filtering and relevance scoring before anything reaches the context window.

The Observability Gap

Memory sorted. Tools hardened. Cost capped. Something goes wrong anyway.

Your APM dashboard says 3.2 seconds, 200 OK. Tells you nothing about what the agent actually did. Agent observability captures the full trail: which tools were considered, which were picked, what arguments were sent, what came back, and how the agent used those results to pick its next move.

Without it, debugging means staring at “step 7 failed” with no idea why the agent chose step 7 in the first place. The new hire made a mistake, but nobody kept a log of what they did all day. Good luck writing the incident report.

Four metrics matter beyond latency. Token cost per run catches runaway loops before they become budget fires. Tool call success rate shows which integrations need fixing. Step count distribution shows which task types make your agents struggle. Guardrail rejection rate shows how often the agent pushes against its fences. Track these next to normal APM. AI automation agent observability that skips these is flying blind.

Real Failure Modes

Observability tells you what went wrong. Know the failure modes upfront and you can design around them. Four patterns cause most production agent incidents.

Infinite loops. Agent calls a tool, gets an error, retries the exact same way. Keeps going until the cost cap catches it. The new hire trying the same broken printer over and over, expecting a different result. Einstein’s definition of insanity, but with a billing API. A hard step limit of 15-25 steps prevents this, but only if the limit is set before the first production run.

Hallucinated tool calls. The model invents tool names that don’t exist. Gets worse as your tool list grows past 20 entries. The new hire confidently submitting a form to a department the company never had. Check every call against the registered tool list. Return a clear error listing what’s available. The model usually corrects itself on the next try.

Context window exhaustion. A 10-step workflow with wordy API responses can fill a 128K context window before the final step. The model loses access to its system prompt, task description, and guardrails. The original task is buried somewhere in the archaeological layers of piled-up context. Fix: trim tool responses hard. Return only what the agent needs for its next step, not the full API response.

Cascading retries. Tool A fails, agent tries Tool B. Tool B returns different data, causing Tool C to fail with unexpected input. Each retry looks reasonable on its own. Zoom out and the agent has played a game of telephone with itself. The final output has nothing to do with the original task. Circuit breakers at the workflow level catch this before things snowball.

Anti-pattern

Don’t: Expose more than 15-20 tools to a single agent. As the tool list grows, made-up tool names increase and selection accuracy drops. The agent starts guessing instead of choosing.

Do: Use specialized sub-agents with focused tool sets. A “research agent” with 5 retrieval tools and a “data agent” with 5 database tools outperform a single agent with 10 tools. Route to the right sub-agent with regular code, not LLM reasoning.

When NOT to Use Agents

Use an agent	Use regular code instead
Task requires reading unstructured input	Input is structured and predictable
Tool selection depends on unclear context	Steps are fixed and never change
What was found changes the next action	Branching logic fits in an if/else
Natural language understanding is core to the task	The task is data transformation or ETL
Error recovery requires reasoning about alternatives	Errors have known, scriptable fixes

An agent adds latency (seconds per reasoning step), cost (every decision burns tokens), and unpredictability. Using an LLM to check if status == "approved" is slower, more expensive, and less reliable than one line of code. Teams make this mistake all the time because the prototype handles it fine and nobody stops to ask whether an LLM was even the right tool. Using a flamethrower to light a candle. Technically works.

The autonomous AI agents guide covers security architecture for agents that operate on their own.

What the Industry Gets Wrong About AI Agents

“Agents replace traditional automation.” They add to it. For every workflow where reasoning adds genuine value, ten more exist where a state machine is faster, cheaper, and predictable. Agent demos are more impressive than state machine demos. And a Formula 1 car is more impressive than a delivery van. That has nothing to do with which one moves your freight.

“Better models fix reliability.” A model that makes up tool names less often still makes up tool names. The architecture around the model matters more than the model version. Waiting for “the next release” to fix production issues is fixing the wrong layer. You’re waiting for a smarter new hire instead of writing a training manual.

“Agent frameworks handle orchestration.” LangChain and similar frameworks handle the happy path. They don’t handle durable state across restarts. They don’t handle approval gates that last hours. They don’t track cost across multi-step workflows. The framework gets you to a demo. The infrastructure around it gets you to production. Mixing those up is how teams end up rewriting their agent stack six months in.

The Agent Hammer Problem When your most exciting tool is an agent, every workflow starts looking like it needs reasoning. Most don’t. The teams building the most reliable agent systems are the ones most willing to replace agent reasoning with regular code wherever the logic is predictable.

Our take Production agent systems should be 80% regular code and 20% LLM reasoning. The agent handles what genuinely needs reading messy input or picking between tools based on context. Everything else? Normal code. Normal tests. The pressure to “use more AI” builds systems that are slower, more expensive, and less reliable than what they replaced. Resist it.

That lunch prototype worked because it ran once, with perfect inputs. Production generative AI means thousands of runs with inputs nobody expected, against APIs that fail at the worst moment. You wouldn’t give the new hire unsupervised access to production on day one. Don’t give your agent that either.

Frequently Asked Questions

What is the difference between an AI agent and a prompt chain?

A prompt chain is a fixed sequence of LLM calls where each step’s output feeds the next. An agent decides on the fly which tools to call, in what order, based on what it finds along the way. The practical difference is branching. Agents handle far more task types than chains, but their unpredictable paths make them harder to test and debug. Use chains when the workflow is predictable. Use agents when it genuinely needs reasoning over tool selection.

How do you prevent infinite loops in multi-step agent workflows?

Set a hard step limit per run. Most production agent systems cap at 15-25 tool calls per run. Track tool call frequency and kill the workflow if any single tool gets called more than 3 times in a row. That pattern catches loops before they burn meaningful cost. Combine with a per-run token budget that kills the workflow when exceeded.

What is the typical latency overhead of adding guardrails to agent tool calls?

Input validation and output filtering add milliseconds per tool call depending on how complex the check is. Across a multi-step workflow, guardrail overhead is tiny compared to LLM inference latency, which takes seconds per reasoning step. The cost of not having guardrails is hallucinated actions that take hours to reverse.

When should you use Step Functions versus Temporal for agent orchestration?

Step Functions fits when you’re already in AWS and your workflows finish within 25 minutes (Standard) or 5 minutes (Express). Temporal fits when workflows run for hours or days, need complex retry rules per activity, or span multiple cloud providers. Temporal’s replay-based model handles long-running agent sessions where the LLM may need to wait for human approval. Step Functions is better at high-throughput short-lived workflows.

How much does agent observability differ from standard application monitoring?

Standard APM tracks request latency and error rates. Agent observability also needs to capture the full reasoning trail: which tools were considered, which were picked, what arguments were generated, and whether outputs passed checks. Without agent-specific tracing, debugging production failures is painfully slow because you can’t retrace the steps that caused the failure.