AI Agent Orchestration in Production
You wire up a prototype over lunch. The agent calls a search tool, summarizes the result, feeds it into an API, and returns a neat structured answer. Works great in the notebook. You demo it to the team. Everyone nods. The PM starts drafting the launch announcement.
Then production happens.
The agent hits a 429 on the search tool. Retries. Gets a different result set this time. Makes up a parameter name for the downstream API. Gets a 400 back. Then, helpfully and confidently, tries to call a tool that doesn’t exist. Full confidence. Main character energy. Token meter spinning the whole time. By the time someone notices, the run has burned through more tokens than your entire pilot budget.
You hired a brilliant new employee, gave them access to every system in the company, and left for the weekend. The model is fine. What’s missing is supervision. Spending limits. Restricted access. A manager who checks their work before it goes out the door.
- Agent workflows are distributed systems with an unpredictable decision-maker at the center. Same input, different decisions, every time.
- Three orchestration patterns hold up under load. Sequential chains, parallel fan-out, and human-in-the-loop gates. Everything else is a variation or a mistake.
- Durable state is required the moment a workflow needs to pause for approval, wait for a webhook, or survive a restart.
- Tool calling is the most fragile layer. Schema validation, retry with context feedback, and hard timeouts are non-negotiable.
- Set cost caps before the first production run. A per-run token budget and a step limit catch runaway loops before your finance team does.
Orchestration Patterns That Actually Work
Sequential chains are the simplest and the most commonly botched. The mistake: passing the full output of each step forward. A 10-step chain dumping 5,000 tokens of piled-up context before the final step even starts thinking. The new hire’s desk is already buried in printouts from the first nine steps. Can’t find the original assignment under the pile. Drowning in their own paperwork. Compress between steps. Pass only what the next step needs.
Parallel fan-out runs independent subtasks at the same time and merges results. The trap: assuming tasks are independent when they’re not. If subtask B’s quality depends on A’s result, you have a hidden sequential dependency. A whiteboard catches these in five minutes. Production catches them at 3am.
Human-in-the-loop gates pause for approval before anything that touches production state, triggers a transaction, or goes external. Not something to add later. The new hire can draft the email, but someone else clicks send. The challenge: keeping workflow state alive while it waits for a human. A Lambda can’t sit for two hours. You need durable state. And that’s the hardest engineering problem in agent orchestration.
State Management for Long-Running Agents
If your workflow finishes in seconds, a Lambda is fine. The moment it needs to pause for approval, wait for a webhook, or retry after a delay, you need durable orchestration. State that lives through restarts, timeouts, and deployments. The new hire goes home at 5pm and picks up exactly where they left off the next morning. No lost context. No starting over.
AWS Step Functions saves state transitions natively. A workflow picks up from where it failed, not from scratch. Express Workflows cap at five minutes. A single approval gate blows right past that. Standard Workflows run up to a year.
Temporal replays from event history. A workflow that failed at step 47 picks up from step 47 with full context. For expensive LLM calls at every step, that replay efficiency saves real money. Steep learning curve. Worth it for complex agent workflows.
Inngest targets serverless event-driven workflows with a simpler developer experience. You trade maturity and advanced features for that simplicity.
| Step Functions | Temporal | Inngest | |
|---|---|---|---|
| Best for | AWS-native teams | Cross-cloud, long-running | Greenfield, fast shipping |
| Max duration | 1 year (Standard) | Unlimited | Depends on plan |
| Failure recovery | Pick up from failed state | Replay from event history | Retry with backoff |
| Learning curve | Low (if in AWS) | High (worth it for complex) | Low |
| Self-host option | No | Yes | Yes |
| Agent fit | Short workflows with gates | Long sessions with expensive steps | Event-driven, simpler chains |
Durable state keeps your workflow alive. But the layer that breaks most often isn’t state management. It’s the tool calls themselves.
Tool Calling Reliability
The model makes up arguments that look like valid JSON but mean nothing. The target function could be down. Or the model confidently calls a tool that doesn’t exist. Addressing a package to an office nobody’s heard of. All three happen in production. Often in the same run.
Three layers of defense. You need all three.
Strict JSON Schema validation. Every tool gets a schema. Check inputs before running anything. Reject anything that doesn’t match.
{
"name": "calculate_invoice",
"parameters": {
"type": "object",
"properties": {
"amount": { "type": "number", "minimum": 0 },
"currency": { "type": "string", "enum": ["USD", "EUR", "GBP"] },
"invoice_id": { "type": "string", "pattern": "^INV-[0-9]{6}$" }
},
"required": ["amount", "currency", "invoice_id"]
}
}
Retry with context feedback. The naive approach retries the same call with the same arguments. That almost never works. Use exponential backoff, and after 2-3 failures, fall back to another tool or return a structured error. The detail that makes retries actually work: pass the error message back to the LLM as context. Models correct themselves well when they can see what went wrong. Terribly when they can’t. Tell the new hire why the form was rejected, and they fix it. Hand it back without a word, and they submit the same thing again.
Timeout enforcement. Hard timeouts per tool call. 30-60 seconds for API calls, 5-10 minutes for data processing. Without these, a single hung integration blocks the entire workflow while your token meter runs. In cloud-native architectures , circuit breakers serve the same purpose for service-to-service calls.
Guardrails Architecture
An agent without guardrails is the new hire with the corporate credit card and admin access on day one. No spending limit. No approval chain. Full permissions to every system. What could go wrong? (Everything. Everything could go wrong.)
- Every tool has a JSON Schema definition with strict validation
- Per-run token budget is set (50,000-200,000 tokens depending on workflow complexity)
- Step limit of 15-25 tool calls per run is enforced at the orchestration layer
- Human-in-the-loop gates exist for any action that changes production state
- All tool calls are logged with full arguments, results, and token cost
Input validation sits between the user’s request and the agent’s first step. Reject prompt injection attempts, confirm scope, clean up prompts built from user input. Off-the-shelf frameworks handle toxicity and PII detection. The checks that actually matter are specific to your business: a financial agent rejecting unauthorized accounts, a healthcare agent refusing dosing guidance. Those you write yourself.
Output filtering inspects every tool call before it runs. A read-only agent that suddenly tries to call a DELETE endpoint has either been prompt-injected or made up its permissions. Block it, log it, page someone.
Cost caps set a per-run token budget and time limit. Without these, a single infinite loop shreds your monthly budget in an afternoon. The post-mortem always includes “we were going to add cost caps next sprint.” (Famous last words, cloud edition.)
Good AI system guardrails treat the agent like an untrusted new hire. Minimum permissions. Every action logged. Damage limited by hard caps.
Agent Memory Patterns
You’ve secured the tools and capped the cost. The subtler problem: what does the agent remember between steps?
Forget everything, and the agent is useless. Remember everything, and the context window fills up until the agent ignores its own instructions. The desk is so buried in printouts from the last forty steps that the original assignment is somewhere at the bottom. Goldfish with a PhD. Memory comes down to what to keep, what to compress, and what to look up on demand.
Conversation memory uses a sliding window with summaries. Compress old exchanges into summaries, keep recent ones word-for-word. The summary loses detail, but the alternative is worse: the model forgetting its own instructions under all that piled-up context.
Episodic memory stores records of past runs, pulled up as examples when similar tasks come along. An agent that has processed 50 invoice reconciliations carries 50 real examples of decisions and edge cases. The new hire’s notebook. Without it, every invoice is the first invoice.
Semantic memory uses RAG-backed vector retrieval. Simple idea. In practice, retrieval quality makes or breaks it. When retrieval returns 10 chunks and only 2 matter, the other 8 actively hurt the reasoning. Noise drowning out the signal. The NIST AI Risk Management Framework covers what can go wrong. The fix: strict filtering and relevance scoring before anything reaches the context window.
The Observability Gap
Memory sorted. Tools hardened. Cost capped. Something goes wrong anyway.
Your APM dashboard says 3.2 seconds, 200 OK. Tells you nothing about what the agent actually did. Agent observability captures the full trail: which tools were considered, which were picked, what arguments were sent, what came back, and how the agent used those results to pick its next move.
Without it, debugging means staring at “step 7 failed” with no idea why the agent chose step 7 in the first place. The new hire made a mistake, but nobody kept a log of what they did all day. Good luck writing the incident report.
Four metrics matter beyond latency. Token cost per run catches runaway loops before they become budget fires. Tool call success rate shows which integrations need fixing. Step count distribution shows which task types make your agents struggle. Guardrail rejection rate shows how often the agent pushes against its fences. Track these next to normal APM. AI automation agent observability that skips these is flying blind.
Real Failure Modes
Observability tells you what went wrong. Know the failure modes upfront and you can design around them. Four patterns cause most production agent incidents.
Infinite loops. Agent calls a tool, gets an error, retries the exact same way. Keeps going until the cost cap catches it. The new hire trying the same broken printer over and over, expecting a different result. Einstein’s definition of insanity, but with a billing API. A hard step limit of 15-25 steps prevents this, but only if the limit is set before the first production run.
Hallucinated tool calls. The model invents tool names that don’t exist. Gets worse as your tool list grows past 20 entries. The new hire confidently submitting a form to a department the company never had. Check every call against the registered tool list. Return a clear error listing what’s available. The model usually corrects itself on the next try.
Context window exhaustion. A 10-step workflow with wordy API responses can fill a 128K context window before the final step. The model loses access to its system prompt, task description, and guardrails. The original task is buried somewhere in the archaeological layers of piled-up context. Fix: trim tool responses hard. Return only what the agent needs for its next step, not the full API response.
Cascading retries. Tool A fails, agent tries Tool B. Tool B returns different data, causing Tool C to fail with unexpected input. Each retry looks reasonable on its own. Zoom out and the agent has played a game of telephone with itself. The final output has nothing to do with the original task. Circuit breakers at the workflow level catch this before things snowball.
Don’t: Expose more than 15-20 tools to a single agent. As the tool list grows, made-up tool names increase and selection accuracy drops. The agent starts guessing instead of choosing.
Do: Use specialized sub-agents with focused tool sets. A “research agent” with 5 retrieval tools and a “data agent” with 5 database tools outperform a single agent with 10 tools. Route to the right sub-agent with regular code, not LLM reasoning.
When NOT to Use Agents
| Use an agent | Use regular code instead |
|---|---|
| Task requires reading unstructured input | Input is structured and predictable |
| Tool selection depends on unclear context | Steps are fixed and never change |
| What was found changes the next action | Branching logic fits in an if/else |
| Natural language understanding is core to the task | The task is data transformation or ETL |
| Error recovery requires reasoning about alternatives | Errors have known, scriptable fixes |
An agent adds latency (seconds per reasoning step), cost (every decision burns tokens), and unpredictability. Using an LLM to check if status == "approved" is slower, more expensive, and less reliable than one line of code. Teams make this mistake all the time because the prototype handles it fine and nobody stops to ask whether an LLM was even the right tool. Using a flamethrower to light a candle. Technically works.
The autonomous AI agents guide covers security architecture for agents that operate on their own.
What the Industry Gets Wrong About AI Agents
“Agents replace traditional automation.” They add to it. For every workflow where reasoning adds genuine value, ten more exist where a state machine is faster, cheaper, and predictable. Agent demos are more impressive than state machine demos. And a Formula 1 car is more impressive than a delivery van. That has nothing to do with which one moves your freight.
“Better models fix reliability.” A model that makes up tool names less often still makes up tool names. The architecture around the model matters more than the model version. Waiting for “the next release” to fix production issues is fixing the wrong layer. You’re waiting for a smarter new hire instead of writing a training manual.
“Agent frameworks handle orchestration.” LangChain and similar frameworks handle the happy path. They don’t handle durable state across restarts. They don’t handle approval gates that last hours. They don’t track cost across multi-step workflows. The framework gets you to a demo. The infrastructure around it gets you to production. Mixing those up is how teams end up rewriting their agent stack six months in.
That lunch prototype worked because it ran once, with perfect inputs. Production generative AI means thousands of runs with inputs nobody expected, against APIs that fail at the worst moment. You wouldn’t give the new hire unsupervised access to production on day one. Don’t give your agent that either.