Conversational AI: Voice and Chat Architecture

Mar 15, 2026 Metasphere Engineering 10 min read

You built a conversational AI demo. The product manager asked it a question, it responded in a second, the answer was accurate, everyone applauded. Then you connected it to the actual telephony system, added the speech-to-text pipeline, routed it through the dialog manager, generated a response, synthesized speech, and played it back. The user waited 4.5 seconds of dead air. They said “hello?” twice. Then they pressed zero for a human agent. The demo was a hit. Production was a disaster.

The demo-to-production gap in conversational AI is not about model quality. GPT-4, Claude, Gemini. They all generate impressive text. The gap is architectural. It lives in the latency budget between the user finishing a sentence and the system starting its response. It lives in the guardrails that prevent the model from confidently stating something false to a customer. It lives in the channel abstraction that lets you serve web chat, voice, WhatsApp, and SMS from the same dialog engine without duplicating logic for each platform.

Building conversational AI systems that work in production requires solving these architectural problems first. The model is the easy part. Everything around it is where teams fail.

Intent Classification vs LLM-Native Conversations

The first architectural decision is how conversations are managed. Two approaches dominate, and picking the wrong one will cost you months.

Intent classifiers (Rasa, Dialogflow CX, Amazon Lex) parse user input into a predefined intent taxonomy. “What’s my order status?” maps to order.status. “Cancel my subscription” maps to subscription.cancel. Each intent triggers a specific dialog flow with defined slots to fill and actions to execute. This approach is predictable, testable, and fast. Classification runs in under 50ms. The downside: anything outside the taxonomy gets a blank stare.

LLM-native management uses a large language model to interpret user input, maintain context, and generate responses without a predefined intent taxonomy. The model reasons about what the user wants, what information is needed, and what action to take. This handles ambiguity, topic switching, and novel requests that intent classifiers reject. The downside: it is slower, less predictable, and occasionally hallucinates.

The production answer is almost always hybrid. Do not pick one. Use both. High-volume, well-defined paths (order tracking, password reset, appointment scheduling) go through intent classification. These represent 70-80% of traffic and benefit from the speed and predictability of structured flows. The remaining 20-30% (complex questions, multi-topic conversations, edge cases) route to the LLM handler.

Voice Pipeline Architecture

Voice is where the latency problem gets brutal. Voice conversations add four processing stages that text chat does not have, and each one eats into your latency budget. The pipeline is: ASR (automatic speech recognition) to convert audio to text, NLU (natural language understanding) to extract meaning, dialog management to determine the response, TTS (text-to-speech) to convert the response back to audio.

Latency Budget

The total budget for voice is 500ms from end-of-speech detection to first audio output. That is not a nice-to-have target. Anything longer and users start talking over the system or hanging up. Here is how that budget typically breaks down:

End-of-speech detection: 200-400ms (VAD - voice activity detection - waits for silence to confirm the user stopped talking)
ASR: 80-150ms (streaming ASR like Deepgram or Google Speech-to-Text v2 processes audio in chunks, delivering partial results before the user finishes speaking)
NLU + Dialog: 100-300ms (intent classification is fast; LLM inference is the bottleneck)
TTS: 80-150ms for first chunk (streaming TTS begins generating audio from the first sentence while the rest is still being composed)

The single biggest optimization is streaming at every stage. This is not optional. Streaming ASR sends partial transcripts to the NLU layer before the user finishes speaking. Streaming LLM inference sends tokens to the TTS engine as they are generated. Streaming TTS begins audio playback from the first sentence. This pipelining reduces perceived latency by 40-60% compared to waiting for each stage to complete fully. The difference between pipelined and sequential processing is the difference between a system that feels responsive and one that feels broken.

Deepgram’s Nova-2 model delivers streaming ASR with a median latency of 100ms. ElevenLabs and PlayHT offer streaming TTS at similar speeds. The NLU/dialog stage is where the variance lives. A Rasa classifier responds in 20ms, while an LLM inference call ranges from 200ms (GPT-4o mini) to 800ms (Claude with complex reasoning).

Context Management

Conversations are stateful, and this is where most conversational AI systems quietly fall apart. A user says “book a flight to London” and then “actually, make it Paris.” The system needs to understand that “it” refers to the destination, not the entire booking. Get this wrong and every multi-turn conversation becomes an exercise in frustration.

Dialog State Tracking

For structured flows, maintain a session object with extracted slots, current dialog state, and conversation history. The state machine tracks which slots are filled, which are pending, and what confirmations are needed.

For LLM-based conversations, context management means curating the prompt window carefully. A raw conversation history grows unbounded and eventually exceeds the context window. Three approaches work well in production: summarizing earlier turns into a compressed context block, maintaining a structured “memory” object that tracks key facts extracted from the conversation, and using sliding windows that keep the last N turns verbatim while summarizing everything before.

Channel Abstraction

Production conversational AI must serve multiple channels: web chat, WhatsApp Business API, Slack, SMS (Twilio), voice (telephony), and potentially Apple Messages for Business or Google Business Messages. Building separate dialog logic for each channel is the wrong approach. It leads to unmaintainable duplication that grows worse with every channel you add.

The clean architecture is a three-layer stack: channel adapters at the edge, a normalized message bus in the middle, and a channel-unaware dialog engine at the core. Each adapter converts platform-specific webhooks and message formats into a canonical internal format. The dialog engine processes that canonical format and returns a channel-agnostic response. The outbound adapter serializes the response back to the platform’s format, handling media constraints (WhatsApp supports buttons, SMS does not), character limits, and rich card formats. Teams using this pattern support 5+ channels with under 200 lines of adapter code per channel.

Designing this abstraction well is a UX engineering problem as much as a backend one. Users have different expectations on each channel. WhatsApp users expect quick-reply buttons. Voice users need different confirmation patterns than text users. The channel metadata in the canonical message format lets the dialog engine adjust its behavior without coupling to platform specifics.

Guardrails for Customer-Facing AI

An internal chatbot that occasionally hallucinates is annoying. A customer-facing AI that confidently tells a user incorrect information about their account is a liability that will cost you real money and real trust. Guardrails are non-negotiable for production deployment. Do not skip this section.

Hallucination Prevention

Three layers, applied in sequence:

Retrieval grounding: every factual claim must be traceable to a source document. RAG (retrieval-augmented generation) fetches relevant documents from a verified knowledge base, and the model is instructed to only use information from retrieved sources. This alone reduces hallucination from 15-20% to 5-8%.
Fact verification: a deterministic post-processor checks extracted claims against structured data. If the model says “your order ships in 2 days” but the order record shows 5 days, the response is blocked and regenerated. This layer catches the claims RAG misses.
Topic boundaries: a classifier rejects prompts that fall outside the defined domain. If a customer asks the support bot about stock prices, the system responds with a redirect rather than letting the model improvise. NeMo Guardrails and Guardrails AI both provide configurable topic boundary enforcement.

Brand Voice Enforcement

System prompts define the persona, but enforcement needs a structural layer. Fine-tuned classifiers (trained on 500-1000 labeled examples of on-brand vs off-brand responses) score each response before sending. Responses below the threshold get regenerated with a stronger system prompt or fall back to a template. This catches the edge cases where the model drops character under adversarial prompting or unusual inputs.

Handoff to Human Agents

No conversational AI system handles 100% of conversations. Accept that now and design for it. The handoff architecture (detecting when to escalate, what context to transfer, and how to make the transition seamless) determines whether the human agent resolves the issue quickly or spends three minutes re-asking questions the AI already covered.

Detection Signals

Trigger escalation on: explicit user request (“let me talk to a person”), sentiment degradation detected over consecutive turns, repeated intent classification failures (the system doesn’t understand what the user wants), or confidence scores dropping below threshold on consecutive responses. The threshold should be tunable per use case. A support bot for billing questions might escalate after two failed intents. An autonomous agent handling complex workflows might tolerate more ambiguity before escalating.

Context Transfer

The handoff payload should include: full conversation transcript, detected intent and confidence, all extracted entities (account number, order ID, issue category), a one-sentence summary generated by the AI (“Customer is asking about a delayed shipment for order 4821, has been waiting 7 days, tone is frustrated”), and the specific failure reason that triggered handoff. Agents who receive this context resolve issues 35% faster than those who start from scratch.

The Training Data Flywheel

The most underappreciated architectural component is the feedback loop. Every production conversation generates training data: successful completions validate good responses, handoff conversations reveal gaps, low-confidence turns identify weak spots in the intent taxonomy. Most teams ignore this goldmine.

Build the pipeline to capture this data automatically. Tag conversations by outcome (resolved, escalated, abandoned). Feed successful resolutions back as positive examples for fine-tuning. Route escalated conversations through a review queue where human agents label the correct intent and ideal response. This flywheel compounds. Each month’s production traffic improves next month’s model accuracy. Teams running this loop see classification accuracy improve 2-3 percentage points per quarter without manual annotation effort.

The architecture that makes conversational AI work in production is not about choosing the right LLM. It is about the pipeline stages, latency budget management, guardrail layers, and feedback loops around the model. Investing in NLP engineering for these structural components pays off far more than upgrading the model version. A well-architected system with a mid-tier model will outperform a poorly architected system with the best model available. Every time. Serverless inference patterns help manage the bursty traffic patterns typical of conversational workloads without over-provisioning GPU capacity.

Frequently Asked Questions

What is an acceptable end-to-end latency for voice-based conversational AI?

Under 500 milliseconds feels natural and conversational. Between 500ms and 1.2 seconds feels sluggish but tolerable. Over 2 seconds feels broken and users start talking over the system. The largest latency contributor is typically the LLM inference step at 200-800ms. Streaming the first tokens of the TTS response while the LLM is still generating reduces perceived latency by 40-60%, bringing most systems under the 500ms perceptual threshold.

When should you use intent classification versus LLM-native conversation management?

Intent classifiers (Rasa, Dialogflow) work well for bounded domains with under 50 distinct intents and predictable user flows like order tracking, appointment scheduling, or FAQ retrieval. LLM-native management handles open-ended domains, complex multi-turn reasoning, and situations where users frequently go off-script. The hybrid approach wins in production: use intent classification for high-volume predictable paths (80% of traffic) and route the remaining 20% to an LLM handler.

How do you prevent hallucination in customer-facing conversational AI?

Layer three controls: a retrieval-augmented generation pipeline that grounds responses in verified content, a deterministic fact-checker that validates claims against a structured knowledge base before sending, and topic boundary enforcement that rejects prompts outside the defined domain. Systems using all three layers achieve under 2% hallucination rates in production, compared to 15-20% for unguarded LLM responses.

What is the best architecture for supporting multiple channels from a single conversational AI system?

Build a channel abstraction layer that normalizes all inbound messages to a common format (text, sender ID, session ID, channel metadata) and all outbound messages to a channel-agnostic response object. Each channel adapter (web chat, WhatsApp, Slack, SMS, voice) handles serialization, media constraints, and platform-specific features. This lets the dialog engine remain channel-unaware. Teams using this pattern support 5+ channels with under 200 lines of adapter code per channel.

When should a conversational AI system hand off to a human agent?

Trigger handoff on three signals: explicit user request (always honor immediately), sentiment degradation (two consecutive negative-sentiment turns), or confidence collapse (model confidence below 0.4 on three consecutive turns). Transfer the full conversation transcript, detected intent, and any extracted entities to the human agent. The median handoff rate for well-tuned production systems is 8-12% of conversations, and systems that transfer context reduce human agent resolution time by 35%.