Conversational AI: Voice and Chat Architecture
You built a conversational AI demo. The product manager asked it a question, it responded in a second, the answer was accurate, everyone applauded. Ship it. Then you connected it to the actual telephony system, added the speech-to-text pipeline, routed it through the dialog manager, generated a response, synthesized speech, and played it back. The user waited 4.5 seconds of dead air. They said “hello?” twice. Then they pressed zero for a human agent.
The interpreter heard the question, thought about it, translated it in their head, double-checked the phrasing, and finally opened their mouth. By then the speaker had moved on. The demo was a hit. Production was a disaster. The gap between them is entirely architectural.
- The latency budget for voice is 500ms end-to-end. STT, dialog management, LLM inference, TTS. Exceed it and users press zero.
- Intent classifiers are predictable but brittle. LLMs are flexible but hallucinate. The hybrid approach routes the predictable majority through intents and the long tail to the LLM.
- Streaming at every pipeline stage transforms perceived latency. Start TTS on the first sentence while the LLM is still generating the second. Start translating before the speaker finishes.
- Guardrails at the output layer are non-negotiable. The model will confidently state refund policies that don’t exist. Fact-check before delivery.
- Human handoff architecture determines customer experience. If the agent repeats questions the bot already asked, the handoff failed. The interpreter left the room without passing their notes.
The W3C Web Speech API and OpenAI Realtime API are shaping browser-based speech interfaces. Building AI systems that hold up in production means solving architecture first, model selection second.
Intent Classification vs LLM-Native Dialog
Intent classifiers respond in under 50ms. The interpreter who knows “where’s the bathroom?” by heart. Anything in the known set of questions gets a clean, predictable answer. Anything outside it gets a blank stare.
LLM-native dialog handles ambiguity, new requests, and multi-turn reasoning well. The interpreter handling a nuanced philosophical question. Also slower, more expensive, and will occasionally tell a customer that your return policy is 90 days when it’s 30. Confident. Wrong.
The hybrid approach routes the predictable majority through intent classification and the rest to the LLM handler. A confidence threshold (typically 0.85) controls the split. High-confidence intents like “where is my order?” go straight to slot-filling and template responses. Low-confidence or ambiguous inputs go to the LLM with retrieval grounding. The interpreter uses the phrasebook for common questions and does real translation for everything else.
| Aspect | Intent Classifier | LLM-Native | Hybrid |
|---|---|---|---|
| Response time | Under 50ms | 200-800ms | 50-800ms depending on route |
| Coverage | Fixed taxonomy (under 50 intents) | Open-ended | Full coverage |
| Hallucination risk | None (template responses) | High without guardrails | Low (LLM path has guardrails) |
| Maintenance | Re-train on new intents | Prompt engineering | Both, split by path |
| Best for | Order tracking, FAQ, scheduling | Complex reasoning, edge cases | Production at scale |
Voice Pipeline Latency
Voice adds four processing stages that each eat latency: ASR (speech-to-text), NLU, dialog management, and TTS (text-to-speech). Run them one after another and you get 1.5 seconds of dead air. The interpreter who waits for the full paragraph before starting. Pipeline them with streaming at every stage and perceived latency drops below 500ms. Start translating sentence one while sentence two is still being spoken.
| Pipeline Stage | Latency | Optimization |
|---|---|---|
| End-of-speech (VAD) | 200-400ms | Tune silence threshold per use case |
| ASR (speech to text) | 80-150ms | Streaming ASR sends partial results before user finishes |
| NLU + Dialog | 100-300ms | Intent classifier: 20ms. LLM: 200-800ms. Route accordingly |
| TTS (text to speech) | 80-150ms first chunk | Streaming TTS starts audio from first sentence |
| Total (sequential) | 1,500ms+ | Feels broken. Users hang up |
| Total (pipelined) | <500ms perceived | Feels instant. System responds while user finishes |
Context Management Across Turns
“Book a flight to London.” “Actually, make it Paris.” The system must track that “it” refers to the destination. Coreference resolution in casual speech is hard. Users don’t speak in structured queries. They speak like humans. The interpreter has to remember that “the other thing” means “Paris” from three sentences ago.
For intent-based flows, context lives in a session object: slots (destination: Paris, date: pending), dialog state (current step, next step), and turn history. For LLM-native flows, context is a sliding window of recent turns with summarized older turns and extracted memory facts (prefers aisle seat, frequent London route).
The critical design choice: how long to retain context. Too short, and the bot forgets what the user said three turns ago. Goldfish memory. Too long, and irrelevant context pollutes the LLM’s attention. Five verbatim turns plus a compressed summary of earlier history strikes the right balance for most production deployments.
Channel Abstraction
Channel adapters at the edge. Normalized message bus in the middle. Channel-unaware dialog engine at the core. The interpreter works the same way whether they’re at a podium, on a phone call, or typing in a chat window. Different delivery. Same translation.
Each adapter converts platform-specific webhooks and message formats into a standard internal format. The dialog engine processes that format and returns a channel-agnostic response.
# Channel adapter: normalize WhatsApp webhook to canonical format
class WhatsAppAdapter:
def normalize(self, webhook: dict) -> CanonicalMessage:
return CanonicalMessage(
text=webhook["messages"][0]["text"]["body"],
sender_id=webhook["messages"][0]["from"],
session_id=f"wa-{webhook['messages'][0]['from']}",
channel="whatsapp",
supports_buttons=True,
max_length=4096,
)
def serialize(self, response: CanonicalResponse) -> dict:
# Convert to WhatsApp Cloud API format
...
The outbound adapter turns the response back to the platform’s format, handling media limits (WhatsApp supports buttons, SMS does not), character limits, and rich card formats. Channel metadata lets the dialog engine adjust behavior without coupling to platform details. Teams using this pattern support 5+ channels with under 200 lines of adapter code per channel.
Get the abstraction right: one day per new channel. Get it wrong: one month. UX engineering as much as backend plumbing.
| Channel | Supports Buttons | Max Length | Rich Cards | Voice |
|---|---|---|---|---|
| Web chat | Yes | Unlimited | Yes | No |
| Yes (3 max) | 4,096 chars | Yes | Audio messages | |
| SMS | No | 160 chars | No | No |
| Slack | Yes | 3,000 chars | Yes (blocks) | No |
| Voice (IVR) | No (DTMF only) | N/A | No | Yes |
Guardrails for Customer-Facing AI
The model will confidently state refund policies that don’t exist. It will quote prices that changed last quarter. It will promise delivery timelines that no fulfillment system can honor. The interpreter who makes up facts because they sound plausible. Three layers prevent this:
Retrieval grounding. Every factual claim must trace to a source document. RAG pulls relevant documents from a verified knowledge base, and the model is told to use only what’s in those sources. This alone cuts hallucination rates sharply.
Fact checking. A deterministic post-processor checks claims against real data. If the model says “your order ships in 2 days” but the order record shows 5 days, the response gets blocked and regenerated. The interpreter’s editor catches the wrong number before it reaches the audience.
Topic boundaries. A classifier rejects prompts outside the defined domain. Customer asks the support bot about stock prices? The system responds with a redirect rather than letting the model improvise. Stay in your lane. Dedicated guardrail frameworks provide configurable boundary enforcement.
Don’t: Ship the LLM response directly to the user with only a system prompt as guardrail. System prompts are suggestions, not enforcement. The model will break them under tricky or unexpected inputs. Asking the interpreter to “please don’t make things up” is not a quality control system.
Do: Treat guardrails as infrastructure, not prompts. RAG grounding, fact checking, and topic boundaries are separate pipeline stages with their own failure modes and monitoring.
Brand voice adds a fourth, optional layer. A fine-tuned classifier scores responses against brand tone guidelines before sending. Below threshold, the response regenerates or falls back to a template. Not needed everywhere, but critical for consumer-facing brands where tone consistency matters as much as accuracy.
Handoff to Human Agents
Trigger handoff on three signals: explicit user request (always honor right away), sentiment getting worse over two consecutive turns, or confidence collapse (model confidence below 0.4 on three turns). Every other trigger is optional. These three are not.
The handoff itself is where most systems fail. If the human agent says “Can you tell me what the issue is?” to a customer who just spent five turns explaining it to the bot, trust is gone. The interpreter left the room without passing their notes. Transfer the full transcript, extracted entities, detected intent, AI-generated summary, and the specific reason the escalation fired. Agents with full context resolve issues noticeably faster. Autonomous AI agents may tolerate more ambiguity before escalating, but the handoff contract stays the same: pass the notes.
- Conversation transcript transfer to agent console verified end-to-end
- Entity extraction populates agent screen before first human message
- Sentiment detection model tuned against domain-specific language
- Confidence threshold calibrated on 1,000+ sample conversations
- Fallback template responses cover the top 10 handoff scenarios
The Training Data Flywheel
Every conversation is training data. Tag by outcome: resolved, escalated, abandoned. Feed resolved conversations as positive examples. Route escalations through a review queue where human agents annotate what the bot should have said. This flywheel compounds. Accuracy gains stack each quarter without manual annotation.
The insight most teams miss: abandoned conversations are more valuable than escalated ones. An escalation means the system knew its limits. An abandonment means the user gave up before the system did. Mining abandoned conversations for failure patterns shows the blind spots that neither the bot nor the handoff logic caught. The customers who leave without complaining. They don’t come back.
What the Industry Gets Wrong About Conversational AI
“A better model fixes the latency.” Faster inference helps, but the model is one stage in a five-stage pipeline (STT, NLU, LLM, TTS, audio playback). A model that responds in 200ms instead of 400ms saves 200ms. Streaming the entire pipeline saves 800ms+. Architecture beats model speed every time. A faster interpreter doesn’t help if the microphone and speakers add three seconds.
“Build one bot, deploy to every channel.” The dialog logic should be channel-agnostic. The response formatting can’t be. WhatsApp supports buttons and carousels. SMS supports 160 characters of plain text. Voice has no visual options at all. “One bot, every channel” is a correct principle applied incorrectly when it means identical responses everywhere. Same interpreter, different audiences. Adjust the delivery.
That 4.5 seconds of dead air from the demo-to-production gap? With streaming at every stage, channel abstraction handling format differences, and three-layer guardrails catching hallucination before delivery, the same system responds in under a second. The interpreter is fast, accurate, and knows when to hand off. NLP pipeline engineering for architecture pays more than any model upgrade.