← Back to Insights

Conversational AI: Voice and Chat Architecture

Metasphere Engineering 14 min read

You built a conversational AI demo. The product manager asked it a question, it responded in a second, the answer was accurate, everyone applauded. Ship it. Then you connected it to the actual telephony system, added the speech-to-text pipeline, routed it through the dialog manager, generated a response, synthesized speech, and played it back. The user waited 4.5 seconds of dead air. They said “hello?” twice. Then they pressed zero for a human agent.

The interpreter heard the question, thought about it, translated it in their head, double-checked the phrasing, and finally opened their mouth. By then the speaker had moved on. The demo was a hit. Production was a disaster. The gap between them is entirely architectural.

Key takeaways
  • The latency budget for voice is 500ms end-to-end. STT, dialog management, LLM inference, TTS. Exceed it and users press zero.
  • Intent classifiers are predictable but brittle. LLMs are flexible but hallucinate. The hybrid approach routes the predictable majority through intents and the long tail to the LLM.
  • Streaming at every pipeline stage transforms perceived latency. Start TTS on the first sentence while the LLM is still generating the second. Start translating before the speaker finishes.
  • Guardrails at the output layer are non-negotiable. The model will confidently state refund policies that don’t exist. Fact-check before delivery.
  • Human handoff architecture determines customer experience. If the agent repeats questions the bot already asked, the handoff failed. The interpreter left the room without passing their notes.

The W3C Web Speech API and OpenAI Realtime API are shaping browser-based speech interfaces. Building AI systems that hold up in production means solving architecture first, model selection second.

Intent Classification vs LLM-Native Dialog

Intent classifiers respond in under 50ms. The interpreter who knows “where’s the bathroom?” by heart. Anything in the known set of questions gets a clean, predictable answer. Anything outside it gets a blank stare.

LLM-native dialog handles ambiguity, new requests, and multi-turn reasoning well. The interpreter handling a nuanced philosophical question. Also slower, more expensive, and will occasionally tell a customer that your return policy is 90 days when it’s 30. Confident. Wrong.

The hybrid approach routes the predictable majority through intent classification and the rest to the LLM handler. A confidence threshold (typically 0.85) controls the split. High-confidence intents like “where is my order?” go straight to slot-filling and template responses. Low-confidence or ambiguous inputs go to the LLM with retrieval grounding. The interpreter uses the phrasebook for common questions and does real translation for everything else.

AspectIntent ClassifierLLM-NativeHybrid
Response timeUnder 50ms200-800ms50-800ms depending on route
CoverageFixed taxonomy (under 50 intents)Open-endedFull coverage
Hallucination riskNone (template responses)High without guardrailsLow (LLM path has guardrails)
MaintenanceRe-train on new intentsPrompt engineeringBoth, split by path
Best forOrder tracking, FAQ, schedulingComplex reasoning, edge casesProduction at scale

Voice Pipeline Latency

Voice adds four processing stages that each eat latency: ASR (speech-to-text), NLU, dialog management, and TTS (text-to-speech). Run them one after another and you get 1.5 seconds of dead air. The interpreter who waits for the full paragraph before starting. Pipeline them with streaming at every stage and perceived latency drops below 500ms. Start translating sentence one while sentence two is still being spoken.

Voice Pipeline: Speech to Response in Under 2 SecondsVoice Pipeline: Speech to Response in Under 2 SecondsUser SpeaksAudio capturedVAD detects speechASRSpeech to textStreaming partialNLU + DialogIntent + slot fillingContext from historyGenerate responseTTSText to speechStreaming chunksAudio OutputResponse playsTarget: under 2s totalEach stage streams to the next. The user hears the response before TTS finishes generating it.
Pipeline StageLatencyOptimization
End-of-speech (VAD)200-400msTune silence threshold per use case
ASR (speech to text)80-150msStreaming ASR sends partial results before user finishes
NLU + Dialog100-300msIntent classifier: 20ms. LLM: 200-800ms. Route accordingly
TTS (text to speech)80-150ms first chunkStreaming TTS starts audio from first sentence
Total (sequential)1,500ms+Feels broken. Users hang up
Total (pipelined)<500ms perceivedFeels instant. System responds while user finishes
The Dead Air Threshold 500ms of silence feels natural in conversation. Between 500ms and 1.2 seconds, the user notices but tolerates. Beyond 2 seconds, they assume the system is broken. This threshold is set by human perception, not engineering. You can’t negotiate with it. The interpreter who pauses too long loses the room. Every architectural decision in the voice pipeline exists to stay under this line.

Context Management Across Turns

“Book a flight to London.” “Actually, make it Paris.” The system must track that “it” refers to the destination. Coreference resolution in casual speech is hard. Users don’t speak in structured queries. They speak like humans. The interpreter has to remember that “the other thing” means “Paris” from three sentences ago.

For intent-based flows, context lives in a session object: slots (destination: Paris, date: pending), dialog state (current step, next step), and turn history. For LLM-native flows, context is a sliding window of recent turns with summarized older turns and extracted memory facts (prefers aisle seat, frequent London route).

Conversational AI context management across turns with session state and memory tiersThree layers of context: session state tracks slots and intent within a conversation, conversation history provides recent turns for coherence, and long-term memory stores user preferences across sessions. Each feeds into the prompt assembly for the next response.Context Management: Three Memory TiersSession StateCurrent intent: book_flightSlots: origin=NYC, dest=LAXMissing: date, passengersLives within one conversationConversation HistoryLast 5-10 turnsSummarized older turnsToken budget: 2-4K maxCoherence across turnsLong-Term MemoryUser preferencesPast interactions summaryVector-indexed retrievalPersists across sessionsPrompt AssemblySystem prompt + context tiers + user message70% of input tokens are context. The user query is under 2%.

The critical design choice: how long to retain context. Too short, and the bot forgets what the user said three turns ago. Goldfish memory. Too long, and irrelevant context pollutes the LLM’s attention. Five verbatim turns plus a compressed summary of earlier history strikes the right balance for most production deployments.

Channel Abstraction

Channel adapters at the edge. Normalized message bus in the middle. Channel-unaware dialog engine at the core. The interpreter works the same way whether they’re at a podium, on a phone call, or typing in a chat window. Different delivery. Same translation.

Each adapter converts platform-specific webhooks and message formats into a standard internal format. The dialog engine processes that format and returns a channel-agnostic response.

# Channel adapter: normalize WhatsApp webhook to canonical format
class WhatsAppAdapter:
    def normalize(self, webhook: dict) -> CanonicalMessage:
        return CanonicalMessage(
            text=webhook["messages"][0]["text"]["body"],
            sender_id=webhook["messages"][0]["from"],
            session_id=f"wa-{webhook['messages'][0]['from']}",
            channel="whatsapp",
            supports_buttons=True,
            max_length=4096,
        )

    def serialize(self, response: CanonicalResponse) -> dict:
        # Convert to WhatsApp Cloud API format
        ...

The outbound adapter turns the response back to the platform’s format, handling media limits (WhatsApp supports buttons, SMS does not), character limits, and rich card formats. Channel metadata lets the dialog engine adjust behavior without coupling to platform details. Teams using this pattern support 5+ channels with under 200 lines of adapter code per channel.

Get the abstraction right: one day per new channel. Get it wrong: one month. UX engineering as much as backend plumbing.

ChannelSupports ButtonsMax LengthRich CardsVoice
Web chatYesUnlimitedYesNo
WhatsAppYes (3 max)4,096 charsYesAudio messages
SMSNo160 charsNoNo
SlackYes3,000 charsYes (blocks)No
Voice (IVR)No (DTMF only)N/ANoYes

Guardrails for Customer-Facing AI

The model will confidently state refund policies that don’t exist. It will quote prices that changed last quarter. It will promise delivery timelines that no fulfillment system can honor. The interpreter who makes up facts because they sound plausible. Three layers prevent this:

Retrieval grounding. Every factual claim must trace to a source document. RAG pulls relevant documents from a verified knowledge base, and the model is told to use only what’s in those sources. This alone cuts hallucination rates sharply.

Fact checking. A deterministic post-processor checks claims against real data. If the model says “your order ships in 2 days” but the order record shows 5 days, the response gets blocked and regenerated. The interpreter’s editor catches the wrong number before it reaches the audience.

Topic boundaries. A classifier rejects prompts outside the defined domain. Customer asks the support bot about stock prices? The system responds with a redirect rather than letting the model improvise. Stay in your lane. Dedicated guardrail frameworks provide configurable boundary enforcement.

Anti-pattern

Don’t: Ship the LLM response directly to the user with only a system prompt as guardrail. System prompts are suggestions, not enforcement. The model will break them under tricky or unexpected inputs. Asking the interpreter to “please don’t make things up” is not a quality control system.

Do: Treat guardrails as infrastructure, not prompts. RAG grounding, fact checking, and topic boundaries are separate pipeline stages with their own failure modes and monitoring.

Brand voice adds a fourth, optional layer. A fine-tuned classifier scores responses against brand tone guidelines before sending. Below threshold, the response regenerates or falls back to a template. Not needed everywhere, but critical for consumer-facing brands where tone consistency matters as much as accuracy.

Handoff to Human Agents

Trigger handoff on three signals: explicit user request (always honor right away), sentiment getting worse over two consecutive turns, or confidence collapse (model confidence below 0.4 on three turns). Every other trigger is optional. These three are not.

The handoff itself is where most systems fail. If the human agent says “Can you tell me what the issue is?” to a customer who just spent five turns explaining it to the bot, trust is gone. The interpreter left the room without passing their notes. Transfer the full transcript, extracted entities, detected intent, AI-generated summary, and the specific reason the escalation fired. Agents with full context resolve issues noticeably faster. Autonomous AI agents may tolerate more ambiguity before escalating, but the handoff contract stays the same: pass the notes.

Prerequisites
  1. Conversation transcript transfer to agent console verified end-to-end
  2. Entity extraction populates agent screen before first human message
  3. Sentiment detection model tuned against domain-specific language
  4. Confidence threshold calibrated on 1,000+ sample conversations
  5. Fallback template responses cover the top 10 handoff scenarios

The Training Data Flywheel

Every conversation is training data. Tag by outcome: resolved, escalated, abandoned. Feed resolved conversations as positive examples. Route escalations through a review queue where human agents annotate what the bot should have said. This flywheel compounds. Accuracy gains stack each quarter without manual annotation.

The insight most teams miss: abandoned conversations are more valuable than escalated ones. An escalation means the system knew its limits. An abandonment means the user gave up before the system did. Mining abandoned conversations for failure patterns shows the blind spots that neither the bot nor the handoff logic caught. The customers who leave without complaining. They don’t come back.

What the Industry Gets Wrong About Conversational AI

“A better model fixes the latency.” Faster inference helps, but the model is one stage in a five-stage pipeline (STT, NLU, LLM, TTS, audio playback). A model that responds in 200ms instead of 400ms saves 200ms. Streaming the entire pipeline saves 800ms+. Architecture beats model speed every time. A faster interpreter doesn’t help if the microphone and speakers add three seconds.

“Build one bot, deploy to every channel.” The dialog logic should be channel-agnostic. The response formatting can’t be. WhatsApp supports buttons and carousels. SMS supports 160 characters of plain text. Voice has no visual options at all. “One bot, every channel” is a correct principle applied incorrectly when it means identical responses everywhere. Same interpreter, different audiences. Adjust the delivery.

Our take Invest in the streaming pipeline before the model. A mediocre model with streaming end-to-end feels faster than the best model with sequential processing. Users don’t judge response quality independently of response speed. The first word arriving in 300ms buys the rest of the response time to be slow. The interpreter who starts speaking quickly sounds more competent than the one who waits too long, even if the translation is identical. Latency perception is set by the first token, not the last.

That 4.5 seconds of dead air from the demo-to-production gap? With streaming at every stage, channel abstraction handling format differences, and three-layer guardrails catching hallucination before delivery, the same system responds in under a second. The interpreter is fast, accurate, and knows when to hand off. NLP pipeline engineering for architecture pays more than any model upgrade.

Your Chatbot Demo Is Lying to You

The chatbot demo always impresses. Production is where it falls apart. Voice pipeline latency budgets under 500ms, layered hallucination guardrails, and channel abstraction across web, WhatsApp, and voice are what separate impressive demos from reliable products.

Fix Your Voice Pipeline

Frequently Asked Questions

What is an acceptable end-to-end latency for voice-based conversational AI?

+

Under 500 milliseconds feels natural. Between 500ms and 1.2 seconds feels slow but tolerable. Over 2 seconds feels broken and users start talking over the system. The biggest latency hit is usually the LLM inference step at 200-800ms. Streaming the first tokens of TTS while the LLM is still generating cuts perceived latency by more than half, pulling most systems under 500ms.

When should you use intent classification versus LLM-native conversation management?

+

Intent classifiers work well for bounded domains with under 50 distinct intents and predictable flows like order tracking, scheduling, or FAQ retrieval. LLM-native management handles open-ended domains, complex multi-turn reasoning, and users who go off-script. The hybrid approach wins in production: intent classification for the high-volume predictable paths, LLM for the long tail.

How do you prevent hallucination in customer-facing conversational AI?

+

Layer three controls: a RAG pipeline that grounds responses in verified content, a fact-checker that validates claims against a knowledge base before sending, and topic boundaries that reject prompts outside the defined domain. Systems using all three layers see very low hallucination rates in production, compared to the double-digit error rates you get from unguarded LLM responses.

What is the best architecture for supporting multiple channels from a single conversational AI system?

+

Build a channel abstraction layer that turns all inbound messages into a common format (text, sender ID, session ID, channel metadata) and all outbound messages into a standard response object. Each channel adapter (web chat, WhatsApp, Slack, SMS, voice) handles the format conversion, media limits, and platform quirks. The dialog engine never knows which channel it’s talking to. Teams using this pattern support 5+ channels with under 200 lines of adapter code per channel.

When should a conversational AI system hand off to a human agent?

+

Trigger handoff on three signals: the user asks for a human (always honor right away), sentiment getting worse over two turns in a row, or confidence collapse (model confidence below 0.4 three turns running). Transfer the full transcript, detected intent, and extracted entities to the human agent. Well-tuned systems hand off a small fraction of conversations, and passing context cuts human agent resolution time noticeably.