Field notes · Voice AI

How voice AI actually works (and where it still struggles)

2026-05-28·Nexora Labs·voice-ai · explainer · technical-overview

Voice AI is the technology that lets a software agent answer the phone, hold a real conversation, and act on what was said — booking an appointment, looking up a balance, escalating to a human when the call needs one. Strip away the marketing, and a working voice AI deployment is a pipeline of well-known components stitched together with careful operational discipline. This post walks through each component, where the failure modes live, and what running voice AI on real customer calls actually looks like in 2026.

The four-stage pipeline

A voice AI agent on a live phone call runs four stages in a loop, several times per turn. Each stage is its own model, its own engineering surface, and its own latency budget. Get any one of them wrong and the conversation feels off — the caller waits too long, the agent misunderstands, the response sounds robotic, or the turn-taking gets confused. Get all four right and the caller often does not notice the agent is software.

The stages are (1) automatic speech recognition, which turns the audio of the caller speaking into a stream of text tokens; (2) large language model reasoning, which interprets the text, decides what to do, and generates a response; (3) text-to-speech synthesis, which converts the response text into audio; and (4) turn-taking control, which decides when the agent should speak, when it should pause to let the caller finish, and when it should interrupt or be interrupted itself. Modern stacks run these in a streaming pipeline so the agent can start responding before the caller has finished speaking when the context allows it.

Stage 1 — automatic speech recognition (ASR / STT)

Speech-to-text — call it ASR if you read engineering docs, STT if you read product docs — is the layer that turns raw audio into text the rest of the pipeline can reason over. The current generation of ASR models are transformer-based, trained on enormous multilingual corpora, and run with sub-300 millisecond word-level latency in production. The hard parts are not the basic transcription; modern models are very good at clean speech in well-known accents. The hard parts are accent coverage outside the training distribution, background noise, two-people-talking moments, code-switching between languages mid-sentence, and the recognition of unusual proper nouns the model has never seen.

For a voice AI deployment, the ASR layer is the most accent-sensitive surface in the stack. A model that performs beautifully on US English may degrade on the broad New Zealand or Australian accent distribution, and may struggle harder on regional UK accents or non-native English speakers. Custom acoustic-model fine-tuning helps. So does running a small reference set of real customer calls through the ASR layer before pilot to measure the actual word-error rate the model will hit in production. At Nexora the standard rehearsal pack is 50 reference calls per voice profile so the team has concrete evidence before launch, not assumptions.

Stage 2 — LLM reasoning and tool use

Once the ASR layer has produced text, the large language model takes over. This is the stage that gets the most marketing attention and is often the least understood. The LLM is doing three jobs at once: interpreting what the caller said, deciding what to do about it, and generating the response text that will go to TTS. The "deciding what to do" piece is where tool use and function calling enter the picture. Production voice agents do not just talk; they look up balances, create tickets, update CRM records, schedule appointments, and check inventory by calling real APIs against real systems of record.

The integration layer is therefore at least as important as the model layer. A voice agent that can hold a witty conversation but cannot read your Salesforce or your helpdesk is not a useful product. The standard deployment shape today is an agent definition that includes (a) a system prompt describing the agent's role, (b) a knowledge base the agent grounds answers against (Confluence, your help centre, your product documentation), and (c) a tool registry — the catalogue of API calls the agent can make against your CRM, helpdesk, telephony provider, payments system, and so on. The agent picks the right tool at the right moment, fills in the right arguments from the conversation context, and waits for the tool result before continuing.

Hallucination — the model confidently saying something untrue — is the most-discussed risk and the most preventable one. The fix is RAG (retrieval-augmented generation): the agent consults the knowledge base before answering, and where the knowledge base is silent, the agent defers ("Let me check with a colleague and come back to you" or "I do not have that information, can I take your contact details so we can call you back"). RAG works only as well as the source content. A poorly maintained knowledge base produces a poorly grounded agent. The work of running voice AI in production is at least as much knowledge-base hygiene as it is prompt engineering.

Stage 3 — text-to-speech synthesis (TTS)

The TTS layer turns the model's response text into audio the caller hears. Modern neural TTS models can produce speech that is genuinely difficult to distinguish from a recorded human voice in side-by-side blind tests under good acoustic conditions. Voice profiles can be custom-trained from a small reference set — typically 30-60 minutes of high-quality recorded speech — so the agent speaks in a specific voice the customer brand owns. The voice profile is not generic; it does not sound like every other contact centre.

The hard parts of TTS in 2026 are emotional range, dis-fluency handling, and the moments when the LLM produces text that is syntactically fine but reads awkwardly aloud. A response that looks natural in a chat window can be unspeakable on a phone call (long parenthetical asides, numbers that should be spoken as words rather than digits, abbreviations the TTS layer does not expand correctly). Production voice agents include a "voice-aware" output layer that the LLM has been tuned against — the model learns to write text that the TTS layer can read aloud naturally, not just text that reads well on a screen.

Stage 4 — turn-taking and full-duplex behaviour

Turn-taking is the stage with no good marketing name and the largest impact on whether the call feels natural. Old-school IVR menus do not have turn-taking; you press a digit and the system responds. A voice AI agent has to know when the caller has finished a sentence, when the caller has paused mid-thought, when the agent itself should interrupt because the caller is going in the wrong direction, and when the agent should yield because the caller has tried to interrupt. Get this wrong and the agent sounds robotic regardless of how good the TTS is — long pauses after every caller utterance, or worse, the agent talking over the caller because it failed to detect the caller still had more to say.

Modern voice agents run full-duplex — both sides can speak at the same time, briefly, the way humans do. The agent detects barge-in (caller starts speaking while the agent is mid-sentence) and yields the floor immediately. The agent also produces backchannels — short verbal acknowledgements ("mm-hmm", "right", "I see") — at the moments humans would, so the caller knows they have been heard. The implementation is a small specialised model that runs on the audio stream itself, separate from the LLM, with millisecond-level latency. Without that model, the call feels stilted; with it, the call feels like a phone call.

Latency: the operational constraint everything else lives within

End-to-end latency is the single biggest determinant of whether a voice AI deployment ships or stalls. The target for natural conversation is approximately 600-800 milliseconds from the moment the caller stops speaking to the moment the agent's voice starts responding. Push past about 1.2 seconds and the call feels delayed. Push past 2 seconds and the caller starts repeating themselves because they assume the line dropped. Hitting 600-800ms requires every stage in the pipeline to be streaming (no batching), running on infrastructure close to the caller geographically, and built to overlap stages where possible (the LLM can start producing tokens before ASR has finished, TTS can start synthesising before the LLM has finished). The engineering work is more about pipeline orchestration than about any single model.

Where voice AI still struggles

In 2026 the technology is dramatically better than it was even two years ago. But it still has known weaknesses. Heavy regional accent coverage outside the model's training distribution remains the largest accuracy gap. Emotional conversations — frustrated customers, distressed callers, callers in genuine welfare-related distress — are surfaces where routing to a human is almost always the correct answer; the agent should detect the tone and hand off, not push through. Highly specialised technical conversations (deep medical, deep legal, complex multi-step troubleshooting) still benefit from human escalation. The right deployment shape lets the agent handle the routine 70-90% of conversation volume and routes the rest cleanly.

The other operational reality: voice AI is not "fire and forget". A production deployment requires ongoing tuning — adjustments to the system prompt as edge cases emerge, knowledge-base updates as products and policies change, voice-profile refresh when the brand evolves, and routine review of conversations that ended badly. The teams that succeed treat the agent like a new colleague who joined two weeks ago and needs coaching: regular sample-call review, regular knowledge-base maintenance, regular outcome measurement. The teams that fail treat the agent like a deploy-and-forget software install. The technology rewards the former and punishes the latter.

What a working deployment looks like

A typical production voice AI deployment in 2026 looks like this: the agent answers a portion of inbound calls (often 30-70%) without human involvement, escalates the rest to a human queue with full transcript context attached, handles outbound calls (reminders, renewals, payment-due notifications) at higher volume than the human team would otherwise reach, and writes every conversation back to the CRM or helpdesk as an activity record. The business metric that matters is usually first-call resolution rate on the agent-handled portion — how often the caller got what they needed without escalation — together with the cost-per-conversation and the customer satisfaction score on the agent-handled calls. Done well, the agent shifts the human team's workload toward the harder conversations that genuinely need a human, rather than replacing the human team wholesale.

If you are evaluating voice AI for your team, the questions worth pushing on during the demo are: which ASR provider, how does voice-profile training work, where does latency end up in real conditions (not on a hand-tuned demo), which integration patterns ship pre-built, how does the agent ground its answers in your authoritative content, how is escalation handled, what audit trail does each conversation produce, and how the vendor handles the inevitable need for ongoing tuning. A 14-day pilot on real customer calls is the most reliable way to get honest answers to those questions. Sandbox demos are too curated; production calls are where the truth lives.

Keep reading

More from the Nexora Labs blog

Field notes, explainers, and announcements. New posts go out as the platform evolves.