Voice AI
Barge-in
Barge-in is the user-experience term for letting the caller interrupt the agent’s prompt at any point — effectively skipping ahead. Originally a feature of touch-tone IVR systems, barge-in is now expected in voice AI. A well-tuned barge-in implementation distinguishes between a deliberate interruption and an inadvertent cough or background noise, so the agent only yields when the caller is actually trying to speak.
RELATEDInterruption handling
Full-duplex voice
A voice system is full-duplex when both the caller and the AI agent can speak and listen at the same time, the way two humans converse on a phone line. Older half-duplex voice bots required the caller to wait for the bot to finish before speaking. Full-duplex stacks listen continuously, detect when the caller starts talking mid-sentence, and either pause or yield the turn. This is the baseline for a natural-feeling phone interaction and a hard prerequisite for sub-second response latency.
RELATEDLatency (end-to-end) · Turn-taking
Interruption handling
Interruption handling is how a voice agent reacts when the caller starts speaking while the agent is still talking. The well-mannered response is to stop mid-sentence, listen, and pick up where the caller leaves off. Naive implementations either ignore the interruption (talking over the caller) or fully reset (losing conversational context). Robust interruption handling requires both real-time voice-activity detection and a state model that can resume gracefully after the user’s side of the exchange.
RELATEDTurn-taking
Latency (end-to-end)
End-to-end latency is the elapsed time from the caller finishing their utterance to the first audible token of the agent’s reply. It is the sum of speech-to-text transcription, language-model inference, and text-to-speech synthesis, plus network hops. For a voice interaction to feel conversational rather than transactional, end-to-end latency should sit below roughly 500 milliseconds. Latency above one second cues the caller that they are talking to a machine and increases hang-ups.
RELATEDSTT (Speech-to-Text) · TTS (Text-to-Speech)
STT (Speech-to-Text)
Speech-to-Text is the component that transcribes a caller’s audio into written text the language model can reason over. Modern STT models stream partial transcripts in real time so the downstream LLM can begin planning a response before the caller has finished speaking. STT accuracy varies by accent, background noise, and language coverage. For voice AI deployments the relevant metric is not raw word-error rate but rather how often the resulting transcript causes the agent to take the wrong action.
RELATEDTTS (Text-to-Speech) · Latency (end-to-end)
TTS (Text-to-Speech)
Text-to-Speech is the component that converts the language model’s written response back into audible speech. Quality is measured along three axes: naturalness of the voice, expressiveness of inflection, and synthesis latency. High-end TTS engines support emotional tone, pace control, and pronunciation hints for proper nouns. Latency matters as much as quality — a beautiful voice that takes a second to start speaking still breaks the conversational rhythm.
RELATEDSTT (Speech-to-Text) · Voice cloning
Turn-taking
Turn-taking is the protocol that decides who speaks next in a conversation. Human speakers signal turn boundaries with pitch, pause length, and gaze. Voice AI systems approximate this with voice-activity detection, pause-threshold tuning, and predictive end-of-utterance models. Poor turn-taking is the most common cause of unnatural-feeling agents: the bot either talks over the caller or sits silent after the caller has clearly finished. Good turn-taking is mostly invisible — you only notice it when it fails.
RELATEDFull-duplex voice
Voice cloning
Voice cloning creates a custom synthetic voice from a sample of human recordings. With sufficient reference audio — typically 30 minutes to several hours of clean speech — the resulting voice can read arbitrary new text in a tone that closely matches the source speaker. Enterprise voice deployments use cloning to maintain a consistent brand voice across thousands of agent calls. Responsible use requires explicit consent from the source speaker and clear disclosure to end-users.
RELATEDTTS (Text-to-Speech)
Chat AI
Channel handoff
Channel handoff is the moment a conversation moves from one surface to another — web chat to SMS, SMS to phone call, AI agent to human teammate. The technical challenge is preserving conversational state, customer identity, and the working context so the receiving channel picks up where the sending one left off. Well-implemented handoff is invisible to the customer; poorly implemented handoff forces them to repeat themselves and is one of the largest sources of customer-experience friction.
RELATEDOmnichannel chat
Omnichannel chat
Omnichannel chat means a single conversational agent serves customers across multiple messaging surfaces — web chat, WhatsApp, SMS, Telegram, Messenger, in-app — while preserving the same identity, context, and conversation history. The implementation hinges on a unified customer record keyed off phone, email, or platform-specific identifier. Done well, a customer can start an enquiry on WhatsApp at lunch, switch to the web chat from a desk in the afternoon, and find the agent already knows what they were asking about.
RELATEDChannel handoff
Streaming responses
Streaming responses means the agent emits its reply token-by-token rather than waiting for the full answer to be generated before sending anything. In chat, the user sees text typing out in real time; in voice, the TTS engine can begin speaking before the LLM has finished planning the sentence. Streaming materially reduces perceived latency — a four-second reply that starts in 400ms feels faster than a two-second reply that starts in two seconds.
RELATEDLatency (end-to-end)
WhatsApp Business API
The WhatsApp Business API is the programmatic interface businesses use to send and receive WhatsApp messages at scale. Unlike the consumer app it requires a verified business profile, message templates pre-approved by Meta for outbound notifications, and a Business Solution Provider relationship. Inbound replies and reactive conversations are unrestricted. Common use cases include order updates, appointment reminders, customer support, and conversational commerce in markets where WhatsApp is the dominant messaging channel.
RELATEDOmnichannel chat
Conversational AI
Agent orchestration
Agent orchestration is the runtime that decides, on each turn, what the AI agent should do next — ask a clarifying question, call an external tool, look something up in a knowledge base, hand off to a human. Orchestration sits between the language model and everything it interacts with, enforcing guardrails, sequencing tool calls, and managing conversational state across turns. Quality of orchestration is what separates a demo-grade chatbot from a production-grade agent.
RELATEDIntent classification · Tool use
Function calling
Function calling is the lower-level protocol that lets a language model emit a structured request — a JSON payload naming a function and its arguments — instead of free-form text. The orchestration layer recognises the request, invokes the named function, and feeds the result back to the model. Most modern LLM providers expose function calling as a first-class feature. It is the substrate underneath tool use and is what makes conversational AI deterministic enough to be useful in production.
RELATEDTool use
Hallucination
A hallucination is a confidently-stated fact the language model has invented — a fabricated date, a non-existent feature, a policy that does not apply. Hallucinations are the dominant failure mode of conversational AI in customer-facing settings because the model has no native sense of what it does and does not know. Practical mitigations include grounding answers in retrieved context, structuring critical answers as tool calls, and instructing the model to acknowledge uncertainty rather than guess.
RELATEDRAG (Retrieval-Augmented Generation)
Intent classification
Intent classification is the step that maps a caller’s utterance to a discrete action the system knows how to handle — "check my balance", "book an appointment", "speak to a person". Classical systems used hand-tuned classifiers; modern conversational AI delegates intent recognition to the language model itself, which infers intent in context as part of its reasoning. The remaining engineering work is defining the action set and ensuring the LLM has the tools to actually execute each action.
RELATEDAgent orchestration
Knowledge base
A knowledge base in the conversational AI sense is the indexed corpus the agent can search over to answer questions — product documentation, internal policies, past tickets, FAQs. The corpus is typically chunked, embedded into vectors, and stored in a vector database. Quality of answers depends as much on how the source material is structured as on the model doing the answering. Well-maintained knowledge bases require ongoing editorial work, not a one-time data import.
RELATEDRAG (Retrieval-Augmented Generation)
RAG (Retrieval-Augmented Generation)
Retrieval-Augmented Generation pairs a language model with a search step over the customer’s own documents. On each turn the system retrieves the most relevant snippets from a knowledge base and supplies them to the model as context. The model then writes a response grounded in the retrieved material. RAG is the dominant pattern for question-answering over proprietary content because it sidesteps the need to fine-tune the model itself and keeps source material under the customer’s control.
RELATEDKnowledge base
Tool use
Tool use is the capability for a language model to call external functions — query a database, post to an API, schedule a calendar event — as part of producing its response. The model decides when a tool is needed, formulates the arguments, receives the result, and continues reasoning. Reliable tool use depends on clear function signatures, well-typed return values, and explicit error handling. It is the mechanism by which conversational AI moves from "talking about a task" to "actually doing it".
RELATEDAgent orchestration · Function calling
Deployment
Audit trail
An audit trail is a per-conversation record of every decision the agent made — user input, model output, tools invoked, knowledge sources retrieved, handoff events. It is the artefact compliance teams use to reconstruct what happened on any given call. Useful audit trails are queryable, exportable, and tied to a stable conversation identifier. They are the substrate for both ongoing quality review and any post-incident investigation.
RELATEDZero-retention routing
BYO cloud (Bring Your Own Cloud)
Bring Your Own Cloud is a deployment model where the conversational AI platform runs inside the customer’s own cloud account rather than in a shared vendor tenant. The customer keeps direct control over data residency, network egress, and identity controls; the vendor supplies the software and operates it under shared-responsibility terms. BYO cloud is most often requested by enterprises with strict residency requirements or who want call recordings to never leave their account.
RELATEDOn-prem deployment
Multi-region residency
Multi-region residency means the platform can be configured to keep customer data within a chosen jurisdiction — New Zealand, Australia, the United States, the United Kingdom, the European Union, and so on. Routing, storage, and model inference all stay inside the elected region. It is the operational answer to data-sovereignty requirements that vary market by market. Buyers should ask vendors which regions are actually available, since "multi-region" is sometimes marketed as a capability before all regions are live.
RELATEDAudit trail
On-prem deployment
On-prem deployment runs the conversational AI stack inside the customer’s own data centre or private network, with no dependency on a vendor-hosted cloud. It is the strictest residency posture and is requested where regulation, internal policy, or air-gap requirements rule out public-cloud egress. Trade-offs are higher operational overhead and slower release cycles than a managed cloud option. For most enterprise buyers, BYO cloud is the practical middle ground between vendor SaaS and full on-prem.
RELATEDBYO cloud (Bring Your Own Cloud)
PCI redaction
PCI redaction is the automatic masking of payment-card numbers in transcripts, recordings, and audit logs. When a caller reads out a credit-card number, the redaction layer detects the pattern in real time and substitutes a placeholder before the data hits any persistent store. Implementations may also pause audio recording during card-entry windows. PCI redaction is a baseline engineering control for any voice or chat system that handles card-not-present transactions; it is a control, not a certification.
RELATEDAudit trail
Zero-retention routing
Zero-retention routing is a configuration in which prompts and model outputs are not retained by the upstream LLM provider — they are processed in memory and discarded after the response. The provider commits contractually that no training data is derived from the customer’s traffic. It is the baseline privacy posture enterprise buyers expect when sensitive conversational data passes through a hosted model. Without zero-retention routing, otherwise-sound data-handling controls leak at the LLM hop.
RELATEDAudit trail