Most people hear “voice AI” and picture Siri getting a question wrong, or a meeting bot transcribing their calls. Neither of those is what I mean when I talk about voice AI. Both are the shallow end of the pool.
Voice AI, in its most useful form, is a system that can listen, understand context, reason, and speak back in real time at the quality level where a person on the other end isn’t constantly compensating for the technology. That bar sounds obvious. It is remarkably hard to actually hit.
I’ve spent years building Speak AI, a voice and audio intelligence platform now used by over 250,000 people across more than 100 countries. That experience has given me a particular view of where voice AI is genuinely useful, where it still breaks, and why most of what gets covered in the press misses the interesting part entirely.
The stack underneath voice AI
When you talk to a voice AI agent, at minimum three things are happening: your speech is being converted to text (ASR: automatic speech recognition), that text is being processed by a language model to generate a response, and that response is being converted back into speech (TTS: text to speech). Each of these has latency. Each has failure modes. And they have to work together fast enough that the conversation feels natural, not like you’re waiting for a web page to load in 2003.
The language model piece is the one that’s improved most dramatically in the last two years. What hasn’t kept pace as cleanly is the orchestration layer. How you handle interruptions, how you maintain context across a long call, how you gracefully recover when the model says something wrong, and how you tune the experience for a specific use case rather than a generic one.
That orchestration layer is where most of the real work lives, and it’s the part that’s almost never discussed in coverage of voice AI.
Phone agents and web agents: the two use cases that actually matter
There are two environments where voice AI is proving genuinely useful right now. Phone and web. They are different problems with different constraints.
Phone agents handle inbound and outbound calls. Customer support, appointment scheduling, lead qualification, intake forms. The advantage here is that people are already comfortable talking on the phone. The constraints are significant: you’re often dealing with poor audio quality, background noise, accents, and a user who did not opt into talking to an AI and may be frustrated before the call starts. The bar for quality is high because the comparison is a real person.
Web agents are embedded in products and websites. Voice-first interfaces for demos, support widgets, onboarding flows. The user context is better. They’ve chosen to engage. But the design challenge is different. You have to make speaking feel like the natural choice, not a novelty. Most web voice interfaces fail because they treat voice as a feature rather than a primary interaction model.
Meeting transcription is a third category, but it’s largely a solved problem at this point. There are dozens of products that do it well. It’s also not really “voice AI” in the sense that matters. It’s voice-to-text with some summarization. Useful, but not the interesting frontier.
What makes it hard
Latency is the obvious challenge. If there’s more than about 800 milliseconds between when someone finishes speaking and when the agent responds, the conversation starts feeling broken. Getting that number down while maintaining response quality, on commodity cloud infrastructure, at scale, is not trivial.
Interruptions are harder than latency. In real conversation, people interrupt each other constantly. They start a sentence, change direction, add a clarification mid-thought. A voice agent that can’t handle this naturally will either stop and restart awkwardly or ignore the interruption entirely. Either way, it immediately breaks the illusion of a real conversation.
Context memory matters more in voice than in text because the user can’t scroll back. If a voice agent forgets something you said three minutes ago and asks you again, it’s a bad experience in a way that a chatbot asking the same question just isn’t. The expectation of continuity is higher when you’re speaking.
Then there’s the failure handling. What happens when the agent mishears something, or the user says something completely outside the expected range, or the connection degrades? Graceful failure in a voice interface is an underrated design problem. The systems that handle it well have thought hard about it. Most haven’t.
Where this is going
The cost of voice AI infrastructure has dropped significantly and will keep dropping. That means the barrier to building a voice agent is lower than it’s ever been, which means more experiments, more products, and faster iteration on what actually works.
The more interesting shift is toward always-on, ambient voice interfaces. Devices and contexts where the voice agent is present continuously rather than activated for a specific task. That changes the design problem in fundamental ways. Persistence, proactivity, and knowing when not to speak become as important as knowing what to say.
Multimodal integration is also accelerating. Voice combined with vision opens up use cases that feel closer to science fiction than current deployment. Within two or three years, I don’t think that comparison will hold.
If you want to go deeper on how to integrate agents into a real working workflow, I wrote about how I use AI agents every day to build a software company.
When to build versus when to buy
The honest answer is almost always: buy the components, build the orchestration.
ASR, TTS, and LLM inference are commodities at this point. There are good providers for each. Building your own ASR model is an enormous undertaking that makes sense for only a handful of companies with specific requirements. Using an existing provider for transcription and focusing your engineering effort on the orchestration layer is the right division of labor for most teams. How the agent behaves, how it handles context and failures, how it’s tuned to your domain: that’s where your effort belongs.
The exception is if your use case has a domain-specific audio challenge that general models handle poorly. Heavy accents in a specific language, industry-specific vocabulary that gets consistently misrecognized, noise conditions that exceed what generic noise cancellation handles. In those cases, fine-tuning on domain-specific audio data is worth exploring. But that’s a later optimization, not a starting point.
Start with the fastest path to a working prototype that you can test with real users. The architecture decisions that matter can’t be made in the abstract. They have to be made in response to real usage data. Get something in front of real users, collect the failure cases, and optimize for those specifically. That iteration cycle is faster than getting the architecture perfect before you launch, and the learnings are more valuable.
Building a voice agent: what the actual stack looks like
I’ll be specific because the vague version of this advice is everywhere and the specific version is harder to find.
For phone-based voice agents, you need three infrastructure layers: real-time audio transport (WebRTC-based, bidirectional, low latency), carrier connectivity to the phone network (for inbound and outbound calls), and a RAG layer sitting in front of the language model. Each is a meaningful choice with multiple providers, and the tradeoffs matter more than which vendor you pick.
The real-time transport handles the WebRTC layer, moving audio bidirectionally at low enough latency that conversation feels natural. There are open source options you can self-host for cost control and commercial options that trade control for managed infrastructure. The right choice depends on your volume, your regions, and whether you want to own the operational complexity. Test latency in the geographies your users are actually in before you commit to anything.
Carrier connectivity is commodity. It’s the plumbing that connects your agent to the phone network. For inbound calls, you route a phone number to your agent infrastructure. For outbound, you make programmatic calls that hand off to the same stack. Don’t over-engineer this layer. Multiple providers do it reliably and the differences between them are minor for most use cases.
The RAG layer is where most of the actual intelligence lives for domain-specific agents. The pattern is: take your knowledge base (documentation, FAQs, product information, call scripts, whatever the agent needs to know), chunk it, embed it, and store it in a vector database. When a call comes in, the agent queries the knowledge base in real time and injects relevant context into the language model prompt before generating a response. This is what separates a generic voice agent from one that actually knows your business.
The quality of your knowledge base determines the quality of your agent more than any other variable. I’ve seen teams spend weeks tuning their prompts and almost no time improving their knowledge base content. The ratio should probably be inverted. Garbage in, garbage out. It’s not a cliche when it’s true.
Prompt engineering for voice is different from text
The same prompt that works well for a text-based chatbot will perform poorly for a voice agent. The constraints are different in ways that aren’t obvious until you’ve run into them.
Responses have to be shorter. A text response can run several paragraphs. The user can read at their pace. A spoken response at the same length sounds like a speech. The optimal spoken response is usually two to four sentences. If the agent needs to convey more information than that, it needs to do it across multiple conversational turns, not one long answer.
Responses have to be speakable. Markdown formatting, parenthetical asides, bullet points, numbered lists: these are all text conventions that don’t translate to speech. The TTS system will either try to render them literally or strip them in ways that produce awkward output. Write prompts that constrain the model to produce clean prose in the kind of sentence structure that sounds natural when spoken aloud.
Responses have to handle the unexpected gracefully. In text interactions, an unusual or out-of-scope question can be redirected with a short message. In voice, that same redirect can feel jarring if the agent’s tone shifts. The prompts that work best for voice agents spend as much time on the failure cases as the success cases. What to say when the question is ambiguous, what to say when the knowledge base doesn’t have an answer, how to ask a clarifying question without breaking conversational flow.
Persona consistency matters more than in text. A voice agent with a defined character, a specific name, a specific communication style, specific things it will and won’t do, produces more coherent experiences than a generic assistant persona. Users calibrate their expectations based on the first few turns of a conversation. If those turns are consistent, the rest of the conversation goes better even when things get complicated.
The economics at scale
Building a voice agent is cheap in prototype. Running it in production at volume is a different calculation.
The cost components are: ASR (audio in), LLM inference (processing and response generation), TTS (audio out), and real-time transport infrastructure. At the usage levels we’re at with Speak AI, the LLM inference is typically the largest cost component, followed by TTS. ASR has gotten cheap enough that it’s rarely the constraining factor. Transport infrastructure costs scale with concurrent sessions, not total minutes.
The number that matters for business model design is cost per minute of conversation. Depending on the LLM you’re using, the length of your knowledge base context, and the TTS provider, this can range from less than a cent per minute on an optimized stack to several cents per minute on a naive one. Those differences compound very fast at production volume. The architecture decisions you make in the prototype phase will be very expensive to undo at scale.
Caching is your most powerful cost lever. If a significant portion of your calls follow similar patterns, similar questions, similar knowledge base lookups, you can cache the context retrieval and save meaningful LLM tokens. We’ve reduced effective per-call costs significantly this way. The setup is not complicated. Most teams building their first voice agent don’t think about it until they see a billing statement that forces the conversation.
What I’ve learned building in this space
The most durable insight from building Speak AI is that the closer your voice AI product is to the user’s actual workflow, the more value it delivers and the stickier it becomes. Generic transcription is a commodity. Transcription that’s integrated into how a specific type of researcher captures, analyses, and shares their data is a product people build their work around.
The same principle applies to voice agents. A phone agent that handles appointment scheduling for a specific type of medical practice, built around the actual vocabulary and questions and exceptions of that context, will outperform a generic scheduling agent every time. The work of understanding the domain is most of the work.
Voice AI is not a single technology. It’s a set of components that, put together thoughtfully and tuned to a specific context, can produce something that genuinely changes how people interact with software. That’s still early, still being figured out, and worth paying close attention to.
The questions worth asking before you build
Before starting on a voice agent, there are a few questions that will save a lot of iteration if you answer them honestly first.
Is voice the right modality for this use case? Voice works well for tasks where the user’s hands are occupied, where the interaction is short enough to hold in working memory, or where the natural conversation format is genuinely better than a form or menu. It works poorly for tasks involving complex visual information, multi-step forms with lookup requirements, or contexts where the user is in a public space and doesn’t want to speak aloud. Starting with the modality because it’s interesting is different from starting with a use case that’s genuinely better served by voice.
What does success look like and how will you measure it? Task completion rate, escalation rate (how often does the voice agent fail and transfer to a human), session length, and user satisfaction score are the metrics that tell you whether a voice agent is actually working. Define these before you build. The tendency is to instrument after launch, by which point you’ve lost the early data you need to understand baseline performance.
What happens when it fails? The failure path in a voice agent is not a 404 page. It’s a person who is frustrated, possibly in a time-sensitive situation, and needs to get to a resolution. How the agent handles its own limitations, gracefully acknowledging that it can’t help and routing to the right alternative, is as important as how it handles the cases it can handle. Design the failure path with the same care you design the success path.
One more thing worth saying: most of the interesting work in voice AI right now is not being done by the companies getting the most press coverage. It’s being done by teams building narrowly scoped, deeply tuned agents for specific industries. Healthcare intake, legal client intake, real estate qualification, field service dispatch. These are not glamorous applications. They are genuinely valuable ones. The companies building them are often profitable before they’re well-known. That’s the version of the voice AI story that I think matters most in the next few years.