Voice AI Latency: Benchmarks, Budgets, and How to Hit Sub-Second
· Rayvoc Team
In human conversation, the gap between one person finishing and the other starting is 200–300 milliseconds. It’s one of the most stable findings in linguistics — consistent across languages and cultures. Our brains are tuned to it, which is why anything slower feels off long before we can articulate why.
For voice AI, the practical thresholds look like this: under ~500ms feels genuinely conversational. Up to ~800ms feels like a thoughtful human pause. Past one second, callers start assuming the line dropped — they say “hello?”, talk over the agent, or hang up. Past two seconds, you’re not having a conversation; you’re using a voice-activated IVR.
Measured platform latencies in 2026 range from roughly 600ms at the fast end (Retell) to 950–1450ms for typical configurations. That spread is the difference between an agent callers mistake for a person and one they immediately clock as a robot. This guide breaks down where the milliseconds go, the six levers that pull them out, and how to measure latency without lying to yourself.
First, agree on the metric: voice-to-voice TTFA
The only honest latency number is voice-to-voice time to first audio: the elapsed time from the caller’s last syllable to the first audible syllable of the agent’s reply, measured at the caller’s ear — including the telephone network.
Vendors quote all sorts of partial numbers: model time-to-first-token, TTS synthesis latency, “API response time.” Each is real, but none is what the caller experiences. A platform can have a 300ms model and still deliver a 1.4-second turn if audio transport, endpointing, and synthesis buffering eat the rest.
The latency budget: where a turn’s milliseconds go
A pipeline-architecture voice agent turn passes through five stages. Here’s a realistic budget for each, with typical and optimized ranges:
| Stage | What happens | Typical range | Optimized |
|---|---|---|---|
| Endpointing / turn detection | Deciding the caller is done speaking | 300–800ms | 100–250ms |
| STT finalization | Closing out the streaming transcript | 100–300ms | 50–150ms |
| LLM time-to-first-token | Model starts generating | 300–800ms | 150–350ms |
| TTS time-to-first-byte | First audio chunk synthesized | 100–300ms | 50–150ms |
| Network & media transport | Audio to/from the caller (PSTN/SIP) | 50–200ms | 30–100ms |
| Voice-to-voice total | 850–2400ms | 380–1000ms |
Two structural observations:
LLM time-to-first-token is the dominant slice — typically the largest single contributor and by far the most variable. It depends on model size, prompt length, tool-call overhead, provider load, and the physical distance between your media servers and the inference endpoint.
Endpointing is the hidden tax. A naive 700ms silence timeout adds 700ms to every single turn before anything else even starts. It’s the most commonly ignored stage and often the cheapest to fix.
Six levers to pull latency out
1. Stream everything
No stage should wait for the previous one to finish. STT emits partial transcripts while the caller is still talking; the LLM starts generating on stable partials; TTS starts synthesizing on the first sentence while the model is still writing the second. A non-streaming handoff anywhere in the chain adds the full duration of that stage to your turn time.
2. Co-locate media and inference
Every network hop between your media server, STT, LLM, and TTS adds round-trip time — and worse, adds jitter. A platform whose media layer sits in one cloud while calling out to models in another pays 30–80ms per hop, several hops per turn. Putting media processing and model inference in the same region (ideally the same datacenter) removes that tax entirely.
3. Go speech-to-speech where it fits
Native speech-to-speech models collapse STT → LLM → TTS into a single model that consumes and produces audio directly. The savings are real: Grok’s voice model measures ~0.78s TTFA in independent testing, versus ~1.49s for GPT-4o Realtime. You trade some component-level control (specific voices, swappable STT) for raw speed. The right answer is often per-use-case: speech-to-speech for latency-critical conversational flows, pipeline for flows that need a specific voice or model. Rayvoc supports both architectures on the same platform.
4. Use semantic endpointing, not silence timers
Fixed silence thresholds force a brutal trade-off: short timeouts cut off callers mid-thought (“my account number is… four four two…”), long ones add dead air to every turn. Semantic endpointing combines voice activity detection with a model that judges whether the utterance is complete — “What’s my balance?” ends a turn; “What’s my…” doesn’t. This routinely recovers 300–500ms per turn while reducing interruptions. Pair it with barge-in so callers can interrupt naturally when the agent misjudges.
5. Pick (and measure) your model deliberately
Time-to-first-token varies enormously across models and providers — and across times of day on shared endpoints. A smaller model with a tight prompt often beats a frontier model with a 4,000-token system prompt for both latency and task success on structured calls. If your platform supports bringing your own model, benchmark candidates with production-shaped prompts, not “hello world.”
6. Deploy regionally
Speed of light is non-negotiable. A caller in Frankfurt routed through a US-East media server pays ~90ms each way before any AI runs. If your callers are international, your media ingress — and ideally your inference — should be too. This is also an underrated argument for BYOC: your carrier’s local breakout often beats a platform’s single-region telephony.
Pipeline vs. speech-to-speech: a quick decision frame
| Pipeline (STT→LLM→TTS) | Speech-to-speech | |
|---|---|---|
| Typical TTFA | 800–1500ms (600ms well-optimized) | ~780ms (Grok) – 1490ms (GPT-4o Realtime) |
| Component choice | Full — swap any STT/LLM/TTS | Locked to the model |
| Voice control | Any TTS voice, cloning | Model’s built-in voices |
| Cost control | Optimize each layer | Single (often premium) rate |
| Best for | Complex tool-calling flows, specific voices | Latency-critical natural conversation |
How to measure honestly
Most published latency numbers are somewhere between optimistic and fictional. To benchmark properly:
- Measure voice-to-voice TTFA at the caller’s side — record the call from a real phone, not from inside the platform. The PSTN leg is part of the experience.
- Report p50 and p95, never the average. Latency distributions are long-tailed; the mean hides the 2-second turns that make callers hang up. A platform with 700ms p50 and 2100ms p95 feels worse than one with 850ms p50 and 1100ms p95.
- Test over real phone calls, not WebRTC demos. Browser demos skip the carrier network and routinely flatter results by 100–250ms.
- Test with production-shaped prompts — your real system prompt, tools, and context length. Time-to-first-token scales with input size.
- Test at your peak hours. Shared inference endpoints degrade under load; a benchmark run at 3am tells you nothing about Monday 9am.
- Track per-stage timings continuously in production. Latency is not a launch checklist item; it regresses every time someone lengthens the prompt.
Latency also isn’t free to buy down — faster models and premium infrastructure show up on the invoice. For how the layers price out, see our pricing teardown.
Where Rayvoc fits
Rayvoc is engineered end to end for sub-second voice-to-voice response: streaming at every stage, media servers co-located with inference, native speech-to-speech support (including Grok voice at ~0.78s TTFA, with auto-detection across 20 languages), semantic endpointing, and a telecom layer that’s part of the same stack — no extra hop to a third-party carrier API. Every call in the dashboard shows a per-stage latency waterfall, so you’re working from measurements, not marketing. Read more on the low-latency platform page.
We’re pre-launch — join the waitlist and you’ll get a 14-day trial with a real phone number to measure it yourself.