latencyengineering

Voice AI Latency: Benchmarks, Budgets, and How to Hit Sub-Second

June 10, 2026 · Rayvoc Team

In human conversation, the gap between one person finishing and the other starting is 200–300 milliseconds. It’s one of the most stable findings in linguistics — consistent across languages and cultures. Our brains are tuned to it, which is why anything slower feels off long before we can articulate why.

For voice AI, the practical thresholds look like this: under ~500ms feels genuinely conversational. Up to ~800ms feels like a thoughtful human pause. Past one second, callers start assuming the line dropped — they say “hello?”, talk over the agent, or hang up. Past two seconds, you’re not having a conversation; you’re using a voice-activated IVR.

Measured platform latencies in 2026 range from roughly 600ms at the fast end (Retell) to 950–1450ms for typical configurations. That spread is the difference between an agent callers mistake for a person and one they immediately clock as a robot. This guide breaks down where the milliseconds go, the six levers that pull them out, and how to measure latency without lying to yourself.

First, agree on the metric: voice-to-voice TTFA

The only honest latency number is voice-to-voice time to first audio: the elapsed time from the caller’s last syllable to the first audible syllable of the agent’s reply, measured at the caller’s ear — including the telephone network.

Vendors quote all sorts of partial numbers: model time-to-first-token, TTS synthesis latency, “API response time.” Each is real, but none is what the caller experiences. A platform can have a 300ms model and still deliver a 1.4-second turn if audio transport, endpointing, and synthesis buffering eat the rest.

The latency budget: where a turn’s milliseconds go

A pipeline-architecture voice agent turn passes through five stages. Here’s a realistic budget for each, with typical and optimized ranges:

Stage	What happens	Typical range	Optimized
Endpointing / turn detection	Deciding the caller is done speaking	300–800ms	100–250ms
STT finalization	Closing out the streaming transcript	100–300ms	50–150ms
LLM time-to-first-token	Model starts generating	300–800ms	150–350ms
TTS time-to-first-byte	First audio chunk synthesized	100–300ms	50–150ms
Network & media transport	Audio to/from the caller (PSTN/SIP)	50–200ms	30–100ms
Voice-to-voice total		850–2400ms	380–1000ms

Two structural observations:

LLM time-to-first-token is the dominant slice — typically the largest single contributor and by far the most variable. It depends on model size, prompt length, tool-call overhead, provider load, and the physical distance between your media servers and the inference endpoint.

Endpointing is the hidden tax. A naive 700ms silence timeout adds 700ms to every single turn before anything else even starts. It’s the most commonly ignored stage and often the cheapest to fix.

Six levers to pull latency out

1. Stream everything

No stage should wait for the previous one to finish. STT emits partial transcripts while the caller is still talking; the LLM starts generating on stable partials; TTS starts synthesizing on the first sentence while the model is still writing the second. A non-streaming handoff anywhere in the chain adds the full duration of that stage to your turn time.

2. Co-locate media and inference

Every network hop between your media server, STT, LLM, and TTS adds round-trip time — and worse, adds jitter. A platform whose media layer sits in one cloud while calling out to models in another pays 30–80ms per hop, several hops per turn. Putting media processing and model inference in the same region (ideally the same datacenter) removes that tax entirely.

3. Go speech-to-speech where it fits

Native speech-to-speech models collapse STT → LLM → TTS into a single model that consumes and produces audio directly. The savings are real: Grok’s voice model measures ~0.78s TTFA in independent testing, versus ~1.49s for GPT-4o Realtime. You trade some component-level control (specific voices, swappable STT) for raw speed. The right answer is often per-use-case: speech-to-speech for latency-critical conversational flows, pipeline for flows that need a specific voice or model. Rayvoc supports both architectures on the same platform.

4. Use semantic endpointing, not silence timers

Fixed silence thresholds force a brutal trade-off: short timeouts cut off callers mid-thought (“my account number is… four four two…”), long ones add dead air to every turn. Semantic endpointing combines voice activity detection with a model that judges whether the utterance is complete — “What’s my balance?” ends a turn; “What’s my…” doesn’t. This routinely recovers 300–500ms per turn while reducing interruptions. Pair it with barge-in so callers can interrupt naturally when the agent misjudges.

5. Pick (and measure) your model deliberately

Time-to-first-token varies enormously across models and providers — and across times of day on shared endpoints. A smaller model with a tight prompt often beats a frontier model with a 4,000-token system prompt for both latency and task success on structured calls. If your platform supports bringing your own model, benchmark candidates with production-shaped prompts, not “hello world.”

6. Deploy regionally

Speed of light is non-negotiable. A caller in Frankfurt routed through a US-East media server pays ~90ms each way before any AI runs. If your callers are international, your media ingress — and ideally your inference — should be too. This is also an underrated argument for BYOC: your carrier’s local breakout often beats a platform’s single-region telephony.

Pipeline vs. speech-to-speech: a quick decision frame

	Pipeline (STT→LLM→TTS)	Speech-to-speech
Typical TTFA	800–1500ms (600ms well-optimized)	~780ms (Grok) – 1490ms (GPT-4o Realtime)
Component choice	Full — swap any STT/LLM/TTS	Locked to the model
Voice control	Any TTS voice, cloning	Model’s built-in voices
Cost control	Optimize each layer	Single (often premium) rate
Best for	Complex tool-calling flows, specific voices	Latency-critical natural conversation

How to measure honestly

Most published latency numbers are somewhere between optimistic and fictional. To benchmark properly:

Measure voice-to-voice TTFA at the caller’s side — record the call from a real phone, not from inside the platform. The PSTN leg is part of the experience.
Report p50 and p95, never the average. Latency distributions are long-tailed; the mean hides the 2-second turns that make callers hang up. A platform with 700ms p50 and 2100ms p95 feels worse than one with 850ms p50 and 1100ms p95.
Test over real phone calls, not WebRTC demos. Browser demos skip the carrier network and routinely flatter results by 100–250ms.
Test with production-shaped prompts — your real system prompt, tools, and context length. Time-to-first-token scales with input size.
Test at your peak hours. Shared inference endpoints degrade under load; a benchmark run at 3am tells you nothing about Monday 9am.
Track per-stage timings continuously in production. Latency is not a launch checklist item; it regresses every time someone lengthens the prompt.

Latency also isn’t free to buy down — faster models and premium infrastructure show up on the invoice. For how the layers price out, see our pricing teardown.

Where Rayvoc fits

Rayvoc is engineered end to end for sub-second voice-to-voice response: streaming at every stage, media servers co-located with inference, native speech-to-speech support (including Grok voice at ~0.78s TTFA, with auto-detection across 20 languages), semantic endpointing, and a telecom layer that’s part of the same stack — no extra hop to a third-party carrier API. Every call in the dashboard shows a per-stage latency waterfall, so you’re working from measurements, not marketing. Read more on the low-latency platform page.

We’re pre-launch — join the waitlist and you’ll get a 14-day trial with a real phone number to measure it yourself.