rayvoc.ai

Architecture

What happens when your agent picks up

From the first ring to the last word — a technical tour of the Rayvoc stack: telephony ingress, the media layer, the two model architectures we support, and the engineering that keeps every turn under a second.

Caller any phone Rayvoc DID our carrier network BYOC trunk your carrier, via SIP Media layer streaming · VAD · barge-in Pipeline: STT → LLM → TTS bring your own engines, every stage streams Speech-to-speech model e.g. Grok voice · one model, lowest latency

Inbound shown; outbound is the same path in reverse, initiated by the campaign engine. Tools, knowledge bases, and transfer logic attach at the model layer.

1. The call arrives — on our network or yours

Inbound calls hit either a Rayvoc DID on our carrier network or your own carrier via a BYOC SIP trunk. Either way, the call lands on a media server in the region closest to the caller. Because the telecom layer is part of the platform — not a third-party API we poll — call control events (answer, DTMF, transfer, hangup) are handled in-process with no extra network hop.

2. The media layer turns sound into structure

The media layer manages the audio streams in both directions. It runs voice activity detection to know when someone is speaking, semantic endpointing to know when they’ve finished, and barge-in handling so a caller can interrupt the agent mid-sentence and the agent stops talking instead of bulldozing on.

3. Two model architectures, one platform

Pipeline mode chains three models: speech-to-text transcribes the caller, your LLM decides what to say, and text-to-speech says it. Every stage streams — the LLM starts on partial transcripts, synthesis starts on the first sentence. You choose each component: any OpenAI-compatible LLM, any TTS or STT engine, or our managed defaults.

Speech-to-speech mode uses a single model that listens and speaks natively — no transcription step at all. With Grok’s voice models this also brings automatic language detection across 20 languages and the lowest time-to-first-audio of any architecture. It’s the right default when latency and naturalness matter more than picking your own voice vendor.

4. Tools make it an agent, not an answering machine

Agents act through tool calls — functions you define with a JSON schema and a webhook. Mid-call, the model can check an order status, book an appointment, look up a caller by their number, or hand off to a human with a warm transfer and a context summary.

tool.json
{
  "name": "book_appointment",
  "description": "Book a slot in the clinic calendar",
  "parameters": {
    "type": "object",
    "properties": {
      "patient_name": { "type": "string" },
      "datetime":     { "type": "string", "format": "date-time" }
    },
    "required": ["patient_name", "datetime"]
  },
  "webhook": "https://api.yourapp.com/rayvoc/book"
}

5. Everything is observable

Every call produces a transcript, a recording (where consented — see the trust center), tool-call logs, and a per-stage latency waterfall. You see exactly how long recognition, the model, and synthesis took on every turn — which is how you keep a production agent honest. Dive deeper in why latency matters.

Frequently asked questions

Pipeline or speech-to-speech — which should I use?

Use a pipeline (STT → LLM → TTS) when you need maximum control: a specific LLM, a specific voice vendor, custom vocabulary in recognition. Use a speech-to-speech model (like Grok voice) when you want the lowest latency and the most natural turn-taking. Rayvoc supports both on the same phone number — you can switch with a config change.

How do agents take real actions during a call?

Through tool calling. You define functions with JSON schemas and point them at your webhooks or APIs. When the model decides to call one — to check an order, book a slot, or transfer the call — Rayvoc invokes your endpoint and feeds the result back into the conversation, all mid-call.

What happens when a call needs a human?

Agents can warm-transfer to any phone number or SIP destination, with a whispered context summary to the human before connecting the caller. Escalation rules can trigger on caller request, sentiment, or your own tool logic.

Can I run inbound and outbound on the same agent?

Yes. An agent is a definition; you attach it to numbers for inbound and to campaigns for outbound. The same instructions, tools, and knowledge serve both directions.

See it answer your own number

Every account starts with a 14-day free trial — 1 concurrent channel, a real phone number, and full platform access.