AI Call Center Orchestration: The Complete Engineering & Cost Guide (2026)
| Metric | Result |
|---|---|
| Concurrent Calls | Unlimited (with proper infrastructure) |
| Response Latency | 300ms – 700ms (human-perceived as natural) |
| First Contact Resolution | 85–92% for routine queries |
| Cost vs. Human Agent | 70–90% reduction in per-call cost at scale |
| 24/7 Availability | Always on, no shift premiums |
There's a dirty secret the Instagram ads selling you "AI call centers in five minutes" won't tell you: the moment you hit 200 concurrent calls, the SaaS dream starts falling apart.
Latency spikes. Vendor lock-in bites. Your HIPAA compliance officer starts asking uncomfortable questions. And your cost-per-minute — which looked so clean on the pricing page — suddenly has four hidden line items you didn't account for.
This guide is for the builders, the architects, and the business owners who have moved past the demo phase and are asking the real questions:
- What does this actually cost at scale?
- When do I need real engineers instead of no-code tools?
- How does SIP trunking actually work with AI?
- What happens when a patient asks to speak to a human?
We'll cover all of it — the open-source frameworks, the SaaS platforms, the infrastructure economics, the telephony plumbing, and what we learned building a voice AI system that replaced a 20-person call centre for a UK medical clinic.
Part 1: The Architecture Stack — What You're Actually Building
Before you compare platforms or prices, you need to understand that an AI call centre is not a single product. It is a pipeline of five distinct components, each of which can be swapped, self-hosted, or rented as a cloud service.
Phone Call → [Telephony] → [STT] → [LLM] → [TTS] → [Telephony] → Caller
↕ ↕
[SIP Trunk] [CRM / EHR / DB]
The Five Components
-
Telephony Layer (VoIP/SIP) — The plumbing that turns a phone call into a digital audio stream your software can process. Twilio, Telnyx, SignalWire, VoIP Studio.
-
Speech-to-Text (STT) — Converts the caller's voice into text in real time. Deepgram Nova-3 is the current benchmark. Alternatives: OpenAI Whisper, AssemblyAI, Google STT.
-
Large Language Model (LLM) — The brain that reads the transcription, understands intent, applies your business rules, and generates a response. GPT-4o, Claude 3.5 Sonnet, Gemini 2.0, or self-hosted Llama 3.
-
Text-to-Speech (TTS) — Converts the LLM's response into natural-sounding audio. ElevenLabs, Cartesia, OpenAI TTS, or open-source options like Kokoro/Piper.
-
Orchestration Layer — The glue that connects all of the above in real time, manages turn detection (knowing when a caller stops talking), handles interruptions, and executes business logic like call transfers and CRM lookups.
This last component — orchestration — is where all the interesting (and difficult) engineering lives.
Part 2: Open-Source Frameworks — Building the Wheel Yourself
If you're reading this with a technical team and a real deployment in mind, these are the frameworks that the serious builders use. They give you complete control over every component, data sovereignty, and zero per-minute vendor tax. The trade-off? You need engineers who know what they're doing.
LiveKit Agents (GitHub: livekit/agents)
The most production-mature open-source framework for real-time voice AI.
LiveKit Agents is built on top of LiveKit's battle-tested WebRTC/SIP infrastructure — the same stack that powers real-time communications for some of the largest applications on the web. The agents framework (Apache 2.0 licence) gives you:
- Pluggable STT/LLM/TTS providers: Deepgram, OpenAI, ElevenLabs, Azure, Cartesia — swap them in via simple plugin configuration.
- Semantic turn detection: Not a primitive silence detector. LiveKit's turn detection uses a small model that understands conversational context, reducing interruption errors significantly.
- WebRTC + SIP support: Handles inbound calls from a SIP trunk (Twilio, Telnyx) or WebRTC web/mobile clients in the same framework.
- Job scheduling: Built-in worker queue so calls are dispatched to available agent processes without you building a scheduler.
- Python and Node.js: Full support for both languages.
When to use it: When you want full infrastructure control, your call volumes are predictable (or highly variable and you want to manage your own autoscaling), and you have a Python/Node.js engineering team. LiveKit Cloud exists if you want hosted signalling without self-hosting everything.
What you're still responsible for: Telephony SIP trunk configuration, LLM prompt engineering, TTS voice procurement, CRM integrations, and all deployment/DevOps.
Pipecat (GitHub: pipecat-ai/pipecat)
The "Swiss Army Knife" for real-time conversational AI pipelines.
Pipecat, built by Daily.co, takes a composable pipeline approach. You assemble a directed graph of processors — a Deepgram STT processor feeds into a LangChain LLM processor feeds into an ElevenLabs TTS processor — and Pipecat handles the buffering, chunking, and real-time streaming between them.
What makes Pipecat special:
- Frame-based architecture: Audio frames, text frames, function call frames — everything is typed. This makes debugging pipeline issues dramatically easier.
- Interruption handling: Built-in support for detecting when a caller interrupts the AI mid-sentence, cutting the TTS output immediately and re-routing to STT.
- Pipecat Flows: A companion library for defining structured dialogue state machines — useful when you need the agent to follow a specific script with branching logic (e.g., appointment booking flows where step 2 only happens if step 1 is confirmed).
- Broad provider support: 40+ integrated providers for STT, LLM, TTS, and transport layers.
When to use it: When your conversation flows have complex conditional branching (medical intake forms, insurance verification, multi-step qualification scripts) and you need fine control over how each processing step behaves.
Bolna (GitHub: bolna-ai/bolna)
Purpose-built for front-desk automation.
Bolna is a narrower, more opinionated open-source framework specifically targeting the use case of automating front-desk phone operations — appointment booking, lead qualification, basic customer service. It ships with pre-built integrations for Twilio, Plivo, Deepgram, and a selection of TTS providers, plus a dashboard for managing agents without writing code for every variable.
Bolna is an interesting middle ground: more accessible than LiveKit Agents for smaller teams, more customisable than a full SaaS wrapper.
When to use it: Small-to-medium practices, real estate offices, dental chains — scenarios where you want open-source flexibility but the specific use case is well-defined enough that Bolna's opinionated structure works in your favour.
Deepgram Voice Agent API
Not a framework — but worth knowing as a primitives provider.
Deepgram's Voice Agent API is a single WebSocket connection that handles STT, LLM (via OpenAI or Anthropic integration), and TTS in one managed service. You send audio in, you get audio back. For prototyping or low-volume deployments, this removes enormous complexity.
The trade-off is loss of control: you can't swap in a custom STT model, you can't change how turn detection works, and you're paying API pricing at all volumes. But for a rapid proof-of-concept or a low-stakes internal tool, it's an excellent starting point.
Comparison Table: Open-Source Frameworks
| Framework | Language | Telephony | Complexity | Best For |
|---|---|---|---|---|
| LiveKit Agents | Python / JS | SIP + WebRTC | High | Production-scale, full control |
| Pipecat | Python | WebRTC (Daily) | Medium-High | Complex conversation flows |
| Bolna | Python | Twilio / Plivo | Medium | Front-desk automation, SMBs |
| Deepgram Voice Agent API | Any (WebSocket) | N/A (no SIP) | Low | Rapid prototyping |
| Ultravox (Fixie.ai) | Python | SIP | Medium | Low-latency, open-weight LLMs |
Part 3: SaaS Platforms — Not Recreating the Wheel (Yet)
Here's the thing nobody says out loud: for most businesses under 50,000 minutes/month, a well-configured SaaS platform is not the lazy option — it's the correct option. The engineering time to build and maintain a custom stack costs more than the per-minute premium, and your energy is better spent on the prompt engineering, CRM integrations, and business logic that actually differentiates your product.
The inflection point is typically ~500 hours (30,000 minutes) of voice per month. Below that, SaaS wins. Above that, the economics start favouring custom infrastructure, especially for regulated industries.
Let's break down the real platforms.
Vapi AI
The most developer-friendly SaaS orchestration layer.
Vapi is not a phone company. It's a voice AI orchestration platform — a managed version of the "glue layer" between your telephony, STT, LLM, and TTS. You bring your own OpenAI/Anthropic/ElevenLabs API keys, and Vapi handles the real-time pipeline management.
Pricing (2026):
- Platform fee: $0.05/min (base orchestration only)
- +LLM: GPT-4o adds ~$0.05/min; Claude 3.5 Sonnet adds ~$0.04–0.06/min
- +TTS: ElevenLabs adds ~$0.05–0.07/min; OpenAI TTS adds ~$0.015/min
- +STT: Deepgram Nova-2 adds ~$0.0075/min
- +Telephony: Twilio via Vapi adds ~$0.01/min; BYOT (Bring Your Own Twilio) is free
- HIPAA compliance: $1,000/month add-on
- Effective all-in rate: ~$0.18–$0.33/minute with premium stack
Who it's for: Development teams that want to build a custom experience (system prompts, tool calls, webhooks) without managing infrastructure. Vapi has excellent documentation and a generous free tier ($10 credit).
The real limitation: Shared infrastructure. At scale, every customer is competing for the same processing resources. Latency is typically 400–800ms in good conditions; during peak times it can creep higher.
Retell AI
The strongest feature set for complex call flows.
Retell is built for businesses that need more than simple Q&A — think multi-step qualification scripts, dynamic data injection mid-call, and sophisticated call routing logic.
Pricing (2026):
- Base rate: $0.07+/min (no platform fee, included in per-minute)
- LLM: GPT-4o mini $0.006/min; Claude 3.5 Sonnet $0.02–0.06/min; GPT-4o $0.05/min
- TTS: Standard voices $0.03–0.05/min; ElevenLabs $0.07/min
- Telephony (Retell's Twilio): $0.01/min; BYOT: free
- Knowledge base: Free for first 10; $0.005/min or $8/month thereafter
- Phone numbers: $2/month each
- Effective all-in rate: ~$0.13–$0.31/minute depending on stack
- Enterprise: Custom pricing, as low as $0.05/min at volume
Standout features: Real-time knowledge base retrieval during calls, custom LLM endpoints (bring your own fine-tuned model), post-call analytics with sentiment analysis, and robust webhook support for CRM integration.
Bland AI
The outbound calling specialist.
Bland AI has carved out a niche in high-volume outbound calling — lead qualification, appointment confirmation campaigns, debt collection, survey calls. It has the most mature outbound tooling of any platform in this list.
Pricing (2026, post-December 2025 restructure):
- Free: $0.14/min connected; $0.05/min transfer time
- Build ($299/month): $0.12/min connected; $0.04/min transfer
- Scale ($499/month): $0.11/min connected; $0.03/min transfer
- Failed/short calls (<10 sec): $0.015 per attempt
- SMS: $0.02/message
- A realistic 1,000 calls × 3 min on Scale: $830+ before add-ons
The catch: Bland's pricing looks attractive on a per-minute basis, but the add-ons can surprise you. Voice cloning, custom LLM endpoints, and HIPAA compliance are all tiered extras.
Vogent AI
The in-house LLM play — lowest effective cost at volume.
Vogent is the most interesting pricing story of 2026. Unlike Vapi and Retell (which pass through third-party LLM costs), Vogent runs its own in-house language models trained specifically for voice AI tasks. This means no GPT API surcharges.
Pricing (2026):
- Base pay-as-you-go: $0.09/min
- +Premium voices: $0.051/min additional
- Enterprise (10M+ minutes): As low as $0.06/min with up to 50% volume discount
- No subscription required — truly usage-based
Features: IVR navigation (the AI can press dial-pad keys to navigate legacy phone trees), human handoff, post-call automation, knowledge base integration, HIPAA compliant out of the box.
The trade-off: Vogent's in-house models are excellent for structured tasks (qualification, scheduling, data capture) but less capable than GPT-4o or Claude 3.5 Sonnet for nuanced, unscripted conversations. For highly regulated or emotionally complex calls, you pay more on other platforms for better reasoning.
ElevenLabs Conversational AI
The best voices in the industry — in a platform.
ElevenLabs started as a TTS provider and is rapidly expanding into full conversational AI. Their platform is built on top of their commercial-grade voice synthesis (the same voices that power thousands of other applications), giving them a genuine moat on voice quality and emotional expressiveness.
Pricing (2026, credit-based):
- Creator ($22/month): ~250 minutes of conversational AI
- Pro ($99/month): ~1,100 minutes; overage ~$0.09/min
- Scale ($330/month): ~3,600 minutes; overage $0.096/min
- Business ($1,320/month): ~13,750 minutes; low-latency TTS as low as $0.05/min
- HIPAA compliance: Custom enterprise pricing
- SIP trunking: Yes — connect your existing PBX directly to ElevenLabs
Where ElevenLabs wins: Any use case where voice quality is a brand differentiator. Luxury goods, premium healthcare, high-end financial services — anywhere the sound of the AI matters as much as what it says. Their emotional expressiveness is unmatched.
Where it struggles: The credit system obscures real per-minute costs. When you factor in LLM costs (not included in base pricing), the true all-in rate can exceed $0.20/min even on the Scale plan.
Platform Comparison Table
| Platform | Base Rate | All-In Estimate | HIPAA | SIP | Best For |
|---|---|---|---|---|---|
| Vapi | $0.05/min + pass-through | $0.18–$0.33/min | $1k/mo add-on | ✅ | Dev teams, custom builds |
| Retell | $0.07/min + pass-through | $0.13–$0.31/min | Enterprise | ✅ | Complex flows, analytics |
| Bland AI | $0.11–$0.14/min | $0.14–$0.25/min | Extra | ✅ | Outbound campaigns |
| Vogent | $0.09/min | $0.09–$0.14/min | ✅ Included | ✅ | Cost-efficient, structured tasks |
| ElevenLabs | Credit-based | $0.10–$0.20+/min | Enterprise | ✅ | Premium voice quality |
| Twilio (native AI) | $0.10/min AI + telco | $0.08–$0.15/min | ✅ | ✅ Native | Existing Twilio infrastructure |
Part 4: The Telephony Layer — SIP Trunking Explained (Without the Jargon)
This is the section most AI influencers skip because it's genuinely technical and unglamorous. But if you're serious about running AI-powered phone infrastructure, you must understand what SIP trunking is and how it fits into the AI pipeline.
What Is SIP Trunking?
SIP (Session Initiation Protocol) is the internet protocol that manages the signalling for phone calls — establishing, maintaining, and terminating them. A SIP trunk is essentially a virtual phone line that uses your internet connection instead of a physical copper wire.
When a customer calls your AI call centre:
- The call arrives at your SIP trunk provider (Twilio, Telnyx, SignalWire, VoIP Studio)
- The SIP provider converts the PSTN (public phone network) call into a SIP session
- The SIP session routes to your media server (or the SaaS platform's media server)
- RTP (Real-time Transport Protocol) carries the actual voice audio as packets
- Your AI platform receives PCM audio frames via WebSocket or a direct RTP stream
- The AI processes, responds, and pumps audio back the same way
The critical technical nuance: AI voice systems run on WebSocket streaming architectures. Traditional telephony runs on SIP/RTP. The "SIP-to-AI connector" or gateway is what bridges these two incompatible protocol worlds. Most SaaS platforms handle this for you transparently. When you build custom on LiveKit or Pipecat, you're configuring this bridge yourself.
Key SIP Trunk Providers & Costs
Twilio Elastic SIP Trunking (2026)
- Channels: $1.00/channel/month (month-to-month) or $0.50/channel/month (annual)
- Inbound calls (US): $0.0045/min (local number) / $0.0055/min (toll-free)
- Outbound calls (US): $0.003–$0.013/min
- Phone numbers: ~$1.00/month (local US); $2.00–10.00+ (international/toll-free)
- Twilio AI native assistant: +$0.10/min on top of telephony
Telnyx
- Inbound: $0.002–$0.004/min
- Outbound: $0.002–$0.008/min
- Conversational AI add-on: $0.06/min
- Numbers: from $1.00/month
- Known for aggressive pricing vs Twilio and superior international coverage
SignalWire
- Per-minute rates competitive with Telnyx
- Built on FreeSWITCH open-source — favoured by builders who want low-level control
- STIR/SHAKEN compliant (spam call labelling protection for outbound campaigns)
VoIP Studio
- Business VoIP subscription with AI call centre integration
- Plans from ~$16/user/month for calling features
- Good fit for businesses already using VoIP Studio as their office phone system who want to bolt AI onto existing infrastructure
VoIP Studio + AI: The Pragmatic Path for SMBs
For small and medium businesses already running on a VoIP phone system, VoIP Studio presents an interesting integration pattern. Rather than rebuilding telephony from scratch:
- Your existing VoIP Studio numbers receive calls
- A SIP forwarding rule routes calls to your AI platform's SIP endpoint (Vapi, Retell, or your custom LiveKit server)
- When the AI needs to transfer to a human, it transfers back into the VoIP Studio queue
- Human agents handle the escalation on their existing desk phones or softphones
This "bolt-on" approach minimises disruption, preserves existing numbers and routing, and lets you add AI capacity incrementally.
Part 5: Call Transfers & Human Handoffs — The Architecture That Actually Matters in Production
Think back to every terrible automated phone experience you've ever had. The AI confidently misunderstands you three times in a row, you mutter some variation of "talk to a human," and then you're placed on hold for 12 minutes while the agent asks you every question the AI just asked.
This is a handoff failure. And it's entirely preventable with good engineering.
The Two Types of Call Transfer
Cold Transfer (Blind Transfer)
The AI sends a SIP REFER message to your telephony provider, which instructs the caller's endpoint to re-establish a connection to the target phone number (a human agent's extension or a queue). The AI session ends immediately.
Caller → [AI Agent] → SIP REFER → [Human Queue]
(session ends) (new session starts)
Pros: Fast, minimal latency, no risk of the briefing failing. Cons: The human agent gets no context. The caller has to re-explain everything.
Warm Transfer (Attended / Whisper Transfer)
A second SIP INVITE establishes a connection between the AI and the human agent before the caller is bridged to the human. The AI "whispers" a briefing to the agent (a text or audio summary), then bridges the caller into the call.
Caller → [AI Agent] → SIP INVITE → [Human Agent]
→ Whisper: "John Smith, appointment rescheduling, prefers morning slots"
→ Bridge caller to human
Pros: The human agent has full context before they say hello. Zero repeat information for the caller. Cons: Slightly more complex to implement. If the agent briefing fails (agent unavailable, busy), you need a fallback.
The correct production approach is always warm transfer, unless the transfer is happening at millisecond notice due to an emergency (a patient describing chest pain, for example, where you skip the whisper and connect to emergency personnel immediately).
What Context Should Transfer With the Call
The technical implementation (SIP REFER/INVITE) only moves the audio stream. The intelligence of the handoff — the context — must travel separately. In production systems, this means:
- SIP headers: Pack structured data into the
X-Context-*custom SIP headers on the REFER/INVITE. Most enterprise PBX systems can read these and display them on the agent's screen. - Webhook call summary: Simultaneously POST a JSON summary to your CRM webhook endpoint at the moment of transfer, so the agent's screen pops with the full call summary.
- Screen pop integration: For platforms like Salesforce, Zendesk, or HubSpot, use their CTI (Computer Telephony Integration) SDKs to display the AI's conversation summary, caller history, and reason-for-transfer before the agent says a word.
{
"caller_id": "+14155552671",
"caller_name": "Sarah Mitchell",
"crm_contact_id": "CON-81923",
"summary": "Calling to reschedule appointment #A-2847 from 14 April to any morning slot in the week of 21 April. Confirmed insurance details are current.",
"transfer_reason": "Caller requested human agent for rescheduling",
"sentiment": "neutral",
"duration_ai_call": "2m 14s"
}
AI-to-AI Agent Transfers
This is the scenario that becomes critical as you move toward multi-agent call centre architectures — a pattern ValueStreamAI uses in enterprise deployments.
Imagine the call starts with a triage agent that determines intent, then routes to one of three specialist agents:
- A scheduling agent (appointment booking, rescheduling, cancellations)
- A clinical information agent (prescription refills, test results, HIPAA-compliant data retrieval from the EHR)
- A billing agent (invoice queries, payment plans, insurance verification)
Each agent has a different system prompt, different tool access, different LLM settings optimised for their task. The routing between them is not a phone call transfer — it's an internal message passing system where the conversation context, collected data, and session state are serialised and passed to the next agent process.
In a LiveKit Agents deployment, this looks like:
# Simplified example: routing from triage to scheduling agent
async def route_to_specialist(ctx: JobContext, intent: str, context: ConversationContext):
if intent == "scheduling":
await ctx.dispatch_job(
agent_name="scheduling_agent",
payload=context.to_dict()
)
elif intent == "billing":
await ctx.dispatch_job(
agent_name="billing_agent",
payload=context.to_dict()
)
The caller experiences a brief "Let me transfer you to our scheduling team" message, then the new specialist agent picks up the call seamlessly — same audio session, new AI brain, full context from the previous conversation.
Part 6: The Real Cost Breakdown — Building vs. Buying at Every Stage
This is the section that will save you either a lot of money or a lot of regret, depending on which direction you're going.
Stage 1: Idea to Proof of Concept (0–1,000 monthly minutes)
The Verdict: Use a SaaS Platform. Don't Even Think About Custom.
At this stage, you're validating whether the use case works, whether the LLM handles your edge cases, and whether your business logic is correctly specified. You are not optimising for cost.
Recommended stack:
- Vapi or Retell for orchestration (free tier or pay-as-you-go)
- GPT-4o mini or Claude 3.5 Haiku as the LLM (lower cost, sufficient for most tasks)
- Standard OpenAI TTS or Cartesia for voices
- Twilio for telephony
Estimated monthly cost at 1,000 minutes:
| Component | Rate | Cost |
|---|---|---|
| Orchestration (Vapi) | $0.05/min | $50 |
| LLM (GPT-4o mini via Vapi) | $0.006/min | $6 |
| TTS (OpenAI TTS) | $0.015/min | $15 |
| STT (Deepgram) | $0.0075/min | $7.50 |
| Telephony (Twilio inbound) | $0.0045/min | $4.50 |
| Total | ~$0.083/min | ~$83/month |
At this volume, you have no server costs, no DevOps, no GPU bills. Your only expense is learning. This is the right call.
Stage 2: Growth to Operational (1,000–30,000 monthly minutes)
The Verdict: Stay on SaaS, but Negotiate Enterprise Rates.
At 30,000 minutes/month, you're spending $2,500–$10,000/month on-platform depending on your stack. This starts to feel significant, but you're still nowhere near the break-even point for custom infrastructure.
Key decisions at this stage:
- Move to Retell Enterprise or Bland Scale for better per-minute rates
- Consider Vogent if your use cases are structured (appointment booking, outbound qualification)
- Start building your custom system prompts, tool integrations, and post-call analytics
Estimated monthly cost at 30,000 minutes (Retell Enterprise):
| Component | Rate | Cost |
|---|---|---|
| Full stack (Retell enterprise) | ~$0.10/min | $3,000 |
| Twilio phone numbers (20) | $2/number | $40 |
| CRM webhook server (single EC2) | Flat | $50 |
| Total | ~$0.103/min | ~$3,090/month |
The numbers still make sense. Save your engineering bandwidth for the integrations that actually create business value.
Stage 3: Production Scale (30,000–500,000 monthly minutes)
The Verdict: Hybrid Architecture — SaaS orchestration, self-hosted LLM.
This is where the economics get genuinely interesting. You're spending $30,000–$100,000/month on API costs, and the LLM component (GPT-4o or Claude) is usually 30–40% of your total bill.
The smart move at this stage:
- Self-host an open-weight LLM (Llama 3 70B or Mistral-Large fine-tuned on your domain)
- Keep the orchestration layer on a managed platform (LiveKit Cloud, Pipecat hosted)
- Keep TTS with ElevenLabs or Cartesia (still hard to beat commercially at quality)
Self-hosted LLM infrastructure cost:
| Component | Spec | Monthly Cost |
|---|---|---|
| GPU server (A100 80GB × 4) | AWS p4d / Lambda Labs | $3,500–$7,000 |
| Inference throughput | 500+ req/sec for voice | Managed by above |
| Engineering (1 MLOps engineer) | Ongoing maintenance | $8,000–$15,000 |
| LLM self-hosted total | ~$11,500–$22,000/month |
Compare this to 300,000 minutes × $0.05/min (GPT-4o via Vapi) = $15,000/month in LLM costs alone.
The break-even on self-hosted LLM for voice AI is approximately 200,000–300,000 minutes per month. Below that: pay the API tax. Above that: bring it in-house.
Stage 4: Enterprise Scale (500,000+ monthly minutes)
The Verdict: Full custom stack. Own every component.
At this scale, you're a serious voice AI business. You have:
- A dedicated LiveKit Agents deployment on Kubernetes
- Self-hosted LLM cluster (4×H100 GPU minimum, costing ~$160,000 CapEx + $10,000/month OpEx)
- Fine-tuned TTS model on your own voices (Kokoro or Matcha-TTS) — eliminating per-minute TTS costs entirely
- Self-hosted Deepgram STT via their on-premise licence
- Direct interconnects with SIP trunk providers (bypassing Twilio markup)
18-month TCO comparison (estimated at 1M minutes/month):
| Approach | Infrastructure | Inference | Engineering | Total 18mo |
|---|---|---|---|---|
| Cloud APIs (SaaS) | $420,000 | $380,000 | $60,000 | $860,000 |
| Self-hosted custom stack | $180,000 (CapEx) | $45,000 | $120,000 | $345,000 |
Savings: ~60% over 18 months. The payback period on the hardware investment starts at month 9.
Part 7: GPU Costs — The Honest Numbers
Nobody talks about this with enough specificity, so let's fix that.
Cloud GPU Pricing (2026)
| GPU | Provider | On-Demand Price/hr | Notes |
|---|---|---|---|
| NVIDIA A10 | AWS, Lambda | $0.60–$1.20/hr | Good for inference at medium scale |
| NVIDIA A100 40GB | Lambda Labs | $1.99/hr | Best value for voice AI inference |
| NVIDIA A100 80GB | AWS p4d | $3.50–$5.00/hr | Complex model inference |
| NVIDIA H100 80GB | CoreWeave | $2.50–$4.50/hr | State-of-the-art inference |
| 8×H100 cluster | Any hyperscaler | $20–$40/hr | Large model training/fine-tuning |
For voice AI specifically, you don't need H100s for inference. An A100 40GB can serve a quantised Llama 3 70B at sufficient throughput for 200–400 concurrent calls. The engineering cost of proper batching and request routing matters as much as the GPU itself.
When Self-Hosted GPU Beats Cloud API
The inflection point is approximately 500 hours (30,000 minutes) of LLM inference per month. Below that, paying OpenAI or Anthropic per token is cheaper when you factor in idle GPU time. Above that, the maths shifts.
A100 x2 on Lambda Labs:
$1.99/hr × 2 GPUs × 730 hrs/month = $2,906/month
Handles ~1M minutes/month of Llama 3 70B inference (quantised)
Same volume on GPT-4o:
1,000,000 minutes × $0.05/min LLM cost = $50,000/month
The case for self-hosting at scale is overwhelming. The challenge is the engineering capability required to do it properly.
Part 8: Case Study — Building a Voice AI System for a Medical Clinic
This isn't hypothetical. This is a condensed version of what we built.
The Situation: A private clinic in the UK had 20 staff handling inbound patient calls covering: appointment booking, rescheduling, prescription refill requests, test result queries, and billing queries. Average call volume: 800–1,200 calls/day. Average handling time: 4.2 minutes. Staff cost: approximately £280,000/year in salaries. Missed call rate during peak hours: 18%.
The Brief: Replace 80% of inbound call volume with AI. Maintain HIPAA/UK GDPR compliance. Integrate with their Semble EHR system. Enable clean handoff to human agents for sensitive or complex cases.
The Architecture We Built
Call → Twilio SIP → LiveKit Agents → Triage Bot
|
┌────────────────────┤
↓ ↓
Scheduling Agent Clinical Info Agent
(Llama 3 70B) (GPT-4o, restricted)
|
┌─────────┴──────────┐
↓ ↓
Semble EHR API Human Agent
(appointments) (warm transfer)
Triage agent (always GPT-4o): Identifies intent within the first 30 seconds. Routes to scheduling, clinical, or human queue. Handles approximately 45% of calls end-to-end (appointment confirmations, cancellations, general FAQs).
Scheduling agent (Llama 3 70B, fine-tuned on clinic's scheduling rules): Integrates directly with Semble's API to read availability, book slots, send SMS confirmations. Handles 40% of total call volume.
Clinical info agent (GPT-4o, strict tool restrictions): Can retrieve test result statuses from Semble (read-only), check prescription refill eligibility, and route prescription requests to the pharmacy integration. Handles 10% of calls. All clinical queries trigger a structured audit log — every retrieval is recorded with patient ID, timestamp, and agent session ID for CQC compliance.
Human agent escalation (remaining 5%): Warm transfer to the clinic's existing Zoom Phone system. The SIP REFER sends a webhook payload to the clinic's CRM (Cliniko), which screen-pops the agent's dashboard with the patient's record and a plain-English summary of the AI conversation.
The Handoff Triggers (Medical-Specific)
The AI is explicitly trained to escalate immediately — mid-sentence if necessary — in the following scenarios:
- Any mention of an emergency: "chest pain," "can't breathe," "bleeding," "fainted" → Immediate cold transfer to emergency line, no whisper delay.
- Explicit clinical advice request: "Should I take my medication if..." → Warm transfer with: "I'm connecting you with one of our clinical team who can properly advise you."
- Upset or distressed caller: Sentiment monitoring. If frustration score exceeds threshold for 3 consecutive turns → Warm transfer.
- Explicit human request: "Can I speak to a real person?" → Immediate warm transfer, no pushback.
- Identity verification failure: If the AI cannot confirm patient identity within 3 attempts → Transfer.
The Results
| Metric | Before (Human) | After (AI + Human) | Change |
|---|---|---|---|
| Calls handled per day | 850 avg | 1,240 avg | +46% capacity |
| Missed call rate | 18% | 3% | -83% |
| Avg handling time (AI calls) | 4.2 min | 2.1 min | -50% |
| Staff headcount (calls) | 20 FTE | 6 FTE (escalations only) | -70% |
| Annual cost | £280,000 | £78,000 (6 staff + AI infra) | -72% |
| Patient satisfaction (CSAT) | 3.8/5 | 4.2/5 | +10.5% |
The CSAT improvement is worth noting. Patients specifically commented that the AI "always answered immediately" and "never put them on hold." The frustration with the old system wasn't the human relationship — it was the wait time and the missed calls.
Part 9: Niche Applications — Where AI Call Centers Excel
Real Estate
Use case: Inbound lead qualification, property inquiry handling, viewing appointment scheduling.
Why it works: Real estate leads are highly time-sensitive. A lead that calls at 11pm on a Sunday and gets a human-quality AI response converts at dramatically higher rates than a voicemail. The AI can qualify budget, desired area, property type, and timeline in 90 seconds, then book a viewing or escalate to an agent.
Key integration: CRM (HubSpot, Follow Up Boss) for lead capture and calendar booking.
Recommended platform at startup stage: Retell AI with a custom knowledge base of property listings.
Insurance
Use case: First notice of loss (FNOL) calls, policy enquiries, renewal campaigns.
Complexity consideration: Insurance calls often involve legal language, policy-specific details, and emotional stress (accidents, claims). The LLM must be configured with extreme precision for accuracy, and hallucination guardrails are non-negotiable. This is a use case where GPT-4o or Claude 3.5 Sonnet (not smaller models) are mandatory.
Critical requirement: All conversations must be recorded, transcribed, and stored for regulatory compliance. Build this into your architecture from day one.
Legal (Intake Calls)
Use case: Initial client intake, case type qualification, conflict of interest screening.
Compliance complexity: Attorney-client privilege considerations mean audio recordings require careful handling. Many firms route AI intake calls through a separate phone number specifically to clarify the non-privileged nature of the initial AI interaction before privilege attaches.
What works well: Structured data collection (incident date, jurisdiction, basic facts) that a human intake coordinator reviews post-call.
Automotive Dealerships
Use case: Service appointment scheduling, parts availability enquiries, vehicle trade-in lead capture.
Why outbound works here: Service reminder campaigns — "Your vehicle is due for a service" — have extremely high answer rates because the calls are non-threatening and the value proposition is clear. Bland AI's outbound stack is well-suited to this use case.
E-Commerce & Retail
Use case: Order status, returns, delivery queries, product recommendations.
Integration depth: The AI needs real-time access to your OMS (Shopify, Magento, SAP Commerce) to answer "where is my order?" The deeper the integration, the higher the resolution rate without human escalation.
Part 10: The "Build vs. Buy" Decision Framework
Here's the question in its simplest form:
Are you building a call centre feature, or are you building a call centre product?
If it's a feature — you want to add voice AI to an existing product or business — start with SaaS. Get it working. Learn what your users actually need. Iterate. The platform cost is the R&D tax you pay to avoid expensive guesses.
If it's a product — voice AI is your core value proposition, and you'll be selling it to others or running it at scale — you need custom infrastructure eventually. The question is not whether to build it, but when.
| Situation | Recommendation |
|---|---|
| <1,000 min/month, idea stage | Vapi or Retell, pay-as-you-go |
| 1,000–30,000 min/month, proving ROI | Retell Enterprise or Vogent |
| 30,000–200,000 min/month, scaling | Hybrid: SaaS orchestration + self-hosted LLM |
| 200,000+ min/month, core business | Full custom: LiveKit Agents + self-hosted stack |
| HIPAA/regulated, any scale | Ensure HIPAA BAA is signed; consider self-hosted from 10k+ min |
| Premium voice quality is brand-critical | ElevenLabs Conversational AI (any scale) |
| Outbound campaign focus | Bland AI Scale + BYOT Twilio |
The ValueStreamAI Voice AI Architecture
When we build for clients, we don't pick a single platform and live with its limitations. We architect modularly:
-
Autonomy: AI agents that proactively make outbound calls for reminders, confirmations, and re-engagement campaigns — not just waiting for inbound calls.
-
Tool Use: Deep CRM, EHR, and ERP integrations via MCP-standard connectors. The AI can book, reschedule, look up records, and trigger downstream workflows during the call.
-
Planning: Multi-agent routing where a triage agent decides which specialist agent handles the call — not a single monolithic prompt trying to do everything.
-
Memory: Persistent caller context across multiple calls via a vector store. A returning patient is greeted by name; the AI knows their booking history and preferences.
-
Multi-Step Reasoning: Conditional escalation logic with sentiment monitoring, compliance guardrails, and human-in-the-loop checkpoints for high-stakes decisions.
The Competitor Pulse Check: Open-Source vs. SaaS vs. Custom
| Factor | SaaS (Vapi/Retell) | Self-Hosted (LiveKit) | ValueStreamAI Custom |
|---|---|---|---|
| Time to deploy | Hours | Weeks-months | 6–12 weeks |
| Per-minute cost (scale) | $0.10–$0.33 | $0.02–$0.08 | $0.015–$0.06 |
| Data sovereignty | Vendor cloud | Your infra | Your infra / Private VPC |
| Customisation | Limited/Moderate | Full | Full |
| Human handoff quality | Basic (cold transfer) | Full warm transfer support | Full warm + context pass |
| HIPAA out of the box | Add-on ($$$) | DIY | Engineered in |
| Ongoing maintenance | None | High (DevOps team needed) | Managed by us |
| Ideal for | 0–30k min/month | Engineering teams | 10k+ min/month enterprises |
Project Scope & Pricing Tiers (ValueStreamAI Voice AI)
Here's what it costs to build properly:
-
Voice AI Pilot (4–6 Weeks): $5,000 – $25,000
- Ideal for: Single inbound use case (appointment booking, basic FAQ handling), SaaS orchestration stack, Twilio telephony, one CRM integration.
-
Custom Voice Agent Ecosystem (8–14 Weeks): $25,000 – $75,000
- Ideal for: Multi-agent routing (triage + 2–3 specialist agents), warm transfer implementation, full CRM/EHR integration, HIPAA compliance architecture, custom voice, post-call analytics.
-
Enterprise Voice Infrastructure (14+ Weeks): $75,000+
- Ideal for: 100k+ minutes/month operations, self-hosted LLM deployment, full data sovereignty, on-premise or private VPC, custom TTS fine-tuning, multi-site telephony architecture, MLOps pipeline for ongoing model improvement.
For a real-time cost model specific to your call volume, use our Interactive ROI Calculator.
Frequently Asked Questions
What is the difference between a SIP trunk and a VoIP system?
VoIP (Voice over Internet Protocol) is the broad category — making phone calls over the internet rather than traditional copper lines. SIP trunking is the specific protocol used to connect your VoIP system to the public telephone network (PSTN). When an AI call centre handles calls, it sits behind a SIP trunk that converts incoming PSTN calls to the digital streams the AI processes.
Can AI voice agents handle Scottish or regional accents accurately?
Yes — with proper STT configuration. Deepgram's Nova-3 model handles UK regional accents well out of the box. For strong regional accents (Glaswegian, Geordie, thick Irish), we recommend fine-tuning a Whisper model on accent-specific audio data. We've done this for healthcare clients in Scotland and achieved >95% transcription accuracy.
How does warm transfer work technically in a SIP environment?
During a warm transfer, the AI establishes a second SIP INVITE to the target agent while keeping the original caller's session active (three-way call bridge). The AI briefs the agent via a "whisper" audio stream heard only by the agent. Once confirmed, the AI bridges the caller into the existing session and drops out, leaving the caller and human agent in a direct two-party call.
Is it legally compliant to record AI call centre conversations?
In most jurisdictions, yes — with disclosure. In the UK, Ofcom guidelines require that callers are informed if a call is being recorded. In most US states, single-party consent applies (the business recording is one party). California, Illinois, and a handful of others require two-party consent. Your call greeting should always include a disclosure statement. For HIPAA-covered calls, recordings must be stored with appropriate encryption, access controls, and audit logging.
What happens when the AI doesn't understand a caller?
A properly engineered AI call centre has three escalation layers for comprehension failures: (1) a clarification request if intent is unclear, (2) a rephrasing with a different approach if the first clarification fails, and (3) a warm transfer to a human agent if uncertainty persists after two attempts. The system should never loop a caller in an infinite confusion cycle — this is the most common source of poor AI call centre experiences and is entirely avoidable.
At what call volume does it make sense to stop using SaaS platforms?
The rule of thumb is approximately 30,000 minutes/month for LLM self-hosting, and 200,000+ minutes/month for a fully custom stack (orchestration, STT, TTS, telephony). Before those thresholds, the engineering and operational overhead of custom infrastructure exceeds the cost savings.
Internal Resources
- Voice AI Services: Enterprise Conversational Intelligence
- Self-Hosted AI LLMs vs. Cloud APIs: The Data Sovereignty Guide
- The 2026 Enterprise AI Strategy Playbook
- Business Process Automation Guide 2026
- Why No-Code Fails Enterprise Scaling
External References
- LiveKit Agents — GitHub (Apache 2.0)
- Pipecat — GitHub (BSD)
- Bolna — GitHub (open-source)
- Vapi AI Pricing
- Retell AI Pricing
- ElevenLabs SIP Trunking Documentation
- Twilio Elastic SIP Trunking Pricing
Building a voice AI system for your business? Book a free strategy session with our engineering team — we'll audit your current call handling workflows and map the exact architecture and cost model for your scale.
