AI Call Center Orchestration: Engineering Guide, Costs & Platforms (2026)

Metric	Result
Concurrent Calls	Unlimited (with proper infrastructure)
Response Latency	300ms – 700ms (human-perceived as natural)
First Contact Resolution	85–92% for routine queries
Cost vs. Human Agent	70–90% reduction in per-call cost at scale
24/7 Availability	Always on, no shift premiums

There's a dirty secret the Instagram ads selling you "AI call centers in five minutes" won't tell you: the moment you hit 200 concurrent calls, the SaaS dream starts falling apart.

Latency spikes. Vendor lock-in bites. Your HIPAA compliance officer starts asking uncomfortable questions. And your cost-per-minute — which looked so clean on the pricing page — suddenly has four hidden line items you didn't account for.

This guide is for the builders, the architects, and the business owners who have moved past the demo phase and are asking the real questions:

What does this actually cost at scale?
When do I need real engineers instead of no-code tools?
How does SIP trunking actually work with AI?
What happens when a patient asks to speak to a human?

We'll cover all of it — the open-source frameworks, the SaaS platforms, the infrastructure economics, the telephony plumbing, and what we learned building a voice AI system that replaced a 20-person call centre for a UK medical clinic.

Part 1: The Architecture Stack — What You're Actually Building

Before you compare platforms or prices, you need to understand that an AI call centre is not a single product. It is a pipeline of five distinct components, each of which can be swapped, self-hosted, or rented as a cloud service.

Phone Call → [Telephony] → [STT] → [LLM] → [TTS] → [Telephony] → Caller
                  ↕                          ↕
             [SIP Trunk]              [CRM / EHR / DB]

The Five Components

Telephony Layer (VoIP/SIP) — The plumbing that turns a phone call into a digital audio stream your software can process. Twilio, Telnyx, SignalWire, VoIP Studio.
Speech-to-Text (STT) — Converts the caller's voice into text in real time. Deepgram Nova-3 is the current benchmark. Alternatives: OpenAI Whisper, AssemblyAI, Google STT.
Large Language Model (LLM) — The brain that reads the transcription, understands intent, applies your business rules, and generates a response. GPT-5.5, Claude 3.5 Sonnet, Gemini 2.0, or self-hosted DeepSeek V4.
Text-to-Speech (TTS) — Converts the LLM's response into natural-sounding audio. ElevenLabs, Cartesia, OpenAI TTS, or open-source options like Kokoro/Piper.
Orchestration Layer — The glue that connects all of the above in real time, manages turn detection (knowing when a caller stops talking), handles interruptions, and executes business logic like call transfers and CRM lookups.

This last component — orchestration — is where all the interesting (and difficult) engineering lives.

Part 2: Open-Source Frameworks — Building the Wheel Yourself

If you're reading this with a technical team and a real deployment in mind, these are the frameworks that the serious builders use. They give you complete control over every component, data sovereignty, and zero per-minute vendor tax. The trade-off? You need engineers who know what they're doing.

LiveKit Agents (GitHub: `livekit/agents`)

The most production-mature open-source framework for real-time voice AI.

LiveKit Agents is built on top of LiveKit's battle-tested WebRTC/SIP infrastructure — the same stack that powers real-time communications for some of the largest applications on the web. The agents framework (Apache 2.0 licence) gives you:

Pluggable STT/LLM/TTS providers: Deepgram, OpenAI, ElevenLabs, Azure, Cartesia — swap them in via simple plugin configuration.
Semantic turn detection: Not a primitive silence detector. LiveKit's turn detection uses a small model that understands conversational context, reducing interruption errors significantly.
WebRTC + SIP support: Handles inbound calls from a SIP trunk (Twilio, Telnyx) or WebRTC web/mobile clients in the same framework.
Job scheduling: Built-in worker queue so calls are dispatched to available agent processes without you building a scheduler.
Python and Node.js: Full support for both languages.

When to use it: When you want full infrastructure control, your call volumes are predictable (or highly variable and you want to manage your own autoscaling), and you have a Python/Node.js engineering team. LiveKit Cloud exists if you want hosted signalling without self-hosting everything.

What you're still responsible for: Telephony SIP trunk configuration, LLM prompt engineering, TTS voice procurement, CRM integrations, and all deployment/DevOps.

Pipecat (GitHub: `pipecat-ai/pipecat`)

The "Swiss Army Knife" for real-time conversational AI pipelines.

Pipecat, built by Daily.co, takes a composable pipeline approach. You assemble a directed graph of processors — a Deepgram STT processor feeds into a LangChain LLM processor feeds into an ElevenLabs TTS processor — and Pipecat handles the buffering, chunking, and real-time streaming between them.

What makes Pipecat special:

Frame-based architecture: Audio frames, text frames, function call frames — everything is typed. This makes debugging pipeline issues dramatically easier.
Interruption handling: Built-in support for detecting when a caller interrupts the AI mid-sentence, cutting the TTS output immediately and re-routing to STT.
Pipecat Flows: A companion library for defining structured dialogue state machines — useful when you need the agent to follow a specific script with branching logic (e.g., appointment booking flows where step 2 only happens if step 1 is confirmed).
Broad provider support: 40+ integrated providers for STT, LLM, TTS, and transport layers.

When to use it: When your conversation flows have complex conditional branching (medical intake forms, insurance verification, multi-step qualification scripts) and you need fine control over how each processing step behaves.

Bolna (GitHub: `bolna-ai/bolna`)

Purpose-built for front-desk automation.

Bolna is a narrower, more opinionated open-source framework specifically targeting the use case of automating front-desk phone operations — appointment booking, lead qualification, basic customer service. It ships with pre-built integrations for Twilio, Plivo, Deepgram, and a selection of TTS providers, plus a dashboard for managing agents without writing code for every variable.

Bolna is an interesting middle ground: more accessible than LiveKit Agents for smaller teams, more customisable than a full SaaS wrapper.

When to use it: Small-to-medium practices, real estate offices, dental chains — scenarios where you want open-source flexibility but the specific use case is well-defined enough that Bolna's opinionated structure works in your favour.

Deepgram Voice Agent API

Not a framework — but worth knowing as a primitives provider.

Deepgram's Voice Agent API is a single WebSocket connection that handles STT, LLM (via OpenAI or Anthropic integration), and TTS in one managed service. You send audio in, you get audio back. For prototyping or low-volume deployments, this removes enormous complexity.

The trade-off is loss of control: you can't swap in a custom STT model, you can't change how turn detection works, and you're paying API pricing at all volumes. But for a rapid proof-of-concept or a low-stakes internal tool, it's an excellent starting point.

Comparison Table: Open-Source Frameworks

Framework	Language	Telephony	Complexity	Best For
LiveKit Agents	Python / JS	SIP + WebRTC	High	Production-scale, full control
Pipecat	Python	WebRTC (Daily)	Medium-High	Complex conversation flows
Bolna	Python	Twilio / Plivo	Medium	Front-desk automation, SMBs
Deepgram Voice Agent API	Any (WebSocket)	N/A (no SIP)	Low	Rapid prototyping
Ultravox (Fixie.ai)	Python	SIP	Medium	Low-latency, open-weight LLMs

Part 3: SaaS Platforms — Not Recreating the Wheel (Yet)

Here's the thing nobody says out loud: for most businesses under 50,000 minutes/month, a well-configured SaaS platform is not the lazy option — it's the correct option. The engineering time to build and maintain a custom stack costs more than the per-minute premium, and your energy is better spent on the prompt engineering, CRM integrations, and business logic that actually differentiates your product.

The inflection point is typically ~500 hours (30,000 minutes) of voice per month. Below that, SaaS wins. Above that, the economics start favouring custom infrastructure, especially for regulated industries.

Let's break down the real platforms.

Vapi AI

The most developer-friendly SaaS orchestration layer.

Vapi is not a phone company. It's a voice AI orchestration platform — a managed version of the "glue layer" between your telephony, STT, LLM, and TTS. You bring your own OpenAI/Anthropic/ElevenLabs API keys, and Vapi handles the real-time pipeline management.

Pricing (2026):

Platform fee: $0.05/min (base orchestration only)
+LLM: GPT-5.5 adds ~$0.05/min; Claude 3.5 Sonnet adds ~$0.04–0.06/min
+TTS: ElevenLabs adds ~$0.05–0.07/min; OpenAI TTS adds ~$0.015/min
+STT: Deepgram Nova-2 adds ~$0.0075/min
+Telephony: Twilio via Vapi adds ~$0.01/min; BYOT (Bring Your Own Twilio) is free
HIPAA compliance: $1,000/month add-on
Effective all-in rate: ~$0.18–$0.33/minute with premium stack

Who it's for: Development teams that want to build a custom experience (system prompts, tool calls, webhooks) without managing infrastructure. Vapi has excellent documentation and a generous free tier ($10 credit).

The real limitation: Shared infrastructure. At scale, every customer is competing for the same processing resources. Latency is typically 400–800ms in good conditions; during peak times it can creep higher.

Retell AI

The strongest feature set for complex call flows.

Retell is built for businesses that need more than simple Q&A — think multi-step qualification scripts, dynamic data injection mid-call, and sophisticated call routing logic.

Pricing (2026):

Base rate: $0.07+/min (no platform fee, included in per-minute)
LLM: GPT-5.5 mini $0.006/min; Claude 3.5 Sonnet $0.02–0.06/min; GPT-5.5 $0.05/min
TTS: Standard voices $0.03–0.05/min; ElevenLabs $0.07/min
Telephony (Retell's Twilio): $0.01/min; BYOT: free
Knowledge base: Free for first 10; $0.005/min or $8/month thereafter
Phone numbers: $2/month each
Effective all-in rate: ~$0.13–$0.31/minute depending on stack
Enterprise: Custom pricing, as low as $0.05/min at volume

Standout features: Real-time knowledge base retrieval during calls, custom LLM endpoints (bring your own fine-tuned model), post-call analytics with sentiment analysis, and robust webhook support for CRM integration.

Bland AI

The outbound calling specialist.

Bland AI has carved out a niche in high-volume outbound calling — lead qualification, appointment confirmation campaigns, debt collection, survey calls. It has the most mature outbound tooling of any platform in this list.

Pricing (2026, post-December 2025 restructure):

Free: $0.14/min connected; $0.05/min transfer time
Build ($299/month): $0.12/min connected; $0.04/min transfer
Scale ($499/month): $0.11/min connected; $0.03/min transfer
Failed/short calls (<10 sec): $0.015 per attempt
SMS: $0.02/message
A realistic 1,000 calls × 3 min on Scale: $830+ before add-ons

The catch: Bland's pricing looks attractive on a per-minute basis, but the add-ons can surprise you. Voice cloning, custom LLM endpoints, and HIPAA compliance are all tiered extras.

Vogent AI

The in-house LLM play — lowest effective cost at volume.

Vogent is the most interesting pricing story of 2026. Unlike Vapi and Retell (which pass through third-party LLM costs), Vogent runs its own in-house language models trained specifically for voice AI tasks. This means no GPT API surcharges.

Pricing (2026):

Base pay-as-you-go: $0.09/min
+Premium voices: $0.051/min additional
Enterprise (10M+ minutes): As low as $0.06/min with up to 50% volume discount
No subscription required — truly usage-based

Features: IVR navigation (the AI can press dial-pad keys to navigate legacy phone trees), human handoff, post-call automation, knowledge base integration, HIPAA compliant out of the box.

The trade-off: Vogent's in-house models are excellent for structured tasks (qualification, scheduling, data capture) but less capable than GPT-5.5 or Claude 3.5 Sonnet for nuanced, unscripted conversations. For highly regulated or emotionally complex calls, you pay more on other platforms for better reasoning.

ElevenLabs Conversational AI

The best voices in the industry — in a platform.

ElevenLabs started as a TTS provider and is rapidly expanding into full conversational AI. Their platform is built on top of their commercial-grade voice synthesis (the same voices that power thousands of other applications), giving them a genuine moat on voice quality and emotional expressiveness.

Pricing (2026, credit-based):

Creator ($22/month): ~250 minutes of conversational AI
Pro ($99/month): ~1,100 minutes; overage ~$0.09/min
Scale ($330/month): ~3,600 minutes; overage $0.096/min
Business ($1,320/month): ~13,750 minutes; low-latency TTS as low as $0.05/min
HIPAA compliance: Custom enterprise pricing
SIP trunking: Yes — connect your existing PBX directly to ElevenLabs

Where ElevenLabs wins: Any use case where voice quality is a brand differentiator. Luxury goods, premium healthcare, high-end financial services — anywhere the sound of the AI matters as much as what it says. Their emotional expressiveness is unmatched.

Where it struggles: The credit system obscures real per-minute costs. When you factor in LLM costs (not included in base pricing), the true all-in rate can exceed $0.20/min even on the Scale plan.

Platform Comparison Table

Platform	Base Rate	All-In Estimate	HIPAA	SIP	Best For
Vapi	$0.05/min + pass-through	$0.18–$0.33/min	$1k/mo add-on	✅	Dev teams, custom builds
Retell	$0.07/min + pass-through	$0.13–$0.31/min	Enterprise	✅	Complex flows, analytics
Bland AI	$0.11–$0.14/min	$0.14–$0.25/min	Extra	✅	Outbound campaigns
Vogent	$0.09/min	$0.09–$0.14/min	✅ Included	✅	Cost-efficient, structured tasks
ElevenLabs	Credit-based	$0.10–$0.20+/min	Enterprise	✅	Premium voice quality
Twilio (native AI)	$0.10/min AI + telco	$0.08–$0.15/min	✅	✅ Native	Existing Twilio infrastructure

Part 4: The Telephony Layer — SIP Trunking Explained (Without the Jargon)

This is the section most AI influencers skip because it's genuinely technical and unglamorous. But if you're serious about running AI-powered phone infrastructure, you must understand what SIP trunking is and how it fits into the AI pipeline.

What Is SIP Trunking?

SIP (Session Initiation Protocol) is the internet protocol that manages the signalling for phone calls — establishing, maintaining, and terminating them. A SIP trunk is essentially a virtual phone line that uses your internet connection instead of a physical copper wire.

When a customer calls your AI call centre:

The call arrives at your SIP trunk provider (Twilio, Telnyx, SignalWire, VoIP Studio)
The SIP provider converts the PSTN (public phone network) call into a SIP session
The SIP session routes to your media server (or the SaaS platform's media server)
RTP (Real-time Transport Protocol) carries the actual voice audio as packets
Your AI platform receives PCM audio frames via WebSocket or a direct RTP stream
The AI processes, responds, and pumps audio back the same way

The critical technical nuance: AI voice systems run on WebSocket streaming architectures. Traditional telephony runs on SIP/RTP. The "SIP-to-AI connector" or gateway is what bridges these two incompatible protocol worlds. Most SaaS platforms handle this for you transparently. When you build custom on LiveKit or Pipecat, you're configuring this bridge yourself.

Key SIP Trunk Providers & Costs

Twilio Elastic SIP Trunking (2026)

Channels: $1.00/channel/month (month-to-month) or $0.50/channel/month (annual)
Inbound calls (US): $0.0045/min (local number) / $0.0055/min (toll-free)
Outbound calls (US): $0.003–$0.013/min
Phone numbers: ~$1.00/month (local US); $2.00–10.00+ (international/toll-free)
Twilio AI native assistant: +$0.10/min on top of telephony

Telnyx

Inbound: $0.002–$0.004/min
Outbound: $0.002–$0.008/min
Conversational AI add-on: $0.06/min
Numbers: from $1.00/month
Known for aggressive pricing vs Twilio and superior international coverage

SignalWire

Per-minute rates competitive with Telnyx
Built on FreeSWITCH open-source — favoured by builders who want low-level control
STIR/SHAKEN compliant (spam call labelling protection for outbound campaigns)

VoIP Studio

Business VoIP subscription with AI call centre integration
Plans from ~$16/user/month for calling features
Good fit for businesses already using VoIP Studio as their office phone system who want to bolt AI onto existing infrastructure

VoIP Studio + AI: The Pragmatic Path for SMBs

For small and medium businesses already running on a VoIP phone system, VoIP Studio presents an interesting integration pattern. Rather than rebuilding telephony from scratch:

Your existing VoIP Studio numbers receive calls
A SIP forwarding rule routes calls to your AI platform's SIP endpoint (Vapi, Retell, or your custom LiveKit server)
When the AI needs to transfer to a human, it transfers back into the VoIP Studio queue
Human agents handle the escalation on their existing desk phones or softphones

This "bolt-on" approach minimises disruption, preserves existing numbers and routing, and lets you add AI capacity incrementally.

The Telephony Access Audit: Do This Before Any Architecture Decision

The integration blocker that kills more voice AI projects than any other isn't the AI itself — it's the phone infrastructure. Before selecting a platform or writing a line of code, verify the state of your existing telephony stack.

If you're on a modern cloud-based phone system (RingCentral, Telnyx, Twilio, VoIP Studio), integration is typically straightforward. If you're on an on-premise PBX, a legacy SIP trunk with limited configuration access, or an older hotel or contact centre system, you'll face constraints that must be understood before scoping. Four questions to answer in week one: Who controls your SIP trunk credentials? Can your current PBX forward calls to an external SIP endpoint? What configuration access do you have to your existing phone numbers? And is there a vendor support contract involved that gates the changes you need?

In every call centre AI engagement we run, the first thing we do is a telephony access audit. It's a single conversation that either confirms the path is clear or surfaces the blockers when they're still a conversation — not a mid-build scope change.

Part 5: Call Transfers & Human Handoffs — The Architecture That Actually Matters in Production

Think back to every terrible automated phone experience you've ever had. The AI confidently misunderstands you three times in a row, you mutter some variation of "talk to a human," and then you're placed on hold for 12 minutes while the agent asks you every question the AI just asked.

This is a handoff failure. And it's entirely preventable with good engineering.

The Two Types of Call Transfer

Cold Transfer (Blind Transfer)

The AI sends a SIP REFER message to your telephony provider, which instructs the caller's endpoint to re-establish a connection to the target phone number (a human agent's extension or a queue). The AI session ends immediately.

Caller → [AI Agent] → SIP REFER → [Human Queue]
         (session ends)              (new session starts)

Pros: Fast, minimal latency, no risk of the briefing failing. Cons: The human agent gets no context. The caller has to re-explain everything.

Warm Transfer (Attended / Whisper Transfer)

A second SIP INVITE establishes a connection between the AI and the human agent before the caller is bridged to the human. The AI "whispers" a briefing to the agent (a text or audio summary), then bridges the caller into the call.

Caller → [AI Agent] → SIP INVITE → [Human Agent]
                    → Whisper: "John Smith, appointment rescheduling, prefers morning slots"
                    → Bridge caller to human

Pros: The human agent has full context before they say hello. Zero repeat information for the caller. Cons: Slightly more complex to implement. If the agent briefing fails (agent unavailable, busy), you need a fallback.

The correct production approach is always warm transfer, unless the transfer is happening at millisecond notice due to an emergency (a patient describing chest pain, for example, where you skip the whisper and connect to emergency personnel immediately).

What Context Should Transfer With the Call

The technical implementation (SIP REFER/INVITE) only moves the audio stream. The intelligence of the handoff — the context — must travel separately. In production systems, this means:

SIP headers: Pack structured data into the X-Context-* custom SIP headers on the REFER/INVITE. Most enterprise PBX systems can read these and display them on the agent's screen.
Webhook call summary: Simultaneously POST a JSON summary to your CRM webhook endpoint at the moment of transfer, so the agent's screen pops with the full call summary.
Screen pop integration: For platforms like Salesforce, Zendesk, or HubSpot, use their CTI (Computer Telephony Integration) SDKs to display the AI's conversation summary, caller history, and reason-for-transfer before the agent says a word.

{
  "caller_id": "+14155552671",
  "caller_name": "Sarah Mitchell",
  "crm_contact_id": "CON-81923",
  "summary": "Calling to reschedule appointment #A-2847 from 14 April to any morning slot in the week of 21 April. Confirmed insurance details are current.",
  "transfer_reason": "Caller requested human agent for rescheduling",
  "sentiment": "neutral",
  "duration_ai_call": "2m 14s"
}

AI-to-AI Agent Transfers

This is the scenario that becomes critical as you move toward multi-agent call centre architectures — a pattern ValueStreamAI uses in enterprise deployments.

Imagine the call starts with a triage agent that determines intent, then routes to one of three specialist agents:

A scheduling agent (appointment booking, rescheduling, cancellations)
A clinical information agent (prescription refills, test results, HIPAA-compliant data retrieval from the EHR)
A billing agent (invoice queries, payment plans, insurance verification)

Each agent has a different system prompt, different tool access, different LLM settings optimised for their task. The routing between them is not a phone call transfer — it's an internal message passing system where the conversation context, collected data, and session state are serialised and passed to the next agent process.

In a LiveKit Agents deployment, this looks like:

# Simplified example: routing from triage to scheduling agent
async def route_to_specialist(ctx: JobContext, intent: str, context: ConversationContext):
    if intent == "scheduling":
        await ctx.dispatch_job(
            agent_name="scheduling_agent",
            payload=context.to_dict()
        )
    elif intent == "billing":
        await ctx.dispatch_job(
            agent_name="billing_agent", 
            payload=context.to_dict()
        )

The caller experiences a brief "Let me transfer you to our scheduling team" message, then the new specialist agent picks up the call seamlessly — same audio session, new AI brain, full context from the previous conversation.

Part 6: The Real Cost Breakdown — Building vs. Buying at Every Stage

This is the section that will save you either a lot of money or a lot of regret, depending on which direction you're going.

Stage 1: Idea to Proof of Concept (0–1,000 monthly minutes)

The Verdict: Use a SaaS Platform. Don't Even Think About Custom.

At this stage, you're validating whether the use case works, whether the LLM handles your edge cases, and whether your business logic is correctly specified. You are not optimising for cost.

Recommended stack:

Vapi or Retell for orchestration (free tier or pay-as-you-go)
GPT-5.5 mini or Claude 3.5 Haiku as the LLM (lower cost, sufficient for most tasks)
Standard OpenAI TTS or Cartesia for voices
Twilio for telephony

Estimated monthly cost at 1,000 minutes:

Component	Rate	Cost
Orchestration (Vapi)	$0.05/min	$50
LLM (GPT-5.5 mini via Vapi)	$0.006/min	$6
TTS (OpenAI TTS)	$0.015/min	$15
STT (Deepgram)	$0.0075/min	$7.50
Telephony (Twilio inbound)	$0.0045/min	$4.50
Total	~$0.083/min	~$83/month

At this volume, you have no server costs, no DevOps, no GPU bills. Your only expense is learning. This is the right call.

Stage 2: Growth to Operational (1,000–30,000 monthly minutes)

The Verdict: Stay on SaaS, but Negotiate Enterprise Rates.

At 30,000 minutes/month, you're spending $2,500–$10,000/month on-platform depending on your stack. This starts to feel significant, but you're still nowhere near the break-even point for custom infrastructure.

Key decisions at this stage:

Move to Retell Enterprise or Bland Scale for better per-minute rates
Consider Vogent if your use cases are structured (appointment booking, outbound qualification)
Start building your custom system prompts, tool integrations, and post-call analytics

Estimated monthly cost at 30,000 minutes (Retell Enterprise):

Component	Rate	Cost
Full stack (Retell enterprise)	~$0.10/min	$3,000
Twilio phone numbers (20)	$2/number	$40
CRM webhook server (single EC2)	Flat	$50
Total	~$0.103/min	~$3,090/month

The numbers still make sense. Save your engineering bandwidth for the integrations that actually create business value.

Stage 3: Production Scale (30,000–500,000 monthly minutes)

The Verdict: Hybrid Architecture — SaaS orchestration, self-hosted LLM.

This is where the economics get genuinely interesting. You're spending $30,000–$100,000/month on API costs, and the LLM component (GPT-5.5 or Claude) is usually 30–40% of your total bill.

The smart move at this stage:

Self-host an open-weight LLM (DeepSeek V4 or Mistral-Large fine-tuned on your domain)
Keep the orchestration layer on a managed platform (LiveKit Cloud, Pipecat hosted)
Keep TTS with ElevenLabs or Cartesia (still hard to beat commercially at quality)

Self-hosted LLM infrastructure cost:

Component	Spec	Monthly Cost
GPU server (A100 80GB × 4)	AWS p4d / Lambda Labs	$3,500–$7,000
Inference throughput	500+ req/sec for voice	Managed by above
Engineering (1 MLOps engineer)	Ongoing maintenance	$8,000–$15,000
LLM self-hosted total		~$11,500–$22,000/month

Compare this to 300,000 minutes × $0.05/min (GPT-5.5 via Vapi) = $15,000/month in LLM costs alone.

The break-even on self-hosted LLM for voice AI is approximately 200,000–300,000 minutes per month. Below that: pay the API tax. Above that: bring it in-house.

Stage 4: Enterprise Scale (500,000+ monthly minutes)

The Verdict: Full custom stack. Own every component.

At this scale, you're a serious voice AI business. You have:

A dedicated LiveKit Agents deployment on Kubernetes
Self-hosted LLM cluster (4×H100 GPU minimum, costing ~$160,000 CapEx + $10,000/month OpEx)
Fine-tuned TTS model on your own voices (Kokoro or Matcha-TTS) — eliminating per-minute TTS costs entirely
Self-hosted Deepgram STT via their on-premise licence
Direct interconnects with SIP trunk providers (bypassing Twilio markup)

18-month TCO comparison (estimated at 1M minutes/month):

Approach	Infrastructure	Inference	Engineering	Total 18mo
Cloud APIs (SaaS)	$420,000	$380,000	$60,000	$860,000
Self-hosted custom stack	$180,000 (CapEx)	$45,000	$120,000	$345,000

Savings: ~60% over 18 months. The payback period on the hardware investment starts at month 9.

Part 7: GPU Costs — The Honest Numbers

Nobody talks about this with enough specificity, so let's fix that.

Cloud GPU Pricing (2026)

GPU	Provider	On-Demand Price/hr	Notes
NVIDIA A10	AWS, Lambda	$0.60–$1.20/hr	Good for inference at medium scale
NVIDIA A100 40GB	Lambda Labs	$1.99/hr	Best value for voice AI inference
NVIDIA A100 80GB	AWS p4d	$3.50–$5.00/hr	Complex model inference
NVIDIA H100 80GB	CoreWeave	$2.50–$4.50/hr	State-of-the-art inference
8×H100 cluster	Any hyperscaler	$20–$40/hr	Large model training/fine-tuning

For voice AI specifically, you don't need H100s for inference. An A100 40GB can serve a quantised DeepSeek V4 at sufficient throughput for 200–400 concurrent calls. The engineering cost of proper batching and request routing matters as much as the GPU itself.

When Self-Hosted GPU Beats Cloud API

The inflection point is approximately 500 hours (30,000 minutes) of LLM inference per month. Below that, paying OpenAI or Anthropic per token is cheaper when you factor in idle GPU time. Above that, the maths shifts.

A100 x2 on Lambda Labs:
$1.99/hr × 2 GPUs × 730 hrs/month = $2,906/month
Handles ~1M minutes/month of DeepSeek V4 inference (quantised)

Same volume on GPT-5.5:
1,000,000 minutes × $0.05/min LLM cost = $50,000/month

The case for self-hosting at scale is overwhelming. The challenge is the engineering capability required to do it properly.

Part 8: Case Study — Building a Voice AI System for a Medical Clinic

This isn't hypothetical. This is a condensed version of what we built.

The Situation: A private clinic in the UK had 20 staff handling inbound patient calls covering: appointment booking, rescheduling, prescription refill requests, test result queries, and billing queries. Average call volume: 800–1,200 calls/day. Average handling time: 4.2 minutes. Staff cost: approximately £280,000/year in salaries. Missed call rate during peak hours: 18%.

The Brief: Replace 80% of inbound call volume with AI. Maintain HIPAA/UK GDPR compliance. Integrate with their Semble EHR system. Enable clean handoff to human agents for sensitive or complex cases.

The Architecture We Built

Call → Twilio SIP → LiveKit Agents → Triage Bot
                                          |
                     ┌────────────────────┤
                     ↓                    ↓
             Scheduling Agent       Clinical Info Agent
             (DeepSeek V4)         (GPT-5.5, restricted)
                     |
           ┌─────────┴──────────┐
           ↓                    ↓
     Semble EHR API       Human Agent
     (appointments)      (warm transfer)

Triage agent (always GPT-5.5): Identifies intent within the first 30 seconds. Routes to scheduling, clinical, or human queue. Handles approximately 45% of calls end-to-end (appointment confirmations, cancellations, general FAQs).

Scheduling agent (DeepSeek V4, fine-tuned on clinic's scheduling rules): Integrates directly with Semble's API to read availability, book slots, send SMS confirmations. Handles 40% of total call volume.

Clinical info agent (GPT-5.5, strict tool restrictions): Can retrieve test result statuses from Semble (read-only), check prescription refill eligibility, and route prescription requests to the pharmacy integration. Handles 10% of calls. All clinical queries trigger a structured audit log — every retrieval is recorded with patient ID, timestamp, and agent session ID for CQC compliance.

Human agent escalation (remaining 5%): Warm transfer to the clinic's existing Zoom Phone system. The SIP REFER sends a webhook payload to the clinic's CRM (Cliniko), which screen-pops the agent's dashboard with the patient's record and a plain-English summary of the AI conversation.

The Handoff Triggers (Medical-Specific)

The AI is explicitly trained to escalate immediately — mid-sentence if necessary — in the following scenarios:

Any mention of an emergency: "chest pain," "can't breathe," "bleeding," "fainted" → Immediate cold transfer to emergency line, no whisper delay.
Explicit clinical advice request: "Should I take my medication if..." → Warm transfer with: "I'm connecting you with one of our clinical team who can properly advise you."
Upset or distressed caller: Sentiment monitoring. If frustration score exceeds threshold for 3 consecutive turns → Warm transfer.
Explicit human request: "Can I speak to a real person?" → Immediate warm transfer, no pushback.
Identity verification failure: If the AI cannot confirm patient identity within 3 attempts → Transfer.

The Results

Metric	Before (Human)	After (AI + Human)	Change
Calls handled per day	850 avg	1,240 avg	+46% capacity
Missed call rate	18%	3%	-83%
Avg handling time (AI calls)	4.2 min	2.1 min	-50%
Staff headcount (calls)	20 FTE	6 FTE (escalations only)	-70%
Annual cost	£280,000	£78,000 (6 staff + AI infra)	-72%
Patient satisfaction (CSAT)	3.8/5	4.2/5	+10.5%

The CSAT improvement is worth noting. Patients specifically commented that the AI "always answered immediately" and "never put them on hold." The frustration with the old system wasn't the human relationship — it was the wait time and the missed calls.

Part 9: Niche Applications — Where AI Call Centers Excel

Real Estate

Use case: Inbound lead qualification, property inquiry handling, viewing appointment scheduling.

Why it works: Real estate leads are highly time-sensitive. A lead that calls at 11pm on a Sunday and gets a human-quality AI response converts at dramatically higher rates than a voicemail. The AI can qualify budget, desired area, property type, and timeline in 90 seconds, then book a viewing or escalate to an agent.

Key integration: CRM (HubSpot, Follow Up Boss) for lead capture and calendar booking.

Recommended platform at startup stage: Retell AI with a custom knowledge base of property listings.

Insurance

Use case: First notice of loss (FNOL) calls, policy enquiries, renewal campaigns.

Complexity consideration: Insurance calls often involve legal language, policy-specific details, and emotional stress (accidents, claims). The LLM must be configured with extreme precision for accuracy, and hallucination guardrails are non-negotiable. This is a use case where GPT-5.5 or Claude 3.5 Sonnet (not smaller models) are mandatory.

Critical requirement: All conversations must be recorded, transcribed, and stored for regulatory compliance. Build this into your architecture from day one.

Legal (Intake Calls)

Use case: Initial client intake, case type qualification, conflict of interest screening.

Compliance complexity: Attorney-client privilege considerations mean audio recordings require careful handling. Many firms route AI intake calls through a separate phone number specifically to clarify the non-privileged nature of the initial AI interaction before privilege attaches.

What works well: Structured data collection (incident date, jurisdiction, basic facts) that a human intake coordinator reviews post-call.

Automotive Dealerships

Use case: Service appointment scheduling, parts availability enquiries, vehicle trade-in lead capture.

Why outbound works here: Service reminder campaigns — "Your vehicle is due for a service" — have extremely high answer rates because the calls are non-threatening and the value proposition is clear. Bland AI's outbound stack is well-suited to this use case.

E-Commerce & Retail

Use case: Order status, returns, delivery queries, product recommendations.

Integration depth: The AI needs real-time access to your OMS (Shopify, Magento, SAP Commerce) to answer "where is my order?" The deeper the integration, the higher the resolution rate without human escalation.

Part 10: The "Build vs. Buy" Decision Framework

Here's the question in its simplest form:

Are you building a call centre feature, or are you building a call centre product?

If it's a feature — you want to add voice AI to an existing product or business — start with SaaS. Get it working. Learn what your users actually need. Iterate. The platform cost is the R&D tax you pay to avoid expensive guesses.

If it's a product — voice AI is your core value proposition, and you'll be selling it to others or running it at scale — you need custom infrastructure eventually. The question is not whether to build it, but when.

Situation	Recommendation
<1,000 min/month, idea stage	Vapi or Retell, pay-as-you-go
1,000–30,000 min/month, proving ROI	Retell Enterprise or Vogent
30,000–200,000 min/month, scaling	Hybrid: SaaS orchestration + self-hosted LLM
200,000+ min/month, core business	Full custom: LiveKit Agents + self-hosted stack
HIPAA/regulated, any scale	Ensure HIPAA BAA is signed; consider self-hosted from 10k+ min
Premium voice quality is brand-critical	ElevenLabs Conversational AI (any scale)
Outbound campaign focus	Bland AI Scale + BYOT Twilio

The ValueStreamAI Voice AI Architecture

When we build for clients, we don't pick a single platform and live with its limitations. We architect modularly:

Autonomy: AI agents that proactively make outbound calls for reminders, confirmations, and re-engagement campaigns — not just waiting for inbound calls.
Tool Use: Deep CRM, EHR, and ERP integrations via MCP-standard connectors. The AI can book, reschedule, look up records, and trigger downstream workflows during the call.
Planning: Multi-agent routing where a triage agent decides which specialist agent handles the call — not a single monolithic prompt trying to do everything.
Memory: Persistent caller context across multiple calls via a vector store. A returning patient is greeted by name; the AI knows their booking history and preferences.
Multi-Step Reasoning: Conditional escalation logic with sentiment monitoring, compliance guardrails, and human-in-the-loop checkpoints for high-stakes decisions.

The Competitor Pulse Check: Open-Source vs. SaaS vs. Custom

Factor	SaaS (Vapi/Retell)	Self-Hosted (LiveKit)	ValueStreamAI Custom
Time to deploy	Hours	Weeks-months	6–12 weeks
Per-minute cost (scale)	$0.10–$0.33	$0.02–$0.08	$0.015–$0.06
Data sovereignty	Vendor cloud	Your infra	Your infra / Private VPC
Customisation	Limited/Moderate	Full	Full
Human handoff quality	Basic (cold transfer)	Full warm transfer support	Full warm + context pass
HIPAA out of the box	Add-on ($$$)	DIY	Engineered in
Ongoing maintenance	None	High (DevOps team needed)	Managed by us
Ideal for	0–30k min/month	Engineering teams	10k+ min/month enterprises

Project Scope & Pricing Tiers (ValueStreamAI Voice AI)

Here's what it costs to build properly:

Voice AI Pilot (4–6 Weeks): $5,000 – $25,000
- Ideal for: Single inbound use case (appointment booking, basic FAQ handling), SaaS orchestration stack, Twilio telephony, one CRM integration.
Custom Voice Agent Ecosystem (8–14 Weeks): $25,000 – $75,000
- Ideal for: Multi-agent routing (triage + 2–3 specialist agents), warm transfer implementation, full CRM/EHR integration, HIPAA compliance architecture, custom voice, post-call analytics.
Enterprise Voice Infrastructure (14+ Weeks): $75,000+
- Ideal for: 100k+ minutes/month operations, self-hosted LLM deployment, full data sovereignty, on-premise or private VPC, custom TTS fine-tuning, multi-site telephony architecture, MLOps pipeline for ongoing model improvement.

For a real-time cost model specific to your call volume, use our Interactive ROI Calculator.

Real Caller Testing Before Full Autonomy

Every voice AI pilot we've built goes through the same final gate before autonomous operation: a controlled batch of real caller interactions with full logging and human review. Without exception, the first 100 real calls surface 4–6 failure modes that survived weeks of internal testing — because internal testers know what the system is supposed to do and approach it with that knowledge. Real callers don't. They speak with accents the test team didn't represent, they ask questions mid-flow that interrupt the expected conversation path, and they express frustration in ways that internal QA never simulated.

The mitigation isn't to delay launch indefinitely. It's to run the first real-caller batch with a human review gate in place: the agent handles calls, but a team member reviews transcripts before any autonomous action is taken at scale. Once the first 100 calls are reviewed and failure modes are addressed, the approval gate can be progressively removed. Deploying without this step means your real callers are your QA process — which is expensive in a different way.

Frequently Asked Questions

What is the difference between a SIP trunk and a VoIP system?

VoIP (Voice over Internet Protocol) is the broad category — making phone calls over the internet rather than traditional copper lines. SIP trunking is the specific protocol used to connect your VoIP system to the public telephone network (PSTN). When an AI call centre handles calls, it sits behind a SIP trunk that converts incoming PSTN calls to the digital streams the AI processes.

Can AI voice agents handle Scottish or regional accents accurately?

Yes — with proper STT configuration. Deepgram's Nova-3 model handles UK regional accents well out of the box. For strong regional accents (Glaswegian, Geordie, thick Irish), we recommend fine-tuning a Whisper model on accent-specific audio data. We've done this for healthcare clients in Scotland and achieved >95% transcription accuracy.

How does warm transfer work technically in a SIP environment?

During a warm transfer, the AI establishes a second SIP INVITE to the target agent while keeping the original caller's session active (three-way call bridge). The AI briefs the agent via a "whisper" audio stream heard only by the agent. Once confirmed, the AI bridges the caller into the existing session and drops out, leaving the caller and human agent in a direct two-party call.

Is it legally compliant to record AI call centre conversations?

In most jurisdictions, yes — with disclosure. In the UK, Ofcom guidelines require that callers are informed if a call is being recorded. In most US states, single-party consent applies (the business recording is one party). California, Illinois, and a handful of others require two-party consent. Your call greeting should always include a disclosure statement. For HIPAA-covered calls, recordings must be stored with appropriate encryption, access controls, and audit logging.

What happens when the AI doesn't understand a caller?

A properly engineered AI call centre has three escalation layers for comprehension failures: (1) a clarification request if intent is unclear, (2) a rephrasing with a different approach if the first clarification fails, and (3) a warm transfer to a human agent if uncertainty persists after two attempts. The system should never loop a caller in an infinite confusion cycle — this is the most common source of poor AI call centre experiences and is entirely avoidable.

At what call volume does it make sense to stop using SaaS platforms?

The rule of thumb is approximately 30,000 minutes/month for LLM self-hosting, and 200,000+ minutes/month for a fully custom stack (orchestration, STT, TTS, telephony). Before those thresholds, the engineering and operational overhead of custom infrastructure exceeds the cost savings.

Internal Resources

External References

LiveKit Agents — GitHub (Apache 2.0)
Pipecat — GitHub (BSD)
Bolna — GitHub (open-source)
Vapi AI Pricing
Retell AI Pricing
ElevenLabs SIP Trunking Documentation
Twilio Elastic SIP Trunking Pricing

Building a voice AI system for your business? Book a free strategy session with our engineering team — we'll audit your current call handling workflows and map the exact architecture and cost model for your scale.

Disclaimer: This article is for informational purposes only and does not constitute financial, legal, or professional advice. Consult a qualified professional before making business or investment decisions.

ShareLinkedIn X / Twitter

ValueStreamAI Engineering Team

AI Automation Specialists · Paisley, Scotland & Pembroke Pines, FL

ValueStreamAI builds custom agentic AI systems for SMBs and enterprises across the US and UK. Learn more about us →

#AI Call Center#Voice AI#SIP Trunking#LiveKit#Vapi#ElevenLabs#Twilio#AI Agent Orchestration#Call Center Automation#Human Handoff#VoIP#Pipecat

← back to blog

AI Call Center Orchestration: The Complete Engineering & Cost Guide (2026)

Part 1: The Architecture Stack — What You're Actually Building

The Five Components

Part 2: Open-Source Frameworks — Building the Wheel Yourself

LiveKit Agents (GitHub: livekit/agents)

Pipecat (GitHub: pipecat-ai/pipecat)

Bolna (GitHub: bolna-ai/bolna)

Deepgram Voice Agent API

Comparison Table: Open-Source Frameworks

Part 3: SaaS Platforms — Not Recreating the Wheel (Yet)

Vapi AI

Retell AI

Bland AI

Vogent AI

ElevenLabs Conversational AI

Platform Comparison Table

Part 4: The Telephony Layer — SIP Trunking Explained (Without the Jargon)

What Is SIP Trunking?

Key SIP Trunk Providers & Costs

VoIP Studio + AI: The Pragmatic Path for SMBs

The Telephony Access Audit: Do This Before Any Architecture Decision

Part 5: Call Transfers & Human Handoffs — The Architecture That Actually Matters in Production

The Two Types of Call Transfer

What Context Should Transfer With the Call

AI-to-AI Agent Transfers

Part 6: The Real Cost Breakdown — Building vs. Buying at Every Stage

Stage 1: Idea to Proof of Concept (0–1,000 monthly minutes)

Stage 2: Growth to Operational (1,000–30,000 monthly minutes)

Stage 3: Production Scale (30,000–500,000 monthly minutes)

Stage 4: Enterprise Scale (500,000+ monthly minutes)

Part 7: GPU Costs — The Honest Numbers

Cloud GPU Pricing (2026)

When Self-Hosted GPU Beats Cloud API

Part 8: Case Study — Building a Voice AI System for a Medical Clinic

The Architecture We Built

The Handoff Triggers (Medical-Specific)

The Results

Part 9: Niche Applications — Where AI Call Centers Excel

Real Estate

Insurance

Legal (Intake Calls)

Automotive Dealerships

E-Commerce & Retail

Part 10: The "Build vs. Buy" Decision Framework

The ValueStreamAI Voice AI Architecture

The Competitor Pulse Check: Open-Source vs. SaaS vs. Custom

Project Scope & Pricing Tiers (ValueStreamAI Voice AI)

Real Caller Testing Before Full Autonomy

Frequently Asked Questions

What is the difference between a SIP trunk and a VoIP system?

Can AI voice agents handle Scottish or regional accents accurately?

How does warm transfer work technically in a SIP environment?

Is it legally compliant to record AI call centre conversations?

What happens when the AI doesn't understand a caller?

At what call volume does it make sense to stop using SaaS platforms?

Internal Resources

External References

Thirty minutes.We'll tell you exactlywhere your ROI is.

LiveKit Agents (GitHub: `livekit/agents`)

Pipecat (GitHub: `pipecat-ai/pipecat`)

Bolna (GitHub: `bolna-ai/bolna`)

Thirty minutes.
We'll tell you exactly
where your ROI is.