Most AI projects fail not because they chose the wrong model, but because they chose the wrong architecture. A GPT-4o wrapper bolted onto a monolith backend might survive a demo. It will not survive 10,000 concurrent users, a compliance audit, a model deprecation, or the moment you need to add a second agent that talks to the first.
The fundamental shift happening across enterprise AI in 2026 is architectural: teams that built their first AI features as monolithic additions to existing applications are now rebuilding them as distributed, composable services — where each agent, each retrieval pipeline, and each model integration is its own independently deployable unit. This is not over-engineering. It is the only pattern that survives contact with real production requirements.
This guide maps every major AI system architecture pattern — from RAG pipelines to multi-agent microservices — with concrete diagrams, cloud service recommendations for AWS, Azure, and GCP, and the open standards (MCP, A2A, AGENTS.md) that make these systems interoperable by design. If you are still at the stage of deciding whether to build an agent at all, our practical AI agent development guide covers that decision in depth.
| Architecture Signal | Benchmark (2026) |
|---|---|
| Monolith-to-microservices AI migration | 60% of enterprise AI teams mid-migration |
| MCP adoption | 150+ organisations including AWS, Google, Microsoft, OpenAI |
| LLM cost reduction (2024 → 2026) | ~80% across all major providers |
| Legacy integration as #1 AI barrier | Cited by 60% of enterprise AI leaders |
1. The Architectural Decision Before Anything Else
Before picking a framework or a model, every AI system architect must answer four questions:
- Knowledge problem — Does the system need to reason over private, dynamic, or domain-specific data the base model was never trained on? → RAG or fine-tuning layer.
- Action problem — Does the system need to do things in the world — write to a CRM, trigger a workflow, execute code? → Tool-use layer with structured outputs.
- Latency problem — Does the UX require token-by-token streaming, real-time voice, or sub-100ms responses? → Streaming architecture (SSE or WebSockets).
- Integration problem — Does the system need to work alongside legacy enterprise software, multiple AI vendors, or other agents? → MCP + A2A + event-driven integration layer.
Every other technical choice flows from these four. Teams that skip this end up refactoring within three months. The right AI system architecture decision starts here — not with the model choice.
2. Monolith vs. Microservices: The Core AI Architecture Debate
The AI Monolith
In a monolithic AI architecture, the entire AI capability — prompt construction, LLM calls, RAG retrieval, tool execution, response formatting — lives inside a single application. This is almost always how teams start, and it is entirely appropriate for early-stage projects.
┌─────────────────────────────────────────────────────────┐
│ AI MONOLITH APPLICATION │
│ │
│ ┌──────────┐ ┌───────────┐ ┌──────────┐ ┌───────┐ │
│ │ Prompt │ │ RAG / │ │ LLM │ │ Tool │ │
│ │ Builder │→ │ Retrieval │→ │ Client │→ │ Use │ │
│ └──────────┘ └───────────┘ └──────────┘ └───────┘ │
│ ↑ ↓ │
│ ┌──────────────────────────────────────────────────┐ │
│ │ Single Database / Single Deploy │ │
│ └──────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────┘
Where the monolith breaks:
- Scaling one AI component (e.g., the retrieval pipeline under heavy load) means scaling the entire application
- Swapping the LLM provider requires a full redeploy
- A second team building a second AI feature duplicates the entire stack
- A compliance requirement to isolate data processing forces a full restructure
The AI Microservices Architecture
In a microservices AI architecture, each capability is an independently deployable service with a well-defined API. The LLM gateway, the retrieval service, each agent, the memory store — all separate services that communicate over standard protocols.
┌─────────────────┐
│ API GATEWAY │
│ (Kong / APIM) │
└────────┬────────┘
│
┌──────────────────┼──────────────────┐
▼ ▼ ▼
┌────────────────┐ ┌────────────────┐ ┌────────────────┐
│ ORCHESTRATOR │ │ RAG SERVICE │ │ MEMORY SERVICE │
│ AGENT │ │ (Retrieval) │ │ (Vector Store) │
│ (LangGraph) │ │ │ │ │
└───────┬────────┘ └───────┬────────┘ └───────┬────────┘
│ │ │
┌───────▼────────┐ ┌───────▼────────┐ ┌───────▼────────┐
│ LLM GATEWAY │ │ EMBED SERVICE │ │ TOOL SERVICES │
│ (LiteLLM / │ │ (text-embed- │ │ (MCP Servers) │
│ Portkey) │ │ 3-large) │ │ │
└───────┬────────┘ └────────────────┘ └────────────────┘
│
┌────────┴─────────┐
│ │
┌───▼───┐ ┌───────┐ ┌▼──────┐
│Claude │ │ GPT-5 │ │Gemini │
│ API │ │ API │ │ API │
└───────┘ └───────┘ └───────┘
Each service is containerised, independently scalable, and replaceable without touching the rest of the system. This is the pattern that survives.
3. AI Agents as Microservices
The most important architectural insight of 2026 is this: an AI agent is a microservice. This reframing transforms how you think about your AI system architecture — each agent gets its own deployment, its own scaling policy, and its own interface contract. It has a defined input schema, a defined output schema, a single responsibility, and it communicates over a standard protocol. The difference from a traditional microservice is that its internal logic is driven by an LLM rather than deterministic code.
The Agent-as-Service Pattern
┌─────────────────────────────────────────────────────────┐
│ AGENT MICROSERVICE │
│ │
│ Input: Task JSON (A2A Task or HTTP POST) │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────────────┐ │
│ │ System │ │ Tools │ │ Memory / RAG │ │
│ │ Prompt │ │ (via MCP)│ │ (Vector Search) │ │
│ └──────────┘ └──────────┘ └──────────────────┘ │
│ ↘ ↓ ↙ │
│ ┌──────────────────────────────┐ │
│ │ LLM Runtime │ │
│ │ (Claude / GPT-5 / Gemini) │ │
│ └──────────────────────────────┘ │
│ ↓ │
│ Output: Structured JSON (Pydantic / Zod validated) │
│ │
│ Deployment: Docker container → K8s pod / Cloud Run │
└─────────────────────────────────────────────────────────┘
This framing has profound implications:
- Single Responsibility — each agent does one thing well (invoice processing, compliance checking, customer triage). A "do everything" agent is an antipattern.
- Horizontal scaling — spin up more agent pods under load, exactly like any other service
- Independent deployment — update the invoice agent's prompt or model without touching the compliance agent
- Standard interfaces — inputs and outputs are typed JSON schemas; the internal LLM is an implementation detail
- Observability — traces, logs, and metrics work the same way as any other service
Specialist Agent Registry Pattern
In a multi-agent system, an orchestrating agent needs to know which specialist agents exist and what they can do. The Agent Card (from the A2A protocol) is the standard for this:
┌──────────────────────────────────────────────────────────┐
│ ORCHESTRATOR AGENT │
│ "Complete a new vendor onboarding" │
└──────────────────┬───────────────────────────────────────┘
│ Discovers agents via Agent Cards (A2A)
┌───────────┼───────────┬───────────────┐
▼ ▼ ▼ ▼
┌────────────┐ ┌──────────┐ ┌──────────┐ ┌──────────────┐
│ Document │ │ Risk & │ │ ERP │ │ Notification │
│ Extract │ │Compliance│ │ Onboard │ │ Agent │
│ Agent │ │ Agent │ │ Agent │ │ │
└────────────┘ └──────────┘ └──────────┘ └──────────────┘
↓ ↓ ↓ ↓
Reads PDF Checks SAP API Sends email
(MCP: S3) sanctions DB (MCP: SAP) (MCP: SES)
(MCP: Postgres)
Each specialist agent exposes an Agent Card — a JSON document declaring what it can do, what inputs it accepts, and how to reach it. The orchestrator queries the registry, selects the right specialists, and delegates via A2A task handoffs.
4. Cloud-Native AI Architecture: AWS, Azure, and GCP
AWS AI Architecture
AWS provides the most mature set of managed services for production AI systems. A reference architecture for a multi-agent enterprise system on AWS:
┌──────────────────────────────────────────────────────────────────────┐
│ AWS AI REFERENCE ARCHITECTURE │
│ │
│ ┌────────────────┐ ┌──────────────────────────────────────┐ │
│ │ API GATEWAY │─────▶│ ECS / EKS CLUSTER │ │
│ │ (AWS APIGW) │ │ │ │
│ └────────────────┘ │ ┌────────────┐ ┌────────────────┐ │ │
│ │ │Orchestrator│ │ RAG Service │ │ │
│ ┌────────────────┐ │ │ Agent │ │ (FastAPI) │ │ │
│ │ EVENT BUS │─────▶│ │ (LangGraph)│ └───────┬────────┘ │ │
│ │ (EventBridge) │ │ └──────┬─────┘ │ │ │
│ └────────────────┘ │ │ ┌────────▼────────┐ │ │
│ │ ┌──────▼──────┐ │ OpenSearch │ │ │
│ ┌────────────────┐ │ │ LLM Gateway │ │ (Vector Store) │ │ │
│ │ QUEUE / ASYNC │ │ │ (LiteLLM) │ └─────────────────┘ │ │
│ │ (SQS) │─────▶│ └──────┬──────┘ │ │
│ └────────────────┘ │ │ │ │
│ └─────────┼────────────────────────────┘ │
│ ┌────────────────┐ │ │
│ │ OBJECT STORE │ │ ┌──────────────────────────┐ │
│ │ (S3) │◀───MCP Server──┤ │ AWS BEDROCK │ │
│ └────────────────┘ │ │ (Claude / Llama / Titan)│ │
│ ├─▶│ or direct Anthropic API │ │
│ ┌────────────────┐ │ └──────────────────────────┘ │
│ │ SECRET STORE │ │ │
│ │ (Secrets Mgr) │ │ ┌──────────────────────────┐ │
│ └────────────────┘ └─▶│ TOOL MCP SERVERS │ │
│ │ Lambda functions wrapping│ │
│ ┌────────────────┐ │ Salesforce / RDS / SAP │ │
│ │ OBSERVABILITY │ └──────────────────────────┘ │
│ │ (CloudWatch + │ │
│ │ X-Ray) │ │
│ └────────────────┘ │
└──────────────────────────────────────────────────────────────────────┘
Key AWS services per layer:
| Layer | AWS Service | Role |
|---|---|---|
| Compute | ECS Fargate / EKS | Agent container hosting |
| Serverless tools | Lambda | MCP server implementations |
| LLM access | Bedrock / direct API | Claude, Llama, Titan |
| Vector store | OpenSearch Serverless | RAG retrieval |
| Object storage | S3 | Document ingestion source |
| Async messaging | SQS + EventBridge | Event-driven agent triggers |
| Secrets | Secrets Manager | API keys, credentials |
| Observability | CloudWatch + X-Ray | Traces, logs, metrics |
| API layer | API Gateway + WAF | Rate limiting, auth |
Azure AI Architecture
Azure's strength is deep enterprise integration, particularly for organisations already on Microsoft 365 and Azure Active Directory:
┌──────────────────────────────────────────────────────────────────────┐
│ AZURE AI REFERENCE ARCHITECTURE │
│ │
│ ┌────────────────┐ ┌──────────────────────────────────────┐ │
│ │ API MGMT │─────▶│ AZURE KUBERNETES (AKS) │ │
│ │ (APIM) │ │ │ │
│ └────────────────┘ │ ┌────────────┐ ┌────────────────┐ │ │
│ │ │Orchestrator│ │ RAG Service │ │ │
│ ┌────────────────┐ │ │ Agent │ │ (FastAPI) │ │ │
│ │ SERVICE BUS │─────▶│ │ (LangGraph)│ └───────┬────────┘ │ │
│ │ (Async events) │ │ └──────┬─────┘ │ │ │
│ └────────────────┘ │ │ ┌────────▼────────┐ │ │
│ │ ┌──────▼──────┐ │ AI Search │ │ │
│ ┌────────────────┐ │ │ LLM Gateway │ │ (Vector Index) │ │ │
│ │ BLOB STORAGE │ │ │ (LiteLLM) │ └─────────────────┘ │ │
│ │ (Source docs) │─────▶│ └──────┬──────┘ │ │
│ └────────────────┘ └─────────┼────────────────────────────┘ │
│ │ │
│ ┌────────────────┐ │ ┌──────────────────────────┐ │
│ │ KEY VAULT │ ├─▶│ AZURE OPENAI SERVICE │ │
│ │ (Secrets) │ │ │ GPT-4o / GPT-5 / Ada │ │
│ └────────────────┘ │ └──────────────────────────┘ │
│ │ │
│ ┌────────────────┐ │ ┌──────────────────────────┐ │
│ │ ENTRA ID │ └─▶│ AZURE FUNCTIONS │ │
│ │ (AuthN/AuthZ) │ │ MCP servers: Dynamics, │ │
│ └────────────────┘ │ SharePoint, SQL, Teams │ │
│ └──────────────────────────┘ │
│ ┌────────────────┐ │
│ │ MONITOR + │ ← Traces every agent call end-to-end │
│ │ APP INSIGHTS │ │
│ └────────────────┘ │
└──────────────────────────────────────────────────────────────────────┘
Key Azure services per layer:
| Layer | Azure Service | Role |
|---|---|---|
| Compute | AKS / Container Apps | Agent container hosting |
| Serverless tools | Azure Functions | MCP server implementations |
| LLM access | Azure OpenAI Service | GPT-4o, GPT-5, Ada embeddings |
| Vector store | Azure AI Search | RAG with hybrid retrieval |
| Object storage | Blob Storage | Document source |
| Async messaging | Service Bus + Event Grid | Event-driven triggers |
| Identity | Entra ID (AAD) | Enterprise SSO, RBAC |
| Secrets | Key Vault | API keys, certificates |
| Observability | Monitor + App Insights | Distributed tracing |
GCP AI Architecture
GCP's differentiator is Vertex AI — a fully managed platform for building, deploying, and orchestrating AI agents, with native access to Gemini models and a 1M-token context window:
┌──────────────────────────────────────────────────────────────────────┐
│ GCP AI REFERENCE ARCHITECTURE │
│ │
│ ┌────────────────┐ ┌──────────────────────────────────────┐ │
│ │ CLOUD ENDPOINTS│─────▶│ GOOGLE KUBERNETES ENGINE (GKE) │ │
│ │ / APIGEE │ │ │ │
│ └────────────────┘ │ ┌────────────┐ ┌────────────────┐ │ │
│ │ │Orchestrator│ │ RAG Service │ │ │
│ ┌────────────────┐ │ │ Agent │ │ (FastAPI) │ │ │
│ │ PUB/SUB │─────▶│ │(ADK/LGraph)│ └───────┬────────┘ │ │
│ │ (Async events)│ │ └──────┬─────┘ │ │ │
│ └────────────────┘ │ │ ┌────────▼────────┐ │ │
│ │ ┌──────▼──────┐ │ VERTEX AI │ │ │
│ ┌────────────────┐ │ │ LLM Gateway │ │ VECTOR SEARCH │ │ │
│ │ CLOUD STORAGE │ │ │ (LiteLLM) │ └─────────────────┘ │ │
│ │ (Source docs) │─────▶│ └──────┬──────┘ │ │
│ └────────────────┘ └─────────┼────────────────────────────┘ │
│ │ │
│ ┌────────────────┐ │ ┌──────────────────────────┐ │
│ │ SECRET MANAGER│ ├─▶│ VERTEX AI + GEMINI 2.5 │ │
│ │ │ │ │ Pro (1M ctx) / Flash- │ │
│ └────────────────┘ │ │ Lite ($0.075/1M tokens) │ │
│ │ └──────────────────────────┘ │
│ ┌────────────────┐ │ │
│ │ BIGQUERY │ │ ┌──────────────────────────┐ │
│ │ (Analytics + │◀───MCP Server──┴─▶│ CLOUD RUN │ │
│ │ agent logs) │ │ MCP servers: BigQuery, │ │
│ └────────────────┘ │ Salesforce, SAP, SQL │ │
│ └──────────────────────────┘ │
│ ┌────────────────┐ │
│ │ CLOUD TRACE + │ ← OpenTelemetry-native observability │
│ │ CLOUD LOGGING │ │
│ └────────────────┘ │
└──────────────────────────────────────────────────────────────────────┘
5. RAG Architecture in Production
RAG (Retrieval-Augmented Generation) is the foundational pattern for grounding LLM responses in private or live data. Rather than baking knowledge into model weights, RAG retrieves relevant context at inference time.
The Full RAG Pipeline Architecture
INGESTION PIPELINE (offline / scheduled)
════════════════════════════════════════
┌──────────┐ ┌───────────┐ ┌────────────┐ ┌──────────────┐
│ Source │──▶│ Chunking │──▶│ Embedding │──▶│ Vector Store │
│ (S3 / │ │ Service │ │ Service │ │(Pinecone / │
│ Blob / │ │(recursive │ │(text-embed │ │ pgvector / │
│ GCS) │ │ + overlap)│ │ -3-large) │ │ OpenSearch) │
└──────────┘ └───────────┘ └────────────┘ └──────────────┘
QUERY PIPELINE (real-time, per request)
════════════════════════════════════════
┌──────────┐ ┌───────────┐ ┌────────────┐ ┌──────────────┐
│ User │──▶│ Embed │──▶│ Hybrid │──▶│ Reranking │
│ Query │ │ Query │ │ Search │ │ (Cohere / │
│ │ │ │ │(Dense + │ │ FlashRank) │
└──────────┘ └───────────┘ │ BM25) │ └──────┬───────┘
└────────────┘ │
┌────────▼───────┐
│ LLM (with │
│ retrieved │
│ context) │
└────────────────┘
RAG Variant Comparison
| RAG Type | Best For | Complexity |
|---|---|---|
| Naive RAG | Proof of concept, small doc sets | Low |
| Hybrid RAG (Dense + BM25) | Production — consistently best recall | Medium |
| Graph RAG | Multi-hop reasoning, entity relationships | High |
| Agentic RAG | Agent decides when/what to retrieve | High |
| Multimodal RAG | PDFs with tables, charts, images | High |
6. Fine-Tuning Architecture
Fine-tuning adapts a base model's weights to your domain vocabulary, output format, or task behaviour. It teaches the model how to behave, not what to know — a critical distinction.
LoRA and QLoRA: The Production Standard
Full fine-tuning of a 70B model costs tens of thousands of dollars. In 2026, virtually all production fine-tuning uses LoRA (Low-Rank Adaptation) or QLoRA:
BASE MODEL (frozen weights)
↓
┌─────────────────────────────────────────────┐
│ LORA ADAPTER (0.1–1% of total parameters) │
│ Small trainable rank-decomposition matrices│
│ added to attention layers │
└─────────────────────────────────────────────┘
↓
Fine-tuned behaviour at a fraction of the cost
QLoRA adds: 4-bit quantization of base model
→ 7B model: 14GB (FP16) → ~5GB (QLoRA)
→ Quality loss: typically < 5% on benchmarks
Fine-Tuning vs. RAG: The Decision Matrix
| Scenario | RAG | Fine-Tuning |
|---|---|---|
| Dynamic / frequently updated knowledge | ✅ Ideal | ❌ Requires retraining |
| Citable sources required | ✅ Native | ❌ Not possible |
| Consistent output format / tone | ⚠️ Prompt engineering only | ✅ Ideal |
| Domain-specific jargon | ⚠️ Partial | ✅ Ideal |
| Data privacy (no data leaves premises) | ✅ Self-hosted embeddings | ✅ Self-hosted model |
The production answer in 2026 is often both: a fine-tuned model as the inference engine, RAG providing the dynamic knowledge layer.
7. Quantization: Shrinking Models for Deployment
Quantization reduces the numerical precision of model weights — from FP32 or FP16 down to INT8 or INT4 — to shrink memory footprint and accelerate inference with minimal quality loss.
Model Size by Quantization Format (70B parameter model)
══════════════════════════════════════════════════════════
FP32 ████████████████████████████████████████ ~280 GB
FP16 ████████████████████████ ~140 GB
INT8 █████████████████ ~70 GB
INT4 ████████ ~35 GB ← H100 fits 1 GPU
1-bit ██ ~9 GB ← Experimental
| Format | Quality Loss | Production Readiness | Tool |
|---|---|---|---|
| FP16 | None | Cloud inference, training | Native |
| INT8 | < 1% | Production serving | bitsandbytes, vLLM |
| INT4 GPTQ/AWQ | 2–5% | GPU-accelerated serving | AutoGPTQ, AutoAWQ |
| GGUF (INT4) | 2–5% | CPU-friendly local | llama.cpp, Ollama |
8. Function Calling, Structured Outputs, and JSON Mode
These three capabilities form the data contract layer of any AI system. Getting them right is the difference between a reliable pipeline and a brittle one.
How Function Calling Works
┌─────────────────────────────────────────────────────────┐
│ 1. Developer defines tool schemas (JSON Schema) │
│ 2. LLM receives prompt + tool definitions │
│ 3. LLM decides: respond in text OR call a tool │
│ 4. If tool call: LLM outputs structured tool_use block │
│ 5. Application executes the tool │
│ 6. Tool result injected back into context │
│ 7. LLM generates final response │
└─────────────────────────────────────────────────────────┘
Provider Comparison
| Provider | API Style | Parallel Tool Calls | Tool Reliability (Q1 2026) |
|---|---|---|---|
| OpenAI GPT-4o / GPT-5 | tools + tool_choice |
✅ Native | 8.6 / 10 |
| Anthropic Claude 3.5+ | tools + tool_use blocks |
✅ Native | 8.4 / 10 |
| Google Gemini 2.5 | function_declarations (Vertex) |
✅ Native | 8.2 / 10 |
All three providers have converged on nearly identical JSON Schema–based formats. A tool definition written for Claude adapts to GPT-5 or Gemini in minutes — intentional convergence driven by the MCP standard.
JSON Mode vs. Structured Outputs
| Capability | Guarantee | When to Use |
|---|---|---|
| JSON Mode | Valid JSON, any shape | Quick prototyping |
| Structured Outputs | Valid JSON matching exact schema | Production pipelines, inter-agent data |
| Function calling | Valid tool invocation arguments | When the model needs to act |
In multi-agent microservices architectures, use Structured Outputs only for inter-agent data exchange. One agent's output is another agent's input — a schema violation causes silent downstream failures.
9. Streaming Architecture: SSE vs. WebSockets
REQUEST TYPE PROTOCOL CHOICE
══════════════════════════════════════════════
User sends → AI streams back → Server-Sent Events (SSE)
(chat, Q&A, doc generation)
User and AI both stream → WebSockets
(voice AI, collaborative AI,
real-time interruption)
SSE Architecture (Standard LLM Streaming)
Client (Browser / App)
│ HTTP GET /stream
▼
API Gateway (AWS APIGW / Azure APIM / GCP Endpoints)
│
▼
Agent Service (ECS / AKS / Cloud Run)
│ Streams tokens as SSE events
│ event: token
│ data: {"text": "The", "index": 0}
▼
LLM Provider (Claude / GPT-5 / Gemini)
│ Native SSE streaming response
▼
Client receives + renders tokens in real time
SSE runs over plain HTTP — no protocol upgrade, no extra infrastructure, auto-reconnects on drop. With HTTP/3 (QUIC) now covering ~85% of client-server traffic, SSE scales cleanly to enterprise loads.
WebSocket Architecture (Voice / Bidirectional AI)
Client (Browser / Mobile)
│ WS Upgrade → ws://
▼
WebSocket Gateway (API GW / nginx)
│
▼
Real-Time Agent Service
├── Audio IN stream → Whisper / Deepgram (Speech-to-Text)
├── Text processing → LLM (GPT-5 Realtime / Gemini Live)
└── Audio OUT stream → ElevenLabs / Azure TTS
│
▼
Client receives audio response while still streaming input
OpenAI's Realtime API and Gemini Live both use WebSocket endpoints for bidirectional audio. vLLM exposes a WebSocket server for interactive GPU inference in self-hosted deployments.
10. The Open Standards Layer: MCP, A2A, and AGENTS.md
This is the most architecturally significant development of 2025–2026. All major AI labs have converged on shared interoperability standards, now governed by the Linux Foundation's Agentic AI Foundation (AAIF) — with AWS, Anthropic, Block, Google, Microsoft, and OpenAI as platinum members. For teams building on top of these standards, our agentic AI development services guide covers the full implementation picture.
Model Context Protocol (MCP)
MCP is the universal standard for connecting AI agents to tools, data, and external systems. It replaces bespoke per-system integrations with a single protocol.
WITHOUT MCP (custom integrations — the old way)
════════════════════════════════════════════════
Agent ──── custom code ──── Salesforce
Agent ──── custom code ──── PostgreSQL
Agent ──── custom code ──── SharePoint
Agent ──── custom code ──── SAP
→ 4 integrations. Add a new agent: 4 more. Add a new system: N more.
WITH MCP (protocol-based — the new standard)
════════════════════════════════════════════
Agent ──── MCP Client ──── MCP Server ──── Salesforce
└── MCP Server ──── PostgreSQL
└── MCP Server ──── SharePoint
└── MCP Server ──── SAP
→ Write an MCP server once. Any agent, any model uses it.
OpenAI is deprecating its Assistants API in favour of MCP (mid-2026 sunset). Organisations implementing MCP-first architecture report 60% reduction in tool integration time.
Agent2Agent Protocol (A2A)
Where MCP connects agents to tools, A2A connects agents to other agents. Released by Google (April 2025), now under the Linux Foundation with 150+ supporting organisations.
A2A MULTI-AGENT ORCHESTRATION FLOW
════════════════════════════════════
User Request: "Process and onboard this new vendor"
│
▼
Orchestrator Agent
│ Reads Agent Registry (Agent Cards)
│
├─ A2A Task → Document Extraction Agent
│ └─ MCP: reads PDF from S3 / GCS / Blob
│ └─ Returns: structured vendor data (JSON)
│
├─ A2A Task → Risk & Compliance Agent
│ └─ MCP: queries sanctions DB, credit API
│ └─ Returns: risk score + flags
│
├─ A2A Task → ERP Onboarding Agent
│ └─ MCP: writes to SAP / Oracle / Dynamics
│ └─ Returns: vendor ID + status
│
└─ A2A Task → Notification Agent
└─ MCP: sends email via SES / SendGrid
└─ Returns: confirmation
AGENTS.md
Released by OpenAI (August 2025), donated to the AAIF, AGENTS.md is a markdown convention for giving AI coding agents project-specific context — structure, standards, test procedures, deployment workflows. Over 60,000 open-source projects have adopted it.
your-repo/
├── AGENTS.md ← AI agents read this to understand the project
├── README.md ← Humans read this
├── src/
└── tests/
Just as README.md tells humans how a project works, AGENTS.md tells AI agents how to operate within it.
How the Standards Compose
┌──────────────────────────────────────────────────────────┐
│ UNIFIED INTEROPERABILITY STACK │
│ │
│ AGENTS.md → How to work in this codebase │
│ MCP → What tools and data are available │
│ A2A → What other agents can be delegated to │
│ Structured Outputs → How data flows reliably between │
│ all of the above │
│ │
│ Result: swap Claude for GPT-5 for Gemini with zero │
│ changes to your integration, memory, or agent layer. │
└──────────────────────────────────────────────────────────┘
11. AI Agent Memory Architecture
Memory is the most underrated component in production agent systems. The four-layer model:
MEMORY ARCHITECTURE FOR PRODUCTION AI AGENTS
═════════════════════════════════════════════
┌─────────────────────────────────────────────────────────┐
│ LAYER 1: WORKING MEMORY (In-Context) │
│ Storage: LLM context window (200K–1M tokens) │
│ Duration: Current session only │
│ Contents: Active conversation, recent tool results │
│ Cost trap: DO NOT brute-force past sessions here │
└─────────────────────────────────────────────────────────┘
↓ Session ends → embed + store ↓
┌─────────────────────────────────────────────────────────┐
│ LAYER 2: EPISODIC MEMORY (Vector Database) │
│ Storage: Pinecone / pgvector / OpenSearch │
│ Duration: Long-term, retrieved by relevance │
│ Contents: Past interactions, user preferences │
│ Tools: Mem0, Zep, LangMem │
└─────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────┐
│ LAYER 3: SEMANTIC MEMORY (Knowledge + Entities) │
│ Storage: Vector DB + Knowledge Graph (Neo4j / Neptune) │
│ Duration: Persistent │
│ Contents: Domain knowledge, entity relationships │
└─────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────┐
│ LAYER 4: PROCEDURAL MEMORY (How-To) │
│ Storage: Tool definitions, AGENTS.md, runbooks │
│ Duration: Persistent │
│ Contents: MCP tool schemas, workflow playbooks │
└─────────────────────────────────────────────────────────┘
The context window trap: Gemini 2.5 Pro's 1M-token context looks like a memory solution. But at $1.25/M input tokens, passing 500K tokens of history costs $0.625 per call. At volume, that is thousands of dollars per day. Selective retrieval from a vector store almost always beats brute-force context for cost efficiency.
12. Integrating AI Agents with Legacy Systems
Nearly 60% of enterprise AI leaders cite legacy integration as their primary barrier to advanced AI adoption. Over 75% of ERP AI projects stall at integration boundaries.
The Four Integration Patterns
PATTERN 1: MCP SERVER AS ADAPTER
══════════════════════════════════
AI Agent → MCP Client → MCP Server → Legacy System
(wraps auth, (SAP / Oracle /
transform, Mainframe)
retries)
Best for: Any system with a queryable interface.
Benefit: Decouples AI layer from integration layer.
Swap models without touching integrations.
PATTERN 2: API GATEWAY TRANSLATION
════════════════════════════════════
AI Agent → REST/JSON → API Gateway → SOAP/RPC → Legacy
(Kong / APIM transform)
Best for: Legacy SOAP, RPC, or proprietary protocol systems.
PATTERN 3: EVENT-DRIVEN (Async)
════════════════════════════════
Legacy System → Kafka / SQS / Pub/Sub → AI Agent
(event stream) (reacts async)
Best for: Mainframes, batch ERP systems, systems that can't
tolerate synchronous AI latency.
PATTERN 4: BROWSER AUTOMATION (Last Resort)
════════════════════════════════════════════
AI Agent → Playwright → Web UI → Legacy System
(browser automation)
Best for: Absolutely no API available.
Warning: Fragile. Replace with Pattern 1 as soon as possible.
Why Open Standards Are Non-Negotiable for Enterprise
Legacy integration is a long-term commitment. Any architecture that ties your integration layer to a single AI vendor creates existential risk. MCP solves this: your MCP servers are vendor-agnostic. You can replace Claude with GPT-5 or Gemini without touching a single integration. This is the architectural argument for open standards, not an ideological one.
13. LLM Costs and Deployment: The 2026 Economics
Cloud API Pricing (April 2026)
LLM prices have fallen ~80% from 2024 to 2026. Current landscape:
| Model | Input (per 1M tokens) | Output (per 1M tokens) | Context Window |
|---|---|---|---|
| OpenAI GPT-5 Flagship | $1.75 | $14.00 | 128K |
| OpenAI GPT-5 Mini | $0.25 | $2.00 | 128K |
| Anthropic Claude Sonnet | $3.00 | $15.00 | 200K |
| Anthropic Claude Haiku | $0.25 | $1.25 | 200K |
| Google Gemini 2.5 Pro | $1.25 | $10.00 | 1M |
| Google Gemini 2.0 Flash-Lite | $0.075 | $0.30 | 1M |
| DeepSeek V3 (API) | $0.27 | $1.10 | 64K |
Verify against provider pricing pages — this market moves fast.
Self-Hosted vs. Cloud Architecture
DEPLOYMENT ARCHITECTURE DECISION TREE
══════════════════════════════════════
Is your data regulated (HIPAA, GDPR, FCA, SOC 2)?
YES → Self-hosted is mandatory regardless of cost
NO → Continue...
Token volume > 2–3M/day sustained?
YES → Model self-hosted hybrid (cloud for frontier tasks)
NO → Cloud API is cheaper when engineering overhead is included
Do you need frontier model capability (GPT-5, Claude Sonnet)?
YES → Cloud API (open-weight models not at frontier parity)
NO → Llama 3.3 70B INT4 (self-hosted) is competitive and cheap
RECOMMENDED HYBRID PATTERN (2026):
High-volume / low-complexity tasks → Gemini Flash-Lite / Haiku ($0.075–$0.25/1M)
Complex reasoning / planning tasks → GPT-5 Flagship / Claude Sonnet
Regulated data / on-prem required → vLLM + Llama 3.3 70B INT4
Routing layer → LiteLLM or OpenRouter (model-agnostic)
Self-hosting toolchain:
| Tool | Throughput | Best For |
|---|---|---|
| vLLM | 793 tokens/sec (H100) | Production multi-user GPU serving |
| Ollama | 41 tokens/sec | Local development, prototyping |
| llama.cpp | Variable | CPU inference, edge deployment |
Break-even against cloud frontier APIs typically occurs at 2–3M tokens/day over a 12-month hardware amortisation window. Companies running hybrid AI system architectures report 60–80% inference spend reduction without quality loss. For a deeper cost analysis, see our self-hosted LLMs vs cloud APIs guide.
14. The ValueStreamAI 5-Pillar Agentic Architecture
Every AI system we build is evaluated against this framework. It is our engineering checklist, not a marketing slide.
- Autonomy — The system acts from triggers (webhooks, schedules, events), not just user prompts. It decides whether to act, not just how to respond.
- Tool Use — The agent connects to external systems via MCP-standard interfaces. Not just retrieval — writes, creates, triggers.
- Planning — For multi-step goals, the agent decomposes tasks, sequences sub-steps, handles failures, and replans when tool results deviate from expectations.
- Memory — Four-layer memory architecture: working, episodic (vector RAG), semantic (knowledge graph), procedural (tool definitions).
- Multi-Step Reasoning — The agent handles conditional logic, retries, edge cases, and self-correction loops before committing to irreversible actions.
15. Architecture Comparison: Open vs. Locked-In
| Factor | ValueStreamAI (Open Standards) | Single-Vendor Lock-in | DIY Custom Stack |
|---|---|---|---|
| Standards compliance | MCP + A2A + AGENTS.md | Proprietary | None |
| Model portability | Swap any LLM, zero rework | Re-architect to switch | Variable |
| Legacy integration | MCP adapter pattern | Vendor-specific connectors | Bespoke per system |
| Multi-cloud | Native | Single cloud | Variable |
| Memory architecture | 4-layer | Context window only | None or basic |
| Streaming | SSE + WebSocket (routed by use case) | SSE only | Variable |
16. Project Scope and Investment Tiers
-
Architecture Audit & Roadmap (2 Weeks): £3,500 – £7,500
- For: Teams already building but unsure if their architecture will scale. We review your current stack and deliver a prioritised remediation plan.
-
RAG Knowledge Pipeline (4–6 Weeks): £8,000 – £20,000
- For: Internal knowledge bases, document Q&A, compliance research. Includes embedding pipeline, hybrid retrieval, and evaluation harness on your cloud of choice.
-
Function-Calling Agent with MCP (6–8 Weeks): £15,000 – £35,000
- For: Agents that take real actions in your business systems. Includes MCP server setup for each integration target (CRM, ERP, databases).
-
Multi-Agent Microservices System (10–16 Weeks): £35,000 – £90,000
- For: Full agentic workflows across departments. Includes orchestration layer, A2A agent registry, 4-layer memory architecture, and observability on AWS / Azure / GCP.
-
Self-Hosted AI Infrastructure (8–12 Weeks): £20,000 – £60,000
- For: Regulated industries requiring data sovereignty. Includes vLLM deployment, quantized model selection, GPU provisioning, and security hardening.
Frequently Asked Questions
What is the difference between a monolith AI architecture and an AI microservices architecture?
In a monolith AI architecture, all AI capabilities — prompt construction, retrieval, LLM calls, tool execution — live in a single application. It is fast to build but becomes a bottleneck at scale: you cannot independently scale, update, or replace individual components. In a microservices AI architecture, each capability is an independently deployable service with a defined API. The LLM gateway, RAG service, each agent, and the memory store are separate containers. This follows the same principles as traditional microservices, with the difference that agent services are driven by LLM reasoning rather than deterministic code.
What is MCP and why does it matter for enterprise AI architecture?
The Model Context Protocol (MCP) is an open standard (now under the Linux Foundation) for connecting AI models to tools, data sources, and external systems. Instead of each AI team writing bespoke integrations for each system, you write a standardised MCP server once and every agent — regardless of which LLM provider powers it — can use it. For enterprise architecture, MCP decouples the AI intelligence layer from the integration layer, which means you can swap models without rewriting integrations, and add new integrations without modifying agents.
When should AI agents be built as microservices vs. embedded in a monolith?
Build monolith-first when you are at proof-of-concept stage, have a single team, and the AI feature is not mission-critical. Migrate to microservices when: (1) you have more than one agent or AI service, (2) you need to independently scale components under load, (3) multiple teams own different agents, or (4) compliance requires isolated data processing. The migration is straightforward if you designed your monolith with clean service interfaces from the start.
What is the Agent2Agent (A2A) protocol and how does it differ from MCP?
MCP connects AI agents to tools and data. A2A connects AI agents to other AI agents. They are complementary layers. An orchestrating agent uses MCP to query a database and A2A to hand off a task to a specialist sub-agent. Both are Linux Foundation projects supported by all major AI providers. Together they form the interoperability foundation for multi-agent enterprise systems — enabling agents built by different vendors, on different models, to collaborate without bespoke integration work.
How do I choose between AWS, Azure, and GCP for an AI system architecture?
Choose AWS if you need the broadest ecosystem of managed services, particularly for event-driven and serverless architectures (Lambda, SQS, EventBridge), and if you want access to multiple models via Bedrock. Choose Azure if your organisation already uses Microsoft 365, Azure Active Directory, or Dynamics — Azure OpenAI Service and native Entra ID integration reduce enterprise friction significantly. Choose GCP if you want access to Gemini 2.5 Pro's 1M-token context window natively, or if you already use BigQuery for data infrastructure (Vertex AI integrates directly). All three cloud platforms now have first-class support for containerised AI workloads on Kubernetes and serverless container hosting.
What is AGENTS.md and how does it make a codebase AI-agent-ready?
AGENTS.md is a markdown file (contributed to the Linux Foundation by OpenAI) that lives at the root of a repository and tells AI coding agents how to work in that project. It covers: how to run tests, how to build and deploy, coding conventions, what directories contain what, and any constraints agents must respect. It has been adopted by over 60,000 open-source projects. The analogy is simple: README.md tells humans how a project works; AGENTS.md tells AI agents how to work within it.
Building on the Right Foundation
The AI system architecture landscape in 2026 is defined by convergence: all major providers building toward the same open standards (MCP, A2A, AGENTS.md), the same structured output patterns, and the same hybrid deployment models. The winners will not be the teams with access to the most powerful models — those are increasingly commoditised. They will be the teams that design their systems as composable microservices, built on open standards, deployable on any cloud, with memory and integration architectures that survive model upgrades and vendor changes.
If you are starting a new AI system project or modernising an existing one, the practical checklist:
- Start with four questions: knowledge problem, action problem, latency problem, integration problem.
- Design each AI capability as an independently deployable microservice from day one.
- Use MCP for every tool and data integration — no bespoke connectors.
- Use A2A-compatible agent interfaces even if you only have one agent today.
- Use structured outputs for every inter-agent and inter-service data exchange.
- Design your memory architecture explicitly across all four layers.
- Choose cloud deployment (AWS, Azure, GCP) based on existing infrastructure and compliance requirements.
- Add AGENTS.md to every repository that AI coding agents will work in.
Talk to the ValueStreamAI team about architecting your system on this foundation.
