AI System Architecture 2026: Microservices, MCP & Cloud

Most AI projects fail not because they chose the wrong model, but because they chose the wrong architecture. A GPT-4o wrapper bolted onto a monolith backend might survive a demo. It will not survive 10,000 concurrent users, a compliance audit, a model deprecation, or the moment you need to add a second agent that talks to the first.

The fundamental shift happening across enterprise AI in 2026 is architectural: teams that built their first AI features as monolithic additions to existing applications are now rebuilding them as distributed, composable services — where each agent, each retrieval pipeline, and each model integration is its own independently deployable unit. This is not over-engineering. It is the only pattern that survives contact with real production requirements.

This guide maps every major AI system architecture pattern — from RAG pipelines to multi-agent microservices — with concrete diagrams, cloud service recommendations for AWS, Azure, and GCP, and the open standards (MCP, A2A, AGENTS.md) that make these systems interoperable by design. If you are still at the stage of deciding whether to build an agent at all, our practical AI agent development guide covers that decision in depth.

Architecture Signal	Benchmark (2026)
Monolith-to-microservices AI migration	60% of enterprise AI teams mid-migration
MCP adoption	150+ organisations including AWS, Google, Microsoft, OpenAI
LLM cost reduction (2024 → 2026)	~80% across all major providers
Legacy integration as #1 AI barrier	Cited by 60% of enterprise AI leaders

1. The Architectural Decision Before Anything Else

Before picking a framework or a model, every AI system architect must answer four questions:

Knowledge problem — Does the system need to reason over private, dynamic, or domain-specific data the base model was never trained on? → RAG or fine-tuning layer.
Action problem — Does the system need to do things in the world — write to a CRM, trigger a workflow, execute code? → Tool-use layer with structured outputs.
Latency problem — Does the UX require token-by-token streaming, real-time voice, or sub-100ms responses? → Streaming architecture (SSE or WebSockets).
Integration problem — Does the system need to work alongside legacy enterprise software, multiple AI vendors, or other agents? → MCP + A2A + event-driven integration layer.

Every other technical choice flows from these four. Teams that skip this end up refactoring within three months. The right AI system architecture decision starts here — not with the model choice.

2. Monolith vs. Microservices: The Core AI Architecture Debate

The AI Monolith

In a monolithic AI architecture, the entire AI capability — prompt construction, LLM calls, RAG retrieval, tool execution, response formatting — lives inside a single application. This is almost always how teams start, and it is entirely appropriate for early-stage projects.

┌─────────────────────────────────────────────────────────┐
│                  AI MONOLITH APPLICATION                 │
│                                                         │
│  ┌──────────┐  ┌───────────┐  ┌──────────┐  ┌───────┐  │
│  │  Prompt  │  │   RAG /   │  │   LLM    │  │ Tool  │  │
│  │ Builder  │→ │ Retrieval │→ │  Client  │→ │  Use  │  │
│  └──────────┘  └───────────┘  └──────────┘  └───────┘  │
│         ↑                                       ↓       │
│  ┌──────────────────────────────────────────────────┐   │
│  │         Single Database / Single Deploy          │   │
│  └──────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────┘

Where the monolith breaks:

Scaling one AI component (e.g., the retrieval pipeline under heavy load) means scaling the entire application
Swapping the LLM provider requires a full redeploy
A second team building a second AI feature duplicates the entire stack
A compliance requirement to isolate data processing forces a full restructure

The AI Microservices Architecture

In a microservices AI architecture, each capability is an independently deployable service with a well-defined API. The LLM gateway, the retrieval service, each agent, the memory store — all separate services that communicate over standard protocols.

                        ┌─────────────────┐
                        │   API GATEWAY   │
                        │  (Kong / APIM)  │
                        └────────┬────────┘
                                 │
              ┌──────────────────┼──────────────────┐
              ▼                  ▼                  ▼
     ┌────────────────┐ ┌────────────────┐ ┌────────────────┐
     │  ORCHESTRATOR  │ │  RAG SERVICE   │ │ MEMORY SERVICE │
     │    AGENT       │ │  (Retrieval)   │ │ (Vector Store) │
     │  (LangGraph)   │ │               │ │               │
     └───────┬────────┘ └───────┬────────┘ └───────┬────────┘
             │                  │                  │
     ┌───────▼────────┐ ┌───────▼────────┐ ┌───────▼────────┐
     │  LLM GATEWAY   │ │  EMBED SERVICE │ │  TOOL SERVICES │
     │ (LiteLLM /     │ │ (text-embed-   │ │  (MCP Servers) │
     │  Portkey)      │ │  3-large)      │ │               │
     └───────┬────────┘ └────────────────┘ └────────────────┘
             │
    ┌────────┴─────────┐
    │                  │
┌───▼───┐  ┌───────┐  ┌▼──────┐
│Claude │  │ GPT-5 │  │Gemini │
│  API  │  │  API  │  │  API  │
└───────┘  └───────┘  └───────┘

Each service is containerised, independently scalable, and replaceable without touching the rest of the system. This is the pattern that survives.

3. AI Agents as Microservices

The most important architectural insight of 2026 is this: an AI agent is a microservice. This reframing transforms how you think about your AI system architecture — each agent gets its own deployment, its own scaling policy, and its own interface contract. It has a defined input schema, a defined output schema, a single responsibility, and it communicates over a standard protocol. The difference from a traditional microservice is that its internal logic is driven by an LLM rather than deterministic code.

The Agent-as-Service Pattern

┌─────────────────────────────────────────────────────────┐
│                   AGENT MICROSERVICE                    │
│                                                         │
│  Input: Task JSON (A2A Task or HTTP POST)               │
│                                                         │
│  ┌──────────┐    ┌──────────┐    ┌──────────────────┐   │
│  │  System  │    │  Tools   │    │  Memory / RAG    │   │
│  │  Prompt  │    │ (via MCP)│    │  (Vector Search) │   │
│  └──────────┘    └──────────┘    └──────────────────┘   │
│         ↘              ↓               ↙                │
│         ┌──────────────────────────────┐                │
│         │        LLM Runtime           │                │
│         │  (Claude / GPT-5 / Gemini)   │                │
│         └──────────────────────────────┘                │
│                        ↓                               │
│  Output: Structured JSON (Pydantic / Zod validated)    │
│                                                         │
│  Deployment: Docker container → K8s pod / Cloud Run    │
└─────────────────────────────────────────────────────────┘

This framing has profound implications:

Single Responsibility — each agent does one thing well (invoice processing, compliance checking, customer triage). A "do everything" agent is an antipattern.
Horizontal scaling — spin up more agent pods under load, exactly like any other service
Independent deployment — update the invoice agent's prompt or model without touching the compliance agent
Standard interfaces — inputs and outputs are typed JSON schemas; the internal LLM is an implementation detail
Observability — traces, logs, and metrics work the same way as any other service

Specialist Agent Registry Pattern

In a multi-agent system, an orchestrating agent needs to know which specialist agents exist and what they can do. The Agent Card (from the A2A protocol) is the standard for this:

┌──────────────────────────────────────────────────────────┐
│              ORCHESTRATOR AGENT                          │
│   "Complete a new vendor onboarding"                     │
└──────────────────┬───────────────────────────────────────┘
                   │ Discovers agents via Agent Cards (A2A)
       ┌───────────┼───────────┬───────────────┐
       ▼           ▼           ▼               ▼
┌────────────┐ ┌──────────┐ ┌──────────┐ ┌──────────────┐
│  Document  │ │  Risk &  │ │   ERP    │ │ Notification │
│  Extract   │ │Compliance│ │ Onboard  │ │   Agent      │
│  Agent     │ │  Agent   │ │  Agent   │ │              │
└────────────┘ └──────────┘ └──────────┘ └──────────────┘
      ↓               ↓           ↓              ↓
  Reads PDF      Checks         SAP API       Sends email
  (MCP: S3)    sanctions DB   (MCP: SAP)    (MCP: SES)
               (MCP: Postgres)

Each specialist agent exposes an Agent Card — a JSON document declaring what it can do, what inputs it accepts, and how to reach it. The orchestrator queries the registry, selects the right specialists, and delegates via A2A task handoffs.

4. Cloud-Native AI Architecture: AWS, Azure, and GCP

AWS AI Architecture

AWS provides the most mature set of managed services for production AI systems. A reference architecture for a multi-agent enterprise system on AWS:

┌──────────────────────────────────────────────────────────────────────┐
│                        AWS AI REFERENCE ARCHITECTURE                │
│                                                                      │
│  ┌────────────────┐      ┌──────────────────────────────────────┐   │
│  │  API GATEWAY   │─────▶│           ECS / EKS CLUSTER          │   │
│  │  (AWS APIGW)   │      │                                      │   │
│  └────────────────┘      │  ┌────────────┐  ┌────────────────┐  │   │
│                           │  │Orchestrator│  │  RAG Service   │  │   │
│  ┌────────────────┐      │  │   Agent    │  │  (FastAPI)     │  │   │
│  │   EVENT BUS    │─────▶│  │ (LangGraph)│  └───────┬────────┘  │   │
│  │  (EventBridge) │      │  └──────┬─────┘          │           │   │
│  └────────────────┘      │         │        ┌────────▼────────┐ │   │
│                           │  ┌──────▼──────┐ │  OpenSearch     │ │   │
│  ┌────────────────┐      │  │ LLM Gateway │ │  (Vector Store) │ │   │
│  │  QUEUE / ASYNC │      │  │  (LiteLLM)  │ └─────────────────┘ │   │
│  │     (SQS)      │─────▶│  └──────┬──────┘                    │   │
│  └────────────────┘      │         │                            │   │
│                           └─────────┼────────────────────────────┘   │
│  ┌────────────────┐                │                                 │
│  │  OBJECT STORE  │                │  ┌──────────────────────────┐   │
│  │     (S3)       │◀───MCP Server──┤  │  AWS BEDROCK             │   │
│  └────────────────┘                │  │  (Claude / Llama / Titan)│   │
│                                    ├─▶│  or direct Anthropic API │   │
│  ┌────────────────┐                │  └──────────────────────────┘   │
│  │  SECRET STORE  │                │                                 │
│  │ (Secrets Mgr)  │                │  ┌──────────────────────────┐   │
│  └────────────────┘                └─▶│  TOOL MCP SERVERS        │   │
│                                       │  Lambda functions wrapping│   │
│  ┌────────────────┐                   │  Salesforce / RDS / SAP  │   │
│  │  OBSERVABILITY │                   └──────────────────────────┘   │
│  │  (CloudWatch + │                                                  │
│  │   X-Ray)       │                                                  │
│  └────────────────┘                                                  │
└──────────────────────────────────────────────────────────────────────┘

Key AWS services per layer:

Layer	AWS Service	Role
Compute	ECS Fargate / EKS	Agent container hosting
Serverless tools	Lambda	MCP server implementations
LLM access	Bedrock / direct API	Claude, Llama, Titan
Vector store	OpenSearch Serverless	RAG retrieval
Object storage	S3	Document ingestion source
Async messaging	SQS + EventBridge	Event-driven agent triggers
Secrets	Secrets Manager	API keys, credentials
Observability	CloudWatch + X-Ray	Traces, logs, metrics
API layer	API Gateway + WAF	Rate limiting, auth

Azure AI Architecture

Azure's strength is deep enterprise integration, particularly for organisations already on Microsoft 365 and Azure Active Directory:

┌──────────────────────────────────────────────────────────────────────┐
│                     AZURE AI REFERENCE ARCHITECTURE                  │
│                                                                      │
│  ┌────────────────┐      ┌──────────────────────────────────────┐   │
│  │  API MGMT      │─────▶│         AZURE KUBERNETES (AKS)       │   │
│  │  (APIM)        │      │                                      │   │
│  └────────────────┘      │  ┌────────────┐  ┌────────────────┐  │   │
│                           │  │Orchestrator│  │  RAG Service   │  │   │
│  ┌────────────────┐      │  │   Agent    │  │  (FastAPI)     │  │   │
│  │ SERVICE BUS    │─────▶│  │ (LangGraph)│  └───────┬────────┘  │   │
│  │ (Async events) │      │  └──────┬─────┘          │           │   │
│  └────────────────┘      │         │        ┌────────▼────────┐ │   │
│                           │  ┌──────▼──────┐ │  AI Search      │ │   │
│  ┌────────────────┐      │  │ LLM Gateway │ │  (Vector Index) │ │   │
│  │  BLOB STORAGE  │      │  │  (LiteLLM)  │ └─────────────────┘ │   │
│  │  (Source docs) │─────▶│  └──────┬──────┘                    │   │
│  └────────────────┘      └─────────┼────────────────────────────┘   │
│                                    │                                 │
│  ┌────────────────┐                │  ┌──────────────────────────┐   │
│  │  KEY VAULT     │                ├─▶│  AZURE OPENAI SERVICE    │   │
│  │  (Secrets)     │                │  │  GPT-4o / GPT-5 / Ada    │   │
│  └────────────────┘                │  └──────────────────────────┘   │
│                                    │                                 │
│  ┌────────────────┐                │  ┌──────────────────────────┐   │
│  │  ENTRA ID      │                └─▶│  AZURE FUNCTIONS         │   │
│  │  (AuthN/AuthZ) │                   │  MCP servers: Dynamics,  │   │
│  └────────────────┘                   │  SharePoint, SQL, Teams  │   │
│                                       └──────────────────────────┘   │
│  ┌────────────────┐                                                  │
│  │  MONITOR +     │  ← Traces every agent call end-to-end           │
│  │  APP INSIGHTS  │                                                  │
│  └────────────────┘                                                  │
└──────────────────────────────────────────────────────────────────────┘

Key Azure services per layer:

Layer	Azure Service	Role
Compute	AKS / Container Apps	Agent container hosting
Serverless tools	Azure Functions	MCP server implementations
LLM access	Azure OpenAI Service	GPT-4o, GPT-5, Ada embeddings
Vector store	Azure AI Search	RAG with hybrid retrieval
Object storage	Blob Storage	Document source
Async messaging	Service Bus + Event Grid	Event-driven triggers
Identity	Entra ID (AAD)	Enterprise SSO, RBAC
Secrets	Key Vault	API keys, certificates
Observability	Monitor + App Insights	Distributed tracing

GCP AI Architecture

GCP's differentiator is Vertex AI — a fully managed platform for building, deploying, and orchestrating AI agents, with native access to Gemini models and a 1M-token context window:

┌──────────────────────────────────────────────────────────────────────┐
│                       GCP AI REFERENCE ARCHITECTURE                  │
│                                                                      │
│  ┌────────────────┐      ┌──────────────────────────────────────┐   │
│  │  CLOUD ENDPOINTS│─────▶│     GOOGLE KUBERNETES ENGINE (GKE)   │   │
│  │  / APIGEE      │      │                                      │   │
│  └────────────────┘      │  ┌────────────┐  ┌────────────────┐  │   │
│                           │  │Orchestrator│  │  RAG Service   │  │   │
│  ┌────────────────┐      │  │   Agent    │  │  (FastAPI)     │  │   │
│  │  PUB/SUB       │─────▶│  │(ADK/LGraph)│  └───────┬────────┘  │   │
│  │  (Async events)│      │  └──────┬─────┘          │           │   │
│  └────────────────┘      │         │        ┌────────▼────────┐ │   │
│                           │  ┌──────▼──────┐ │  VERTEX AI      │ │   │
│  ┌────────────────┐      │  │ LLM Gateway │ │  VECTOR SEARCH  │ │   │
│  │  CLOUD STORAGE │      │  │  (LiteLLM)  │ └─────────────────┘ │   │
│  │  (Source docs) │─────▶│  └──────┬──────┘                    │   │
│  └────────────────┘      └─────────┼────────────────────────────┘   │
│                                    │                                 │
│  ┌────────────────┐                │  ┌──────────────────────────┐   │
│  │  SECRET MANAGER│                ├─▶│  VERTEX AI + GEMINI 2.5  │   │
│  │                │                │  │  Pro (1M ctx) / Flash-   │   │
│  └────────────────┘                │  │  Lite ($0.075/1M tokens) │   │
│                                    │  └──────────────────────────┘   │
│  ┌────────────────┐                │                                 │
│  │  BIGQUERY      │                │  ┌──────────────────────────┐   │
│  │  (Analytics +  │◀───MCP Server──┴─▶│  CLOUD RUN               │   │
│  │  agent logs)   │                   │  MCP servers: BigQuery,  │   │
│  └────────────────┘                   │  Salesforce, SAP, SQL    │   │
│                                       └──────────────────────────┘   │
│  ┌────────────────┐                                                  │
│  │  CLOUD TRACE + │  ← OpenTelemetry-native observability           │
│  │  CLOUD LOGGING │                                                  │
│  └────────────────┘                                                  │
└──────────────────────────────────────────────────────────────────────┘

5. RAG Architecture in Production

RAG (Retrieval-Augmented Generation) is the foundational pattern for grounding LLM responses in private or live data. Rather than baking knowledge into model weights, RAG retrieves relevant context at inference time.

The Full RAG Pipeline Architecture

INGESTION PIPELINE (offline / scheduled)
════════════════════════════════════════
 ┌──────────┐   ┌───────────┐   ┌────────────┐   ┌──────────────┐
 │ Source   │──▶│  Chunking │──▶│ Embedding  │──▶│ Vector Store │
 │ (S3 /   │   │  Service  │   │  Service   │   │(Pinecone /   │
 │ Blob /  │   │(recursive │   │(text-embed │   │ pgvector /   │
 │ GCS)    │   │ + overlap)│   │ -3-large)  │   │ OpenSearch)  │
 └──────────┘   └───────────┘   └────────────┘   └──────────────┘

QUERY PIPELINE (real-time, per request)
════════════════════════════════════════
 ┌──────────┐   ┌───────────┐   ┌────────────┐   ┌──────────────┐
 │  User   │──▶│  Embed   │──▶│  Hybrid    │──▶│  Reranking   │
 │  Query  │   │  Query    │   │  Search    │   │  (Cohere /   │
 │         │   │           │   │(Dense +   │   │  FlashRank)  │
 └──────────┘   └───────────┘   │  BM25)    │   └──────┬───────┘
                                └────────────┘          │
                                                ┌────────▼───────┐
                                                │  LLM (with     │
                                                │  retrieved     │
                                                │  context)      │
                                                └────────────────┘

RAG Variant Comparison

RAG Type	Best For	Complexity
Naive RAG	Proof of concept, small doc sets	Low
Hybrid RAG (Dense + BM25)	Production — consistently best recall	Medium
Graph RAG	Multi-hop reasoning, entity relationships	High
Agentic RAG	Agent decides when/what to retrieve	High
Multimodal RAG	PDFs with tables, charts, images	High

6. Fine-Tuning Architecture

Fine-tuning adapts a base model's weights to your domain vocabulary, output format, or task behaviour. It teaches the model how to behave, not what to know — a critical distinction.

LoRA and QLoRA: The Production Standard

Full fine-tuning of a 70B model costs tens of thousands of dollars. In 2026, virtually all production fine-tuning uses LoRA (Low-Rank Adaptation) or QLoRA:

BASE MODEL (frozen weights)
        ↓
  ┌─────────────────────────────────────────────┐
  │  LORA ADAPTER (0.1–1% of total parameters) │
  │  Small trainable rank-decomposition matrices│
  │  added to attention layers                 │
  └─────────────────────────────────────────────┘
        ↓
  Fine-tuned behaviour at a fraction of the cost

QLoRA adds: 4-bit quantization of base model
→ 7B model: 14GB (FP16) → ~5GB (QLoRA)
→ Quality loss: typically < 5% on benchmarks

Fine-Tuning vs. RAG: The Decision Matrix

Scenario	RAG	Fine-Tuning
Dynamic / frequently updated knowledge	✅ Ideal	❌ Requires retraining
Citable sources required	✅ Native	❌ Not possible
Consistent output format / tone	⚠️ Prompt engineering only	✅ Ideal
Domain-specific jargon	⚠️ Partial	✅ Ideal
Data privacy (no data leaves premises)	✅ Self-hosted embeddings	✅ Self-hosted model

The production answer in 2026 is often both: a fine-tuned model as the inference engine, RAG providing the dynamic knowledge layer.

7. Quantization: Shrinking Models for Deployment

Quantization reduces the numerical precision of model weights — from FP32 or FP16 down to INT8 or INT4 — to shrink memory footprint and accelerate inference with minimal quality loss.

Model Size by Quantization Format (70B parameter model)
══════════════════════════════════════════════════════════
FP32   ████████████████████████████████████████  ~280 GB
FP16   ████████████████████████             ~140 GB
INT8   █████████████████               ~70 GB
INT4   ████████                    ~35 GB  ← H100 fits 1 GPU
1-bit  ██                          ~9 GB   ← Experimental

Format	Quality Loss	Production Readiness	Tool
FP16	None	Cloud inference, training	Native
INT8	< 1%	Production serving	bitsandbytes, vLLM
INT4 GPTQ/AWQ	2–5%	GPU-accelerated serving	AutoGPTQ, AutoAWQ
GGUF (INT4)	2–5%	CPU-friendly local	llama.cpp, Ollama

8. Function Calling, Structured Outputs, and JSON Mode

These three capabilities form the data contract layer of any AI system. Getting them right is the difference between a reliable pipeline and a brittle one.

How Function Calling Works

┌─────────────────────────────────────────────────────────┐
│  1. Developer defines tool schemas (JSON Schema)        │
│  2. LLM receives prompt + tool definitions              │
│  3. LLM decides: respond in text OR call a tool         │
│  4. If tool call: LLM outputs structured tool_use block │
│  5. Application executes the tool                       │
│  6. Tool result injected back into context              │
│  7. LLM generates final response                        │
└─────────────────────────────────────────────────────────┘

Provider Comparison

Provider	API Style	Parallel Tool Calls	Tool Reliability (Q1 2026)
OpenAI GPT-4o / GPT-5	`tools` + `tool_choice`	✅ Native	8.6 / 10
Anthropic Claude 3.5+	`tools` + `tool_use` blocks	✅ Native	8.4 / 10
Google Gemini 2.5	`function_declarations` (Vertex)	✅ Native	8.2 / 10

All three providers have converged on nearly identical JSON Schema–based formats. A tool definition written for Claude adapts to GPT-5 or Gemini in minutes — intentional convergence driven by the MCP standard.

JSON Mode vs. Structured Outputs

Capability	Guarantee	When to Use
JSON Mode	Valid JSON, any shape	Quick prototyping
Structured Outputs	Valid JSON matching exact schema	Production pipelines, inter-agent data
Function calling	Valid tool invocation arguments	When the model needs to act

In multi-agent microservices architectures, use Structured Outputs only for inter-agent data exchange. One agent's output is another agent's input — a schema violation causes silent downstream failures.

9. Streaming Architecture: SSE vs. WebSockets

REQUEST TYPE                    PROTOCOL CHOICE
══════════════════════════════════════════════
User sends → AI streams back    →  Server-Sent Events (SSE)
  (chat, Q&A, doc generation)

User and AI both stream         →  WebSockets
  (voice AI, collaborative AI,
   real-time interruption)

SSE Architecture (Standard LLM Streaming)

Client (Browser / App)
        │  HTTP GET /stream
        ▼
API Gateway (AWS APIGW / Azure APIM / GCP Endpoints)
        │
        ▼
Agent Service (ECS / AKS / Cloud Run)
        │  Streams tokens as SSE events
        │  event: token
        │  data: {"text": "The", "index": 0}
        ▼
LLM Provider (Claude / GPT-5 / Gemini)
        │  Native SSE streaming response
        ▼
Client receives + renders tokens in real time

SSE runs over plain HTTP — no protocol upgrade, no extra infrastructure, auto-reconnects on drop. With HTTP/3 (QUIC) now covering ~85% of client-server traffic, SSE scales cleanly to enterprise loads.

WebSocket Architecture (Voice / Bidirectional AI)

Client (Browser / Mobile)
        │  WS Upgrade → ws://
        ▼
WebSocket Gateway (API GW / nginx)
        │
        ▼
Real-Time Agent Service
 ├── Audio IN stream  →  Whisper / Deepgram (Speech-to-Text)
 ├── Text processing  →  LLM (GPT-5 Realtime / Gemini Live)
 └── Audio OUT stream →  ElevenLabs / Azure TTS
        │
        ▼
Client receives audio response while still streaming input

OpenAI's Realtime API and Gemini Live both use WebSocket endpoints for bidirectional audio. vLLM exposes a WebSocket server for interactive GPU inference in self-hosted deployments.

10. The Open Standards Layer: MCP, A2A, and AGENTS.md

This is the most architecturally significant development of 2025–2026. All major AI labs have converged on shared interoperability standards, now governed by the Linux Foundation's Agentic AI Foundation (AAIF) — with AWS, Anthropic, Block, Google, Microsoft, and OpenAI as platinum members. For teams building on top of these standards, our agentic AI development services guide covers the full implementation picture.

Model Context Protocol (MCP)

MCP is the universal standard for connecting AI agents to tools, data, and external systems. It replaces bespoke per-system integrations with a single protocol.

WITHOUT MCP (custom integrations — the old way)
════════════════════════════════════════════════
Agent ──── custom code ──── Salesforce
Agent ──── custom code ──── PostgreSQL
Agent ──── custom code ──── SharePoint
Agent ──── custom code ──── SAP
  → 4 integrations. Add a new agent: 4 more. Add a new system: N more.

WITH MCP (protocol-based — the new standard)
════════════════════════════════════════════
Agent ──── MCP Client ──── MCP Server ──── Salesforce
                       └── MCP Server ──── PostgreSQL
                       └── MCP Server ──── SharePoint
                       └── MCP Server ──── SAP
  → Write an MCP server once. Any agent, any model uses it.

OpenAI is deprecating its Assistants API in favour of MCP (mid-2026 sunset). Organisations implementing MCP-first architecture report 60% reduction in tool integration time.

Agent2Agent Protocol (A2A)

Where MCP connects agents to tools, A2A connects agents to other agents. Released by Google (April 2025), now under the Linux Foundation with 150+ supporting organisations.

A2A MULTI-AGENT ORCHESTRATION FLOW
════════════════════════════════════
User Request: "Process and onboard this new vendor"
        │
        ▼
Orchestrator Agent
  │  Reads Agent Registry (Agent Cards)
  │
  ├─ A2A Task → Document Extraction Agent
  │               └─ MCP: reads PDF from S3 / GCS / Blob
  │               └─ Returns: structured vendor data (JSON)
  │
  ├─ A2A Task → Risk & Compliance Agent
  │               └─ MCP: queries sanctions DB, credit API
  │               └─ Returns: risk score + flags
  │
  ├─ A2A Task → ERP Onboarding Agent
  │               └─ MCP: writes to SAP / Oracle / Dynamics
  │               └─ Returns: vendor ID + status
  │
  └─ A2A Task → Notification Agent
                  └─ MCP: sends email via SES / SendGrid
                  └─ Returns: confirmation

AGENTS.md

Released by OpenAI (August 2025), donated to the AAIF, AGENTS.md is a markdown convention for giving AI coding agents project-specific context — structure, standards, test procedures, deployment workflows. Over 60,000 open-source projects have adopted it.

your-repo/
├── AGENTS.md          ← AI agents read this to understand the project
├── README.md          ← Humans read this
├── src/
└── tests/

Just as README.md tells humans how a project works, AGENTS.md tells AI agents how to operate within it.

How the Standards Compose

┌──────────────────────────────────────────────────────────┐
│              UNIFIED INTEROPERABILITY STACK              │
│                                                          │
│  AGENTS.md       → How to work in this codebase         │
│  MCP             → What tools and data are available     │
│  A2A             → What other agents can be delegated to │
│  Structured Outputs → How data flows reliably between    │
│                       all of the above                   │
│                                                          │
│  Result: swap Claude for GPT-5 for Gemini with zero      │
│  changes to your integration, memory, or agent layer.    │
└──────────────────────────────────────────────────────────┘

11. AI Agent Memory Architecture

Memory is the most underrated component in production agent systems. The four-layer model:

MEMORY ARCHITECTURE FOR PRODUCTION AI AGENTS
═════════════════════════════════════════════

┌─────────────────────────────────────────────────────────┐
│  LAYER 1: WORKING MEMORY (In-Context)                   │
│  Storage: LLM context window (200K–1M tokens)           │
│  Duration: Current session only                         │
│  Contents: Active conversation, recent tool results     │
│  Cost trap: DO NOT brute-force past sessions here       │
└─────────────────────────────────────────────────────────┘
              ↓ Session ends → embed + store ↓
┌─────────────────────────────────────────────────────────┐
│  LAYER 2: EPISODIC MEMORY (Vector Database)             │
│  Storage: Pinecone / pgvector / OpenSearch              │
│  Duration: Long-term, retrieved by relevance            │
│  Contents: Past interactions, user preferences          │
│  Tools: Mem0, Zep, LangMem                              │
└─────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────┐
│  LAYER 3: SEMANTIC MEMORY (Knowledge + Entities)        │
│  Storage: Vector DB + Knowledge Graph (Neo4j / Neptune) │
│  Duration: Persistent                                   │
│  Contents: Domain knowledge, entity relationships       │
└─────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────┐
│  LAYER 4: PROCEDURAL MEMORY (How-To)                    │
│  Storage: Tool definitions, AGENTS.md, runbooks         │
│  Duration: Persistent                                   │
│  Contents: MCP tool schemas, workflow playbooks         │
└─────────────────────────────────────────────────────────┘

The context window trap: Gemini 2.5 Pro's 1M-token context looks like a memory solution. But at $1.25/M input tokens, passing 500K tokens of history costs $0.625 per call. At volume, that is thousands of dollars per day. Selective retrieval from a vector store almost always beats brute-force context for cost efficiency.

12. Integrating AI Agents with Legacy Systems

Nearly 60% of enterprise AI leaders cite legacy integration as their primary barrier to advanced AI adoption. Over 75% of ERP AI projects stall at integration boundaries.

The Four Integration Patterns

PATTERN 1: MCP SERVER AS ADAPTER
══════════════════════════════════
AI Agent → MCP Client → MCP Server → Legacy System
                         (wraps auth,  (SAP / Oracle /
                          transform,   Mainframe)
                          retries)

Best for: Any system with a queryable interface.
Benefit: Decouples AI layer from integration layer.
         Swap models without touching integrations.

PATTERN 2: API GATEWAY TRANSLATION
════════════════════════════════════
AI Agent → REST/JSON → API Gateway → SOAP/RPC → Legacy
                        (Kong / APIM  transform)

Best for: Legacy SOAP, RPC, or proprietary protocol systems.

PATTERN 3: EVENT-DRIVEN (Async)
════════════════════════════════
Legacy System → Kafka / SQS / Pub/Sub → AI Agent
                (event stream)          (reacts async)

Best for: Mainframes, batch ERP systems, systems that can't
          tolerate synchronous AI latency.

PATTERN 4: BROWSER AUTOMATION (Last Resort)
════════════════════════════════════════════
AI Agent → Playwright → Web UI → Legacy System
           (browser automation)

Best for: Absolutely no API available.
Warning:  Fragile. Replace with Pattern 1 as soon as possible.

Why Open Standards Are Non-Negotiable for Enterprise

Legacy integration is a long-term commitment. Any architecture that ties your integration layer to a single AI vendor creates existential risk. MCP solves this: your MCP servers are vendor-agnostic. You can replace Claude with GPT-5 or Gemini without touching a single integration. This is the architectural argument for open standards, not an ideological one.

13. LLM Costs and Deployment: The 2026 Economics

Cloud API Pricing (April 2026)

LLM prices have fallen ~80% from 2024 to 2026. Current landscape:

Model	Input (per 1M tokens)	Output (per 1M tokens)	Context Window
OpenAI GPT-5 Flagship	$1.75	$14.00	128K
OpenAI GPT-5 Mini	$0.25	$2.00	128K
Anthropic Claude Sonnet	$3.00	$15.00	200K
Anthropic Claude Haiku	$0.25	$1.25	200K
Google Gemini 2.5 Pro	$1.25	$10.00	1M
Google Gemini 2.0 Flash-Lite	$0.075	$0.30	1M
DeepSeek V3 (API)	$0.27	$1.10	64K

Verify against provider pricing pages — this market moves fast.

Self-Hosted vs. Cloud Architecture

DEPLOYMENT ARCHITECTURE DECISION TREE
══════════════════════════════════════

Is your data regulated (HIPAA, GDPR, FCA, SOC 2)?
  YES → Self-hosted is mandatory regardless of cost
  NO  → Continue...

Token volume > 2–3M/day sustained?
  YES → Model self-hosted hybrid (cloud for frontier tasks)
  NO  → Cloud API is cheaper when engineering overhead is included

Do you need frontier model capability (GPT-5, Claude Sonnet)?
  YES → Cloud API (open-weight models not at frontier parity)
  NO  → Llama 3.3 70B INT4 (self-hosted) is competitive and cheap

RECOMMENDED HYBRID PATTERN (2026):
  High-volume / low-complexity tasks  →  Gemini Flash-Lite / Haiku ($0.075–$0.25/1M)
  Complex reasoning / planning tasks  →  GPT-5 Flagship / Claude Sonnet
  Regulated data / on-prem required   →  vLLM + Llama 3.3 70B INT4
  Routing layer                       →  LiteLLM or OpenRouter (model-agnostic)

Self-hosting toolchain:

Tool	Throughput	Best For
vLLM	793 tokens/sec (H100)	Production multi-user GPU serving
Ollama	41 tokens/sec	Local development, prototyping
llama.cpp	Variable	CPU inference, edge deployment

Break-even against cloud frontier APIs typically occurs at 2–3M tokens/day over a 12-month hardware amortisation window. Companies running hybrid AI system architectures report 60–80% inference spend reduction without quality loss. For a deeper cost analysis, see our self-hosted LLMs vs cloud APIs guide.

14. The ValueStreamAI 5-Pillar Agentic Architecture

Every AI system we build is evaluated against this framework. It is our engineering checklist, not a marketing slide.

Autonomy — The system acts from triggers (webhooks, schedules, events), not just user prompts. It decides whether to act, not just how to respond.
Tool Use — The agent connects to external systems via MCP-standard interfaces. Not just retrieval — writes, creates, triggers.
Planning — For multi-step goals, the agent decomposes tasks, sequences sub-steps, handles failures, and replans when tool results deviate from expectations.
Memory — Four-layer memory architecture: working, episodic (vector RAG), semantic (knowledge graph), procedural (tool definitions).
Multi-Step Reasoning — The agent handles conditional logic, retries, edge cases, and self-correction loops before committing to irreversible actions.

15. Architecture Comparison: Open vs. Locked-In

Factor	ValueStreamAI (Open Standards)	Single-Vendor Lock-in	DIY Custom Stack
Standards compliance	MCP + A2A + AGENTS.md	Proprietary	None
Model portability	Swap any LLM, zero rework	Re-architect to switch	Variable
Legacy integration	MCP adapter pattern	Vendor-specific connectors	Bespoke per system
Multi-cloud	Native	Single cloud	Variable
Memory architecture	4-layer	Context window only	None or basic
Streaming	SSE + WebSocket (routed by use case)	SSE only	Variable

16. Project Scope and Investment Tiers

Architecture Audit & Roadmap (2 Weeks): £3,500 – £7,500
- For: Teams already building but unsure if their architecture will scale. We review your current stack and deliver a prioritised remediation plan.
RAG Knowledge Pipeline (4–6 Weeks): £8,000 – £20,000
- For: Internal knowledge bases, document Q&A, compliance research. Includes embedding pipeline, hybrid retrieval, and evaluation harness on your cloud of choice.
Function-Calling Agent with MCP (6–8 Weeks): £15,000 – £35,000
- For: Agents that take real actions in your business systems. Includes MCP server setup for each integration target (CRM, ERP, databases).
Multi-Agent Microservices System (10–16 Weeks): £35,000 – £90,000
- For: Full agentic workflows across departments. Includes orchestration layer, A2A agent registry, 4-layer memory architecture, and observability on AWS / Azure / GCP.
Self-Hosted AI Infrastructure (8–12 Weeks): £20,000 – £60,000
- For: Regulated industries requiring data sovereignty. Includes vLLM deployment, quantized model selection, GPU provisioning, and security hardening.

Frequently Asked Questions

What is the difference between a monolith AI architecture and an AI microservices architecture?

In a monolith AI architecture, all AI capabilities — prompt construction, retrieval, LLM calls, tool execution — live in a single application. It is fast to build but becomes a bottleneck at scale: you cannot independently scale, update, or replace individual components. In a microservices AI architecture, each capability is an independently deployable service with a defined API. The LLM gateway, RAG service, each agent, and the memory store are separate containers. This follows the same principles as traditional microservices, with the difference that agent services are driven by LLM reasoning rather than deterministic code.

What is MCP and why does it matter for enterprise AI architecture?

The Model Context Protocol (MCP) is an open standard (now under the Linux Foundation) for connecting AI models to tools, data sources, and external systems. Instead of each AI team writing bespoke integrations for each system, you write a standardised MCP server once and every agent — regardless of which LLM provider powers it — can use it. For enterprise architecture, MCP decouples the AI intelligence layer from the integration layer, which means you can swap models without rewriting integrations, and add new integrations without modifying agents.

When should AI agents be built as microservices vs. embedded in a monolith?

Build monolith-first when you are at proof-of-concept stage, have a single team, and the AI feature is not mission-critical. Migrate to microservices when: (1) you have more than one agent or AI service, (2) you need to independently scale components under load, (3) multiple teams own different agents, or (4) compliance requires isolated data processing. The migration is straightforward if you designed your monolith with clean service interfaces from the start.

What is the Agent2Agent (A2A) protocol and how does it differ from MCP?

MCP connects AI agents to tools and data. A2A connects AI agents to other AI agents. They are complementary layers. An orchestrating agent uses MCP to query a database and A2A to hand off a task to a specialist sub-agent. Both are Linux Foundation projects supported by all major AI providers. Together they form the interoperability foundation for multi-agent enterprise systems — enabling agents built by different vendors, on different models, to collaborate without bespoke integration work.

How do I choose between AWS, Azure, and GCP for an AI system architecture?

Choose AWS if you need the broadest ecosystem of managed services, particularly for event-driven and serverless architectures (Lambda, SQS, EventBridge), and if you want access to multiple models via Bedrock. Choose Azure if your organisation already uses Microsoft 365, Azure Active Directory, or Dynamics — Azure OpenAI Service and native Entra ID integration reduce enterprise friction significantly. Choose GCP if you want access to Gemini 2.5 Pro's 1M-token context window natively, or if you already use BigQuery for data infrastructure (Vertex AI integrates directly). All three cloud platforms now have first-class support for containerised AI workloads on Kubernetes and serverless container hosting.

What is AGENTS.md and how does it make a codebase AI-agent-ready?

AGENTS.md is a markdown file (contributed to the Linux Foundation by OpenAI) that lives at the root of a repository and tells AI coding agents how to work in that project. It covers: how to run tests, how to build and deploy, coding conventions, what directories contain what, and any constraints agents must respect. It has been adopted by over 60,000 open-source projects. The analogy is simple: README.md tells humans how a project works; AGENTS.md tells AI agents how to work within it.

Building on the Right Foundation

The AI system architecture landscape in 2026 is defined by convergence: all major providers building toward the same open standards (MCP, A2A, AGENTS.md), the same structured output patterns, and the same hybrid deployment models. The winners will not be the teams with access to the most powerful models — those are increasingly commoditised. They will be the teams that design their systems as composable microservices, built on open standards, deployable on any cloud, with memory and integration architectures that survive model upgrades and vendor changes.

If you are starting a new AI system project or modernising an existing one, the practical checklist:

Start with four questions: knowledge problem, action problem, latency problem, integration problem.
Design each AI capability as an independently deployable microservice from day one.
Use MCP for every tool and data integration — no bespoke connectors.
Use A2A-compatible agent interfaces even if you only have one agent today.
Use structured outputs for every inter-agent and inter-service data exchange.
Design your memory architecture explicitly across all four layers.
Choose cloud deployment (AWS, Azure, GCP) based on existing infrastructure and compliance requirements.
Add AGENTS.md to every repository that AI coding agents will work in.

Talk to the ValueStreamAI team about architecting your system on this foundation.

#AI System Architecture#AI Microservices#RAG Architecture#Fine-Tuning#Quantization#Function Calling#Structured Outputs#AI Agents#MCP#Model Context Protocol#Agent2Agent#AGENTS.md#LLM Costs#OpenAI SDK#Anthropic Claude#Google Gemini#AI Memory#Legacy System Integration#Cloud Architecture#AWS AI#Azure OpenAI#Google Vertex AI

← back to blog

AI System Architecture: The Essential Guide for 2026