Most enterprise AI adoption conversations start the same way: your team wants AI capabilities, your security team says "not in someone else's cloud," and IT asks "so what exactly do we need?"
This guide answers that question. No hype, no vendor pitch — just the practical architecture you need to run AI workloads on infrastructure you control.
Why On-Premise AI Is Back
The cloud AI wave peaked in 2024. By 2025, a counter-trend emerged: enterprises pulling AI workloads back on-premise. The reasons are predictable:
- Data residency laws (GDPR, DORA, NIS2) make it legally risky to send customer data to US-hosted APIs
- Cost at scale — API calls that seem cheap at prototype scale become expensive at production volumes
- Latency — round-trips to cloud APIs add 200-500ms per request, which compounds in multi-agent workflows
- Vendor lock-in — switching providers means rewriting prompts, pipelines, and integrations
The question is no longer should you run AI locally, but how.
This shift is well documented. Gartner's 2025 Infrastructure Trends report identifies on-premise and hybrid AI deployment as the fastest-growing segment of enterprise AI investment, driven primarily by compliance pressure in European and regulated-industry contexts. McKinsey's 2025 AI Adoption Survey similarly finds that data sovereignty concerns are the primary driver of on-premise AI deployment among European enterprises, cited by 73% of respondents who had moved AI workloads in-house.
Understanding the Full Stack
Before diving into hardware specs and cost models, it helps to understand what "deploying AI" actually means architecturally. Many organizations focus entirely on the model — which model to run, how much memory it needs, how to optimize throughput. The model is one layer of a five-layer stack, and it is rarely the bottleneck once deployed.
Layer 5: Application layer ← Where 80% of your value is
Layer 4: Orchestration layer ← Routing, context, chaining
Layer 3: Model serving layer ← vLLM, Ollama, TGI
Layer 2: Data layer ← PostgreSQL + pgvector, Redis
Layer 1: Infrastructure layer ← Hardware, networking, TLS
Organizations that deploy only Layers 1-3 get raw inference capability: you can send a prompt and get a completion. That is technically impressive but organizationally low-value. The value comes from Layers 4-5 — the orchestration and application layers that turn raw inference into organizational capabilities.
This framing matters for budget allocation. Organizations consistently underinvest in Layers 4-5 because they are invisible ("it's just software") and overinvest in Layer 3 (the GPU hardware is tangible). The right allocation for most organizations is roughly 30% hardware/infrastructure, 70% application and orchestration development — the inverse of what most first deployments look like.
The Minimum Viable AI Infrastructure
Here's what a production-grade on-premise AI deployment actually requires:
Hardware
| Component | Minimum | Recommended |
|---|---|---|
| GPU | 1x NVIDIA A100 40GB | 2x A100 80GB or H100 |
| CPU | 16 cores | 32+ cores |
| RAM | 64GB | 128GB+ |
| Storage | 1TB NVMe | 2TB+ NVMe |
| Network | 1Gbps | 10Gbps internal |
For smaller models (7B-13B parameters), you can start with consumer GPUs like the RTX 4090 (24GB VRAM). A Llama 3.2 3B model runs in approximately 2GB VRAM; Llama 3.1 8B requires approximately 8GB (at 4-bit quantization) or 16GB (at 8-bit). But for production workloads with multiple concurrent users, professional GPUs are worth the investment — the A100's architecture is optimized for sustained throughput, while consumer GPUs thermal-throttle under continuous load.
For organizations deploying Ollama specifically, the minimum viable setup for a team of 10-20 users with moderate AI usage is a single A100 40GB, which can serve Llama 3.1 70B (4-bit quantized) at approximately 40-60 tokens per second — sufficient for conversational latency.
GPU procurement options vary significantly by budget and tolerance for operational overhead:
- Cloud rental (Lambda Labs, CoreWeave, Vast.ai): $1.5-3/hour per A100. Zero CapEx, but recurring cost and data still leaves your network.
- Colocation (Hetzner, OVH Dedicated): Dedicated GPU server, monthly rate, your hardware or theirs, EU data center options available.
- On-premise hardware: Full control, highest CapEx, lowest per-unit cost at scale. Appropriate for organizations committed to long-term AI deployment.
Software Stack
A complete on-premise AI stack needs these layers:
- Model serving — vLLM, Ollama, or TGI (Text Generation Inference) to serve LLM inference
- Embedding service — for semantic search and RAG pipelines (nomic-embed-text, sentence-transformers)
- Vector database — pgvector (PostgreSQL extension) for persistent memory, or dedicated like Qdrant for high-volume search
- Orchestration — something to route requests, manage context, and chain operations
- Reverse proxy — Traefik or Nginx for TLS termination and routing
- Monitoring — GPU utilization, inference latency, token throughput
Each component has well-maintained open-source implementations. The integration work — making them work together coherently, with proper authentication, audit logging, and failover — is where the engineering effort concentrates.
The Architecture
[Users] → [Reverse Proxy (TLS)] → [API Gateway / Router]
↓
┌───────────────────┼───────────────────┐
↓ ↓ ↓
[LLM Service] [Embedding Service] [Application Layer]
(vLLM/Ollama) (sentence-transformers) ↓
↓ ↓ [PostgreSQL + pgvector]
[GPU Pool] [CPU/GPU] [Redis Cache]
The key insight: don't try to build this from scratch. The model serving layer is a solved problem — vLLM handles batching, quantization, and multi-GPU automatically. Your engineering effort should go into the application layer — the part that turns raw model output into useful organizational capabilities. For a detailed comparison of when self-hosted makes financial sense, see our on-premise vs cloud AI comparison.
Model Serving: vLLM vs Ollama vs TGI
Three mature options exist for the model serving layer, each with different optimization targets:
vLLM is optimized for throughput in multi-user production environments. Its PagedAttention algorithm significantly improves GPU memory efficiency, allowing more concurrent requests. vLLM is appropriate for high-traffic production deployments where throughput is the primary metric.
Ollama is optimized for simplicity and ease of use. It handles model downloading, quantization selection, and serving with minimal configuration. For organizations that want to run multiple models on the same hardware and switch between them easily, Ollama is the right choice. Its CPU fallback capability also makes it suitable for organizations that don't yet have GPU hardware.
TGI (Text Generation Inference) from Hugging Face bridges simplicity and production performance, with strong support for the Hugging Face model ecosystem and enterprise features like model sharding across multiple GPUs.
For most organizations starting their on-premise AI journey, Ollama is the right entry point. Migrating to vLLM for production is straightforward once you understand your actual throughput requirements.
Cost Comparison: Cloud API vs On-Premise
Let's do real math for a 25-person team using AI daily:
Cloud API (e.g., GPT-4o via OpenAI)
- ~500 requests/day × 2,000 tokens avg = 1M tokens/day
- At $2.50/1M input + $10/1M output ≈ $375/day = $11,250/month
On-Premise (Llama 3.1 70B on 2x A100)
- Server lease: $2,000-3,000/month (Hetzner, OVH, colocation)
- Electricity: ~$200/month (approximately 400W per A100 under load)
- Maintenance/DevOps: ~$500/month (amortized time for a single server)
- Total: $2,700-3,700/month
The on-premise option is 3-4x cheaper at this scale, and the gap only widens as usage grows. The tradeoff: you need someone who can manage the infrastructure.
The Break-Even Analysis
For smaller teams (5-10 people with light AI usage), cloud APIs are typically more cost-effective — the per-request cost is low, and the fixed cost of on-premise infrastructure is hard to amortize. The break-even point for most organizations is approximately 15-20 users with regular AI usage.
Beyond break-even, cost scales linearly with cloud APIs (more users = proportionally more cost) but has a step function with on-premise (add a GPU when you hit capacity, then flat again). At 100+ users, the on-premise advantage is typically 5-10x.
Important caveat: this analysis assumes the open-source models are adequate for your use cases. For tasks requiring frontier model capabilities — complex multi-step reasoning, nuanced legal analysis, sophisticated code generation — you may need a hybrid approach: on-premise for routine tasks, cloud for complex ones. This hybrid approach is what most mature enterprise deployments use.
Common Pitfalls
After deploying AI infrastructure for multiple organizations, these are the mistakes we see repeatedly:
1. Starting Too Big
Don't buy 8x H100s on day one. Start with a single GPU server, deploy one model, and validate that your team actually uses it. Scale based on measured demand. We have seen organizations invest €100K+ in hardware before validating the use case, then struggle to justify the investment when adoption is lower than projected.
The right approach: start with cloud APIs or a single rented GPU to validate use cases and measure actual token consumption. Buy hardware when you have real usage data to size against.
2. Ignoring the Application Layer
Raw model access is not useful to most employees. A chat interface that sends prompts directly to an LLM is barely better than what users already have with public cloud AI tools — and it does not provide the organizational memory, governance, or specialized capabilities that justify the infrastructure investment.
The application layer is where the ROI is. This means: routing requests to the right model for the right task, attaching organizational context before sending to the model, capturing outputs in a governed memory store, applying approval workflows for high-stakes actions, and providing role-appropriate interfaces for different users. This is where 80% of the engineering effort should go.
3. No Audit Trail
Regulated industries need to know who asked what, when, and what the AI answered. The EU AI Act (Article 12) requires that high-risk AI systems maintain logs automatically. Build audit logging from day one — retrofitting it is painful and often incomplete. See our AI governance framework for what a complete audit architecture looks like.
The audit trail also has operational value: when AI outputs are used in decisions that later turn out to be wrong, you need to know what the AI said and what context it had. Without logs, you cannot learn from failures.
4. Treating Models as Static
Models improve quarterly. Your infrastructure needs to support model updates without downtime. Use container orchestration (Docker Compose or Kubernetes) so you can swap models by changing a tag. Design your application layer to be model-agnostic — never hardcode model-specific behavior or output format assumptions that will break when you upgrade.
Quantization formats also evolve. Models that were only available in FP16 are now available in GPTQ, AWQ, and GGUF formats with different quality/speed tradeoffs. Your serving infrastructure should support format flexibility.
5. Forgetting About Embeddings
Semantic search — finding relevant context before asking the LLM — is often more valuable than the LLM itself. A model that receives relevant organizational context before answering a question consistently outperforms a smarter model answering without context. This is the foundation of Retrieval-Augmented Generation (RAG).
Budget GPU memory for both inference and embedding generation. If your GPU is maxed serving inference, your embedding generation will compete for resources, degrading the quality of context retrieval. In high-usage deployments, a dedicated embedding service on a separate GPU (or on CPU for smaller embedding models) is worth the additional cost.
6. Single Point of Failure
Production AI infrastructure needs the same resilience design as any other production system. Plan for: what happens if the model server crashes? What happens during a model update? What happens if the embedding database becomes inconsistent?
For organizations under DORA (Digital Operational Resilience Act), this is not optional — you must demonstrate a resilience plan for AI system failures. See the DORA requirements overview for financial sector obligations.
Security Considerations
On-premise AI deployment eliminates some security concerns (data leaving the network) while introducing others (running GPU infrastructure with internet-accessible endpoints).
Key security requirements for on-premise AI:
- TLS on all endpoints — never expose inference APIs over plain HTTP, even internally
- Authentication on all APIs — API keys at minimum, JWT for user-facing interfaces
- Network segmentation — AI infrastructure should be on a separate network segment from production application servers
- Model integrity verification — cryptographic verification that models haven't been tampered with
- Prompt injection monitoring — log and review prompts for injection attempts, especially for user-facing AI interfaces
For organizations under European regulations, the ENISA AI Security guidelines provide a useful framework for AI-specific security requirements. See our security page for details on how Odin addresses these requirements.
What We Built
At Odin Labs, we built exactly this infrastructure — not as a thought exercise, but because our customers demanded it. The Odin platform deploys entirely on your servers:
- Docker Compose for the full stack (no Kubernetes required for organizations up to several hundred users)
- vLLM for model serving with automatic GPU management
- PostgreSQL + pgvector for persistent memory and semantic search
- Traefik for routing with TLS termination
- BrainDB for organizational knowledge governance
- Six specialized hubs for different organizational functions: Academy (learning), Compass (decisions), LUNA/Assistant (interface), Sales Engine, Coding, and Legal
Every byte of data stays on your network. We've deployed this for organizations in the Netherlands and Germany where GDPR compliance isn't optional — it's table stakes.
The architecture is intentionally modular. You can deploy the full stack or individual components. Organizations that already have model serving infrastructure can connect Odin's application layer to their existing Ollama or vLLM deployment. Organizations starting fresh can use the full stack deployment, which handles all components.
For the governance and compliance perspective on why European companies are making this move, see our AI data sovereignty guide.
Getting Started
If you're evaluating on-premise AI for your organization:
- Audit your use cases — not everything needs a GPU. Many tasks work fine with smaller models on CPU. Start with a clear picture of what you want to do before sizing hardware.
- Start with inference — get a model serving and answering questions before building complex pipelines. Validate quality before investing in application layer development.
- Measure before scaling — track token usage, latency, and GPU utilization for 2 weeks before buying more hardware. The actual usage pattern is almost always different from the projected one.
- Plan for the application layer — the model is 20% of the value. The other 80% is context, routing, governance, and interfaces. Budget accordingly.
- Design for GDPR and AI Act from day one — retrofitting compliance documentation is significantly harder than building it in. Audit logging, data flow documentation, and governance records should be requirements from the first deployment.
We're happy to walk through what this looks like for your specific situation. Visit our product overview or reach out — we'll share our architecture diagrams and deployment playbooks.