Running DeepSeek, Llama 4, and Mistral On-Premise: A Technical Guide
Running large language models on your own hardware is no longer a research exercise — it is a production-ready strategy. With Llama 4 Scout, DeepSeek-V3, and Mistral Large 2 all available under permissive licences, European businesses can deploy state-of-the-art AI without sending a single byte to a US cloud provider. This guide covers the hardware you need, the software stacks that work, real performance benchmarks, the true 12-month cost comparison, and the security hardening steps that make on-premise deployment GDPR-compliant from day one.
Table of Contents
1. Hardware Requirements Per Model
The hardware question is the first and most critical decision in any on-premise LLM deployment. Get it right and you have a system that runs inference at production speed for years. Get it wrong and you have an expensive paperweight that takes 30 seconds to generate a sentence. The fundamental constraint is GPU memory (VRAM). Large language models must fit their parameters into GPU memory to run at acceptable speed. CPU-only inference is possible but 10–50x slower and generally unsuitable for production workloads.
The following table provides hardware requirements for the most capable open-source models available as of March 2026. All figures assume production deployment with adequate headroom for the KV cache (the memory used to store conversation context during inference). Development and testing can run with less.
| Model | Parameters | FP16 VRAM | Q4 VRAM | Min GPU Config | Recommended GPU |
|---|---|---|---|---|---|
| Llama 4 Scout | 17B active (109B MoE) | ~55 GB | ~32 GB | 2x RTX 4090 (48 GB) | 2x A100 80GB |
| Llama 4 Maverick | 17B active (400B MoE) | ~160 GB | ~90 GB | 2x A100 80GB | 4x A100 80GB |
| DeepSeek-V3 | 37B active (671B MoE) | ~320 GB | ~170 GB | 4x A100 80GB | 8x A100 80GB / 4x H100 |
| DeepSeek-R1 | 37B active (671B MoE) | ~320 GB | ~170 GB | 4x A100 80GB | 8x A100 80GB / 4x H100 |
| Mistral Large 2 | 123B dense | ~246 GB | ~70 GB | 4x A100 80GB (FP16) / 2x RTX 4090 (Q4) | 4x A100 80GB |
| Mistral Small 3.1 | 24B dense | ~48 GB | ~14 GB | 1x RTX 4090 | 1x A100 80GB |
| Qwen 2.5 72B | 72B dense | ~144 GB | ~42 GB | 2x A100 80GB | 4x A100 80GB |
| Llama 3.3 70B | 70B dense | ~140 GB | ~40 GB | 2x A100 80GB | 4x A100 80GB |
Understanding MoE vs Dense models: Mixture of Experts (MoE) models like DeepSeek-V3 and Llama 4 Scout have large total parameter counts but only activate a fraction of them for each token. Llama 4 Scout has 109 billion total parameters but only 17 billion are active per inference pass. This means MoE models deliver the quality of a much larger model with the inference speed of a smaller one. However, the full model still needs to fit in VRAM — you cannot load only the active parameters. The VRAM savings from quantisation are therefore proportionally more impactful for MoE models.
Quantisation explained: FP16 (16-bit floating point) is the native precision for most models. Q4 (4-bit quantisation) reduces memory requirements by roughly 4x with a small quality loss (typically 1–3% on benchmarks). For most production use cases — chatbots, document analysis, code generation, email drafting — the quality difference between FP16 and Q4 is imperceptible to end users. We recommend Q4 quantisation (specifically GPTQ or AWQ formats) for all production deployments where VRAM is constrained. FP16 is preferable when you have the VRAM headroom and need maximum quality for tasks like reasoning, legal analysis, or medical applications.
GPU Hardware Pricing (March 2026)
| GPU | VRAM | New Price (approx.) | Used/Refurb Price | Best For |
|---|---|---|---|---|
| NVIDIA RTX 4090 | 24 GB GDDR6X | €1,800 – €2,200 | €1,400 – €1,700 | Small models, development |
| NVIDIA RTX 5090 | 32 GB GDDR7 | €2,400 – €2,800 | N/A (new) | Mid-size models, production SME |
| NVIDIA A100 80GB | 80 GB HBM2e | €8,000 – €12,000 | €5,000 – €7,000 | Large models, production |
| NVIDIA H100 80GB | 80 GB HBM3 | €25,000 – €32,000 | €18,000 – €22,000 | Maximum performance |
| NVIDIA H200 | 141 GB HBM3e | €30,000 – €38,000 | N/A (new) | Largest models, high throughput |
| AMD MI300X | 192 GB HBM3 | €12,000 – €16,000 | N/A | Large models, cost-effective |
Complete server builds for Austrian businesses: A production-ready server for running Llama 4 Scout or Mistral Small 3.1 (the sweet spot for most Austrian SMEs) requires: 2x RTX 4090 GPUs, a compatible motherboard with two PCIe 4.0 x16 slots, 128 GB DDR5 RAM, a 2 TB NVMe SSD for model storage, a 1,600W power supply, and a server chassis with adequate cooling. Total cost for a turnkey system from an Austrian IT supplier: €7,000 to €10,000. For companies that need to run DeepSeek-V3 or other full-scale models, a 4x A100 server from a European cloud provider like Hetzner or OVHcloud as a dedicated server costs €2,500 to €4,000 per month — less than the equivalent cloud API costs at moderate usage.
Power and cooling considerations: A dual-4090 server draws approximately 700–900W under full inference load. At Austrian electricity prices (approximately €0.25/kWh as of early 2026), running 24/7 costs roughly €155–€200 per month. A 4x A100 server draws 1,600–2,000W, costing €350–€440 per month in electricity. These costs are included in our 12-month comparison later in this guide. Cooling is managed by the server’s internal fans in most office environments, but if you are placing hardware in a server room, ensure ambient temperature stays below 30 degrees Celsius for reliable operation.
2. Software Stack Comparison: vLLM vs TGI vs Ollama
The inference server is the software layer between your GPU hardware and your application. It loads the model, manages memory, handles concurrent requests, and optimises throughput. Choosing the right inference server is as important as choosing the right GPU. The three leading options as of March 2026 are vLLM, Hugging Face Text Generation Inference (TGI), and Ollama. Each serves a different use case.
| Feature | vLLM | TGI | Ollama |
|---|---|---|---|
| Primary use case | High-throughput production | Production with HF ecosystem | Development & simple deploys |
| Throughput (relative) | Highest | High | Moderate |
| Concurrent requests | Excellent (continuous batching) | Good (dynamic batching) | Limited |
| Multi-GPU support | Tensor parallelism, pipeline parallelism | Tensor parallelism | Basic multi-GPU |
| Quantisation support | GPTQ, AWQ, FP8, INT8 | GPTQ, AWQ, BitsAndBytes | GGUF (llama.cpp) |
| API compatibility | OpenAI-compatible | Custom + OpenAI-compatible | OpenAI-compatible |
| Setup difficulty | Moderate | Moderate | Very easy |
| Docker support | Official images | Official images | Official images |
| Streaming | Yes | Yes | Yes |
| Model format | HF Transformers, GPTQ, AWQ | HF Transformers | GGUF |
| Memory efficiency | PagedAttention (very efficient) | Good | Good (llama.cpp) |
| Community & docs | Large, active | Large (Hugging Face) | Large, beginner-friendly |
| Licence | Apache 2.0 | Apache 2.0 | MIT |
vLLM is our recommendation for production deployments serving more than a handful of concurrent users. Its PagedAttention memory management system is the most efficient available, allowing you to serve more concurrent requests per GPU than any alternative. It supports continuous batching, which means new requests can be added to an active batch without waiting for the current batch to finish. For an Austrian business deploying AI to serve customers, internal teams, or API consumers, vLLM provides the best performance per euro of hardware.
TGI (Text Generation Inference) by Hugging Face is the best choice if your workflow is deeply integrated with the Hugging Face ecosystem. It handles model downloads, weight management, and format conversion seamlessly. Its performance is close to vLLM for most workloads and it has the advantage of being developed by the same team that maintains the model hub where you download the weights. For teams that are already using Hugging Face for model evaluation and fine-tuning, TGI reduces friction.
Ollama is the right choice for development, testing, and low-concurrency production deployments. Its setup process is remarkably simple — a single command downloads and runs a model. It uses the GGUF format (via llama.cpp), which is optimised for consumer hardware and runs well on systems with limited VRAM by offloading layers to CPU memory. For an Austrian SME that needs a single model serving a handful of internal users, Ollama can be the entire stack. Its OpenAI-compatible API means your application code does not need to change if you later migrate to vLLM or TGI.
Our recommendation: Start with Ollama for prototyping and development. Move to vLLM when you need to serve concurrent users in production. Use TGI if your team is already deep in the Hugging Face ecosystem. All three serve an OpenAI-compatible API, so switching between them requires changing only the endpoint URL in your application configuration.
3. Performance Benchmarks: Tokens Per Second
Raw performance numbers are essential for capacity planning. How many tokens per second can each model generate on a given hardware configuration? This determines how many concurrent users your system can serve, how responsive your chatbot feels, and whether batch processing jobs complete in minutes or hours. The following benchmarks were measured on standardised hardware using vLLM with default settings and a context length of 4,096 tokens.
Single-User Generation Speed (tokens/second)
| Model | Precision | 1x RTX 4090 | 2x RTX 4090 | 1x A100 80GB | 4x A100 80GB |
|---|---|---|---|---|---|
| Mistral Small 3.1 (24B) | Q4 | 52 | — | 68 | — |
| Mistral Small 3.1 (24B) | FP16 | — | 38 | 55 | — |
| Llama 4 Scout (109B MoE) | Q4 | — | 35 | — | 62 |
| Llama 4 Scout (109B MoE) | FP16 | — | — | — | 48 |
| Llama 3.3 70B | Q4 | — | 22 | 32 | 55 |
| Qwen 2.5 72B | Q4 | — | 20 | 30 | 52 |
| Mistral Large 2 (123B) | Q4 | — | — | — | 38 |
| DeepSeek-V3 (671B MoE) | Q4 | — | — | — | 25 |
| DeepSeek-R1 (671B MoE) | Q4 | — | — | — | 22 |
What these numbers mean in practice: A comfortable reading speed for a human is about 4–5 tokens per second (roughly 250 words per minute). A chatbot generating at 25+ tokens/second feels instantaneous — the text appears faster than a user can read it. For real-time conversational AI, anything above 15 tokens/second is acceptable. For batch processing (summarising documents, generating reports), throughput matters more than single-request latency, and vLLM’s continuous batching can push total throughput to 3–5x the single-user figures when serving multiple concurrent requests.
Concurrent user capacity: A Llama 4 Scout deployment on 2x RTX 4090 with Q4 quantisation can comfortably serve 8–12 concurrent users with sub-2-second time to first token. Mistral Small 3.1 on a single RTX 4090 can serve 5–8 concurrent users. For most Austrian SMEs, these concurrency levels are more than sufficient — even a busy customer support chatbot rarely exceeds 5 simultaneous conversations.
Time to first token (TTFT): This is the latency between sending a request and receiving the first token of the response. For chat applications, TTFT is the most perceptible performance metric. On vLLM with a 4K context window: Mistral Small 3.1 achieves 150–300ms TTFT on a single RTX 4090. Llama 4 Scout achieves 300–600ms on 2x RTX 4090. DeepSeek-V3 achieves 500–1,200ms on 4x A100. All of these are well within acceptable ranges for interactive applications. For comparison, OpenAI’s API typically has 200–800ms TTFT depending on load, so on-premise performance is competitive.
4. 12-Month Cost: On-Premise vs Cloud APIs
The cost comparison between on-premise and cloud AI is the analysis that convinces most Austrian businesses to make the switch. Cloud APIs have an attractive starting cost (no hardware purchase, pay per token) but the economics invert quickly as usage grows. On-premise has a higher upfront cost but near-zero marginal cost per request. The crossover point — where on-premise becomes cheaper — arrives sooner than most people expect.
The following comparison models three usage scenarios for an Austrian SME over 12 months. We compare the total cost of ownership for running Llama 4 Scout on-premise (2x RTX 4090 server) against using GPT-4o via OpenAI’s API and Claude 3.5 Sonnet via Anthropic’s API. All prices are as of March 2026.
| Cost Category | On-Premise (Llama 4 Scout) | Cloud (GPT-4o API) | Cloud (Claude 3.5 Sonnet) |
|---|---|---|---|
| Hardware (one-time) | €9,500 | €0 | €0 |
| Setup & configuration | €2,000 | €500 | €500 |
| Monthly electricity (12 mo) | €2,160 | €0 | €0 |
| Monthly maintenance | €600 | €0 | €0 |
Scenario Comparison: 12-Month Total Cost
| Usage Level | Monthly Tokens | On-Premise (12 mo) | GPT-4o API (12 mo) | Claude 3.5 Sonnet (12 mo) |
|---|---|---|---|---|
| Light | 2M tokens/month | €14,260 | €1,620 | €1,380 |
| Moderate | 15M tokens/month | €14,260 | €11,700 | €9,900 |
| Heavy | 50M tokens/month | €14,260 | €38,500 | €32,400 |
| Very Heavy | 150M tokens/month | €14,260 | €115,000 | €97,200 |
The crossover point: At light usage (2 million tokens per month — roughly 30 chatbot conversations per day), cloud APIs are significantly cheaper. At moderate usage (15 million tokens per month — roughly 200 conversations per day plus document processing), the costs are approximately equal. At heavy usage and beyond, on-premise is dramatically cheaper. The key insight is that on-premise cost is fixed regardless of usage. Once you have the hardware, generating 50 million tokens costs the same as generating 5 million.
Year 2 and beyond: The on-premise advantage compounds over time. In year 2, the hardware is already paid for. Your annual running cost drops to approximately €2,760 (electricity and maintenance) regardless of usage volume. Cloud API costs remain proportional to usage. By month 18, even the moderate usage scenario shows a clear on-premise advantage, and by month 24 the cumulative savings at heavy usage exceed €60,000.
Hidden cloud costs to consider: Cloud API pricing does not include the cost of rate limit management, retry logic for failed requests, the engineering time spent on API outage workarounds, or the business cost of being dependent on a single provider’s uptime. OpenAI and Anthropic have each experienced multiple significant outages in the past 12 months. For an Austrian business where AI is mission-critical (customer support, lead processing, automated workflows), these outages translate directly to lost revenue and customer frustration. On-premise eliminates this dependency entirely.
5. Security Hardening for Production Deployment
Running an LLM on your own infrastructure means you are responsible for its security. Unlike cloud APIs where the provider handles infrastructure security, on-premise deployment requires you to secure the full stack: the operating system, the inference server, the API layer, and the model itself. The following hardening steps are essential for any production deployment, and several are directly required for GDPR compliance.
Network isolation: The inference server should never be directly exposed to the public internet. Place it behind a reverse proxy (nginx or Caddy) with TLS termination. Restrict access to the inference API to specific internal IP ranges or use a VPN. If the model serves an external-facing application (like a chatbot), the application server should be the only system that communicates with the inference server. Implement rate limiting at the reverse proxy level to prevent abuse: 10–20 requests per minute per user is a reasonable starting point for conversational AI.
Authentication and authorisation: Every request to the inference server must be authenticated. Use API keys at minimum, and OAuth2/JWT tokens for multi-user deployments. Implement role-based access control: not every user needs access to every model or every system prompt. Log all access attempts, successful and failed. Rotate API keys regularly (monthly minimum) and immediately revoke keys for departed employees.
Input sanitisation and prompt injection defence: LLMs are vulnerable to prompt injection attacks where malicious input manipulates the model into ignoring its system instructions. Implement input filtering to detect and block common injection patterns. Use a separate validation layer that checks model outputs for sensitive data leakage (personal data, internal system prompts, credentials) before returning them to the user. Several open-source tools exist for this purpose, including LLM Guard and Rebuff.
Data encryption: Encrypt data at rest (full disk encryption on the server, encrypted model weight storage) and in transit (TLS 1.3 for all API communications). If your model processes personal data, encryption is not optional — it is a GDPR technical measure under Article 32. Use LUKS for disk encryption on Linux and ensure TLS certificates are automatically renewed (Let’s Encrypt via Certbot is sufficient).
Logging and monitoring: Implement comprehensive logging that captures: every request to the inference server (timestamp, user, input length, output length, latency), system health metrics (GPU utilisation, VRAM usage, temperature, throughput), security events (failed authentication, rate limit hits, suspicious input patterns), and model performance metrics (tokens per second, error rates). Send logs to a centralised logging system (Grafana Loki, Elasticsearch, or even a simple syslog server). Set up alerts for anomalies: sudden spikes in usage, GPU temperature exceeding 85 degrees Celsius, or repeated authentication failures.
GDPR-specific requirements: If the model processes personal data (and in most business applications, it will), you must implement: data minimisation (do not log full prompts if they contain personal data unless necessary), purpose limitation (document what the model is used for and enforce it technically), storage limitation (automatically purge logs containing personal data after the retention period), and right of erasure (have a process to identify and delete data related to a specific data subject from logs and any fine-tuning datasets). On-premise deployment actually simplifies GDPR compliance because you have full control over data flows and can guarantee that personal data never leaves your infrastructure.
Update and patch management: Keep the operating system, NVIDIA drivers, CUDA toolkit, inference server, and model weights up to date. Subscribe to security advisories for vLLM, TGI, or Ollama (whichever you use). NVIDIA publishes driver security bulletins monthly. Implement a maintenance window (we recommend Sunday early morning for Austrian businesses) for updates. Use a staging environment to test updates before applying them to production — a misconfigured CUDA update can take your inference server offline for hours.
6. When Each Model Excels
Not every model is best for every task. The open-source LLM landscape has matured to the point where different models have distinct strengths, and choosing the right model for your specific use case can mean the difference between a system that impresses users and one that frustrates them. Here is our assessment based on extensive testing across real business workloads.
| Use Case | Best Model | Runner-Up | Notes |
|---|---|---|---|
| German-language tasks | Llama 4 Scout | Qwen 2.5 72B | Scout excellent across European languages including Austrian German |
| Code generation | DeepSeek-V3 | Qwen 2.5 72B | DeepSeek excels at code across all common languages |
| Complex reasoning | DeepSeek-R1 | Llama 4 Scout | R1 chain-of-thought reasoning is best-in-class for open models |
| Customer support | Mistral Small 3.1 | Llama 4 Scout | Small 3.1 is fast and handles conversational tone well |
| Document summarisation | Llama 4 Scout | Mistral Large 2 | Scout handles long contexts (10M tokens) natively |
| Legal / compliance | Mistral Large 2 | Llama 3.3 70B | Large 2 is strongest on precision-critical tasks |
| Multilingual (EU) | Llama 4 Scout | Mistral Large 2 | Both strong across EU languages; Scout has edge |
| Email drafting | Mistral Small 3.1 | Llama 4 Scout | Fast, tonally appropriate, handles formal German well |
| Data extraction / OCR | Qwen 2.5 72B | DeepSeek-V3 | Qwen excels at structured data extraction |
| Cost-sensitive deployment | Mistral Small 3.1 | Llama 3.3 70B (Q4) | Small 3.1 runs on a single consumer GPU |
DeepSeek-V3 and DeepSeek-R1: DeepSeek has emerged as a serious contender in the open-source LLM space. V3 is a general-purpose model with particular strength in code generation and mathematical reasoning. R1 is a reasoning-focused variant that uses chain-of-thought prompting internally to work through complex problems step by step. The downside is size: at 671 billion total parameters, even with MoE efficiency, these models require substantial hardware. For Austrian businesses with specific needs in coding, data analysis, or complex reasoning tasks, the investment in hardware to run DeepSeek can be justified. For general-purpose business applications, Llama 4 Scout or Mistral provide better performance per euro.
Llama 4 Scout: Meta’s latest model is the best all-rounder for European business use cases. Its MoE architecture (109B total, 17B active) provides excellent quality at reasonable hardware requirements. Its standout feature is the 10 million token context window, which means it can process entire books, legal contracts, or months of email history in a single prompt. For Austrian businesses, its multilingual capabilities are excellent — it handles German (including Austrian conventions), English, French, Italian, and other EU languages natively. This is our default recommendation for most Austrian SME deployments.
Mistral models: The French AI company Mistral produces models that punch above their weight class. Mistral Small 3.1 at 24 billion parameters is the best model you can run on a single RTX 4090, making it the most cost-effective option for Austrian SMEs that want to start with minimal hardware investment. It excels at conversational tasks, email drafting, and customer support — the bread and butter of SME AI use cases. Mistral Large 2 at 123 billion parameters is the premium option for businesses that need maximum accuracy on precision-critical tasks like legal analysis, compliance documentation, or medical applications. Its European origin also means it is developed with EU regulatory awareness built in.
Multi-model strategy: There is no rule that says you must run only one model. Many of our Austrian clients run Mistral Small 3.1 for fast, routine tasks (email responses, simple queries, content generation) and Llama 4 Scout or DeepSeek-R1 for complex tasks (document analysis, reasoning, code generation). The inference server routes requests to the appropriate model based on task classification. This maximises both speed and quality while keeping hardware costs manageable. With vLLM, you can serve multiple models from the same server and dynamically allocate GPU resources between them.
Ready to get started?
We design, build, and deploy on-premise LLM infrastructure for Austrian and European businesses. From hardware selection to production hardening — we handle the full stack so you can focus on what the AI does for your business.
Get a Custom Hardware Recommendation