Free Tool - No Signup

On-Prem LLM Cost Calculator

Estimate the monthly cost of self-hosting an open model on your own GPUs, then compare it against a per-token cloud API. See the break-even token volume and a 1 and 3 year total cost of ownership. Every assumption below is editable - these are estimates, not quotes.

Your Workload

Monthly Requests / Calls

1 000 000 requests / month

Avg Input Tokens / Request

1 500 tokens in

Avg Output Tokens / Request

500 tokens out

Total Monthly Token Volume

2.00B tokens

1.50B in + 500.0M out

Open Model Size

70B (e.g. Llama 3.1 70B)

Output Tokens / Sec (per min cluster)

Sustained throughput at healthy batching. Edit to match your benchmark.

Min GPUs to Host the Model

Floor for the cluster size, driven by weights + KV cache memory.

Hardware & Economics

GPU Type

Costing Model

GPU Capex (each)

Amortise Over (yrs)

Board Power (W)

Energy (EUR/kWh)

Datacentre PUE: 1.40

Cooling + power overhead multiplier (1.0 = none).

GPU Utilisation: 40%

Share of wall-clock time the GPUs are actually serving tokens.

Ops Overhead: 30%

Engineering, monitoring, networking and hosting on top of hardware + power.

Cloud API Price (EUR / 1M tokens)

Blended input + output price of a comparable hosted open model. Default is a mid-market figure - replace with your provider quote.

The Comparison

Sized for 2x NVIDIA H100 80GB (hosting-bound)

 On-Prem / Self-Hosted

€ 2.487

/ month - € 1,24 per 1M tokens

 Cloud API

€ 1.200

/ month - € 0,60 per 1M tokens

Cloud API is cheaper

€ 1.287/mo difference

Break-Even Volume

4.15B

tokens/mo at this cloud price

Break-Even Cloud Price

€ 1,24

per 1M to match on-prem

1-Year TCO (On-Prem)

€ 29.847

vs € 14.400 cloud

3-Year TCO (On-Prem)

€ 89.540

vs € 43.200 cloud

On-Prem Monthly Breakdown
Hardware (amortised capex): € 1.556
Power (1 431 kWh/mo): € 358
Ops overhead (30%): € 574
Total: € 2.487 / month

Estimates only. Break-even compares this fixed on-prem sizing against cloud cost that scales linearly with tokens, at the cloud price you entered. Real on-prem cost steps up as you add GPUs, and cloud discounts apply at volume.

How the Math Works

Every figure traces back to inputs you can edit. Here is each step.

1. Token Volume

Monthly Tokens = Requests x (Avg Input + Avg Output)

2. Cluster Sizing

Throughput is quoted as sustained output tokens per second for the minimum cluster. We size on output tokens because decoding is the generation bottleneck, then never go below the GPUs needed to hold the weights.

Output Tokens / GPU-month = Tokens/sec x 3600 x 730 x UtilisationGPUs = max( ceil(Output Tokens / per-GPU-month), Min GPUs to Host )

3. On-Prem Monthly Cost

Buy mode amortises capex over your chosen lifespan and adds metered power. Rent mode uses an all-in GPU-hour rate, so power is not added twice. Ops overhead covers engineering, monitoring and hosting.

Capex/mo = (GPUs x GPU Price) / (Amort Years x 12)Rental/mo = GPUs x Rate/hr x 730Power/mo = (GPUs x Board W x PUE / 1000) x 730 x Energy PriceOn-Prem/mo = (Hardware + Power) x (1 + Ops%)

4. Cloud API Cost

Cloud/mo = (Monthly Tokens / 1,000,000) x Price per 1M

5. Break-Even & TCO

Once provisioned, on-prem cost is largely fixed while cloud scales per token. Break-even is the volume where cloud cost equals your on-prem monthly cost, at the price you entered.

Break-Even Volume = (On-Prem/mo / Cloud Price per 1M) x 1,000,000Break-Even Cloud Price = On-Prem/mo / (Monthly Tokens / 1M)1-Year TCO = Monthly x 12 ; 3-Year TCO = Monthly x 36

What this does not model

Redundancy and failover GPUs, networking and storage capex, one-off fine-tuning, software licences, model quality differences, cold-start latency, and cloud volume discounts. Real deployments add a redundant node and headroom for traffic spikes. Use this as a directional model, not a quote. Throughput and prices shift quickly, so confirm against your own benchmarks and supplier figures.

Thinking about hosting your own LLM?

Data residency, GDPR and cost control are why many Austrian and European teams move inference on-prem. We scope the model, hardware and rollout so the numbers above turn into a real plan.

Book an On-Prem Scoping Call Custom AI Development