Estimate the monthly cost of self-hosting an open model on your own GPUs, then compare it against a per-token cloud API. See the break-even token volume and a 1 and 3 year total cost of ownership. Every assumption below is editable - these are estimates, not quotes.
Sized for 2x NVIDIA H100 80GB (hosting-bound)
Estimates only. Break-even compares this fixed on-prem sizing against cloud cost that scales linearly with tokens, at the cloud price you entered. Real on-prem cost steps up as you add GPUs, and cloud discounts apply at volume.
Every figure traces back to inputs you can edit. Here is each step.
Monthly Tokens = Requests x (Avg Input + Avg Output)Throughput is quoted as sustained output tokens per second for the minimum cluster. We size on output tokens because decoding is the generation bottleneck, then never go below the GPUs needed to hold the weights.
Output Tokens / GPU-month = Tokens/sec x 3600 x 730 x UtilisationGPUs = max( ceil(Output Tokens / per-GPU-month), Min GPUs to Host )Buy mode amortises capex over your chosen lifespan and adds metered power. Rent mode uses an all-in GPU-hour rate, so power is not added twice. Ops overhead covers engineering, monitoring and hosting.
Capex/mo = (GPUs x GPU Price) / (Amort Years x 12)Rental/mo = GPUs x Rate/hr x 730Power/mo = (GPUs x Board W x PUE / 1000) x 730 x Energy PriceOn-Prem/mo = (Hardware + Power) x (1 + Ops%)Cloud/mo = (Monthly Tokens / 1,000,000) x Price per 1MOnce provisioned, on-prem cost is largely fixed while cloud scales per token. Break-even is the volume where cloud cost equals your on-prem monthly cost, at the price you entered.
Break-Even Volume = (On-Prem/mo / Cloud Price per 1M) x 1,000,000Break-Even Cloud Price = On-Prem/mo / (Monthly Tokens / 1M)1-Year TCO = Monthly x 12 ; 3-Year TCO = Monthly x 36Redundancy and failover GPUs, networking and storage capex, one-off fine-tuning, software licences, model quality differences, cold-start latency, and cloud volume discounts. Real deployments add a redundant node and headroom for traffic spikes. Use this as a directional model, not a quote. Throughput and prices shift quickly, so confirm against your own benchmarks and supplier figures.
Data residency, GDPR and cost control are why many Austrian and European teams move inference on-prem. We scope the model, hardware and rollout so the numbers above turn into a real plan.