Research Report — 2026

How to Deploy an LLM On-Premise in 2026

A step-by-step engineering guide to deploying a large language model on hardware you own and control: VRAM and GPU sizing math, model selection, vLLM vs NVIDIA NIM, multi-GPU scaling, quantization, air-gapped setup, CPU inference on Intel Xeon, and total cost of ownership versus the cloud.

On-Prem LLM
VRAM Sizing
vLLM
NVIDIA NIM
Air-Gapped
~14 GBVRAM for a 7B model at FP16
8–18xCheaper per token vs cloud (at scale)
~3.7 mo8x H100 break-even vs Azure on-demand
71%AI infra outside public cloud by 2025

On-Premise LLM Deployment in 2026: What This Guide Covers

Deploying a large language model on-premise means running the full inference stack — model weights, serving engine, and API — on hardware you own and control, inside your own network boundary, with no dependency on a third-party API. For enterprises in regulated industries, that control is increasingly non-negotiable: by 2025, roughly 71% of AI infrastructure ran outside the public cloud, a shift driven heavily by financial-services data-residency requirements and the arrival of enforceable AI regulation.

The good news for platform teams is that the open-weight model ecosystem has matured to the point where self-hosted models rival frontier hosted APIs on most enterprise tasks, and the serving software — vLLM, NVIDIA NIM, SGLang, TensorRT-LLM — is production-hardened. The hard part is no longer "can we run it" but "how do we size it correctly and operate it reliably." This guide walks the full decision path in order: when on-prem makes sense, how to choose a model, how to compute exact VRAM requirements, how to select GPUs, how to pick and configure a serving stack, how to scale across GPUs and nodes, how to quantize, how to plan capacity from real demand, how to deploy in an air-gapped enclave, what it actually costs versus cloud, and how to run it in production.

A note on numbers: hardware specs and formulas in this guide are stable, but model versions and software defaults drift monthly. Where we name models, we use families and tiers rather than chasing point releases; where benchmarks are version-specific, we say so. For deeper dives, see our companion Hardware Sizing Guide, LLM Selection Guide, and Best AI Tools for Air-Gapped Environments.

On-Prem vs Cloud: When to Run LLMs in Your Own Data Center

The decision between on-premise and a hosted API turns on four axes: utilization, volume, data sovereignty, and latency.

On-premise wins when inference demand is sustained, predictable, and high-volume. The economics are unforgiving of idle GPUs but generous to busy ones — against hyperscaler on-demand pricing, an owned cluster typically breaks even somewhere above roughly 50–83% sustained GPU utilization, and a fully-utilized owned cluster delivers token costs 8–18x lower than equivalent cloud over a multi-year horizon. It also wins outright when data sovereignty is a legal mandate: under GDPR Article 46, EU financial institutions cannot freely route customer data through US-hosted LLM APIs, and the EU AI Act's general-purpose-AI obligations — enforceable since August 2025 — carry fines up to €35 million or 7% of global turnover. For regulated finance, healthcare, government, and defense, the deployment location is decided before any cost spreadsheet is opened.

Cloud and hosted APIs win when demand is spiky or unpredictable, when volume is low (small and mid-size workloads below ~10M tokens/month outside small-model cases), when you need frontier closed models, or when you must scale fast without capital expenditure. Token prices on hosted APIs also fell roughly 80% from 2025 to 2026, which structurally erodes the on-prem cost advantage over time and should be modeled, not assumed away.

Rule of Thumb
If you can keep GPUs busy more than ~4–5 hours per day equivalent over a multi-year horizon, or if regulation forces your hand, on-prem is the default; otherwise start in the cloud and revisit at volume. For a full treatment of the crossover math, see Edge AI vs Cloud Economics and the dedicated TCO section later in this guide.

Step 1: Choose the Right Open-Weight Model

License first: what you can legally productize

For enterprise on-prem, the license gates everything — pick the model your legal team can clear before you benchmark quality. Three tiers matter:

Apache 2.0 / MIT

Fully permissive: no monthly-active-user caps, no naming obligations, explicit patent grant (Apache). Covers gpt-oss-120b/20b, all Qwen3 models, Mistral Small 3.x and Mixtral (Apache 2.0), and DeepSeek V3/R1 plus Phi-4 (MIT). MIT is the most permissive — DeepSeek even permits downstream distillation.

Custom community (Llama 4)

The Llama 4 Community License adds a clause requiring a separate Meta license if your products exceeded 700 million monthly active users in the calendar month before the model's release date, plus "Built with Llama" attribution and a "Llama-" model-name prefix. It is not OSI-approved open source.

Use-restricted (Gemma)

Google's Gemma license permits commercial use after accepting its Terms of Use and Prohibited Use Policy, but is not Apache/MIT and carries redistribution restrictions.

MoE vs dense: the distinction that decides serveability

Mixture-of-Experts (MoE) models — gpt-oss, Qwen 3.5 / 3.6, DeepSeek V3.2 / V4, Kimi K2 Thinking, GLM-5, MiniMax M2.7, Gemma 4 26B-A4B, Llama 4 — activate only a fraction of their parameters per token, which lowers per-token compute and raises throughput. But the critical sizing insight is that VRAM must hold all total parameters (every expert is resident in GPU memory), while only the active parameters drive compute. A 671B MoE with 37B active still needs roughly 700 GB at FP8, and the 2026 frontier open-weights (DeepSeek V4, Kimi K2, GLM-5) are 700B–1.6T total — they require multi-GPU and usually multi-node serving (see Step 6: Scaling). Dense models (Qwen3.6-27B, Phi-4, Gemma 3, Mistral Small) are simpler to serve, have more predictable latency, and are easier to fully fit and quantize on a single GPU — which is why they are often the better choice for constrained single-node on-prem. MoE models are also especially CPU-friendly because so few parameters activate per token — see Option B: Run Inference on Intel Xeon CPUs.

Table 1 — Open-Weight Model Landscape for Enterprise On-Prem (May 2026)

ModelTotal ParamsActive ParamsArchContextLicenseOn-prem note
Frontier open-weights — multi-GPU / multi-node
DeepSeek V4-Pro~1.6T~49BMoE1MMITMost permissive frontier; needs a multi-node cluster (8+ GPUs at FP8/INT4)
Kimi K2 Thinking~1T~32BMoE (reasoning)256KModified MITTop agentic / coding scores (SWE-bench Pro leader); multi-node
GLM-5~744B~40BMoE200KMITStrong permissive frontier; multi-GPU
DeepSeek V3.2671B~37BMoE (MLA, 256+1 shared)128KMITMost permissive; distillation allowed; MLA shrinks KV cache ~28x
DeepSeek R1671B37BMoE (reasoning)128KMITDistilled 1.5–70B variants run on a single GPU
MiniMax M2.7~230B~10BMoE200K+Modified MITLong-context agentic; open weights
Qwen3.5-397B-A17B397B17BMoE (GDN + sparse)262K (→1M)Apache 2.0Largest open Qwen flagship; fully permissive
Llama 4 Maverick400B17BMoE (128 exp), multimodal1MLlama 4 Community700M-MAU clause; "Built with Llama"
Deployable single-node (1–2x 80GB GPUs)
gpt-oss-120b116.8B5.1BMoE (128 exp/4 active)131KApache 2.0Fully permissive; single 80GB GPU via MXFP4
Llama 4 Scout109B17BMoE (16 exp), multimodalup to 10MLlama 4 CommunityFits 1–2x 80GB; 700M-MAU clause
Qwen3-Coder-Next80B3BMoE256KApache 2.0Coding / agents; very low active-param footprint
Devstral 2 (Mistral)123B123BDense256KApache 2.0Coding-tuned dense; predictable latency
Small / single-GPU / edge & CPU
Qwen3.6-27B (dense)27B27BDense262KApache 2.0Single-GPU general / RAG; long context
Gemma 3 27B27B27BDense, multimodal128KGemma (use-restricted)Commercial OK after terms; not OSI
Mistral Small 3.x 24B24B24BDense, multimodal128KApache 2.0Strong single-GPU mid-size pick
gpt-oss-20b20.9B3.6BMoE (32 exp/4 active)131KApache 2.0Runs in 16GB; ideal for edge / Xeon CPU (Option B)
Phi-4 14B14B14BDense128KMITStrong math; synthetic-data trained

Total/active parameter counts marked with "~" are approximate where providers have not published exact figures for the newest frontier releases — verify specs and license terms against the model card before sizing. Benchmark scores and rankings for these models are tracked live on our LLM Benchmark Repository.

Table 1b — Full Variant Lineups: Qwen 3.5, Qwen 3.6 & Gemma 4 (May 2026)

These three families ship a complete size ladder from sub-1B edge models to 397B-parameter MoE flagships, all under permissive licenses — making them the most common starting point for on-prem standardization. The small Gemma 4 and Qwen variants are also the best fit for Intel Xeon CPU inference (Option B); the Gemma 4 26B-A4B MoE is the exact model benchmarked there.

VariantTotal ParamsActive ParamsArch / ModalityContextLicenseBest-fit deployment
Qwen 3.6 — open weights (Apr 2026; multimodal, hybrid-thinking)
Qwen3.6-35B-A3B35B~3BMoE / text+vision+code262K (→1M)Apache 2.0Flagship open MoE; ~21 GB at Q4, ~120 tok/s on one RTX 4090
Qwen3.6-27B27B27BDense / text+vision+code262K (→1M)Apache 2.0Flagship-level coding; ~16.8 GB at Q4 on a single consumer GPU
Qwen3-Coder-Next80B3BMoE / code+agents256KApache 2.0Coding/agent specialist; very low active footprint
Qwen 3.5 — full family, 0.8B–397B (Feb 2026; multimodal, GDN + MoE, 262K native)
Qwen3.5-397B-A17B397B17BMoE / multimodal262K (→1M)Apache 2.0Frontier flagship; multi-node cluster
Qwen3.5-122B-A10B122B10BMoE / multimodal262KApache 2.0High-end; 2–4x 80 GB GPUs
Qwen3.5-35B-A3B35B3BMoE / multimodal262KApache 2.0Single-node; throughput-friendly (low active params)
Qwen3.5-27B27B27BDense / multimodal262KApache 2.0Single 48–80 GB GPU; predictable latency
Qwen3.5-9B9B9BDense / multimodal262KApache 2.0Single 24 GB GPU; punches above its size
Qwen3.5-4B4B4BDense / multimodal262KApache 2.0Lightweight agents; edge / Xeon CPU
Qwen3.5-2B2B2BDense / multimodal262KApache 2.0Phones, tablets, embedded
Qwen3.5-0.8B0.8B0.8BDense / multimodal262KApache 2.0<2 GB VRAM at full precision; micro-edge
Gemma 4 — four variants (Apr 2026; multimodal text+image, audio on small)
Gemma 4 31B30.7B30.7BDense / multimodal256KApache 2.0Flagship dense; reportedly rivals far larger models
Gemma 4 26B-A4B26B3.8BMoE (8 of 128 exp) / multimodal256KApache 2.0MoE; 3.69x faster than 31B dense on Intel Xeon CPU (Option B)
Gemma 4 E4B~4.5B eff.~4.5B eff.Dense (edge) / multimodal128KApache 2.0Edge-optimized; laptops, workstations, Xeon CPU
Gemma 4 E2B~2.3B eff. (~5.1B w/ PLE)~2.3B eff.Dense (edge) / multimodal128KApache 2.0Fits ~2 GB at Q4; runs on a Raspberry Pi

Qwen "Plus" / "Max" tiers (e.g. Qwen3.5-Plus, Qwen 3.7 Max) are hosted, closed-weight Alibaba Cloud endpoints and are not deployable on-prem — only the numbered open-weight variants above ship downloadable weights. Gemma 4 ships under the permissive Apache 2.0 license — a notable change from the use-restricted custom Gemma license used through Gemma 3. Gemma 4 "E" sizes (E2B / E4B) use effective-parameter counts (per-layer embeddings / MatFormer), so on-disk size differs from the effective figure.

Model selection by use case

Table 2 — Use-Case Model Selection (on-prem)

Use caseRecommended modelsWhy
General chat / assistantQwen3.6-27B, Gemma 4 31B, Mistral Small 3.x 24B, Llama 4 Scout (if MAU < 700M)Strong general quality, single-node serveable, permissive (except Llama)
RAG / grounded enterpriseQwen3.6-27B, Gemma 4 31B / 26B-A4B, Phi-4 14B, DeepSeek V3.2 (if cluster available)Dense, predictable latency, long context, easy to fully fit/quantize
CodingKimi K2 Thinking, Qwen3-Coder-Next, gpt-oss-120b, DeepSeek V3.2, Devstral 2Leading SWE-bench Pro / agentic-coding scores, strong tool use
Reasoning / agenticDeepSeek V4 / R1, Kimi K2 Thinking, GLM-5, Qwen 3.6 (thinking mode), gpt-oss-120bRL-trained chain-of-thought, configurable reasoning effort
Edge / CPU-constrainedGemma 4 E2B / E4B, gpt-oss-20b (16GB), Qwen3.5-2B / 4B, Phi-4Small footprint, on-device / Intel Xeon CPU inference (Option B)

For RAG specifically, model choice is only half the equation — retrieval quality dominates grounded accuracy. Pair a dense long-context model with a disciplined ingestion pipeline; see Blockify Data Ingestion for how to structure source data before it reaches the model. For a fuller decision tree across every family, see the LLM Selection Guide.

Step 2: Do the VRAM Math (Weights + KV Cache + Overhead)

GPU memory for inference splits into four buckets: model weights, KV cache, activations, and framework/CUDA overhead. Weights and KV cache dominate. Get this math right and the rest of the deployment falls into place; get it wrong and you will either over-buy hardware or hit out-of-memory failures in production.

Model weights

The weights formula is exact:

Model WeightsVRAM_weights = num_params × bytes_per_param

Table 3 — Bytes per parameter by precision

PrecisionBytes/paramVRAM per 1B params (weights)Notes
FP324~4 GBFull precision; rarely used for inference
FP16 / BF162~2 GBStandard inference precision
FP81~1 GBNative DeepSeek-V3 training/inference precision
INT81~1 GB8-bit quantization
INT4 / 4-bit0.5~0.5 GBAggressive quantization (GPTQ/AWQ/GGUF Q4)
Canonical Anchor
A 7B model in FP16 needs about 7B × 2 bytes = ~14 GB of VRAM for weights alone.

KV cache (the long-context tax)

During decoding the model caches the Key and Value tensors of every prior token so it does not recompute attention each step. NVIDIA's formulas are:

KV CacheKV bytes per token = 2 × num_layers × (num_heads × head_dim) × precision_bytes KV bytes total = batch_size × seq_len × 2 × num_layers × hidden_size × precision_bytes

The leading 2 accounts for the separate Key and Value tensors, and hidden_size = num_heads × head_dim. KV cache scales linearly with both context length and batch size while weights stay fixed — so at long context or high concurrency the KV cache can rival or exceed weight memory and becomes the binding constraint. Two corrections keep modern models from matching the naive formula's worst case:

  • GQA (Grouped-Query Attention): Replace num_heads with the smaller num_kv_heads. Llama 3 70B has 64 query heads but only 8 KV heads — an 8x KV-cache reduction versus full multi-head attention.
  • MLA (Multi-head Latent Attention), DeepSeek-V3: Stores a 512-dim latent per token instead of the full KV, roughly 28x smaller, cutting a ~213.5 GB max cache down to ~7.6 GB.
Worked KV Example
NVIDIA, Llama 2 7B, FP16, batch 1, seq 4096, 32 layers, hidden 4096: 1 × 4096 × 2 × 32 × 4096 × 2 bytes ≈ 2 GB.

Activations and framework overhead

Add a runtime multiplier on top of weights. A practical rule of thumb: total VRAM ≈ weights × 1.3–1.5 for moderate concurrency and context, rising to × 1.5–2.0 for long context or high concurrency. Modal's compact sizing formula folds this in:

Compact Sizing (Modal)M (GB) = P (billions) × (Q / 8) × 1.2 (Q = bit precision, 1.2 = ~20% overhead) Example: 70B at INT4 = 70 × (4/8) × 1.2 = 42 GB

Worked per-model VRAM tables

Table 4 — Worked VRAM examples (weights + ~15–20% overhead unless noted)

ModelParams (total / active)Config (layers / hidden / KV heads)FP16 totalINT8INT4 / 4-bit
Mistral 7B7B / 7B32 / 4096 / 8 (GQA)~18 GB~9 GB~5 GB
Llama 3.1 8B8B / 8B32 / 4096 / 8 (GQA)~20 GB~10 GB~6 GB
Llama 2 13B13B / 13B40 / 5120 / 40 (MHA)~26 GB~14 GB~8 GB
Llama 3.3 70B70B / 70B80 / 8192 / 8 (GQA)~168 GB~84 GB~46 GB
DeepSeek V3.2 (MoE)671B / 37B61 / 7168 / MLA (d_c=512)~1,543 GB~671 GB (FP8)~386 GB
MoE Sizing Gotcha
DeepSeek-V3's weights bill the full 671B parameters (all experts resident), but per-token compute bills only the 37B active. This is the single most common point of confusion in MoE sizing. For the full per-tier mapping including which GPU each cell requires, see the GPU section below and the Hardware Sizing Guide.

Step 3: Select Your GPUs (H100 / H200 / A100 / L40S / Blackwell / RTX)

The two axes that decide inference: capacity and bandwidth

Two GPU properties govern LLM serving. VRAM capacity gates which model and context length fit at all. Memory bandwidth governs decode/token-generation latency, because decode is memory-bandwidth-bound: every new token streams all model weights from HBM once per forward pass. This is why the H200 — which has compute identical to the H100 but 43% more bandwidth (4.8 TB/s vs 3.35 TB/s) — generates tokens roughly 43% faster in the small-batch (memory-bound) regime, despite no compute uplift.

Reading Note
NVIDIA datasheets usually headline the "with sparsity" (2:4) tensor numbers. The dense throughput is half. The table below reports dense / sparse explicitly so you do not double-count.

Data-center and workstation GPU comparison

Table 5 — NVIDIA Data-Center / Pro GPU Specs for LLM Inference (2025–2026)

GPU (variant)Arch / Tensor GenVRAMMem BandwidthFP8 (dense / sparse) TFLOPSFP16/BF16 (dense / sparse)FP4 (dense / sparse)NVLink/GPUTDP
A100 SXM (40GB)Ampere / 3rd40GB HBM2e1,555 GB/sN/A (no FP8)312 / 624N/ANVLink3 600 GB/s400W
A100 SXM (80GB)Ampere / 3rd80GB HBM2e~2,039 GB/sN/A (no FP8)312 / 624N/ANVLink3 600 GB/s400W
H100 SXM5 (80GB)Hopper / 4th80GB HBM33,350 GB/s1,979 / 3,958989 / 1,979N/ANVLink4 900 GB/s700W
H100 PCIe (80GB)Hopper / 4th80GB HBM2e2,000 GB/s~1,513 / ~3,026~756 / ~1,513N/ABridge 600 GB/s350W
H200 SXM (141GB)Hopper / 4th141GB HBM3e4,800 GB/s1,979 / 3,958989 / 1,979N/ANVLink4 900 GB/s700W
L4 (24GB)Ada / 4th24GB GDDR6~300 GB/s~242 / ~485~121 / ~242N/ANone (PCIe)72W
L40S (48GB)Ada / 4th48GB GDDR6 ECC864 GB/s733 / 1,466366 / 733N/ANone (PCIe)300W
RTX 6000 Ada (48GB)Ada / 4th48GB GDDR6 ECC960 GB/s~728 / ~1,457~364 / ~728N/ANone300W
B200 SXM (192GB)Blackwell / 5th192GB HBM3e8,000 GB/s4,500 / 9,0002,250 / 4,5009,000 / 18,000NVLink5 1,800 GB/s1,000W
GB200 (= 2x B200 + Grace)Blackwell / 5th2x192GB HBM3e2x 8,000 GB/s2x 4,500 dense2x 2,250 dense2x 9,000 denseNVLink5 1,800 GB/s~2,700W
RTX PRO 6000 Blackwell (96GB)Blackwell / 5th96GB GDDR7 ECC1,800 GB/s~2,000 (AI TOPS class)~4,000 AI TOPSNone600W (WS) / 300W (Server)

Organize procurement by generation, because precision support tracks it: Ampere (A100) tops out at INT8/FP16 with no FP8; Hopper (H100/H200) adds FP8; Blackwell (B200/GB200, RTX PRO 6000) adds native FP4/NVFP4, which roughly doubles throughput and halves memory versus FP8 and is the 2025–2026 cost-per-token frontier. Note that L40S, L4, and the RTX cards have no NVLink — they scale only over PCIe, which makes them better suited to pipeline parallelism than tensor parallelism (see Step 6).

Table 6 — Approximate Model-Size Fit by VRAM (weights-only, +20–40% for KV/runtime)

Model sizeFP16 weights (~2GB/1B)INT4 (~0.5GB/1B)Single-GPU fit (FP16)Single-GPU fit (INT4)
7B~14 GB~4 GBAny 24GB+ (L40S/A100/H100 easily)Any 8GB+
13B~26 GB~7 GB48GB+ (L40S / RTX 6000 Ada / A100-80 / H100)24GB (L4 tight)
34B~68 GB~17 GB80GB+ (A100-80 / H100); tight48GB (L40S / RTX 6000 Ada)
70B~140 GB~35–40 GB141GB H200 single GPU; else 2x H100 (TP)48GB tight / 80GB comfortably
180B (Falcon-class)~360 GB~90 GBMulti-GPU only96GB RTX PRO 6000 / B200 192GB
Trillion-param (MoE)Rack-scaleRack-scaleGB200 NVL72 (72-GPU NVLink domain)GB200 NVL72
Production Caveat (70B row)
NVIDIA NIM's supported production minimum for a 70B at BF16 is 4x 80GB GPUs, not 2 — because 2x 80GB technically holds the ~140 GB of weights but leaves too little room for KV cache at realistic context and concurrency.

Consumer GPUs (RTX 4090 / 5090): where they fit and where they stop

For 7B–13B single-GPU inference, consumer cards are genuinely competitive and cost a fraction of data-center GPUs — an RTX 4090 can match or beat an A100 on small models. The ceilings are firm: a single 24GB 4090 tops out around 32B at Q4; a 32GB 5090 fits 32B at Q8 and 70B only at aggressive Q2/Q3 with tiny context; comfortable 70B-Q4 needs dual GPUs (48GB combined). Neither consumer card has NVLink, so multi-GPU communication runs over PCIe, achieving roughly 85–90% of NVLink-linked throughput with about a 30% loss versus a monolithic 80GB card. For any model needing 40–80GB+ of VRAM there is no consumer alternative — data-center cards are required.

Table 7 — Consumer vs Data-Center Spec Comparison

GPUVRAMBandwidthNVLinkNative FP4TDPMSRP
RTX 409024GB GDDR6X~1.0 TB/sNoNo450W$1,599
RTX 509032GB GDDR71.79 TB/sNoYes (MXFP4)575W$1,999
A10040/80GB HBM2e~2.0 TB/sYesNo400Wdata-center
H100 SXM80GB HBM3~3.35 TB/sYes (NVLink/NVSwitch)No (FP8)700Wdata-center
H200141GB HBM3e~4.8 TB/sYesNo (FP8)700Wdata-center

Option B: Run Inference on Intel Xeon CPUs with AirgapAI Edge (No GPU)

GPUs are the default path, but they are not the only one. AirgapAI Edge runs LLM inference entirely on Intel Xeon CPUs — no GPU required — using Intel AMX (Advanced Matrix Extensions) acceleration with the OpenVINO Model Server (and llama.cpp built with AMX kernels). For teams with no GPUs, constrained power and cooling, or an existing Xeon fleet to reuse, this turns on-prem LLM serving into a software problem rather than a hardware procurement project. Learn more on the AirgapAI product page, and see the crossover math in Edge AI vs Cloud Economics.

2026 Internal Benchmark (verified)
A Gemma-class 26B-A4B Mixture-of-Experts model (~4B active of 26B total) at INT8 (Q8_0), on a single half-socket Intel Xeon 6 (Granite Rapids, 48 physical cores, 768 GiB RAM), delivered ~32 tokens/sec single-stream decode — about 3x reading speed and fully interactive — and scaled near-linearly to ~105 tokens/sec aggregate at 16 concurrent requests (16/16 success, zero failures).

Why MoE makes CPU inference viable

On the same box, the 26B-A4B MoE model ran 3.69x faster than a 31B dense model (8.77 tok/s at INT4) — because the MoE activates only ~4B parameters per token, dramatically easing the CPU memory-bandwidth bottleneck that throttles dense models. CPU inference is fundamentally memory-bandwidth bound: streaming fewer active weights per token is exactly what a CPU needs to stay interactive.

Table B1 — AirgapAI Edge on a Single Half-Socket Intel Xeon 6 (Granite Rapids, 48 cores, 768 GiB)

Model / PrecisionActive ParamsSingle-stream decodeAggregate @ 16 concurrentCost per page (16-way)Throughput per box
Gemma-class 26B-A4B MoE (INT8 / Q8_0)~4B of 26B~32 tok/s (~3x reading speed)~105 tok/s (16/16 success)~$0.044 per pageup to ~4,100 pages/day
31B dense model (INT4)31B (all)8.77 tok/s— (MoE 3.69x faster)

Economics, fully on-prem / in-VPC: at 16-way concurrency on a 600-token-in to 2,000-token-out workload, AirgapAI Edge costs roughly $0.044 per page, about $181/day per box, processing up to ~4,100 pages/day per box — with no GPU, no data egress, and no per-token API fee.

Why CPU inference is fast enough in 2026

Intel AMX-INT8

AMX-INT8 kernels deliver roughly 2x the throughput of AMX-BF16 on Granite Rapids, turning the tile-matrix unit into the inference workhorse.

INT8 / INT4 + u8 KV cache

INT8/INT4 weight quantization plus an 8-bit (u8) KV cache shrink the memory footprint and the bandwidth the CPU must stream per token.

Prompt-lookup decoding

A free 1.4–2x speedup on RAG and grounded tasks where output echoes input — no draft model required.

Multi-token prediction

Multi-token-prediction / speculative decoding yields up to 2–3x on MoE models, compounding the AMX and quantization gains.

The Sweet Spot (and the Limits)
CPU inference is memory-bandwidth bound: large dense 70B-class models are slow on CPU. The sweet spot is small-to-mid dense models, MoE models, and throughput/batch workloads — extraction, RAG, summarization, classification — where aggregate tok/s and concurrency beat single-stream latency. GPUs still win for very large dense models and high-concurrency, low-latency interactive chat at scale.

When to choose AirgapAI Edge over GPUs

Choose Xeon CPU
AirgapAI Edge
No or constrained GPUs; edge / branch / forward-deployed / air-gapped sites where GPU power, cooling, and physical security are impractical; an existing Xeon fleet to reuse; cost-sensitive batch pipelines; SCIF / classified enclaves.
Choose GPUs
NVIDIA / Data-Center
Very large dense models (70B+ dense), and high-concurrency, low-latency interactive chat at scale where single-stream latency dominates the user experience.

AirgapAI Edge is fully offline / air-gapped — OpenVINO runs from a local model IR with no telemetry — and pairs with Blockify for on-prem RAG ingestion, so the entire retrieval-and-generation pipeline stays inside your boundary on CPU hardware you already own.

Step 4: Pick a Serving Stack — vLLM vs NVIDIA NIM

The serving engine is what turns model weights into a production API. The two leading choices for on-prem are vLLM (open-source, maximum flexibility) and NVIDIA NIM (enterprise-packaged, vendor-supported). They share an OpenAI-compatible API surface, so application code rarely changes when you switch.

vLLM: the open-source default

vLLM is a high-throughput, memory-efficient inference engine originally from UC Berkeley (2023), built around two innovations:

  • PagedAttention applies OS-style virtual-memory paging to the KV cache. Each sequence's KV cache is addressed through a logical block table mapping to non-contiguous physical blocks (default block size 16 tokens), eliminating the contiguous-allocation fragmentation that wasted 60–80% of KV memory in naive serving — reducing waste to under 4%.
  • Continuous (in-flight) batching schedules at the per-token level: when a request finishes it immediately frees its KV blocks and the next queued request is admitted on the following step, keeping the GPU near 100% utilized. vLLM cites up to ~4x more tokens/sec versus naive Hugging Face generation.

You launch it with vllm serve <model>, which listens on 0.0.0.0:8000 and exposes /v1/chat/completions, /v1/completions, /v1/embeddings, /v1/models, plus /health and a Prometheus /metrics endpoint. Connect with the standard OpenAI Python client by setting base_url='http://localhost:8000/v1'; require auth with --api-key or the VLLM_API_KEY environment variable.

Minimal single-GPU vLLM launch# Minimal single-GPU vLLM launch vllm serve /models/qwen3-32b --gpu-memory-utilization 0.9 --max-model-len 32768 --api-key "$VLLM_API_KEY"

Table 8 — Most-used vllm serve engine arguments

FlagPurposeDefault / typical
--tensor-parallel-sizeShard model across GPUs in a node= GPUs per node
--pipeline-parallel-sizeSplit layers across nodes= number of nodes
--gpu-memory-utilizationFraction of VRAM for weights+activations+KV0.9
--max-model-lenMax context lengthmodel default
--max-num-batched-tokensPer-step token budget (controls chunked prefill)version/model dependent
--max-num-seqsMax concurrent sequencesversion dependent
--block-sizeKV cache block size (tokens)16
--kv-cache-dtypeKV cache precision (e.g. fp8)auto
--quantizationWeight quantization methodnone
--host / --portBind address / port0.0.0.0 / 8000

The current V1 engine (default since vLLM v0.8.0) is a core rewrite delivering up to 1.7x higher throughput than V0, with FlashAttention 3, piecewise CUDA graphs, and near-zero-overhead prefix caching (under 1% throughput drop even at a 0% cache-hit rate, so it is on by default).

NVIDIA NIM: the enterprise-packaged option

NVIDIA NIM (NVIDIA Inference Microservices) packages a model, an optimized inference engine, and an OpenAI-compatible API server into a single prebuilt Docker container that runs on NVIDIA GPUs anywhere. It auto-selects among TensorRT-LLM, vLLM, and SGLang backends and applies performance-tuned settings; the NIM LLM 2.0 line moved to a "one container, one backend" philosophy built on vLLM for predictable behavior. The default serving port is 8000, with native OpenAI endpoints plus a /metrics observability endpoint.

Prerequisites for the latest NIM LLM (2026): NVIDIA driver 580+ with CUDA 13.0+ (older NIMs accept CUDA 12.1+), Docker ≥ 19.03, and the NVIDIA Container Toolkit; the CUDA Toolkit does not need to be on the host, only the driver. A typical single-node run:

Single-node NIM containerdocker run --runtime=nvidia --gpus all --shm-size=16GB -v ~/.cache/nim:/opt/nim/.cache -u $(id -u) -p 8000:8000 <nim-llm-container>

Table 9 — NIM offering tiers

TierPurposeNotable attributes
NIM Day 0Rapid access to newly released modelsEarliest availability, less hardening
NIM TurboValidated performancePerformance-optimized, validated profiles
NIM CertifiedEnterprise productionCVE patching, OSRB open-source review compliance, AI Enterprise support

Table 10 — NIM licensing / access tiers

TierCostLimits / terms
Developer Program (free)$0Up to 2 nodes / 16 GPUs; 1,000 inference credits at signup (up to 5,000 on request); research/dev/test only
AI Enterprise 90-day eval$0 for 90 daysFree evaluation license for production validation
AI Enterprise (production)~$4,500 per GPU/year or ~$1 per GPU/hour (cloud)Per-GPU pricing (not per-NIM); same price regardless of GPU size; includes support + Certified NIMs

The AI Enterprise list price (~$4,500/GPU/year) is a starting figure subject to volume and term discounts — confirm with NVIDIA sales.

vLLM vs NIM vs the rest of the field

The honest tradeoff: vLLM gives maximum flexibility, zero license cost, and the fastest access to new open models, at the price of you owning integration, hardening, and support. NIM gives a turnkey container with vendor SLAs, proactive security patching, and validated performance profiles, at the price of NVIDIA AI Enterprise licensing and tighter version coupling. Raw throughput between the top GPU engines is narrow — within roughly 15% — and flips by workload.

Table 11 — On-Prem LLM Serving Engine Comparison (2026)

EngineCore TechOpenAI-CompatibleQuantizationThroughput TierEase of SetupEnterprise SupportBest-Fit Use Case
vLLMPagedAttention + continuous batchingYesGPTQ, AWQ, FP8Highest (100+ QPS)ModerateCommunity / commercial via vendorsGeneral-purpose production multi-user GPU serving
NVIDIA NIMPrebuilt optimized containersYesFP8 + TRT-LLM quantHighEasy (turnkey)Yes — NVIDIA AI Enterprise (SLAs, security patches)Enterprises needing vendor support, stability, security SLAs
TensorRT-LLMCompiled CUDA kernels + KV reuseYes (via Triton/serve)FP8, paged+quantized KVHighest latency-optimized (NVIDIA-only)Hard (long compile)Via NVIDIA AI EnterpriseLatency-sensitive, high-volume, NVIDIA-standardized fleets
SGLangRadixAttention (radix-tree KV reuse)YesFP8, AWQVery high on shared-contextModerateCommunityAgents, RAG, structured generation, high prefix reuse
Hugging Face TGI v3Chunked prefill + prefix cachingYesGPTQ, AWQ, EETQHighModerateCommunity (upstream in maintenance mode 2026)HF-ecosystem teams, long chat histories
OllamaWraps llama.cpp; auto model mgmtYesGGUF (Q2–Q8)Medium (10–50 QPS)Easiest (one command)CommunityLocal dev, prototyping
llama.cppC/C++ GGUF runtimeYes (server mode)GGUF (Q2–Q8)Low-medium (5–30 QPS)Easy (binary + GGUF)CommunityCPU-only servers, edge, embedded

Table 12 — Single H100 SXM5 80GB Benchmark, Llama-3.3-70B-Instruct FP8 (~512 in / ~256 out)

MetricConcurrencyvLLM v0.18.0TensorRT-LLM v1.2.0SGLang v0.5.9
Throughput (output tok/s)1 req120130125
Throughput (output tok/s)10 req650710680
Throughput (output tok/s)50 req1,8502,1001,920
Throughput (output tok/s)100 req2,4002,7802,460
TTFT p50 (ms)100 req740680710
TTFT p95 (ms)100 req1,4501,2801,380
Peak VRAM @100 req (GB)100 req787978
Cold startfirst load~62 s~28 min (compile)~58 s

The decisive operational figure is the cold start: TensorRT-LLM's ~28-minute first-time engine compile (subsequent reloads ~90s) makes it painful for rapid model iteration, whereas vLLM and SGLang start in about a minute. A common, sound pattern is to develop and prototype on Ollama or llama.cpp, then serve production on vLLM or NIM. For the broader tool landscape, see Best Local AI Tools for Enterprise.

The AI Strategy Blueprint Book Cover
Recommended Reading

The AI Strategy Blueprint

The executive playbook for aligning AI strategy with infrastructure decisions — covering model selection, deployment architecture, security, and the ROI frameworks behind on-premise and edge AI investments.

5.0 on Amazon
$24.95
Get it on Amazon
Infrastructure Chapters
Deployment Playbooks
Security Architecture
ROI Frameworks

Step 5: Understand Throughput and Latency (Tokens/sec, TTFT, ITL)

Four metrics define serving performance:

  • Throughput — total output tokens/sec across all concurrent requests.
  • TTFT (Time To First Token) — latency from request to first token, dominated by prefill of the input prompt.
  • ITL (Inter-Token Latency), a.k.a. TPOT — time between successive output tokens during decode. Per-request decode speed = 1000 / ITL tokens/sec.
  • Goodput — throughput that meets your SLOs.

The mechanism that explains everything: prefill is compute-bound, decode is memory-bandwidth-bound. Aggregate system throughput and per-request latency move in opposite directions as concurrency rises — continuous batching keeps the GPU busy and lifts total tokens/sec, but each individual request's ITL grows because the GPU time-slices decode across more sequences.

Concrete anchors: a single H100 running Llama 3.1 8B in vLLM peaks around 12,500 tokens/sec aggregate, with sub-80ms TTFT at low concurrency and ITL of ~11–21ms. For 70B, a single H200 reaches >3,800 tok/s/GPU at FP8 (up to 6.7x faster than A100), and 8x H100 in MLPerf delivered 24,525 tok/s total (~3,066 per GPU). The H100-vs-A100 gap widens sharply with concurrency: at 16 concurrent requests the H100 produced the first token roughly 16x faster than the A100.

Table 13 — Batch size / concurrency effect on throughput (Llama, H200 FP8, TP=1)

ModelBatch sizeInput/Output tokensThroughput (tok/s)Takeaway
Llama-13B1024128/12811,819Large batch maximizes aggregate throughput
Llama-13B128128/20484,750Long output lowers per-batch tok/s
Llama-70B512128/1283,014Peak 70B aggregate at large batch
Llama-70B642048/128341Long input (prefill) crushes throughput
Llama-70B322048/128303Smaller batch + long prompt = lowest tok/s

Long prompts collapse throughput because prefill cost dominates — note how 70B falls from 3,014 tok/s (128 input) to ~303 tok/s (2,048 input).

Table 14 — Latency SLA targets (MLPerf Inference v5.1, Llama 3.1 8B scenarios)

ScenarioTTFT limitTPOT / ITL limitApprox reading speed
Server≤ 2 s≤ 100 ms~480 words/min
Interactive≤ 0.5 s≤ 30 ms~1,600 words/min
Practical interactive bar (vLLM/H100, up to 70B)< 200 ms< 30 ms (8B ITL ~11–21 ms observed)Fluid streaming
Pin Your Engine Version
Always pin the engine version when citing numbers: vLLM v0.6.0 alone delivered 2.7x higher throughput and 5x lower TPOT on Llama 8B versus v0.5.3.

Step 6: Scale Across GPUs — Tensor, Pipeline, and Expert Parallelism

When a model exceeds one GPU, you shard it. There are three primary strategies, and matching them to your interconnect is what separates a fast cluster from a slow one.

Tensor parallelism (TP) shards each layer's weight matrices across GPUs (the Megatron column-parallel to row-parallel pattern), producing exactly two all-reduce collectives per transformer layer in the forward pass. Llama-3-70B's 80 layers means 160 all-reduce synchronization points per forward pass — so TP is bandwidth-bound and effectively requires NVLink/NVSwitch. On 4x L40 without NVLink, communication can exceed 50% of prefill cost.

Pipeline parallelism (PP) splits the model by layers across stages and only passes activations at stage boundaries, so it tolerates slower inter-node links (InfiniBand or even Ethernet) far better than TP. Expert parallelism (EP) shards MoE experts across GPUs, using all-to-all dispatch/combine; it pairs with data-parallel attention for large MoE models like DeepSeek-V3/R1.

The vLLM decision rule is clean: TP inside a node, PP across nodes, with tensor_parallel_size = GPUs per node and pipeline_parallel_size = number of nodes. For 2 nodes x 8 GPUs: --tensor-parallel-size 8 --pipeline-parallel-size 2. The critical exception: if GPUs lack NVLink (e.g. L40S) or the GPU count does not evenly divide the model, use pipeline parallelism instead of tensor parallelism.

Table 15 — vLLM Parallelism Strategy Selection

ScenarioRecommended configExample flags
Model fits on 1 GPUSingle GPU, no distribution(none)
Single node, multiple GPUs, NVLink presentTensor parallel = GPU count--tensor-parallel-size 4
Multi-node, multiple GPUsTP = GPUs per node, PP = number of nodes--tensor-parallel-size 8 --pipeline-parallel-size 2
Single node, no NVLink (e.g. L40S) or uneven splitTP=1, PP = GPU count--tensor-parallel-size 1 --pipeline-parallel-size 8
Large MoE (DeepSeek-V3/R1, Mixtral)DP attention + EP experts--tensor-parallel-size 1 --data-parallel-size 8 --enable-expert-parallel

Table 16 — NVLink / NVSwitch bandwidth by GPU generation

GPU / GenerationNVLink genPer-GPU bandwidth (bidirectional)
A100 (Ampere)NVLink 3600 GB/s
H100 (Hopper)NVLink 4900 GB/s
Blackwell (B200/GB200)NVLink 51,800 GB/s
Rubin (announced)NVLink 63,600 GB/s
Interconnect Notes
NVLink 4 is more than 14x the bandwidth of a PCIe Gen4 x16 bus; an 8-GPU 20GB all-reduce takes ~22ms with NVSwitch versus ~150ms without (~7x). Multi-node TP needs InfiniBand/RoCE with ≥100 Gbps and GPUDirect RDMA; verify it is engaged by checking NCCL logs for [send] via NET/IB/GDRDMA (good) versus [send] via NET/Socket (slow fallback). Container requirements for TP: run with --ipc=host --shm-size=16G -v /dev/shm:/dev/shm; on Kubernetes mount a /dev/shm emptyDir and grant IPC_LOCK — a missing /dev/shm is a common cause of hangs and OOMKilled pods. Single-node multi-GPU uses native multiprocessing; multi-node currently requires Ray.

Step 7: Quantize for Memory and Throughput (FP8 / INT8 / INT4 / FP4)

Quantization is the highest-leverage lever for fitting a model on fewer GPUs. The precision ladder runs FP32 to FP16/BF16 to FP8 to INT8 to INT4/FP4, halving memory roughly at each step beyond FP16.

The accuracy results are encouraging. Red Hat/Neural Magic's study spanning over 500,000 evaluations on the Llama-3.1 family found FP8 (W8A8-FP) effectively lossless across all model scales, INT8 (W8A8-INT) showing a surprisingly low 1–3% degradation per task, and even INT4 weight-only (W4A16) "more competitive than expected, rivaling 8-bit." The reason FP8 is near-lossless while INT8 needs calibration: FP8's exponential value spacing handles outlier activations gracefully, whereas INT8's uniform spacing needs SmoothQuant-style calibration.

The Deployment Rule
Use W4A16 (INT4 weight-only) for latency-bound, low-batch, synchronous serving where weight-loading dominates; use W8A8 (FP8 preferred) for high-throughput continuous batching where you are compute-bound. On Intel Xeon, AMX-INT8 is the high-throughput CPU path — see Option B.

Table 17 — Quantization format comparison (vs FP16/BF16 baseline)

FormatBits (W/A)Memory vs FP16Accuracy vs BF16Throughput noteBest for
FP16 / BF1616 / 161x (baseline)BaselineBaselineMax accuracy, fine-tune
FP8 W8A8 (E4M3)8 / 8~2x smallerEffectively lossless (all scales)~33% faster tok/s on H100High-throughput continuous batching
INT8 W8A8 (SmoothQuant)8 / 8~2x smaller1–3% drop per taskStrong on Ampere/Turing (no FP8 HW)High-throughput on pre-Ada GPUs
INT4 W4A16 (AWQ)4 / 16~4x smallerCompetitive, rivals 8-bitMarlin kernel ~741 tok/s (~10.9x vs no-Marlin)Latency / low-batch sync serving
INT4 W4A16 (GPTQ)4 / 16~4x smallerSlightly below AWQMarlin-accelerated on Ampere+Latency / low-batch sync serving
GGUF Q4_K_M (llama.cpp)~4.5 / mixed~4x smaller~6.74 ppl vs 6.56 BF16CPU/mixedCPU / Apple Silicon / edge
bitsandbytes NF4 / INT84 or 8 / 16~4x / ~2xNF4 ~6.66 pplOn-the-fly (no prequant)Experimentation, QLoRA
NVFP4 (Blackwell)4 / 4~4x smallerNear-FP8 with calibration~2x math throughput vs FP8Blackwell high-throughput serving

AWQ (activation-aware) slightly edges GPTQ (Hessian-based) on perplexity, and the Marlin kernel makes both fast on Ampere+. Hardware support is the practical constraint:

Table 18 — Engine x GPU-architecture support matrix (2026, version-sensitive)

FormatAmpere SM8.0/8.6Ada SM8.9Hopper SM9.0Blackwell SM100/103/120vLLMTensorRT-LLM
FP8 W8A8NoYesYesYesYes (Ada/Hopper+)Yes
INT8 W8A8YesYesYesNo (CC≥10.0 unsupported in vLLM)YesYes (SmoothQuant)
INT4 W4A16 AWQYesYesYesYesYes (AutoAWQ + Marlin)Yes
INT4 W4A16 GPTQYesYesYesYesYes (GPTQModel + Marlin)Yes
NVFP4 / MXFP4NoNoNoYesYes (NVIDIA ModelOpt)Yes (Blackwell only)
GGUFYesYesYesYesYesNo (llama.cpp)
FP8 KV cacheYesYesYesYesYesYes
Two Gotchas
FP8 needs Ada/Hopper or newer (not Ampere), and INT8 W8A8 is currently unsupported in vLLM on Blackwell (compute capability ≥ 10.0) — use FP8 there instead.
Free Download

Get Chapter 1 Free + AI Academy Access

Download the first chapter of The AI Strategy Blueprint and get instant access to our AI Academy — covering infrastructure planning, model selection, and on-premise deployment frameworks.

AI Strategy Blueprint Preview

Step 8: Manage the KV Cache and Long Context

The KV cache caches Keys and Values from prior tokens to avoid O(n²) recomputation each decode step, and at long context it is the primary memory bottleneck, frequently exceeding weight memory. Per-token cost = 2 × num_layers × num_kv_heads × head_dim × bytes_per_element, scaled by tokens × batch.

Table 19 — KV Cache per-token cost and VRAM (Llama 3.1 70B, single sequence)

PrecisionBytes/elementPer-token KV costKV cache at 32K ctxKV cache at 128K ctx
BF16/FP162 bytes~0.31 MB (310 KB)~10 GB~42.9 GB
FP8 (e4m3/e5m2)1 byte~0.155 MB~5 GB~21.5 GB
NVFP4 / 4-bit (Blackwell)0.5 bytes~0.078 MB~2.7 GB~10.7 GB

Every additional 1,000 tokens of context adds ~310 MB for a 70B-class model at BF16, and FP8 KV-cache quantization halves the footprint. Two techniques tame this:

  • GQA/MQA shrink the cache by the ratio of query heads to KV heads. Llama 3.1 70B's 8 KV heads (versus 64 query heads) give an 8x reduction — which can mean 2 GPUs instead of 4 at 128K context.
  • Automatic prefix caching (vLLM --enable-prefix-caching, on by default) hashes complete 16-token KV blocks (SHA-256) and reuses them across requests sharing a prefix — system prompts, tool definitions, few-shot examples — with LRU eviction and a cache_salt for multi-tenant isolation.

Table 20 — PagedAttention vs prior serving systems (vLLM paper)

MetricPrior systemsvLLM PagedAttention
KV-cache memory waste60%–80% (fragmentation + over-reservation)under 4% (last partial block only)
Throughput vs HF Transformers1x14x–24x
Throughput vs TGI (1 completion)1x2.2x–2.5x
Practical Sizing
Budget KV cache as (GPU memory − weights − activations) / per-token cost to derive the maximum total tokens (the sum of all concurrent sequence lengths) the GPU can hold. If kv_cache_usage_perc approaches 100% in production, new requests queue and risk preemption — lower --max-num-seqs or enable --kv-cache-dtype fp8, which roughly doubles effective capacity.

Step 9: Tune Batching and Speculative Decoding

Continuous (in-flight) batching is the single biggest throughput lever: rather than padding to a fixed batch, the engine evicts finished requests and admits queued ones every step. The vLLM V1 scheduler can mix prefill and decode in the same step, prioritizing decode then filling the remaining token budget with (chunked) prefill.

Chunked prefill splits a long prompt's prefill across steps so one long request cannot stall all others — the technique introduced by Sarathi-Serve. The tuning tradeoff: a smaller max_num_batched_tokens (e.g. 2048) gives better ITL because fewer prefill tokens stall decodes; a higher value gives better TTFT and throughput.

Speculative decoding drafts k tokens cheaply, then verifies them in one target-model forward pass, accepting the longest valid prefix. vLLM supports n-gram/prompt-lookup, draft-model, EAGLE/EAGLE-3, and Medusa/MTP.

Table 21 — Speculative decoding methods in vLLM

MethodProposerKey configNotes
n-gram / prompt-lookupMatch trailing n-gram, propose following k tokensmethod=ngram, num_speculative_tokens, prompt_lookup_maxBest when output echoes input (RAG, code edit)
Draft modelSmall separate LLMmodel=<draft>, num_speculative_tokens=5Needs a quality draft sharing the target vocab
EAGLE / EAGLE-3Lightweight MLP replacing target transformer stackmethod=eagle3, draft_tensor_parallel_size=1Top performer; draft runs without TP even if target uses TP
Medusa / MTPAuxiliary heads predict next k tokensdraft_tensor_parallel_size=1No separate draft model
Load-Dependence Caveat
EAGLE-3 delivers up to 2.5x speedup at low load — on MT-Bench with Llama-3.1-8B, 4.40x at acceptance length 6.13 tokens — but the gain erodes under high concurrency: SGLang measured EAGLE-3 at 1.81x throughput at batch 2 but only 1.38x at batch 64. vLLM's own docs warn that speculative decoding "is not yet optimized and does not usually yield inter-token latency reductions for all prompt datasets" — under high concurrency, rejected draft tokens waste target FLOPs that would otherwise serve other requests. Enable it for low-concurrency, latency-sensitive workloads; benchmark before enabling it under heavy load.

Capacity Planning: A Worked Sizing Example, End to End

This is where the math becomes a purchase order. The flow: define demand to compute memory to compute per-token timing to convert to GPU count to apply SLO-driven utilization ceilings and headroom.

Memory formulas (VMware/Lenovo)Weights: M = P × Z × 1.2 (P in billions, Z = bytes/param, 1.2 = ~20% overhead) Llama 3.3 70B FP16 = 70 × 2 × 1.2 = 168 GB KV/token: 2 × precision_bytes × num_layers × num_kv_heads × head_dim Llama-3-8B FP16 = ~128 KB/token ; Llama-3-70B = ~0.000305 GiB/token Max concurrent = max_kv_cache_tokens / max_context_window where max_kv_cache_tokens = (GPU_mem − weights) / kv_per_token
Latency formulasPrefill (compute-bound): weights_per_GPU × 2 FLOP / GPU_TFLOPS Decode (bandwidth-bound): weights_per_GPU × 2 bytes / GPU_bandwidth Example (Llama-3-8B on L40, 181 TFLOPS, 864 GB/s): prefill 0.088 ms/token ; decode 18.5 ms/token 4,000-token prompt + 256-token response = (4000 × 0.088) + (256 × 18.5) ≈ 5.1 seconds

Worked GPU-count example

Suppose peak demand is 1,000 requests/sec, average service time 40 ms, target GPU utilization 70%. Per-H100 service rate = 0.70 / 0.040 = 17.5 RPS. GPU count = ceil(1000 / 17.5) = 60 H100 instances. But the SLO sets the utilization ceiling, because P99 TTFT degrades nonlinearly with concurrency:

Table 22 — P99 TTFT degradation vs concurrency (70B FP8 on H100 SXM5, 512-token prompts)

Concurrent RequestsP50 TTFTP99 TTFTP99/P50
845ms90ms2.0x
1652ms160ms3.1x
3268ms280ms4.1x
6495ms480ms5.1x

Table 23 — SLO target to max GPU utilization ceiling, and resulting fleet (1,000 RPS, 40ms service, H100 spot)

TTFT P99 targetMax GPU utilizationInstances (ceil)Monthly cost (spot)
200ms55%73$88,826
300ms63%64$77,875
400ms70%60$73,008
500ms75%54$65,707
The Lesson
A tighter SLO buys you fewer requests per GPU and roughly 15% more cost per tightening step. Scale on the concurrency trigger (~24–28 concurrent per H100 for 70B FP8), not on raw utilization — by the time CPU-based autoscaling fires, the queue is already deep. Add peak-to-average headroom on top. For models ≥ 70B, factor in tensor parallelism (TP ≥ 2) and use per-GPU weight count in the latency formulas. To model your own numbers interactively, use the LLM Pricing Calculator; for hardware specifics see the Hardware Sizing Guide.

Air-Gapped and Secure Deployment

For classified, defense, and the most sensitive regulated workloads, air-gapping is the deployment model — and it is an architecture, not a configuration flag. Every runtime dependency must be pre-staged inside the enclave: a signed model registry, GPU inference workers, a local vector DB with a local embedding model, a container-registry mirror, OS/language package mirrors, on-prem observability, and internal PKI. True air-gap means no NAT, no DNS to external hostnames, no public CA chain, and no route by which a packet can leave. The single most common way an "air-gapped" RAG stack secretly breaks the gap is calling a remote embedding API — the embedding model must run inside the enclave alongside the LLM. For a fuller treatment, see Best AI for Air-Gapped Environments.

The workflow is two-phase. On a connected staging host, pre-download models and containers; verify SHA-256/signatures; physically transfer across the gap; then run isolated. For NVIDIA NIM, the connected host sets NGC_API_KEY and LOCAL_NIM_CACHE, runs download-to-cache -p <profile-hash>, copies the cache to AIR_GAP_NIM_CACHE, and the disconnected host mounts it at /opt/nim/.cache and runs the container without NGC_API_KEY or HF_TOKEN — omitting the keys prevents any model-download, registry, or telemetry call. For open-source vLLM, use huggingface-cli/snapshot_download on the connected host, serve a local directory path (not a hub repo ID), and set HF_HUB_OFFLINE=1 so the tokenizer resolves locally.

Table 24 — Telemetry / phone-home kill switches by component (air-gap hardening)

ComponentVariable / mechanismEffect
Hugging Face HubHF_HUB_OFFLINE=1No HTTP to the Hub; cache-only; skips cached-file version check
TransformersTRANSFORMERS_OFFLINE=1Loads strictly from local cache
HF ecosystemHF_HUB_DISABLE_TELEMETRY=1 (or DO_NOT_TRACK=1)Disables usage telemetry across transformers/datasets/diffusers/gradio
HF authHF_HUB_DISABLE_IMPLICIT_TOKEN=1Stops auto-attaching token to read requests
vLLMVLLM_NO_USAGE_STATS=1 / VLLM_DO_NOT_TRACK=1 / ~/.config/vllm/do_not_trackDisables default-on anonymous usage stats
NVIDIA NIM (air-gap run)Omit NGC_API_KEY and HF_TOKENRuns from mounted cache with no registry/Hub callouts

Mirror every container image through a frozen local registry (Harbor, or oc-mirror on OpenShift) and version-pin scanned PyPI/npm/apt snapshots. Updates arrive as signed tarballs (manifests + images + Helm charts) physically walked across the gap, integrity- and signature-verified before staging, on a slow cadence — monthly (healthcare) to quarterly (defense). Use the customer's internal PKI with mTLS between gateway and workers; there is no route to a public CA.

Table 25 — Compliance frameworks for on-prem / air-gapped LLM

FrameworkKey figure / control setAir-gap relevance
FedRAMP High421 controlsEliminates boundary-defense & external-monitoring control categories (no boundary)
DoD Impact LevelsIL4 = CUI, IL5 = CUI+mission-critical, IL6 = classified to SECRETAir-gap required/expected at IL5–IL6
CMMC 2.0 Level 2NIST SP 800-171 (110 controls)Eases MP, SC, AC families; avoids 32 CFR Part 170 FedRAMP-Moderate cloud rule on-prem
CMMC 2.0 Level 3NIST 800-171 + 800-172 enhancedHighest CUI tier; air-gap simplifies enhanced SC/AC
HIPAANot required; BAA + "minimum necessary"Air-gap + HITRUST CSF attestation common for PHI
SCIF / classifiedEncrypted drives, cleared installers, cross-domain media updatesNo external connectivity; physical update channel only
Strategic Point
Air-gapping does not merely satisfy controls, it eliminates entire control categories — there is no network boundary to defend or continuously monitor. Pair the deployment with a written AI Governance Framework so the model-update, access, and audit processes are documented before an assessor asks. AirgapAI Edge (see Option B) runs fully offline on Intel Xeon CPUs from a local model IR — a natural fit for SCIF and classified enclaves where GPU power and cooling are impractical.

Total Cost of Ownership: On-Prem vs Cloud

The GPU sticker is only about 35% of five-year TCO — power, cooling, networking, redundancy, and staff make up the rest.

Table 26 — On-Prem GPU Server CAPEX (full system, Lenovo Press 2026, priced Jan 15 2026)

ConfigGPU SetupGPU MemoryPrice (USD)
A8x H10080 GB$250,141.80
B8x H200141 GB$277,897.75
C8x B200192 GB$338,495.75
D8x B300288 GB$461,567.50
E4x L40S48 GB$52,390.50

An 8x H100 server pulls ~10 kW at full load (~$10,500/yr electricity at $0.12/kWh), with cooling adding ~30%. Staff is typically the single largest line item, exceeding hardware depreciation over three years:

Table 27 — 3-Year TCO of One 8x H100 SXM5 Server (Spheron cost model, 2026)

Cost CategoryAnnual3-Year Total
Hardware depreciation$116,000–150,000$350,000–450,000
Power (~10 kW @ $0.12/kWh)$10,500–10,700$31,500–32,100
Cooling (~30% of power)$3,150–3,210$9,450–9,630
Datacenter / colocation$12,000–24,000$36,000–72,000
Networking (InfiniBand)~$10,000~$30,000
Storage (NVMe, object)$5,000–8,000$15,000–24,000
Staff (0.5 FTE engineer)$75,000–100,000$225,000–300,000
Maintenance / spares$5,000–10,000$15,000–30,000
TOTAL~$236,650–315,910~$711,950–947,730

Table 28 — Break-Even Time, On-Prem 8x H100 vs Azure (Lenovo 2026)

Cloud Pricing TierRate ($/hr, 8-GPU server)On-Prem Break-Even
Azure on-demand$98.32~3.7 months
Azure 1-year reserved$62.92~6 months
Azure 5-year reserved$39.32~10.4 months

Table 29 — Per-Token Cost: On-Prem vs Cloud/API (Lenovo 2026)

Model / ConfigThroughputOn-Prem $/1M tokensCloud/API $/1M tokensOn-Prem advantage
Llama-70B, 8x H10030,576 tok/s$0.11$0.89 (Azure H100)8x
Llama-3.1-405B, 8x B3001,360 tok/s$4.74$29.09 (AWS)84% cheaper
GPT-5-mini-equivalent open model, 8x H100n/a$0.11~$2.00 (GPT-5 mini API)~18x
Two Honest Counterweights
First, independent academic analysis (arXiv 2509.18101) finds break-even is sharply model-size-dependent: small ~30B models pay back in 0.3–3 months, medium ~70B in 2.3–34 months, and large 235B+ models in 4.3–69.3 months. Second, against ultra-cheap specialist clouds (e.g. ~$2.90/hr H100), cloud can beat on-prem even at 100% utilization — and real production teams run only 40–65% utilization, well below the 80–90% optimistic vendor models assume. The break-even that pays back in 3.7 months at 90% utilization may never pay back at 40%. Model your own utilization honestly; see Edge AI vs Cloud Economics for the full crossover analysis and the LLM Pricing Calculator to plug in your token volume.

Production Operations: Observability, Autoscaling, and Go-Live

Four pillars carry an on-prem LLM from "it runs" to "it runs reliably": observability, autoscaling, health/lifecycle, and go-live readiness.

Observability

vLLM exposes Prometheus metrics at /metrics. Monitor the golden signals: latency histograms (time_to_first_token, inter_token_latency, e2e_request_latency, request_queue_time), saturation gauges (num_requests_running, num_requests_waiting, kv_cache_usage_perc), and throughput/health counters (generation_tokens, num_preemptions, prefix-cache hit rate). Triage rule: if num_requests_waiting > 0 consistently, requests are queuing and TTFT is rising — add capacity; if num_requests_waiting == 0 but TTFT is still high, the bottleneck is prefill compute, not scaling. Healthy steady state is zero requests waiting with KV cache below 90%.

Autoscaling

Standard Kubernetes HPA on CPU/memory is wrong for GPU inference — the GPU saturates while CPU stays low. Use KEDA scaling on queue depth (num_requests_waiting) per replica via a Prometheus trigger. A reference ScaledObject: threshold ~5 pending, minReplicaCount 1, maxReplicaCount 3, pollingInterval 15s, cooldownPeriod 360s. Model-weight load is the dominant pod-startup cost; a shared weights cache on an NFS-backed PVC cuts startup "from minutes to seconds," making reactive autoscaling feasible.

Health & Lifecycle

vLLM's /health confirms only that the engine process is alive — it does not verify the GPU can run a forward pass. Set Kubernetes readinessProbe (initialDelaySeconds 120) and livenessProbe (initialDelaySeconds 180) with high initial delays because model load takes minutes, and drain active streams gracefully on deploy. Version model weights, tokenizer, prompt templates, and inference config together with commit hashes; ship via stable deployment IDs with shadow traffic and canary rollout that auto-rolls-back on TTFT/TPS regression.

Go-Live

Before launch, run a saturation sweep with GuideLLM or genai-perf across realistic input/output lengths to find the knee and set P95/P99 SLOs from observed data. Token-aware rate limits, client retries with jitter, and idempotency keys round out the production posture. The full pre-launch checklist follows below.

Printable On-Prem LLM Requirements Checklist

Model & Licensing
  • Model license cleared by legal (Apache 2.0 / MIT preferred; verify Llama 700M-MAU clause; review Gemma terms)
  • Model selected by use case (chat / RAG / coding / reasoning / edge)
  • MoE vs dense decision recorded (VRAM bills total params, compute bills active)
Sizing
  • Weights VRAM computed (params × bytes/param × 1.2)
  • KV cache budgeted at target context AND concurrency (GQA/MLA-aware)
  • Quantization chosen (W4A16 for latency/low-batch; W8A8/FP8 for throughput)
  • Max concurrent requests per GPU derived from leftover VRAM
Hardware
  • GPU model selected on capacity AND bandwidth (not just VRAM) — or Intel Xeon + AirgapAI Edge for no-GPU CPU inference
  • Precision support verified (FP8 needs Ada/Hopper+; FP4 needs Blackwell; AMX-INT8 on Xeon)
  • NVLink present if using tensor parallelism; else plan pipeline parallelism
  • InfiniBand/RoCE ≥100 Gbps + GPUDirect RDMA for multi-node TP
Serving Stack
  • Engine chosen (vLLM / NIM / SGLang / TensorRT-LLM) with rationale
  • OpenAI-compatible endpoint + API-key auth configured
  • --gpu-memory-utilization, --max-model-len, --max-num-seqs tuned
  • Continuous batching + prefix caching confirmed on; speculative decoding benchmarked under real load
  • --ipc=host --shm-size=16G / /dev/shm + IPC_LOCK set for multi-GPU
Capacity & SLO
  • Demand model built (concurrent users, RPS, in/out tokens)
  • GPU count derived two ways (tokens/sec and queueing)
  • SLO-driven utilization ceiling applied; scale trigger = concurrency, not CPU
  • Peak-to-average headroom added
Air-Gap & Security (if applicable)
  • All dependencies pre-staged inside enclave (incl. local embedding model)
  • Two-phase download/verify/transfer workflow documented; SHA-256 verified
  • Telemetry kill switches set (HF_HUB_OFFLINE, VLLM_NO_USAGE_STATS, NIM keys omitted)
  • Private registry mirror frozen; packages version-pinned and scanned
  • Internal PKI + mTLS; on-prem observability; signed-bundle update cadence defined
  • Compliance mapping documented (FedRAMP / CMMC / HIPAA / IL level)
Production Ops
  • Prometheus /metrics scraped; Grafana dashboards on golden signals
  • Alerts on P95 TTFT regression, queue depth, KV%, preemptions, error rate
  • KEDA autoscaling on queue depth validated under load
  • Liveness + GPU-level readiness probes; graceful drain on deploy
  • Load tested with GuideLLM/genai-perf; P95/P99 SLOs set from data
  • Token-aware rate limits; client retries with jitter; idempotency keys
  • Model artifacts versioned together; canary + auto-rollback; DR runbooks drilled

Put the Sizing Math to Work

An on-prem deployment is one chapter of a defensible enterprise AI program. Build the strategy behind the infrastructure, then turn this guide into a tailored deployment roadmap.

Build the Strategy Behind the Infrastructure
Get the full playbook in the AI Strategy Blueprint — the executive guide to deploying AI with the right infrastructure, security, and ROI built in from day one. $24.95 on Amazon, rated 5 stars.
Get the AI Strategy Blueprint
Turn This Guide Into Your Roadmap
Use the AI Blueprint Builder to generate a tailored on-premise deployment plan mapped to your models, hardware, concurrency targets, and compliance requirements.
Launch the AI Blueprint Builder

Frequently Asked Questions

A 70B model needs ~140 GB at FP16 (70B x 2 bytes), ~70 GB at FP8/INT8, and ~35-46 GB at INT4 -- before KV cache and activations. In practice, FP16 requires 2x H100 80GB (tensor-parallel) or a single H200 141GB, while INT4 fits comfortably on one 80GB GPU. For production at realistic context and concurrency, NVIDIA NIM's supported minimum for a 70B at BF16 is 4x 80GB GPUs, because 2x 80GB leaves too little headroom for the KV cache.
Use vLLM when you want maximum flexibility, no license cost, and the fastest access to new open models, and you have the platform team to own integration and support. Use NIM when you need a turnkey, vendor-supported container with SLAs, proactive CVE patching, and validated performance -- and you are licensing NVIDIA AI Enterprise (~$4,500/GPU/year). Raw throughput between the top engines is within ~15% and flips by workload, so the decision is about support model and operational fit, not speed.
Yes. Air-gapping is an architecture: you pre-stage the model, container, embedding model, and all dependencies on a connected host, verify signatures, physically transfer them across the gap, and run isolated with telemetry disabled. For NIM, run the container without NGC_API_KEY/HF_TOKEN; for vLLM, serve a local model path with HF_HUB_OFFLINE=1. The most common mistake is leaving a remote embedding-API call in a RAG pipeline, which silently breaks the air gap.
KV cache per token = 2 x num_layers x num_kv_heads x head_dim x bytes_per_element, multiplied by tokens x batch size. For Llama 3.1 70B at BF16 that is ~0.31 MB/token, so 128K context for one stream is ~42.9 GB; FP8 halves it to ~21.5 GB. Because it scales linearly with both context length and concurrency, the KV cache often exceeds weight memory at long context -- budget it explicitly as (GPU memory - weights - activations) / per-token cost.
FP8 is effectively lossless and is the production standard for high-throughput continuous batching on Hopper/Blackwell; INT8 shows only 1-3% degradation and is the right choice on Ampere GPUs that lack FP8; INT4 weight-only (AWQ/GPTQ) is competitive and best for latency-bound, low-batch serving where weight loading dominates. Rule of thumb: W4A16 for latency/cost-efficiency, W8A8 (FP8 preferred) for throughput.
For sustained, high-volume, predictable inference, yes -- self-hosting an open model runs roughly 8-18x cheaper per token over a multi-year horizon, and an 8x H100 cluster can break even versus Azure on-demand in about 3.7 months. But the GPU sticker is only ~35% of true TCO (staff is often the largest line item), break-even is sharply model-size-dependent, and at the 40-65% utilization real teams actually achieve, cheap specialist clouds can win even so. Model your real utilization before committing capital.
Decode -- the token-by-token generation phase -- is memory-bandwidth-bound, not compute-bound, because every new token streams all model weights from HBM once per forward pass. That is why the H200, with compute identical to the H100 but 43% more bandwidth (4.8 vs 3.35 TB/s), generates tokens ~43% faster at small batch sizes. Prioritize HBM bandwidth and capacity over raw TFLOPS for inference workloads.
Use a single GPU if the model fits; tensor parallelism (TP = GPU count) within a node when it does not, provided NVLink is present; and TP-per-node plus pipeline parallelism (PP = node count) across nodes. If GPUs lack NVLink (e.g. L40S) or do not evenly divide the model, prefer pipeline parallelism. Multi-node TP needs InfiniBand/RoCE ≥100 Gbps with GPUDirect RDMA -- verify with NCCL logs showing NET/IB/GDRDMA rather than NET/Socket.
Do not use CPU-based Kubernetes HPA -- the GPU saturates while CPU stays idle, so the queue is already deep by the time it triggers. Use KEDA scaling on queue depth (vllm:num_requests_waiting) per replica via Prometheus, with a threshold around 5 pending requests and a cooldown of ~360s. Mitigate cold starts with a shared NFS-backed PVC weights cache, which drops pod startup from minutes to seconds.

Sources & References

Serving Engines (vLLM, NIM, TensorRT-LLM, SGLang)

Sizing, VRAM & KV Cache

GPUs, Quantization & Parallelism

Air-Gap, TCO & Operations

This guide synthesizes publicly available vendor documentation, academic research, and benchmarks as of 2026-05-30. Hardware specs and formulas are stable, but model versions, software defaults, and pricing are version-sensitive and drift monthly — always verify against the authoritative source before relying on a specific figure in a procurement or capacity decision. The Intel Xeon / AirgapAI Edge figures are from Iternal internal benchmarks (2026); run a proof-of-concept on your own workload before finalizing hardware.