LLM Model Capacity Calculator
Calculate GPU VRAM, model parameters, context windows, and quantization for Large Language Models
▶
How it works
📋 How it works:
•
Leave exactly ONE field empty
to calculate its value
• Decoder-only Transformer scaling: P ≈ 6 × d_model³, L ≈ d_model/2
• Model memory = Parameters × Quantization multiplier
• KV cache memory = 2 × Context × Layers × d_model × Quantization
• Total memory = Model + KV cache + 0.5GB overhead
GPU VRAM (GB)
KV Cache Quantization
f32 (4.00× memory, full precision)
f16 (2.00× memory, half precision)
bf16 (2.00× memory, bfloat16)
q8_0 (1.00× memory, 8-bit quant)
q5_1 (0.625× memory, 5-bit+scale)
q5_0 (0.5625× memory, 5-bit quant)
q4_1 (0.5625× memory, 4-bit+scale)
q4_0 (0.50× memory, 4-bit quant)
iq4_nl (0.55× memory, 4-bit improved)
Custom
Context Window Range (tokens)
Minimum
Maximum
💡 Results will show all values in this range
Model Parameters (billions)
🚀 Compute Missing Value
Error:
📊 Calculation Results