Model Quantization

Model Quantization

Quantization reduces model precision to lower memory usage and increase inference speed.

Quantization Formats

FormatBitsUse CaseTools
FP1616TrainingNative
BF1616TrainingNative
INT88InferenceTensorRT
GPTQ4GPU inferenceAutoGPTQ
AWQ4GPU inferenceAutoAWQ
GGUF2-8CPU/Apple Mllama.cpp

VRAM Savings

Approximate VRAM for a 7B parameter model:

PrecisionVRAM
FP3228 GB
FP1614 GB
INT87 GB
4-bit4 GB

GGUF Quantization Levels

QuantBits/WeightQuality
Q8_08.5Best
Q6_K6.6Excellent
Q5_K_M5.7Great
Q4_K_M4.8Good
Q3_K_M3.9Acceptable
Q2_K2.6Lossy

When to Use

  • Q8: Maximum quality, have VRAM
  • Q5_K_M: Balanced (recommended)
  • Q4_K_M: Memory constrained
  • Q2-Q3: Extreme memory limits