## Definition **Quantization** is the process of representing a model's numerical values (weights, activations, gradients) in a lower-precision format — fewer bits per value — to reduce memory footprint, increase throughput, and lower inference cost, at the expense of some numerical precision. ## Numerical Formats Modern neural networks use floating-point representations. The most common families: | Format | Bits | Notes | |---|---|---| | FP32 | 32 | Standard precision; baseline for training | | FP16 | 16 | Half precision; common for inference | | BF16 | 16 | Wider range than FP16 (more exponent bits), less precision; default for many LLMs (e.g., Llama 2) | | TF32 | 19 | NVIDIA GPU format; FP32 range, FP16 precision | | INT8 | 8 | Integer; increasingly common | | INT4 | 4 | Very low precision; requires careful calibration | | NF4 | 4 | NormalFloat-4; used in QLoRA; calibrated for near-Gaussian weight distributions | **Warning**: loading a model in the wrong format degrades quality significantly — e.g., loading a BF16-trained model (Llama 2) in FP16 caused widespread confusion and perceived quality drops. ## Memory Impact A 10B-parameter model requires: - FP32 (4 bytes/param): 40 GB - FP16 or BF16 (2 bytes/param): 20 GB - INT8 (1 byte/param): 10 GB - INT4 (~0.5 bytes/param): ~5 GB Halving the bit width halves the weight memory. Reduced precision also allows larger batch sizes and faster computation (fewer bit operations per addition/multiply). ## When to Quantize **Post-Training Quantization (PTQ)** — quantize a fully trained model; the most common approach. Major frameworks (PyTorch, TensorFlow, Hugging Face transformers) provide PTQ in a few lines of code. PTQ is especially relevant to application developers who do not train models from scratch. **Quantization-Aware Training (QAT)** — simulate low-precision behaviour during training so the model learns to produce high-quality outputs in low precision. Does not reduce training time (computations still run at high precision) but improves inference quality compared to PTQ. **Training in lower precision** — training directly in INT8 or mixed precision (some weights at low precision, others at high precision) both reduces training time and eliminates train/serve precision mismatch. Harder to do because backpropagation is sensitive to rounding errors that accumulate across steps. ## Limits of Quantization Precision cannot go below 1 bit. Research has explored 1-bit representations (BinaryConnect, Xnor-Net, BitNet). In 2024, Microsoft introduced BitNet b1.58 (1.58 bits/param), achieving performance comparable to Llama 2 16-bit at up to 3.9B parameters. Each precision reduction introduces small value changes; many small changes can compound into significant quality degradation, especially if values fall outside the representable range (clamped to infinity in FP16). ## Quantization in PEFT Context QLoRA combines PTQ with LoRA: the base model is stored at 4-bit (NF4), dequantised to BF16 during forward/backward passes, while only the LoRA adapter matrices are updated in 16-bit. This enables finetuning 65B+ models on a single consumer-grade GPU. See [[LoRA]]. ## Related - [[Parameter-Efficient Finetuning]] - [[LoRA]] - [[Inference Latency]] - [[KV Cache]] ## Sources - [[AI Engineering - Chip Huyen]]