## Definition
**Quantization** is the process of representing a model's numerical values (weights, activations, gradients) in a lower-precision format — fewer bits per value — to reduce memory footprint, increase throughput, and lower inference cost, at the expense of some numerical precision.
## Numerical Formats
Modern neural networks use floating-point representations. The most common families:
| Format | Bits | Notes |
|---|---|---|
| FP32 | 32 | Standard precision; baseline for training |
| FP16 | 16 | Half precision; common for inference |
| BF16 | 16 | Wider range than FP16 (more exponent bits), less precision; default for many LLMs (e.g., Llama 2) |
| TF32 | 19 | NVIDIA GPU format; FP32 range, FP16 precision |
| INT8 | 8 | Integer; increasingly common |
| INT4 | 4 | Very low precision; requires careful calibration |
| NF4 | 4 | NormalFloat-4; used in QLoRA; calibrated for near-Gaussian weight distributions |
**Warning**: loading a model in the wrong format degrades quality significantly — e.g., loading a BF16-trained model (Llama 2) in FP16 caused widespread confusion and perceived quality drops.
## Memory Impact
A 10B-parameter model requires:
- FP32 (4 bytes/param): 40 GB
- FP16 or BF16 (2 bytes/param): 20 GB
- INT8 (1 byte/param): 10 GB
- INT4 (~0.5 bytes/param): ~5 GB
Halving the bit width halves the weight memory. Reduced precision also allows larger batch sizes and faster computation (fewer bit operations per addition/multiply).
## When to Quantize
**Post-Training Quantization (PTQ)** — quantize a fully trained model; the most common approach. Major frameworks (PyTorch, TensorFlow, Hugging Face transformers) provide PTQ in a few lines of code. PTQ is especially relevant to application developers who do not train models from scratch.
**Quantization-Aware Training (QAT)** — simulate low-precision behaviour during training so the model learns to produce high-quality outputs in low precision. Does not reduce training time (computations still run at high precision) but improves inference quality compared to PTQ.
**Training in lower precision** — training directly in INT8 or mixed precision (some weights at low precision, others at high precision) both reduces training time and eliminates train/serve precision mismatch. Harder to do because backpropagation is sensitive to rounding errors that accumulate across steps.
## Limits of Quantization
Precision cannot go below 1 bit. Research has explored 1-bit representations (BinaryConnect, Xnor-Net, BitNet). In 2024, Microsoft introduced BitNet b1.58 (1.58 bits/param), achieving performance comparable to Llama 2 16-bit at up to 3.9B parameters.
Each precision reduction introduces small value changes; many small changes can compound into significant quality degradation, especially if values fall outside the representable range (clamped to infinity in FP16).
## Quantization in PEFT Context
QLoRA combines PTQ with LoRA: the base model is stored at 4-bit (NF4), dequantised to BF16 during forward/backward passes, while only the LoRA adapter matrices are updated in 16-bit. This enables finetuning 65B+ models on a single consumer-grade GPU. See [[LoRA]].
## Related
- [[Parameter-Efficient Finetuning]]
- [[LoRA]]
- [[Inference Latency]]
- [[KV Cache]]
## Sources
- [[AI Engineering - Chip Huyen]]