## Definition
**LoRA (Low-Rank Adaptation)** is a parameter-efficient finetuning technique (Hu et al., 2021) that adapts a pretrained model by decomposing each target weight matrix into the product of two small trainable matrices, then merging them back into the original weights after training — adding zero inference latency.
## Mechanism
Given a frozen weight matrix $W \in \mathbb{R}^{n \times m}$, LoRA introduces two trainable matrices $A \in \mathbb{R}^{n \times r}$ and $B \in \mathbb{R}^{r \times m}$ where the rank $r \ll \min(n, m)$. The adapted weight is:
$W' = W + \frac{\alpha}{r} \cdot AB$
- Only $A$ and $B$ are updated during finetuning; $W$ is frozen.
- $\alpha$ is a scaling hyperparameter controlling how much the adapter contributes.
- After training, $AB$ can be merged into $W$ to produce $W'$ — no extra computation at inference.
For GPT-3 (175B), Hu et al. achieved comparable or better performance than full finetuning on several tasks using only ~4.7M trainable parameters (0.0027% of total).
## Why Low-Rank Works
Pre-training implicitly compresses a model's intrinsic dimension. The weight updates needed for a downstream task lie in a low-dimensional subspace, so a low-rank factorisation is sufficient to capture them. Larger, better-pretrained models tend to have lower intrinsic dimensions — making LoRA more effective on stronger base models.
## Key Hyperparameters
**Rank $r$** — typically between 4 and 64; surprisingly, a small $r$ is usually sufficient. Increasing $r$ beyond a threshold yields no further gain and can cause overfitting.
**Scaling factor $\alpha$** — the ratio $\alpha : r$ usually lies between 1:8 and 8:1. A smaller $r$ generally benefits from a larger $\alpha$.
**Which matrices to target** — most commonly applied to attention query ($W_q$), key ($W_k$), value ($W_v$), and output projection ($W_o$) matrices; extending LoRA to feedforward layers can yield additional gains within memory budgets.
## Multi-LoRA Serving
Because LoRA adapters are small and modular, a single base model can serve many LoRA-finetuned variants simultaneously:
- **Merged mode**: merge $AB$ into $W'$ before serving — zero overhead, but requires a separate copy of the full model per adapter.
- **Separate mode**: keep $W$, $A$, $B$ distinct; merge at inference time — adds latency but allows one shared base model and 100× smaller storage per customer/task variant.
At rank $r = 8$ on a 4096×4096 weight matrix, the LoRA adapter holds only ~65K parameters versus 16.8M for the full matrix — a 256× reduction per weight. Apple used multi-LoRA with a single 3B-parameter base model to serve multiple on-device iPhone features (2024).
## QLoRA
QLoRA (Dettmers et al., 2023) quantises the frozen base model weights to 4-bit (NF4 format, exploiting the near-Gaussian distribution of pretrained weights), dequantising to BF16 only during the forward and backward pass. LoRA adapters are trained in 16-bit. This allows finetuning a 65B-parameter model on a single 48 GB GPU. See [[Quantization]].
## Limitations
- Does not match the absolute ceiling of full finetuning on sufficiently large datasets.
- Requires understanding the target model's architecture to select matrices correctly.
- NF4 dequantisation (QLoRA) adds time overhead, increasing training duration.
## Related
- [[Parameter-Efficient Finetuning]]
- [[Fine-Tuning]]
- [[Quantization]]
- [[Model Merging]]
- [[Embedding]]
## Sources
- [[AI Engineering - Chip Huyen]]