## Definition **LoRA (Low-Rank Adaptation)** is a parameter-efficient finetuning technique (Hu et al., 2021) that adapts a pretrained model by decomposing each target weight matrix into the product of two small trainable matrices, then merging them back into the original weights after training — adding zero inference latency. ## Mechanism Given a frozen weight matrix $W \in \mathbb{R}^{n \times m}$, LoRA introduces two trainable matrices $A \in \mathbb{R}^{n \times r}$ and $B \in \mathbb{R}^{r \times m}$ where the rank $r \ll \min(n, m)$. The adapted weight is: $W' = W + \frac{\alpha}{r} \cdot AB$ - Only $A$ and $B$ are updated during finetuning; $W$ is frozen. - $\alpha$ is a scaling hyperparameter controlling how much the adapter contributes. - After training, $AB$ can be merged into $W$ to produce $W'$ — no extra computation at inference. For GPT-3 (175B), Hu et al. achieved comparable or better performance than full finetuning on several tasks using only ~4.7M trainable parameters (0.0027% of total). ## Why Low-Rank Works Pre-training implicitly compresses a model's intrinsic dimension. The weight updates needed for a downstream task lie in a low-dimensional subspace, so a low-rank factorisation is sufficient to capture them. Larger, better-pretrained models tend to have lower intrinsic dimensions — making LoRA more effective on stronger base models. ## Key Hyperparameters **Rank $r$** — typically between 4 and 64; surprisingly, a small $r$ is usually sufficient. Increasing $r$ beyond a threshold yields no further gain and can cause overfitting. **Scaling factor $\alpha$** — the ratio $\alpha : r$ usually lies between 1:8 and 8:1. A smaller $r$ generally benefits from a larger $\alpha$. **Which matrices to target** — most commonly applied to attention query ($W_q$), key ($W_k$), value ($W_v$), and output projection ($W_o$) matrices; extending LoRA to feedforward layers can yield additional gains within memory budgets. ## Multi-LoRA Serving Because LoRA adapters are small and modular, a single base model can serve many LoRA-finetuned variants simultaneously: - **Merged mode**: merge $AB$ into $W'$ before serving — zero overhead, but requires a separate copy of the full model per adapter. - **Separate mode**: keep $W$, $A$, $B$ distinct; merge at inference time — adds latency but allows one shared base model and 100× smaller storage per customer/task variant. At rank $r = 8$ on a 4096×4096 weight matrix, the LoRA adapter holds only ~65K parameters versus 16.8M for the full matrix — a 256× reduction per weight. Apple used multi-LoRA with a single 3B-parameter base model to serve multiple on-device iPhone features (2024). ## QLoRA QLoRA (Dettmers et al., 2023) quantises the frozen base model weights to 4-bit (NF4 format, exploiting the near-Gaussian distribution of pretrained weights), dequantising to BF16 only during the forward and backward pass. LoRA adapters are trained in 16-bit. This allows finetuning a 65B-parameter model on a single 48 GB GPU. See [[Quantization]]. ## Limitations - Does not match the absolute ceiling of full finetuning on sufficiently large datasets. - Requires understanding the target model's architecture to select matrices correctly. - NF4 dequantisation (QLoRA) adds time overhead, increasing training duration. ## Related - [[Parameter-Efficient Finetuning]] - [[Fine-Tuning]] - [[Quantization]] - [[Model Merging]] - [[Embedding]] ## Sources - [[AI Engineering - Chip Huyen]]