Parameter-Efficient Finetuning - Albert Masoliver's learning site

## Definition **Parameter-efficient finetuning (PEFT)** is a family of techniques that adapt a pretrained model to a specific task by updating only a small subset of (or newly added) parameters, achieving performance close to full finetuning while using orders of magnitude fewer trainable parameters and far less memory. ## Why Full Finetuning Is Impractical at Scale During full finetuning every model parameter is trainable. For a 7B-parameter model in FP16, loading weights costs ~14 GB; the Adam optimizer adds three values per trainable parameter (gradient + two states), pushing total memory to ~56 GB — beyond most consumer and mid-tier GPUs. PEFT attacks this bottleneck by shrinking the count of trainable parameters. ## The Core Insight: Low Intrinsic Dimension Pre-training implicitly minimises a model's intrinsic dimension — the degrees of freedom actually needed to represent the task. Aghajanyan et al. (2020) and Hu et al. (2021) showed empirically that larger, better-trained LLMs have lower intrinsic dimensions after pretraining. This means that fine-tuning changes can be captured in a low-dimensional subspace, enabling a small number of trainable parameters to steer the full model effectively. ## Two Main Families ### Adapter-Based (Additive) Methods Additional trainable modules are inserted into (or alongside) the frozen base model. During finetuning only these modules are updated; the original weights are frozen. Examples: - **Original adapters** (Houlsby et al., 2019) — two bottleneck modules per transformer block. On BERT-large they matched full finetuning on GLUE using only 3% of the parameters, though they add inference latency because adapters are extra layers. - **LoRA** (Hu et al., 2021) — the dominant method; uses mergeable low-rank matrices that add zero inference overhead after merging. See [[LoRA]]. - **IA3** (Liu et al., 2022) — rescales activations rather than adding layers; strong for multi-task batching. ### Soft Prompt-Based Methods Trainable continuous token vectors (soft prompts) are prepended to the input at one or more layers. Unlike hard prompts they are not human-readable and are optimised via backpropagation. - **Prefix tuning** (Li and Liang, 2021) — prepends soft tokens at every transformer layer. - **Prompt tuning** (Lester et al., 2021) — prepends only at the embedded input. - **P-Tuning** (Liu et al., 2021) — similar prepend strategy with slight differences in placement. These are a cross between prompt engineering and finetuning: they require no changes to model weights, only the soft-prompt vectors are learned. ## Practical Properties | Property | Full Finetuning | PEFT (e.g., LoRA) | |---|---|---| | Trainable params | 100% of model | 0.001%–1% of model | | Memory overhead | Very high | Low | | Data needed | Thousands–millions | Hundreds–thousands | | Inference overhead | None | None (after merge) | | Multi-model serving | Costly (full copies) | Efficient (one base + adapters) | PEFT methods are also generally sample-efficient: whereas full finetuning may require millions of examples, LoRA-based methods often deliver strong performance with a few hundred to a few thousand examples. ## Related - [[Fine-Tuning]] - [[LoRA]] - [[Quantization]] - [[Model Merging]] ## Sources - [[AI Engineering - Chip Huyen]]