## Definition
**Parameter-efficient finetuning (PEFT)** is a family of techniques that adapt a pretrained model to a specific task by updating only a small subset of (or newly added) parameters, achieving performance close to full finetuning while using orders of magnitude fewer trainable parameters and far less memory.
## Why Full Finetuning Is Impractical at Scale
During full finetuning every model parameter is trainable. For a 7B-parameter model in FP16, loading weights costs ~14 GB; the Adam optimizer adds three values per trainable parameter (gradient + two states), pushing total memory to ~56 GB — beyond most consumer and mid-tier GPUs. PEFT attacks this bottleneck by shrinking the count of trainable parameters.
## The Core Insight: Low Intrinsic Dimension
Pre-training implicitly minimises a model's intrinsic dimension — the degrees of freedom actually needed to represent the task. Aghajanyan et al. (2020) and Hu et al. (2021) showed empirically that larger, better-trained LLMs have lower intrinsic dimensions after pretraining. This means that fine-tuning changes can be captured in a low-dimensional subspace, enabling a small number of trainable parameters to steer the full model effectively.
## Two Main Families
### Adapter-Based (Additive) Methods
Additional trainable modules are inserted into (or alongside) the frozen base model. During finetuning only these modules are updated; the original weights are frozen. Examples:
- **Original adapters** (Houlsby et al., 2019) — two bottleneck modules per transformer block. On BERT-large they matched full finetuning on GLUE using only 3% of the parameters, though they add inference latency because adapters are extra layers.
- **LoRA** (Hu et al., 2021) — the dominant method; uses mergeable low-rank matrices that add zero inference overhead after merging. See [[LoRA]].
- **IA3** (Liu et al., 2022) — rescales activations rather than adding layers; strong for multi-task batching.
### Soft Prompt-Based Methods
Trainable continuous token vectors (soft prompts) are prepended to the input at one or more layers. Unlike hard prompts they are not human-readable and are optimised via backpropagation.
- **Prefix tuning** (Li and Liang, 2021) — prepends soft tokens at every transformer layer.
- **Prompt tuning** (Lester et al., 2021) — prepends only at the embedded input.
- **P-Tuning** (Liu et al., 2021) — similar prepend strategy with slight differences in placement.
These are a cross between prompt engineering and finetuning: they require no changes to model weights, only the soft-prompt vectors are learned.
## Practical Properties
| Property | Full Finetuning | PEFT (e.g., LoRA) |
|---|---|---|
| Trainable params | 100% of model | 0.001%–1% of model |
| Memory overhead | Very high | Low |
| Data needed | Thousands–millions | Hundreds–thousands |
| Inference overhead | None | None (after merge) |
| Multi-model serving | Costly (full copies) | Efficient (one base + adapters) |
PEFT methods are also generally sample-efficient: whereas full finetuning may require millions of examples, LoRA-based methods often deliver strong performance with a few hundred to a few thousand examples.
## Related
- [[Fine-Tuning]]
- [[LoRA]]
- [[Quantization]]
- [[Model Merging]]
## Sources
- [[AI Engineering - Chip Huyen]]