Model Merging - Albert Masoliver's learning site

## Definition **Model merging** is the process of combining the weights of multiple trained models into a single model that performs better than, or as well as, the individual constituent models — often enabling multi-task capability, reduced memory footprint, or federated learning — without requiring access to GPUs or training infrastructure. ## Motivation Finetuning a model on multiple tasks sequentially risks **catastrophic forgetting**: the model forgets earlier tasks when trained on later ones. Simultaneous multi-task finetuning demands more data and compute. Model merging offers a third path: finetune each task independently in parallel, then merge the resulting models, combining their strengths without sequential degradation. Other use cases include on-device deployment (merging multiple task-specific models into one to fit limited device memory), model upscaling (creating larger models without training from scratch), and federated learning (merging model copies trained on separate private datasets). ## Three Main Approaches ### 1. Summing Add or interpolate the weight values of constituent models. **Linear combination** — weighted average of parameters: $\text{Merge}(A, B) = \frac{w_A A + w_B B}{w_A + w_B}$ At $w_A = w_B = 1$ this is a simple average. Works surprisingly well. Most effective when models share the same base model. Can be viewed through **task vectors**: subtracting the base model from a finetuned model yields a vector encoding the task. **Task arithmetic** (Ilharco et al., 2022) allows adding or subtracting task vectors to combine or suppress capabilities. Before summing, redundant task-vector parameters (those whose magnitudes are small and whose sign conflicts with other tasks) are pruned by methods such as TIES (Yadav et al., 2023) and DARE (Yu et al., 2023). Pruning up to 80% of task-vector parameters often has minimal performance impact on individual models but substantially improves the merged result. **SLERP (Spherical Linear Interpolation)** — treats each weight vector as a point on a hypersphere and interpolates along the shortest arc. Defined pairwise; to merge more than two models, apply sequentially. Common in practice; implemented by standard merging tools. ### 2. Layer Stacking (Frankenmerging) Take layers from different models and stack them to form a new model. The resulting model typically has a non-standard architecture and requires further finetuning. Notable example: Goliath-120B (2023), merged from two finetuned Llama 2-70B models. Layer stacking can create **Mixture-of-Experts** models from dense checkpoints (Komatsuzaki et al., 2022): copy transformer layers, add a router, then train the merged model. It can also be used for **model upscaling** — creating a larger model from a smaller one via depthwise scaling (e.g., SOLAR 10.7B from a 7B base). ### 3. Concatenation Concatenate the parameter matrices of constituent components (e.g., two LoRA adapters of ranks $r_1$ and $r_2$ become one of rank $r_1 + r_2$). Does not reduce memory relative to serving models separately; not recommended for most use cases. ## Merging vs Ensembling Ensembling combines only model *outputs* (e.g., majority vote across three model responses), keeping each model intact. Model merging combines *parameters* into a single model. Ensembling generally achieves better accuracy but at higher inference cost (multiple forward passes per request). Many top models on public LLM leaderboards are merged models. ## Related - [[Fine-Tuning]] - [[Parameter-Efficient Finetuning]] - [[LoRA]] - [[Mixture-of-Experts]] ## Sources - [[AI Engineering - Chip Huyen]]