Model Parallelism - Albert Masoliver's learning site

## Definition **Model parallelism** refers to the family of strategies that split a single large model across multiple hardware devices so that no single device must hold or compute the entire model. It is the counterpart to *replica parallelism* (which creates independent copies of the model) and is necessary when a model's memory footprint exceeds what one accelerator can hold. ## Why It Is Needed A model must fit in accelerator memory to run. For a 70B-parameter model in FP16 (2 bytes/param), that is ~140 GB — larger than any single consumer GPU and many data-centre GPUs. Model parallelism distributes this weight across devices and parallelises the computation. ## Tensor Parallelism (Intra-Operator) The dominant strategy for inference. Individual tensor operations — primarily matrix multiplications — are partitioned across devices. For example, a weight matrix can be split column-wise: each device computes a partial product, and the results are all-reduced at the end of each layer. - Reduces per-device memory proportionally to the number of devices. - Reduces latency: devices compute in parallel on sub-matrices. - Incurs all-reduce communication overhead per layer, so works best with high-bandwidth interconnects (NVLink). ## Pipeline Parallelism (Inter-Operator) The model's layers are split into *stages*, each assigned to a different device. Data flows sequentially from one stage to the next, with micro-batches used to keep all stages occupied simultaneously. - Also reduces per-device memory. - Increases per-request latency due to inter-stage communication and pipeline fill/drain overhead (*pipeline bubble*). - More suited to training (where throughput matters) than inference (where latency matters). ## Replica Parallelism (Data Parallelism at Serving Time) Technically not model parallelism, but often combined with it. Multiple full or sharded model copies run in parallel to serve independent requests, increasing throughput without changing per-request latency. ## Context and Sequence Parallelism For very long inputs, the input sequence itself can be partitioned across devices (context parallelism) or individual operators required for the full sequence can be split (sequence parallelism). Both aim to make long-context inference tractable. ## Practical Guidance (Inference) | Goal | Preferred strategy | |---|---| | Serve a model too large for one GPU | Tensor parallelism | | Maximise throughput, tolerate latency | Pipeline + tensor | | Low-latency serving, model fits on one node | Replica parallelism | | Very long context | Context / sequence parallelism | Across use cases, tensor parallelism and replica parallelism are the most impactful inference strategies (Huyen, 2024). ## Related - [[KV Cache]] - [[Inference Latency]] - [[Prefill-Decode Disaggregation]] - [[Continuous Batching]] - [[Mixture-of-Experts]] ## Sources - [[AI Engineering - Chip Huyen]]