Mixture-of-Experts - Albert Masoliver's learning site

## Definition **Mixture-of-Experts (MoE)** is a sparse architecture in which the feed-forward layer is split into many "expert" subnetworks, and a small router activates only a few of them per [[Token]]. The model holds a huge number of weights but uses only a slice of them on any given forward pass. ## Total vs active parameters This is the whole point, so it is worth stating plainly: - **Total parameters** — every weight the model stores. Determines memory footprint. - **Active parameters** — the weights actually used to process one token. Determines compute, cost, and latency. In a dense model these two numbers are equal. In MoE they come apart. ## The Mixtral example **Mixtral 8x7B** has 8 experts. Naively that sounds like 56B, but shared layers bring the *total* to **46.7B** parameters. Yet the router picks only 2 experts per token, so the *active* count is about **12.9B**. You get the knowledge capacity closer to a 47B model at the inference cost and speed of a ~13B model. ``` total ≈ 46.7B (lives in memory) active ≈ 12.9B (runs per token → cost/speed) ``` ## Decoupling size from cost MoE breaks the old assumption that a bigger model is a slower, pricier model. By routing, you can scale capacity (total params) far faster than you scale the compute bill (active params). This is why frontier vendors increasingly ship MoE under the hood. ## It complicates scaling laws Classic [[Scaling Laws]] relate capability to *the* parameter count — but which one? For MoE, total params track knowledge capacity while active params track the FLOPs budget, and the two no longer move together. Reasoning about an MoE model means tracking both numbers separately; see [[Parameter]]. ## Related - [[Parameter]] - [[Scaling Laws]] - [[Token]] - [[Transformer Architecture]] - [[Foundation Model]] - [[Model Card]] - [[Hands-On Large Language Models - Alammar, Grootendorst]]