Parameter - Albert Masoliver's learning site

## Definition A **parameter** is a single learned weight inside a neural network, adjusted during [[Pretraining]] to minimize prediction error. The count of parameters is the headline number on a model's spec sheet and a rough proxy for its raw capacity. ## The scale-up Parameter counts exploded across the GPT line, and the jumps were order-of-magnitude: | Model | Parameters | | --- | --- | | GPT-1 | 117M | | GPT-2 | 1.5B | | GPT-3 | 175B | Each generation roughly multiplied capacity, and the leap in capability that came with it is what kicked off the modern [[Large Language Model]] era. ## More is not free A larger parameter count only helps when it is *matched by enough training data*. Hoffmann et al. (DeepMind, 2022) — the Chinchilla paper — showed that the GPT-3-era giants were badly *undertrained*: for compute-optimal training you want roughly **20 tokens per parameter**. Pour parameters in without the data to feed them and you waste capacity. This is the heart of [[Scaling Laws]]. ## Bigger is not better Because data quality, training recipe, and architecture improve over time, a newer small model routinely beats an older large one. A well-trained **Llama 3-8B** outperforms the older **Llama 2-70B** on most benchmarks despite having under a sixth of the parameters. Treat the parameter count as one input, never the verdict — see the [[Model Card]] for what actually matters. ## You pick, you don't tune As a practitioner you almost never touch individual parameters. You *select* a pre-trained model whose parameters are already frozen, then steer it with prompts, [[Fine-Tuning]], or retrieval. The weights are the vendor's artifact; your job is orchestration. ## Active vs total The single number is also getting slippery. In a [[Mixture-of-Experts]] model, only a fraction of parameters fire per [[Token]], so "total parameters" and "active parameters" diverge — and only the active count drives cost and speed. ## Related - [[Scaling Laws]] - [[Mixture-of-Experts]] - [[Model Card]] - [[Pretraining]] - [[Large Language Model]] - [[Foundation Model]] - [[AI Engineering - Chip Huyen]]