## Definition
**Cross-entropy loss** is the standard loss function for classification with neural networks. For a true distribution $p$ and predicted distribution $q$, the cross-entropy is:
$
H(p, q) = -\sum_c p(c) \log q(c)
$
Minimising cross-entropy is equivalent to maximising the log-likelihood of observed labels under the model's distribution.
## Binary Cross-Entropy
For binary labels $y \in \{0, 1\}$ and predicted probability $\hat p$:
$
L = -[y \log \hat p + (1 - y) \log(1 - \hat p)]
$
The standard pairing: **sigmoid output + binary cross-entropy**.
## Categorical Cross-Entropy
For one-hot labels $y_c$ across $K$ classes:
$
L = -\sum_{c=1}^K y_c \log \hat p_c
$
In practice, one-hot labels collapse the sum to a single term: $-\log \hat p_{c_{\text{true}}}$. The standard pairing: **softmax output + categorical cross-entropy**.
## Logits Form (Numerically Stable)
Naively computing softmax then log is numerically unstable for large logits. Frameworks provide a combined operator:
- PyTorch: `nn.CrossEntropyLoss` (logits in, integer labels in).
- TensorFlow: `softmax_cross_entropy_with_logits`.
Internally these use the **log-sum-exp trick** for numerical stability.
## Why Cross-Entropy, Not MSE, for Classification
For sigmoid + MSE, the gradient is small when the model is confidently wrong — slow learning. For sigmoid + cross-entropy, the gradient is *proportional to the error*, regardless of confidence — fast learning.
Mathematically:
$
\frac{\partial L_{\text{CE}}}{\partial z} = \hat p - y
$
Beautifully linear in the error.
## Connection to KL Divergence
$
H(p, q) = H(p) + D_{\text{KL}}(p \| q)
$
Since $H(p)$ doesn't depend on the model, minimising cross-entropy = minimising KL divergence from true to predicted distribution.
## Smoothed Labels
**Label smoothing** replaces one-hot targets with a soft distribution: assign $1 - \epsilon$ to the true class and $\epsilon / (K-1)$ to others (typically $\epsilon = 0.1$). Regularises by preventing the model from becoming overconfident.
Used in many modern training recipes (Inception-v3, transformers, recent vision models).
## Modern Context
Cross-entropy is the loss behind:
- Image classification.
- LLM next-token prediction (autoregressive language modelling).
- Contrastive learning (InfoNCE).
- Reinforcement learning's policy losses.
Almost every deep learning classifier uses some form of cross-entropy.
## Related
- [[Loss Functions]]
- [[Logistic Regression]]
- [[Multilayer Perceptron]]