Cross-Entropy Loss - Albert Masoliver's learning site

## Definition **Cross-entropy loss** is the standard loss function for classification with neural networks. For a true distribution $p$ and predicted distribution $q$, the cross-entropy is: $ H(p, q) = -\sum_c p(c) \log q(c) $ Minimising cross-entropy is equivalent to maximising the log-likelihood of observed labels under the model's distribution. ## Binary Cross-Entropy For binary labels $y \in \{0, 1\}$ and predicted probability $\hat p$: $ L = -[y \log \hat p + (1 - y) \log(1 - \hat p)] $ The standard pairing: **sigmoid output + binary cross-entropy**. ## Categorical Cross-Entropy For one-hot labels $y_c$ across $K$ classes: $ L = -\sum_{c=1}^K y_c \log \hat p_c $ In practice, one-hot labels collapse the sum to a single term: $-\log \hat p_{c_{\text{true}}}$. The standard pairing: **softmax output + categorical cross-entropy**. ## Logits Form (Numerically Stable) Naively computing softmax then log is numerically unstable for large logits. Frameworks provide a combined operator: - PyTorch: `nn.CrossEntropyLoss` (logits in, integer labels in). - TensorFlow: `softmax_cross_entropy_with_logits`. Internally these use the **log-sum-exp trick** for numerical stability. ## Why Cross-Entropy, Not MSE, for Classification For sigmoid + MSE, the gradient is small when the model is confidently wrong — slow learning. For sigmoid + cross-entropy, the gradient is *proportional to the error*, regardless of confidence — fast learning. Mathematically: $ \frac{\partial L_{\text{CE}}}{\partial z} = \hat p - y $ Beautifully linear in the error. ## Connection to KL Divergence $ H(p, q) = H(p) + D_{\text{KL}}(p \| q) $ Since $H(p)$ doesn't depend on the model, minimising cross-entropy = minimising KL divergence from true to predicted distribution. ## Smoothed Labels **Label smoothing** replaces one-hot targets with a soft distribution: assign $1 - \epsilon$ to the true class and $\epsilon / (K-1)$ to others (typically $\epsilon = 0.1$). Regularises by preventing the model from becoming overconfident. Used in many modern training recipes (Inception-v3, transformers, recent vision models). ## Modern Context Cross-entropy is the loss behind: - Image classification. - LLM next-token prediction (autoregressive language modelling). - Contrastive learning (InfoNCE). - Reinforcement learning's policy losses. Almost every deep learning classifier uses some form of cross-entropy. ## Related - [[Loss Functions]] - [[Logistic Regression]] - [[Multilayer Perceptron]]