## Definition **Long Short-Term Memory (LSTM)** (Hochreiter & Schmidhuber, 1997) is a recurrent neural network architecture designed to handle long-range dependencies. Introduces a separate **cell state** that flows through time with minimal modification, controlled by *gates* that decide what to keep, forget, and output. ## Architecture LSTM cell maintains two states: hidden $h_t$ and cell $c_t$. Each step: ### Forget gate $ f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f) $ Decides what fraction of the previous cell state to forget. ### Input gate $ i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i) $ $ \tilde c_t = \tanh(W_c \cdot [h_{t-1}, x_t] + b_c) $ What new information to add. ### Cell state update $ c_t = f_t \odot c_{t-1} + i_t \odot \tilde c_t $ The cell state flows linearly with element-wise gating — no repeated matrix multiplications, so no vanishing gradients along the cell-state path. ### Output gate $ o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o) $ $ h_t = o_t \odot \tanh(c_t) $ What to output as the hidden state. ## Why It Solves Vanishing Gradients The cell state has an *additive* update path: gradients flow through $\partial c_t / \partial c_{t-1} = f_t$, which can be ~1 if the forget gate is open. No exponential decay over many steps. In practice, LSTMs handle sequences of hundreds to thousands of steps reliably — far beyond vanilla RNNs. ## Historical Impact LSTM was the workhorse of: - **Machine translation** before transformers (~2014-2017). - **Speech recognition** (until end-to-end deep models). - **Time-series forecasting.** - **Handwriting generation, text generation.** Andrej Karpathy's 2015 blog post "The Unreasonable Effectiveness of Recurrent Neural Networks" popularised LSTM-based text generation. ## Variants - **Peephole connections** — let gates see cell state directly. - **Coupled gates** — tie forget and input gates ($i_t = 1 - f_t$). - **[[Gated Recurrent Unit]] (GRU)** — simpler alternative. ## Modern Status Transformers (2017+) largely replaced LSTMs for natural language. LSTMs remain useful for: - **Streaming / online inference** with strict latency. - **Memory-constrained edge devices.** - **Time-series with strong temporal structure.** ## Why It's Pedagogically Important Even if you'll deploy transformers, understanding LSTM teaches: the role of gating, the importance of additive paths, the construction of memory mechanisms. Many modern architectures (state-space models) revisit these ideas. ## Related - [[Recurrent Neural Network]] - [[Gated Recurrent Unit]] - [[Vanishing and Exploding Gradients]] - [[Transformer Architecture]]