## Definition
**Long Short-Term Memory (LSTM)** (Hochreiter & Schmidhuber, 1997) is a recurrent neural network architecture designed to handle long-range dependencies. Introduces a separate **cell state** that flows through time with minimal modification, controlled by *gates* that decide what to keep, forget, and output.
## Architecture
LSTM cell maintains two states: hidden $h_t$ and cell $c_t$. Each step:
### Forget gate
$
f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)
$
Decides what fraction of the previous cell state to forget.
### Input gate
$
i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i)
$
$
\tilde c_t = \tanh(W_c \cdot [h_{t-1}, x_t] + b_c)
$
What new information to add.
### Cell state update
$
c_t = f_t \odot c_{t-1} + i_t \odot \tilde c_t
$
The cell state flows linearly with element-wise gating — no repeated matrix multiplications, so no vanishing gradients along the cell-state path.
### Output gate
$
o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o)
$
$
h_t = o_t \odot \tanh(c_t)
$
What to output as the hidden state.
## Why It Solves Vanishing Gradients
The cell state has an *additive* update path: gradients flow through $\partial c_t / \partial c_{t-1} = f_t$, which can be ~1 if the forget gate is open. No exponential decay over many steps.
In practice, LSTMs handle sequences of hundreds to thousands of steps reliably — far beyond vanilla RNNs.
## Historical Impact
LSTM was the workhorse of:
- **Machine translation** before transformers (~2014-2017).
- **Speech recognition** (until end-to-end deep models).
- **Time-series forecasting.**
- **Handwriting generation, text generation.**
Andrej Karpathy's 2015 blog post "The Unreasonable Effectiveness of Recurrent Neural Networks" popularised LSTM-based text generation.
## Variants
- **Peephole connections** — let gates see cell state directly.
- **Coupled gates** — tie forget and input gates ($i_t = 1 - f_t$).
- **[[Gated Recurrent Unit]] (GRU)** — simpler alternative.
## Modern Status
Transformers (2017+) largely replaced LSTMs for natural language. LSTMs remain useful for:
- **Streaming / online inference** with strict latency.
- **Memory-constrained edge devices.**
- **Time-series with strong temporal structure.**
## Why It's Pedagogically Important
Even if you'll deploy transformers, understanding LSTM teaches: the role of gating, the importance of additive paths, the construction of memory mechanisms. Many modern architectures (state-space models) revisit these ideas.
## Related
- [[Recurrent Neural Network]]
- [[Gated Recurrent Unit]]
- [[Vanishing and Exploding Gradients]]
- [[Transformer Architecture]]