## Definition
A **Recurrent Neural Network (RNN)** processes sequences by maintaining a hidden state that is updated at each time step. The same weights are applied at every step — weight sharing across time. The dominant sequence architecture from ~1995 to ~2017, before transformers took over.
## The Core Equation
At each step $t$:
$
h_t = \sigma(W_{hh} h_{t-1} + W_{xh} x_t + b)
$
$
y_t = W_{hy} h_t + b_y
$
The hidden state $h_t$ is a function of the previous hidden state and the current input — that's the *recurrence*.
## Unrolling Through Time
For training, "unroll" the RNN over $T$ time steps and run [[Backpropagation]] through the unrolled graph — **Backpropagation Through Time (BPTT)**. The whole sequence becomes one very deep feedforward network with shared weights.
## The Vanishing Gradient Problem
Because the same $W_{hh}$ multiplies the hidden state at every step, BPTT accumulates products of these matrices. For long sequences, gradients vanish or explode — see [[Vanishing and Exploding Gradients]]. This severely limits the effective length RNNs can handle.
## Solutions
- **[[Long Short-Term Memory]] (LSTM)** — gated cell state preserves information without repeated multiplications.
- **[[Gated Recurrent Unit]] (GRU)** — simpler gating, comparable performance.
- **Gradient clipping** to bound explosions.
- **Better initialisation** (orthogonal initialisation).
## Sequence-to-Sequence
RNNs can be combined into encoder-decoder architectures:
- **Encoder RNN** consumes the input sequence and produces a final hidden state.
- **Decoder RNN** generates the output sequence conditioned on that state.
Used for machine translation, summarisation, dialogue (~2014-2017) before transformers replaced them.
## Why Transformers Replaced RNNs
- **Parallelisation.** RNNs compute step-by-step; transformers process all positions in parallel.
- **Long dependencies.** Attention provides direct paths between any two positions; RNNs propagate through every intermediate step.
- **Scaling.** Transformers scale to long sequences and large models more reliably than RNNs.
## Modern Status (2026)
- **RNNs are largely legacy** for new NLP work.
- **Niche uses remain:** streaming inference (real-time speech, low-memory edge devices), problems where causal recurrence is naturally appropriate.
- **2024+ revival:** state-space models (Mamba, RWKV, RetNet) revisit RNN-style recurrence at scale, competing with transformers on long sequences.
## Variants and Successors
- **Bidirectional RNNs.** Process the sequence in both directions; combine the hidden states.
- **Deep RNNs.** Stack multiple RNN layers.
- **LSTM, GRU.** Address vanishing gradients.
- **Mamba (2023+).** Selective state-space model with RNN-like recurrence and transformer-like throughput.
## Related
- [[Long Short-Term Memory]]
- [[Gated Recurrent Unit]]
- [[Vanishing and Exploding Gradients]]
- [[Transformer Architecture]]