Recurrent Neural Network - Albert Masoliver's learning site

## Definition A **Recurrent Neural Network (RNN)** processes sequences by maintaining a hidden state that is updated at each time step. The same weights are applied at every step — weight sharing across time. The dominant sequence architecture from ~1995 to ~2017, before transformers took over. ## The Core Equation At each step $t$: $ h_t = \sigma(W_{hh} h_{t-1} + W_{xh} x_t + b) $ $ y_t = W_{hy} h_t + b_y $ The hidden state $h_t$ is a function of the previous hidden state and the current input — that's the *recurrence*. ## Unrolling Through Time For training, "unroll" the RNN over $T$ time steps and run [[Backpropagation]] through the unrolled graph — **Backpropagation Through Time (BPTT)**. The whole sequence becomes one very deep feedforward network with shared weights. ## The Vanishing Gradient Problem Because the same $W_{hh}$ multiplies the hidden state at every step, BPTT accumulates products of these matrices. For long sequences, gradients vanish or explode — see [[Vanishing and Exploding Gradients]]. This severely limits the effective length RNNs can handle. ## Solutions - **[[Long Short-Term Memory]] (LSTM)** — gated cell state preserves information without repeated multiplications. - **[[Gated Recurrent Unit]] (GRU)** — simpler gating, comparable performance. - **Gradient clipping** to bound explosions. - **Better initialisation** (orthogonal initialisation). ## Sequence-to-Sequence RNNs can be combined into encoder-decoder architectures: - **Encoder RNN** consumes the input sequence and produces a final hidden state. - **Decoder RNN** generates the output sequence conditioned on that state. Used for machine translation, summarisation, dialogue (~2014-2017) before transformers replaced them. ## Why Transformers Replaced RNNs - **Parallelisation.** RNNs compute step-by-step; transformers process all positions in parallel. - **Long dependencies.** Attention provides direct paths between any two positions; RNNs propagate through every intermediate step. - **Scaling.** Transformers scale to long sequences and large models more reliably than RNNs. ## Modern Status (2026) - **RNNs are largely legacy** for new NLP work. - **Niche uses remain:** streaming inference (real-time speech, low-memory edge devices), problems where causal recurrence is naturally appropriate. - **2024+ revival:** state-space models (Mamba, RWKV, RetNet) revisit RNN-style recurrence at scale, competing with transformers on long sequences. ## Variants and Successors - **Bidirectional RNNs.** Process the sequence in both directions; combine the hidden states. - **Deep RNNs.** Stack multiple RNN layers. - **LSTM, GRU.** Address vanishing gradients. - **Mamba (2023+).** Selective state-space model with RNN-like recurrence and transformer-like throughput. ## Related - [[Long Short-Term Memory]] - [[Gated Recurrent Unit]] - [[Vanishing and Exploding Gradients]] - [[Transformer Architecture]]