Gated Recurrent Unit - Albert Masoliver's learning site

## Definition The **Gated Recurrent Unit (GRU)** (Cho et al., 2014) is a simplified [[Long Short-Term Memory|LSTM]] with two gates instead of three and a single hidden state instead of separate hidden and cell states. Comparable performance to LSTM with fewer parameters and faster computation. ## Architecture ### Update gate $ z_t = \sigma(W_z \cdot [h_{t-1}, x_t]) $ Decides how much of the previous hidden state to retain (vs replacing with new candidate). Combines LSTM's forget and input gates. ### Reset gate $ r_t = \sigma(W_r \cdot [h_{t-1}, x_t]) $ Controls how much of the previous hidden state to use when computing the candidate. ### Candidate hidden state $ \tilde h_t = \tanh(W \cdot [r_t \odot h_{t-1}, x_t]) $ ### Hidden state update $ h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde h_t $ ## GRU vs LSTM | Property | LSTM | GRU | |---|---|---| | Gates | 3 (forget, input, output) | 2 (update, reset) | | Internal states | 2 (hidden + cell) | 1 (hidden) | | Parameters | More | ~25% fewer | | Training speed | Slower per step | Faster | | Long sequences | Slightly better on extremely long sequences | Comparable on most | | Empirical performance | Slightly better on some tasks | Slightly better on others | In practice, the two are interchangeable for most applications. Pick based on: - **Parameter budget:** GRU. - **Very long sequences (>1000 steps):** LSTM often slightly better. - **Small datasets:** GRU often slightly better (fewer parameters → less overfitting). - **Default starting point:** GRU for simplicity; LSTM if it doesn't work. ## Why GRU Works Same mechanism as LSTM: gates allow information to flow without repeated multiplications. The single update gate is enough to control both information addition and decay; the second gate (reset) controls how much past context informs the new candidate. ## Empirical Studies Greff et al. (2017) and others extensively compared LSTM and GRU; conclusion: **no significant difference** on most tasks. Architecture details matter less than the gating mechanism itself. ## Modern Status Same as LSTM — largely displaced by transformers for natural language. Still seen in: - Sequence-to-sequence models for small / resource-constrained settings. - Time-series forecasting. - Speech / audio processing with streaming constraints. ## Related - [[Long Short-Term Memory]] - [[Recurrent Neural Network]] - [[Vanishing and Exploding Gradients]]