## Definition
The **Gated Recurrent Unit (GRU)** (Cho et al., 2014) is a simplified [[Long Short-Term Memory|LSTM]] with two gates instead of three and a single hidden state instead of separate hidden and cell states. Comparable performance to LSTM with fewer parameters and faster computation.
## Architecture
### Update gate
$
z_t = \sigma(W_z \cdot [h_{t-1}, x_t])
$
Decides how much of the previous hidden state to retain (vs replacing with new candidate). Combines LSTM's forget and input gates.
### Reset gate
$
r_t = \sigma(W_r \cdot [h_{t-1}, x_t])
$
Controls how much of the previous hidden state to use when computing the candidate.
### Candidate hidden state
$
\tilde h_t = \tanh(W \cdot [r_t \odot h_{t-1}, x_t])
$
### Hidden state update
$
h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde h_t
$
## GRU vs LSTM
| Property | LSTM | GRU |
|---|---|---|
| Gates | 3 (forget, input, output) | 2 (update, reset) |
| Internal states | 2 (hidden + cell) | 1 (hidden) |
| Parameters | More | ~25% fewer |
| Training speed | Slower per step | Faster |
| Long sequences | Slightly better on extremely long sequences | Comparable on most |
| Empirical performance | Slightly better on some tasks | Slightly better on others |
In practice, the two are interchangeable for most applications. Pick based on:
- **Parameter budget:** GRU.
- **Very long sequences (>1000 steps):** LSTM often slightly better.
- **Small datasets:** GRU often slightly better (fewer parameters → less overfitting).
- **Default starting point:** GRU for simplicity; LSTM if it doesn't work.
## Why GRU Works
Same mechanism as LSTM: gates allow information to flow without repeated multiplications. The single update gate is enough to control both information addition and decay; the second gate (reset) controls how much past context informs the new candidate.
## Empirical Studies
Greff et al. (2017) and others extensively compared LSTM and GRU; conclusion: **no significant difference** on most tasks. Architecture details matter less than the gating mechanism itself.
## Modern Status
Same as LSTM — largely displaced by transformers for natural language. Still seen in:
- Sequence-to-sequence models for small / resource-constrained settings.
- Time-series forecasting.
- Speech / audio processing with streaming constraints.
## Related
- [[Long Short-Term Memory]]
- [[Recurrent Neural Network]]
- [[Vanishing and Exploding Gradients]]