Monte Carlo Tree Search - Albert Masoliver's learning site

## Definition **Monte Carlo Tree Search (MCTS)** is a heuristic search algorithm for decision processes — particularly game playing. It builds a search tree asymmetrically, biased toward promising moves, by combining tree search with **random rollouts**. Underlies AlphaGo, AlphaZero, and many modern game-playing systems. ## The Four-Phase Loop Each iteration: 1. **Selection.** From the root, traverse the tree using a tree policy (typically UCT — see below) until reaching a leaf node. 2. **Expansion.** Add one or more new child nodes to the tree from the leaf. 3. **Simulation (rollout).** Play a random (or weakly heuristic) game from the new node to a terminal state. Record the outcome. 4. **Backpropagation.** Update statistics — visits, wins — along the path from the new node back to the root. After many iterations, choose the move with the most visits (or best win rate) at the root. ## UCT — Upper Confidence Bounds for Trees The tree policy uses UCT to balance exploration and exploitation: $ \text{UCT}(s) = \frac{w_s}{n_s} + c \sqrt{\frac{\ln N}{n_s}} $ - $w_s$ — wins through this node. - $n_s$ — visits to this node. - $N$ — visits to the parent. - $c$ — exploration constant (typically $\sqrt{2}$). The first term favours nodes with high empirical win rate; the second term favours under-explored nodes. ## Properties - **Anytime.** Stop after any number of iterations; the best move so far is always available. - **No evaluation function needed.** Random rollouts substitute. (Modern variants — AlphaZero — replace rollouts with a learned value network.) - **Asymmetric tree.** Promising branches grow deeper; weak branches stay shallow. - **Converges to minimax** in the limit of infinite simulations. ## Where MCTS Won - **Go.** Game tree too large for alpha-beta ($b \approx 250$). MCTS (Coulom 2006, MoGo, Crazy Stone) became state-of-the-art. - **General Game Playing.** No game-specific evaluation function known. - **Real-time strategy games.** Anytime property is critical. - **AlphaGo** (2016) and **AlphaZero** (2017) combine MCTS with deep neural networks: policy network biases selection, value network replaces rollouts. The combination defeated world champions and exceeded human-tuned engines in chess and shogi. ## Limits - **Slow on tactical positions.** A single critical move buried in random rollouts may be missed. Hybrid systems combine MCTS with alpha-beta tactical search. - **High variance** of rollouts adds noise. - **Doesn't naturally handle imperfect-information games.** Variants exist (Information Set MCTS) but are subtler. ## Beyond Games MCTS has been applied to planning under uncertainty, scheduling, automated theorem proving, and as a search component in LLM-driven reasoning systems. ## Related - [[Minimax Algorithm]] - [[Alpha-Beta Pruning]] - [[Evaluation Function]] - [[Multi-Armed Bandit]]