Monte Carlo tree search introductory tutorial!

My main research topics include reinforcement learning, computer vision, deep learning, machine learning, and so on. I want to share my notes and experiences in the process of learning. Look forward to your attention, welcome to learn and exchange progress together!

This article is a general translation of the Monte Carlo Tree Search – Beginners Guide, as well as an explanation of its code. It is divided into two chapters [detailed principles] and [code practice].

1 the introduction

Monte Carlo tree search was first proposed in 2006 by Remi Coulom for the go game Crazy Stone.

Efficient Selectivity and Backup Operators in Monte-Carlo Tree Search

The general idea of monte Carlo tree search is to select the best strategy/action for a given game state.

1.1 Finite two-person zero-sum sequential game

Monte Carlo tree search is actually a very widely used game framework. Here we apply it to finite two-person sequential zero-sum game problems. Games like Go, chess, and Tic-Tac-TOE are finite two-player sequential zero-sum games.

1.2 How to represent a game?

We use a Game Tree to represent a Game: each node represents a state, and if you move a step from a node, you will reach its children nodes. The number of child nodes is called a branching factor. The Root node indicates the initial state. The terminal nodes have no children.

In the Tic-Tac-TOE game, it is shown as follows:

Every time from theThe initial stateAnd the treeThe root nodeStart. intic-tac-toeThe initial state of the game is an empty board.
Moving from one node to another is called a move.
Branching factor(branching factor),tic-tac-toeThe deeper the middle tree is, the fewer branching factors it haschildren nodeThe smaller the number of.
The end of the game indicates the termination node.
A single game at a time is represented from the root node to the end nodeplayout.

You don’t have to worry about how you got to node, just do what you can afterwards.

1.3 What is the best strategy? Minimax and alpha-beta pruning

What we want to find is the most promising next move. If you know what your opponent’s strategy is you can fight to solve it, but most of the time you don’t know what your opponent’s strategy is, so we need to use minimax, assuming that your opponent is smart enough to take the best strategy every time.

Suppose A plays A game with B, in which A expects to maximize its own return. Since it is A zero-sum game, B expects the least return of A. The Minimax algorithm can be described as follows:

andIs the playerandThe benefit function of.
moveIndicates from the current stateAnd actions takenMove to the next state.
evalEvaluate the final game score.
Is the final game state.

Simply put, given a stateExpect to find an actionFind one that maximizes your reward while your opponent minimizes yours.

The biggest weakness of the Minimax algorithm is that it needs to expand the entire tree, which makes it difficult for the algorithm to handle games with high branching factors, such as Go and chess.

One solution to the above problem is to expand our game tree only up to a certain threshold depth d. So we need an evaluation function that evaluates non-termination nodes. It’s pretty easy for us humans to figure out who’s winning and who’s losing based on what’s going on. Computer solutions can be found in the text:

Chess position evaluation with convolutional neural network in Julia

Another solution to overextending trees is the alpha-beta pruning algorithm. It avoids branching, and its best result is the same as miniMax because it reduces the search space.

2 Basic concepts of Monte Carlo tree search (MCTS)

Monte Carlo predicts the optimal strategy through multiple simulations. The core thing is search. Search is a combination of traversal of the entire game tree. A single traversal starts from the root node to a node that is not fully expanded. Incomplete expansion means that it has at least one child node that has not been accessed, or called unexplored. When a node is not fully expanded, select one of its unvisited Childre nodes as the root and perform a single playout/simulation. The simulation result propagated back is used to update the root node of the current tree and update the statistical information of the nodes in the game tree. When the search of the entire game tree is over, you get the strategy of the game tree.

Let’s take a look at the following key concepts:

How to explain the node of the game tree that is not fully unexpanded?
What are the traverse downs during a search? How to select child nodes?
What is simulation?
What is backpropagation?
What statistics are back-propagated and updated in the extended tree nodes?
How to choose actions according to strategy (game tree)?

2.1 Simulation/Simulation/Playout

Playout/ Simulation is a process of interacting with the game, moving from the current node to the end node. The move selection in simulation is based on the rollout policy function:

A Rollout Policy is also known as a fast move Policy and is based on a stateChoose an action. In order to make this strategy fast in simulation, uniform random distribution is generally selected. As shown in the figure below

Playout/Simulation in Alpha Zero

In AlphaGo Zero, DeepMind’s directly outputs position evaluation and moves probability using a CNN residual network.

2.2 Expansion of game tree nodes – full expansion and access nodes

If a node is visited, it means that some simulation started from it. A node is said to be fully extended if all of its children have been accessed, otherwise it is not fully extended. As shown in the figure below:

In practice, the first simulation starts when none of the children of the root node are accessed, and one of them is selected.

When a rollout policy selects a child node during Simulation, it is not considered that the child node has been visited. Only the node at the beginning of Simulation is marked as visited.

2.3 Simulation results of back propagation

Simulation is performed from a recently visited node (sometimes called a left node). When the Simulation is complete, the results are propagated back to the root of the current tree. The node where Simulation started is marked as visited.

Back propagation is from the leaf (the node where simulation starts) to the root. All node statistics on this path are computed and updated.

2.4 Nodes’ statistics

There are two main updates to the simulation reward: all simulation rewardsAnd all nodesNumber of accesses (including the node where simulation started).

– Indicates a nodetheSimulation and reward, the simplest form is the sum of the simulation results of all the considered nodes.
– Represents another attribute of the node, indicating how many times the node has been in the back propagation path (and how many times it has participated in the Total Simulation reward).

2.5 Traversing the game tree

At the beginning of the search, unvisited nodes are selected first and simulated. The results are propagated back to the root, after which the root can be considered fully expanded.

To pick the next node on our path to start the next simulation, we need to considerAll child nodes of ...And its own nodesAs shown in the figure below:

The current state is marked blue and is fully expanded in the figure above, so it is visited and stores statistics about the node: total simulation returns and number of visits, as well as its children. These values form our final part: the Upper Confidence Bound applied to Trees (UCT).

2.6 Upper confidence bound

UCT is a core function in Monte Carlo tree search, used to select the next node to traverse:

In the Monte Carlo tree search process, choose the traversal with the largest UCT.

UCTThe first part of the, also known asexploitation component, can be considered as child nodesWin percentage estimate (total revenue/total attempts = average revenue per play).

This one seems convincing enough, since you can just pick the next step with a high probability of winning, but why not just use this one component? This is because this greedy approach to searching quickly leads to game over, which often results in inadequate searches and missing the optimal solution. So the second item in the UCT is called the Exploration Component. This component tends to favor unexplored nodes (Smaller). The value trend of the second term in the Monte Carlo tree search process is roughly as shown in the figure below, and its value decreases with the increase of iterations:

2.6.1 UCT in Alpha Go and Alpha Zero

In the AlphaGo Lee and Alpha Zero algorithms, UCT formula is as follows:

Is the move prior probability from the policy network, which is a function of the move distribution obtained from the state, in order to improve the efficiency of exploration.

Exploitation Component when the perspective of the game changesThere will be changes.

2.6.2 Policy networks in Alpha Go and Alpha Zero

In AlphaGo algorithm, there are two policy networks:

SL Policy Network: A Network for supervised learning based on human data.
RL Policy Network: Improve SL Policy Network based on reinforcement learning self-game.

Interestingly — in Deepmind’s Monte Carlo Tree Search variant —SL Policy Network output is chosen for prior move probability estimation as it performs better in practice (authors suggest that human-based data is richer in exploratory moves). What is the purpose of the RL Policy Network then? The stronger RL Policy Network is used to generate 30 mln positions dataset for Value Network training (the one used for game state evaluation)

There is only one network in Alpha ZeroIt is not only a value network but also a policy network. It is trained entirely via self-play starting from random initialization. There is a number of networks trained in parallel and the best one is chosen for training data generation every checkpoint after evaluation against best current neural network.

2.7 termination of MCTS

When should we end the MCTS process? If you start playing, then your “thinking time” is probably limited and the computational capacity has its boundaries, too. So the safest thing to do is to do as much traversal as your resources allow.

When an MSCT program ends, the best move is usually the node with the most visits. Because in the case of multiple visits, evaluate itThe value must be high.

When you select an action using monte Carlo tree search, your choice becomes part of the state in the eyes of your opponent. The reverse is also true for you. When your opponent selects a state, your Monte Carlo tree search starts to work. A direct search using previous statistics can yield results.

3 MCTS summary

def monte_carlo_tree_search(root):
    while resources_left(time, computational power):
        leaf = traverse(root) # leaf = unvisited node 
        simulation_result = rollout(leaf)
        backpropagate(leaf, simulation_result)
    return best_child(root)

def traverse(node):
    while fully_expanded(node):
        node = best_uct(node)
    return pick_univisted(node.children) or node # in case no children are present / node is terminal 

def rollout(node):
    while non_terminal(node):
        node = rollout_policy(node)
    return result(node) 

def rollout_policy(node):
    return pick_random(node.children)

def backpropagate(node, result):
   if is_root(node) return 
   node.stats = update_stats(node, result) 
   backpropagate(node.parent)

def best_child(node):
    pick child with highest number of visits
Copy the code

Python Combat Monte Carlo tree search, based on tic-Tac-TOE game. Github.com/int8/monte-…

Reference:

Int8. IO/monte – carlo…
www.bilibili.com/video/av291…
Blog.csdn.net/qq_16137569…
Blog.csdn.net/guotong1988…