Please pay attention to the wechat public account “AI Front”, (ID: AI-front)
[] making code: https://github.com/int8/monte-carlo-tree-search
For a long time, the academic consensus was that it was far from realistic for a machine to reach the level of a human master at go. Academics consider this a “holy grail” for AI — a milestone we’re nowhere near in the next decade. Deep Blue hit the big time more than 20 years ago, but since then no Go engine has come close to human grandmasters. The idea of “numerical chaos” in Go is so popular that it has even appeared in movies.
Amazingly, in March 2016, an algorithm called Alpha Go, developed by Google Deepmind, proved the reality-doubters wrong by beating the South Korean world Go champion 4-1. Almost a year later, Alpha Go Zero, the second generation of the Alpha Go Lee, was reported to have beaten its predecessor 100-0, making it hard for humans to match it.
The Alpha Go/Zero system is a great piece of engineering, assembled from a mixture of methods. The core components are:
-
Monte Carlo tree search (a variant of the PUCT function for tree traversal)
-
Residual convolutional neural network — a strategy and valuation network for game evaluation and prior probability estimation of actions
-
Reinforcement learning – used to train networks through self-play
-
In this article, we will focus only on Monte Carlo tree search.
Monte Carlo tree search was introduced by Remi Coulom in 2006 as part of Crazy Stone. Crazy Stone is an excellent go game engine.
Monte Carlo tree search has one main purpose: given a game state, choose the next step that is most likely to win. In the rest of this article, we’ll try to take you through the details of Monte Carlo tree search to better understand it. From time to time, we’ll also combine Alpha Go/Zero with an attempt to explain which Monte Carlo tree search variants have been applied to Deepmind’s AI.
The game in which the framework or context of monte Carlo tree search manipulation is itself a very abstract, generalized term, so we focus here on a single type of game: finite two-person zero-sum sequential games, which sound complicated at first, but are actually quite simple. We can break it down as follows:
-
A game means we’re dealing with an interactive situation. There is interaction between the participants (one or more)
-
The word limited implies that there are limited ways in which participants can interact with each other at any one point in time
-
A two-person finite game is a game in which there are two players
-
Sequential because the players move in turn — both sides take turns
-
And finally, a zero-sum game — meaning that both players have opposite goals, in other words: at any end of the game, the sum of all players’ gains equals zero. Sometimes, this kind of game is called strict competition
Go, chess, and tic-Tac-toe are obviously finite two-person zero-sum sequential games. There are two players involved, the number of moves each can make is always limited, and the game is strictly competitive — both players have opposite goals (the sum of all outputs in the game is zero).
Note that to simplify this tutorial, we focused on a specific subset of scenarios. Monte Carlo tree search is a much broader tool and can be applied to scenarios outside a limited number of two-person zero-sum games. Go to the Monte Carlo tree search survey to get the full picture.
Formally (check out this free book for an overview of game theory), a game can be represented by a set of basic mathematical entities. In a book on game theory at the doctoral level, you’ll find the following definition:
The definition is complicated, and it might be fun for a mathematician to delve into it.
From a computer programmer’s point of view, formal definitions can be a bit confusing and difficult to use in practice. Fortunately, we can use a well-known data structure to represent a game in a simplified form: a game tree.
A game tree is a tree data structure in which each node represents a particular state of the game. A move from a node to all of its children (if any) is a move. The number of children of a node is called the branching factor. The root node of the tree represents the initial state of the game, while the leaf node of the tree (node with no child node) represents the end state of the game.
Let’s try to describe tic-Tac-toe, and you can see it (partially) :
-
At the top, you can see the root node of the tree, representing the initial state of the tic-Tac-toe game, the empty board (marked green)
-
Any transition from one node to another represents an action
-
Tic-tac-toe’s branching factor is constantly changing – it depends on the depth of the tree
-
The game ends at a leaf node (marked red)
-
Traversing the tree from the root node to the leaf node represents a complete game.
To limit the size of the game tree, only visited states are displayed expanded, and unexpanded nodes are indicated in gray
A game tree is a recursive data structure, so by choosing the best move, you’re actually selecting a child node, and that child node is the root node of its subtree. Thus, you can think of a game as a series of “most winnable next” problems, each of which can be represented by a game tree (with a different root node). In practice, usually you don’t need to remember the path that led to the current state, because it doesn’t affect the current state of the game.
So, again, what we’re trying to do is, based on the hypothetical state of the game and the game tree that it implies, find the most likely next step. But what does that really mean?
There is no straightforward answer. First, you can’t know your opponent’s strategy in advance — he or she may play it well or badly. Let’s take chess for example. Knowing that your opponent is an amateur (mathematicians call this “knowing your opponent’s mixed strategy”), you can choose simple strategies to lure him to the bait and win quickly. However, if you use the same strategy against a strong opponent, it will obviously destroy you.
If you don’t know anything about your opponent, there’s a radical strategy called minimax. This strategy assumes that your opponent is playing his best game and then adjusts his strategy accordingly to maximize your gains. In A two-person finite zero-sum sequential game between A and B (A wants to maximize his utility, while B wants to minimize A’s utility), the minimization maximum algorithm can be described by the following recursive formula:
Among them
-
And the utility function representing player A and player B respectively (utility = gain and gain)
-
Is a function that generates the next game state (a subtree of the current node) based on the current state and the actions taken during the state.
-
Is a function of determining the final state of the game (at the leaf node)
-
ŝ is any final game state (a leaf node)
-
The minus sign in the termination state indicates that the game is a zero-sum game — don’t worry!
Simply put, given the state, you want to find a move that will yield the most, assuming your opponent wants to minimize your gain (and maximize his own). Hence the name minimization maxima. That sounds like a good idea. All we need to do is unfold the entire game tree, and then deduce the values of the nodes according to the rules given by our recursive formula.
The game tree above demonstrates the best one-step choice in minimization maxima algorithms. The White Queen wants the game to be as dark (cold) as possible (payback = pixel intensity), and the Dark Queen wants the game to be as bright (warm) as possible. Each choice at each level is optimal in the sense of minimization maximum. The result of the bottom leaf node is already obvious, so we can work backwards from there. The Dark Queen always chooses the brightest color. Then the White Queen will pursue her greatest reward and choose the path leading to the darkest colors. And so on… Until the node that represents the current state of the game. That’s how the minimization Max algorithm works.
The biggest disadvantage of minimization maxima algorithms is that they need to expand the entire game tree. For games with high branching factors (such as go or chess), this results in a huge game tree and failure.
Is there a remedy for this?
There are ways. One possible approach is to derive our game tree only to a certain threshold depth. But then we cannot guarantee that any tree node at our threshold level is a leaf node. Therefore, we need a function to evaluate the state of a non-terminating game. This is natural for humans: by looking at a chess or Go board, you can predict the winner, even if the game isn’t over. For example: The following games can be declared over in advance.
Another way to overcome the problem of game tree size is to prune the game tree by alpha-beta pruning algorithm. The alpha-beta pruning algorithm strengthens the minimum-maximum algorithm. It traverses the tree by minimization and maximization, but avoids expanding some branches of the tree. At best, the results are identical to those of the minimization maxima algorithm — alpha-beta pruning improves performance by reducing the search area.
An intuitive introduction to minimization Max algorithms and alpha-beta pruning algorithms can be found here. Minimax/alpha-beta pruning algorithms are well-established solutions and have been successfully used in a variety of game engines today, such as Stockfish, Alpha Go’s main competitor in a series of games held in late 2017 (though arguably).
In monte Carlo tree search algorithms (within the scope of our paper today), the optimal action is calculated in a different way. As the name suggests (especially with its Monte Carlo component) — Monte Carlo tree searches the simulation game many times and then tries to predict the best decision based on the results of the simulation.
The main concept of Monte Carlo tree search is search. A search is a series of walks down the game tree. A single walk is from the root node (the current state of the game) to a node that is not fully expanded. A node that is not fully expanded means that at least one of the node’s children has not been accessed. Once a node is encountered that is not fully expanded, one of its unvisited children is selected as the root node of the single simulation. Then, the simulation results are sent back to the root node of the current tree to update the game tree node statistics. Once the search (limited by time or computing power) stops, decisions can be made based on the statistics collected.
There are a lot of points in this definition, but you haven’t been able to connect them all very well, have you?
We can try to simplify the above description by asking some key questions to slowly understand all the information points:
-
What are the nodes of the expanded game tree and the nodes of the game tree that are not fully expanded?
-
What does it mean to “walk down” during a search? How is the next (child) node selected?
-
What is a simulation?
-
What is back propagation?
-
What statistics are back-propagated and updated at the nodes of the unfolding game tree?
-
How was the final action chosen?
Pay attention, and we’ll figure it all out bit by bit.
We can concentrate on the concept of simulation first, because it doesn’t depend so much on other definitions of terminology. A simulation deduction is a single action of a game — a series of actions starting from the current node (representing the state of the game) to the end node where the result of the game can be calculated. The simulation computes an approximate evaluation of a node through a game run randomly from that node.
How was each step chosen during the simulation?
During the simulation, each step is selected according to a function called rollout Policy:
This function takes an argument that represents the state of the game and returns the next action. In fact, it is designed to be very fast so that many simulations can be performed quickly — the default rollout Policy function is an evenly distributed random number generator.
In Alpha Go Lee, the estimate of the leaf node is the weighted sum of the two components:
-
Standard move evaluation using custom quick move strategy, a shallow Softmax neural network with manual features
-
Location assessment based on a 13-layer convolutional neural network, called the valuation network, is extracted from 30 million different positions of Alpha Go’s own games (no two points come from the same game)
At Alpha Zero, Deepmind’s engineers went a step further by not performing simulations at all, but evaluating the current node directly with a 19-layer CNN remnant neural network. (There is a network 0 in Alpha Zero that outputs position estimates and immediately moves probability vectors)
In its simplest form, simulation starts with a given game state, performs a series of random actions, and then terminates. Simulations usually get an evaluation — for the game we’re talking about, it could be a win, a loss, or a draw, but usually any value is a reasonable outcome of a simulation.
In Monte Carlo tree searches, the simulation usually starts with a node that has not been visited previously — we will explain what visited nodes mean later.
How do humans think in games like Go or chess?
Given one root node and the rules of the game tree, the rest of the game tree is implied. We don’t have to remember the whole tree to walk through it. In its original form, it had not been developed at all. In the original game state, we are at the root of the game tree, and the rest of the nodes are not accessed. Once we think about a step, we imagine that this step will lead us to where we are in the game tree, where we visit the nodes of the game tree that correspond to the step. Game states that you never considered visiting are still unvisited, and their potential remains to be discovered.
Monte Carlo tree search game trees can also be distinguished in this way. Nodes are classified as visited nodes or unvisited nodes. When is a node accessed? A node is considered visited if it has been evaluated at least once since a simulation started. A node is a fully expanded node if all of its children are visited nodes, otherwise it is a partially expanded node that can continue to expand.
In practice, a search starts with all the children of the root node unvisited — once one of the children is selected, the first simulation begins immediately.
Note that the nodes selected by the rollout Policy function are not considered visited during the simulation. They are still unvisited, even if a rollout passes through them. Only the nodes at the beginning of the simulation are marked as visited.
Once the simulation of a node just visited (sometimes called a leaf node) is completed, the simulation results can be propagated backwards to the root node of the current game tree. The node at the start of the simulation is then marked as visited.
Back propagation is traversal from the leaf node (the node where the simulation started) back to the root node. Simulation results are passed to the root node, and specific statistics are calculated or updated for each node on the back propagation path. Back propagation ensures that the statistics for each node reflect the results of all simulations starting from its descendants (since the simulation results are passed to the root node of the game tree).
The purpose of backpropagation simulation results is to update the simulated total return and total number of visits for all nodes along the backpropagation path, including the node where the simulation started.
-
Simulation total return, an attribute of a node, is simply a summary of all simulation results that have passed through the node.
-
Total number of visits, another property of the node, which indicates how many backpropagation paths passed through the node (and how many times it contributed to the total return of the simulation).
Each visited node maintains these two values, and once a certain number of simulations are completed, the visited nodes store information indicating how they were explored and utilized.
In other words, if you look at statistics for random nodes, these two values will reflect how well the node is winnable (simulating total return) and how often it is explored (number of visits). Nodes with high returns are good options to take advantage of, but nodes with low numbers of visits may also be good options (because they haven’t been well explored yet).
There is one more difficulty. How do we start a simulation from a root node to an unvisited node?
At the beginning of the search, since we have not conducted any simulation, we first select the nodes that have not been visited. A simulation is started from each of these nodes, and the results are propagated back to the root node, which is then considered fully expanded.
But what do we do next? How do we now go from a fully expanded node to an unvisited node? We have to go through the layer of nodes we have visited, and there is no good way to proceed from there.
By fully expanding the nodes to pick the next node on our path, to start the next simulation, we need to consider all the child nodes: information and information of the node itself. Let’s think about what information is available now:
Our current node (marked blue) is fully expanded, so it must have been accessed and stored in node statistics: simulated total returns and total visits, also available for its children. These values are compounds of the values we mentioned earlier: the Upper Confidence Bound applied to trees, or UCT.
UCT is a function that lets us select the next node to traverse from a list of already visited nodes. It is the core function of Monte Carlo tree search.
In monte Carlo tree search traversal, the node maximizing UCT is selected as the next node. Let’s see what the UCT function does:
First, our function is defined for a child node of the node. This is the sum of two parts — the first, also called the exploitation component, can be understood as the probability of success — the total simulated return divided by the total number of visits. This predicts the probability of success at each node. This section already looks pretty winnable — eventually we’ll traverse the nodes with the highest winnings.
Why can’t we just use components? Because we only explore nodes that lead to a simulation win in the early stages of the game, we ignore their win-rate potential over the course of the game.
An example:
Suppose we do a Monte Carlo tree search using only UCT components. Starting at the root, we run the simulation once on all the child nodes, and then in the next step visit only those nodes whose simulation results have won at least one round. Nodes that fail miserably in the first simulation (remember, in practice the default strategy function is usually random) are immediately abandoned without a chance to improve their chances of success.
Therefore, we need a second UCT component, called the Exploration Component. The probe component benefits nodes that are not explored — nodes that are relatively infrequently visited (that is, low nodes). We can take a look at the shape of the probe component of the UCT function — it decreases with the number of visits to a node, giving the less visited nodes a higher chance of being selected, thus making the search direction more likely to explore the less visited nodes.
Finally, parameter control in the UCT formula utilizes the balance of components and exploration components in the Monte Carlo tree search.
An important comment about the UCT function is that in a competitive game, its exploiting components are always calculated relative to the player moving to the node — this means that, in traversing the game tree, the view changes with the nodes being traversed: for any two consecutive nodes, the view is opposite.
In Both Alpha Go Lee and Alpha Zero, tree traversal follows nodes that maximize the following UCT variants:
Is the prior probability of movement (conversion from to), whose value comes from the results of a deep neural Network called a Policy Network. A strategy network is a calculated probability distribution of possible moves based on the state of the game (note that strategy networks are different from quick moves). The goal of this strategy is to reduce the search scope for reasonable movement — add it to the exploration component to explore in the direction of reasonable movement when it leads the search.
There are two types of strategic networks in Alpha Go
-
SL (Supervised Learning) strategy network, supervised Learning based on human game data sets.
-
Reinforcement Learning RL (Reinforcement Learning) strategic network, the Reinforcement SL strategic network, has the same structure, but carries out in-depth training through Reinforcement Learning (self-playing).
Interestingly, in Deepmind’s Monte Carlo tree search variant, the supervised learning strategy network performed better in practice, so its results were used to estimate prior probabilities of actions (the authors argue that human-based data is richer in exploratory activities). So what is the purpose of the reinforcement learning strategy network? The stronger reinforcement learning strategy network was used to generate 30 million location datasets for training the valuation network (which is used for game state assessment).
On the other hand, there is only one network in Alpha Zero, which is both a valuation network and a strategic network. It is trained entirely by self-playing chess from a random initial state. Multiple networks are trained in parallel. At each checkpoint, the optimal network is selected based on the current optimal neural network to generate training data.
Now that we know almost all the details of a successful Monte Carlo tree search, there are still a few questions we need to answer. First of all, when do we terminate the Monte Carlo tree search program? The answer to this question is: it depends on context. If you’re building a game engine, your “thinking time” may be limited, and so is your computing power. Therefore, the safest bet is to run Monte Carlo tree search for as long as your resources allow.
Once the Monte Carlo tree search cycle is over, the optimal move is usually the one with the largest number of visits, because its value is already rated as the best (the move with the highest rating must also be accessed most often).
After selecting your move using a Monte Carlo tree search, your node choice becomes the game state in which your opponent starts moving. Once he chooses his move, you start the Monte Carlo tree search again from the game state represented by your opponent’s chosen node. Some statistics from the Monte Carlo tree search of previous rounds may still be in the new branch you are considering. This led to the idea of reusing statistics rather than building a new game tree from scratch. In fact, that’s exactly what the creators of Alpha Go and Alpha Zero did.
With all the details covered, let’s review our initial simplified description of Monte Carlo tree search with pseudocode:
As you can see, the set of functions has been reduced to a very small number. These functions apply to any game, not just go or chess. Here, you can see an example of a Tic-Tac-toe monte Carlo tree search implementation.
This is open to everyone, and I hope you enjoy reading it, and that it will help you understand Monte Carlo tree search – I understand monte Carlo tree search better by writing this example.
The next step (which will probably be fodder for the next blog post) is to try to implement a simple chess robot using alphaGo-like components: valuation and strategy networks and Monte Carlo tree searches (some of which are still in the works, you can check them out here).
英文原文 :
Please pay attention to the wechat public account “AI Front”, (ID: AI-front)