AlphaGo main logic
AI = Policy Value Network + MCTS
Policy Value Network
Policy network, input current state, neural network output the probability of each action taken in this state
Value network, for value network, the value of the current situation = the estimate of the endgame
MCTS, Monte Carlo tree search
The input is a normal policy, and we can get a better output of good policy through MCTS
Through MCTS to complete the self-game, so as to update the strategy network
MCTS node definition
Node definition (TreeNode tree construction)
Tree nodes, each of which records its own Q value, prior probability P, and UCT value the second term, the adjusted number of visits u (for exploration)
MCTS tree creation and use
Tree structure: The tree structure defines the solution space of a feasible solution, and each path from leaf node to root node corresponds to a solution.
Monte Carlo method: MSTC does not need to set a marking sample in advance. Random statistical method acts as a driving force to obtain observation results through random statistical experiments
Loss assessment function: to provide a quantifiable deterministic feedback for evaluating the merits and demerits of the solution => MCTS seeks the “real function” behind the loss function representation through random simulation.
Linear optimization of backpropagation: After obtaining the loss result of one path each time, backpropagation (Backup) is used for overall optimization of all nodes on the whole path
Heuristic search strategy: The algorithm follows the principle of loss minimization to conduct heuristic search on the whole search space until it finds a set of optimal solutions or terminates early
MCTS based AI Player
class MCTSPlayer(object):
def __init__(self, policy_value_function,
c_puct=5, n_playout=2000, is_selfplay=0):
# Use MCTS for search
self.mcts = MCTS(policy_value_function, c_puct, n_playout)
self._is_selfplay = is_selfplay
Set player index
def set_player_ind(self, p):
self.player = p
Get AI chess position
def get_action(self, board, temp=1e-3, return_prob=0):
Get all possible chess positions
sensible_moves = board.availables
# MCTS returns the PI vector, based on the alphaGo Zero paper
move_probs = np.zeros(board.width*board.height)
if len(sensible_moves) > 0:
acts, probs = self.mcts.get_move_probs(board, temp)
move_probs[list(acts)] = probs
if self._is_selfplay:
# Add Dirichlet noise for exploration (self-training required)
Move = Np.random. Choice (acts, P =0.75*probs + 0.25* NP.random. Dirichlet (0.3* Np.ones (Len (probs)))))
Update the root node and reuse the search tree
self.mcts.update_with_move(move)
else:
# Default temp=1e-3, almost equal to the step with the highest probability of selection
move = np.random.choice(acts, p=probs)
Reset the root node
self.mcts.update_with_move(-1)
if return_prob:
return move, move_probs
else:
return move
else:
print(“WARNING: the board is full”)
Policy Value Network Implementation
Implementation details of Policy Value Network:
Definition of Neural Network Architecture (PyTorch)
In the training step, the gradient needs to be emptied with Zero_grad () before backpropagation
Definition of Loss (Loss = Value_Loss + policy_Loss)
Neural network parameters to obtain, save, and load
AI main process:
Collect self-playing data through MCTS
Update Policy Value Network through self-playing data
Evaluate the win rate of the current Policy Value Network
Judge the performance of the current model and save the optimal model
Reinforcement learning
Reinforcement learning does not need the Label of training data, but it requires feedback (reward or punishment) from each action environment => Through feedback, the behavior of the training object is constantly adjusted
Strategies for Reinforcement Learning: Policy-based (Policy Gradients) and Value-based (Q-Learning)
Policy-based Directly predicts the actions to be taken in the environment
Value-based Predicts the expected Value (Q Value) of all actions in the environment state, and selects the Action with the highest Q Value to execute
Value-based is suitable for a small number of discrete Action values, and policy-based is suitable for the environment with multiple types of Aciton or continuous Action values
Man-machine game
Reinforcement learning and recommendation systems
Reinforcement learning:
It is a branch of machine learning (supervised and unsupervised).
Different from other learning methods, reinforcement learning is the mapping of the agent from the environment to behavior, and the goal is to maximize the reward.
If a behavioral decision of an agent is positively rewarded, the tendency of the agent to use this behavior in the future will be strengthened
Reinforcement learning is the closest learning to nature
Reinforcement learning combined with deep learning can solve the problem of massive data generalization (e.g. DeepMind’s AlphaGO)
Build feedback mechanisms between agents and the environment
The previous learning methods were mostly based on supervised learning. The lack of effective exploration ability resulted in the system’s tendency to push items (such as goods, shops and problems) that had occurred before to consumers.
Reinforcement learning can effectively establish the interaction process between consumer and system and maximize the accumulated benefits of the process, which has a good application in business scenarios
Search scenario:
In the field of e-commerce, users’ browsing and purchasing behavior can be regarded as Markov process, and the Markov decision process is modeled to realize the ranking decision model based on reinforcement learning, which can make the search more intelligent
Tmall Double 11, through reinforcement learning search ranking index increased by 20%
Ctrip has introduced reinforcement learning in hotel search rankings to predict those unknown situations, which requires a certain amount of “random exploration”, only in this way we can know the actual user feedback
The short-term cost of random exploration cannot be completely avoided, but the ultimate goal is to make up for and more than compensate for the cost of random exploration
Recommended scenarios:
Reinforcement learning and adaptive online learning are used to build decision engines through continuous machine learning and model optimization to conduct real-time analysis of massive user behaviors and item characteristics and help users quickly find their favorite items
Item can be an article, an item, etc
Taobao’s Guess you like, through the introduction of intensive learning, helps each user quickly find their favorite products, improves the matching efficiency between people and products, and improves the effect index by 10%-20%
Intelligent customer service:
Taking intelligent customer service robot as Agent, Agent’s decision is not determined by the direct income of a single node, but a relatively long-term process of interpersonal interaction
Think of the interaction between the consumer and the platform as a Markov decision process, and use reinforcement learning to build a feedback system for the interaction between the consumer and the system
System decisions are based on maximizing the benefits of the process => to achieve a dynamic platform for the system and users
How to define episode, reward,state, action is the key:
Episode, for example, in the scene of ticket booking, when the user talks with the system, if the system determines for the first time that the current user intends to “buy air ticket”, it can be regarded as an Episode. If the user buys air ticket or quits the session, the Episode will be regarded as the end
Reward, collect user feedback, such as user order, exit behavior
State and User Question Embedding extract slot State and historical slot information, access the fully connected neural network, and finally connect the Softmax layer to each action
Action. In the air ticket booking scenario, the Action space is discrete and mainly includes rhetorical questions and orders for each slot, such as rhetorical questions about time, origin, destination and order
Advertising system:
If an advertiser can bid separately based on the value of each piece of traffic, the advertiser can bid higher on the respective high-value traffic and lower on the average traffic, and get a better ROI
At the same time, the platform can also improve the efficiency of matching between advertising and visitors
Through reinforcement learning, we can carry out intelligent pricing technology, and for each visiting user, we can decide how to adjust pricing according to their current state, show them specific ads, and guide their state to move in the direction we want
On Tmall Double 11, CTR, CPM and GMV have all increased significantly