What is reinforcement learning
Reinforcement Learning:
A branch of machine learning: supervised learning, unsupervised learning, reinforcement learning
The idea of reinforcement learning is similar to that of people, which is to learn through practice
For example, learning to walk, if we fall, then our brain gives a negative reward value => the walking posture is bad; If the next step is normal, then the brain gives a positive reward value => this is a good walk
Different from supervised learning, there is no output value of training data prepared by supervised learning. Reinforcement learning only has reward value, which is different from the output value of supervised learning. It is not given in advance, but given later (such as walking and falling).
Different from unsupervised learning, in unsupervised learning, there is neither output value nor reward value, only data features, while reinforcement learning has reward value (negative value means punishment). In addition, non-fleet learning and supervised learning, data are independent, there is no dependency like reinforcement learning
Reinforcement Learning:
It can be applied to different fields: neuroscience, psychology, computer science, engineering, mathematics, economics and so on
Characteristics of reinforcement learning:
No monitoring data, just reward signals
The reward signal doesn’t have to be real time, it can be delayed, even much later
Time (sequence) is an important factor
Current behavior affects subsequent received data
Reinforcement learning has a wide range of applications: game AI, recommendation systems, robot simulation, investment management, power plant control
Basic concepts (individual, environment, action, loading, reward, strategy, state transition probability)
Basic Concepts:
The role of a learner, also known as an Agent
Everything other than Environment, Environment and Agent that consists of and interacts with them
Action, Action, Agent’s behavior
Status, State, information obtained by the Agent from the environment
Feedback from the environment for an action
Policy, Policy, Agent Functions that perform the next action based on the state
State transition probability, the probability that the Agent enters the next state after making an action
Four important elements: state, Action, policy, reward
What is the goal of RL
RL considers the interaction between Agent and Environment
The goal is to find an optimal strategy so that agents can get as many rewards from the environment as possible
For example, in a racing game, the scene is the environment, the car is the Agent, the position of the car is the state, the operation of the car is the action, how to operate the car is the strategy, and the score of the race is the reward
In many cases, agents cannot obtain all the environmental information, but represent the environment through Observation, that is to say, they get the information around themselves
Markov decision process MDP
(Markov Decision Process (MDP): MDP is a kind of uncertain search problem. The goal is to search a feasible state transition path (sequence of state and action) combining the reward function to maximize the reward
MDP (Markov Decision Process) is used because state transitions are only related to the current state and action:
Decisions are made when you need to select an action based on state and potential rewards
The uncertainty is reflected in the transition function. The action A is performed under state S, and the final state S ‘is uncertain, and the reward R is also uncertain
The Markov state
Sequence of random variables X1,X2… ,Xn’s current state, past state, and future state
Given the current state, the future state and the past state are independent of each other, that is, the probability distribution of the system state at t+1 is only related to the state at t, and has nothing to do with the state before t
The state transition from time t to time t+1 is independent of the value of t
Markov chain model can be expressed as = (S, P, Q)
S is the set of states (also known as the state space) of all possible states of the system
P is the state transition matrix
Q is the initial probability distribution of the system,
Agent classification (value-based, policy-based, actor-critic)
Reinforcement learning Agent:
Value-based reinforcement learning
Guide strategy formulation by learning value functions (e.g., Ɛ-greedy implementation method)
Policy-based reinforcement learning
No value function, direct learning strategy
Combining strategy gradient and value function reinforcement learning, actor-critic
Methods of learning both value function and strategy
Actor – criticism, which is the equivalent of an actor acting while the critics tell the actor and the actor gets better and better
What is a policy network
Policy network:
In any game, the player’s input is considered action A, and each input (action) leads to a different output, called the state S of the game
You get a list of different state-action pairings
S includes all policies in the policy network
For example, entering A1 in a game results in state S1 (move up), and entering A2 results in state S2 (move down)
S: state set
A: Action set
R: Reward distribution, given (state, action)
P: state transition probability, the probability distribution of the next state for a given (state, action)
: discount factor, the precaution to prevent the reward R from reaching infinity => the infinite reward ignores the difference between the different actions taken by the agent
: Optimal policy
What is the value network
Value network (numerical network) :
The value network assigns a score (numerical value) to the states in the game by calculating the expected cumulative score for the current state S, and each state goes through the entire numerical network
Rewarding more states will result in a higher Value in the Value network
The reward here is the expected value of the reward, and we’re going to pick the best one from the set of states
V: Value expectation
Principles of MCTS (selection, extension, simulation, return)
MCTS principle:
Each node represents A situation where A/B is visited B times, and black wins A times
We will repeat the process over and over again:
Step1, Select Select, go down from the root node, and Select the “most valuable child node” each time until you find “there are unextended child nodes”, that is, there are unpassed subsequent nodes in this situation, such as 3/3 nodes
Step2: expand the Expansion. Add a 0/0 child node to this node, corresponding to the “unexpanded child node” mentioned earlier.
Step3: simulate Simluation and go to the end with Rollout policy to get a result (Thinking: why not adopt AlphaGo’s strategy value network chess)
Step4, send back Backup, add the simulated result to all its parents, assuming that the simulated result is 0/1, add 0/1 to all its parents
MCTS (Monte Carlo Tree Search)
Monte Carlo tree search combines the generality of random simulation with the accuracy of tree search
MCTS is a search algorithm, which adopts various methods to effectively reduce the search space. In each turn of MCTS, the starting content is a semi-expanded search tree, and the target is the original semi-expanded + one more node/layer of the search tree
The role of MCTS is to predict the output through simulation, which can theoretically be used in any domain defined by {state,action}
Use the main steps:
Select, start from the root node, according to a certain strategy, search to the leaf node
Extension, extending one or more legal child nodes to the leaf node
Simulation, in which children are simulated in a random manner (which is why it is called Monte Carlo) over a number of experiments. Simulation to the final state can get the score of the current simulation
Return, update the simulation times and score values of the current child node according to the scores of several times simulated by the child node. At the same time, the simulation times and score values are sent back to all its ancestor nodes and the ancestor nodes are updated
AlphaGo main logic
AI = Policy Value Network + MCTS
Policy Value Network
Policy network, input current state, neural network output the probability of each action taken in this state
Value network, for value network, the value of the current situation = the estimate of the endgame
MCTS, Monte Carlo tree search
The input is a normal policy, and we can get a better output of good policy through MCTS
Through MCTS to complete the self-game, so as to update the strategy network