What is reinforcement learning?

Abstract: This paper attempts to explain reinforcement learning in a form that is easy to understand and will not include a formula.

This article was shared by Yanghuaili in the Huawei Cloud community.

Machine Learning can be roughly divided into three research fields: supervised Learning, unsupervised Learning and Reinforcement Learning (RL). Supervised learning is one of the most familiar machine learning methods. We often encounter image classification, face recognition, regression prediction and other tasks belong to supervised learning. In short, the task of supervised learning processing is to build a model based on a given input-tag pair to predict the new input tags. Unsupervised learning, as the name implies, does not need labels in the model learning stage (often because of the high cost of label labeling or fuzzy label classification standard, etc., the input labels cannot be obtained), and the most typical scenario is clustering. However, reinforcement learning is quite different from the two learning methods mentioned above. It learns a strategy through the interaction between the agent and the environment, and makes the interaction between the strategy and the environment to get the maximum expected benefits. I believe that the above sentence does not establish a clear concept of reinforcement learning, this blog will try to explain reinforcement learning in an easy-to-understand form in the next section, this article will not include a formula.

Suppose you travel back to the warring States period and become one of the generals under Lord Zhuge Wu of the State of Shu. The prime minister gives you an army of 10,000 people and sets a goal for you: to attack the cities of the state of Wei and seize as many cities as possible, preferably the capital of the state of Wei. Then you think, how can you win a battle when you’ve never really fought one? Using supervised learning? Then you have to have enough and rich examples of actual combat for you to learn, as well as reading military books, but the battlefield situation is changing so quickly, how can military books cover all of them? If there is an enemy who does not play the cards according to the routine, is it not useless? What about using unsupervised learning? Think of this you smiled a wry smile, it is better to read the book of war. So you scowl, scratch, unlimited patriotism can not be implemented. At this time you think, only on the reinforcement study! So you roll out a page and organize your thoughts…

Environment: The situation you are facing on the battlefield, such as the surrounding terrain, the location of the enemy, the size of the enemy army, the commander of the enemy, your position, etc. The information you use to make decisions; On the other hand, your decision also changes the environment. For example, if you make a decision to advance 1 km, the enemy will take action and change their position according to your decision. If you decide to burn the forest in front of you that is blocking your view, the terrain information will change, etc.
Agent: That is, you;
Actions: Decisions you take based on the circumstances, such as moving forward, burning trees that block your view, etc.
Action Space: The Space of all the actions you can take, which can be continuous, discrete, or both. Continuous movements include how far forward, in what direction, and so on. Discrete movements include whether to attack, camp, defend or retreat, divide the army into several groups, whether to frontal impact, two-sided outflank or ambush. In short, motion space is all the possible decisions that you can make depending on the environment;
Policy: The probability of what action you will take in a given environment. This is a little bit harder to understand. For example, Sima Yi’s strategy in the war was quite different from Zhang Fei’s. The prime minister’s plan to empty the city was successful because Sima Yi’s strategy was prudent. But if zhang Fei is facing an empty city, it is possible to directly fight into the city to capture the prime minister, this is the different strategy; Another example is that if Zhang SAN learns the Art of War from Sun Tzu and Li Si learns the suicide note from Wu Mu, they will take different actions in the same situation on the battlefield, which is usually expressed by π.
State: the specific situation of the environment faced at a certain moment or stage. In the empty city scenario, the prime minister was confronted with the enemy general Sima Yi with 150,000 troops, while he was in a city with only 2,500 soldiers. The prime minister is in this state to make empty city plan action, with S to express;
State Transition Probability: The Probability that the current State will Transition to another State after an action is taken for a specific State. Prime minister’s offensive against sima yi, take actions of empty city stratagem, in this case the environment response (i.e., to which state) mainly depends on the enemy will sima yi (he is part of the environment under the set), the next action may be taken by sima yi including attack, sent investigation, circumference and not attack, retreat, eventually sima yi take actions of retreat, The situation becomes sima Yi’s withdrawal; This is because Sima Yi’s cautious character decides that there is a high probability that he will withdraw, but it does not mean that he will not take other actions. If sima Yi encirses but does not attack, the prime minister will face a different situation.
Reward: Will take some action to a certain state benefits of a quantitative indicators, in the empty city stratagem, heavily outnumbered because of our personnel to save more income, the greater the prime minister’s action may take defensive ring including the closed gate, out of the city to meet them, such as empty, out of the city to meet them may be completely annihilated, returns to zero, defensive ring will eventually be closed gate, But a little can hold for a period of time, slightly higher income, and the empty city plan may save the army and the probability is very high, so the prime minister took the action of the empty city plan;
Sequential Decision Problems: This kind of problem is concerned with the final return size of multiple rounds of Decision, but the size of the single return is diluted. Empty city is a special case, a round of decision is completed, but the real battlefield is to real-time according to the dynamics of the enemy decision, in order to achieve the ultimate goal of defeating the enemy. An example of minimizing single payoffs and maximizing long-term gains would be to use a portion of the army as a decoy and sacrifice that portion for the ultimate payoff of wiping out the enemy. And the Chinese Workers’ and peasants’ Red Army’s 16-character formula “when the enemy retreated, we advanced; when the enemy stood by, we harassed; when the enemy was tired, we fought; when the enemy retreated, we pursued”, which also guided the sequential decision in the war.

So, when you’ve summed up these concepts of reinforcement learning, and you think that war should be dealt with by reinforcement learning, you get excited, but these are just concepts related to reinforcement learning. How do you do reinforcement learning? This leads to two important concepts: Q value and V value

The value of V is the expected sum of the returns of an agent in a given state all the way to the final state. Such as war will always be to compete for some strategic in can attack and defend, occupies the strategic position, their own is in a state of good for the whole war situation, the greater the cumulative returns gained by the us to the end of the war, that is when we are in a strategic position of this state V value is bigger, and in the other state V value is relatively small. Why does both the enemy and the enemy know that in this state the sum of returns is expected to be greater? In the game, we can repeat this scenario over and over again, trying to do countless iterations from that state, each with a different probability, until the war is over, and we can calculate the V value for that state. In reality, however, such tests are not allowed. Both the enemy and the enemy know this because there have been so many such cases in history that no more tests are needed. Of course, the value of V is related to the agent, in the same state, different agents take different strategies, the value of V will be different (this you think, you and the prime minister, from the same state, the final result will be different)
Q value is an agent in a certain state, after taking a certain action, until the final state of the expected sum of returns. For example, in the empty city plan, the prime minister adopts the empty city plan in the face of the current situation, and its Q value is the expectation of the sum of returns from the prime minister’s use of the empty city plan until the end of the war.

Q value and V value can be calculated mutually. If the V value of each state is known, then in order to calculate the Q value of action a in the state of S, we also need to know the probability of state transition. If the prime minister adopts the empty city meter, the next state and probability are respectively :(1) sima yi attacks, the probability is 0.1; (2) Sima Yi encirclement without attack, the probability is 0.2; (3) The probability of Sima Yi’s withdrawal is 0.7. Then the prime minister’s Q value of the empty city meter is the probability weighted sum of the return of the empty city meter plus the V value of the three states. If the Q value of each state and action is known, to calculate the V value, we also need to know the probability of different actions taken by the strategy in that state. For example, to calculate the V value of the state before the empty city meter, there are three actions that the prime minister can take :(1) go out of the city to meet the enemy, the probability is 0.1; (2) Defend the city and resist the enemy, probability 0.4; (3) For empty city meter, the probability is 0.5. Since we already know the value of Q, the value of V is the probability weighted sum of the Q value of these three actions.

This blog post Outlines some of the concepts associated with reinforcement learning. Feel free to comment on any mistakes made in the comments section.

2021 Huawei Cloud AI combat camp — Huawei cloud employees are learning AI combat camp, come to sign up for free learning ~

Click follow to learn about the fresh technologies of Huawei Cloud

Related Posts

Centralized Gradient Centralization: simple Gradient, one line of code to accelerate the training and generalization ability | ECCV 2020 Oral

【 Forecast model 】 Based on MATLAB BP_Adaboost financial warning

Understand the principles of driverless cars