The author | Joshua Greaves, compile | liu chang, paulristelhueber Mian
This article is the most important content of the famous book Reinforcement Learning: An Introduction. It aims to introduce the basic concepts and principles of Reinforcement Learning so that readers can realize the latest model as soon as possible. After all, Reinforcement Learning (RL) is a very useful tool for any machine Learning practitioner, especially given AlphaGo’s reputation.
In the first part, we will have a detailed understanding of MDPs (Markov Decision Process) and the main components of reinforcement learning framework. In the second part, we will construct and learn the theoretical knowledge about value function and Bellman equation, which is the most important formula in reinforcement learning. We will deduce and explain it step by step to uncover the mystery of reinforcement learning.
Of course, this article is just trying to give you the quickest and most intuitive way to understand the theory behind Reinforcement Learning. To deepen your understanding on this topic, “Reinforcement Learning: An Introduction” by Sutton and Barto is definitely worth reading. In addition, the ten lessons of reinforcement learning taught by David Silver, the great god behind AlphaGo, on YouTube are also worth your careful study.
Supervised learning vs. evaluated learning
For many issues of interest, the supervised learning paradigm does not provide the flexibility we need. The main difference between supervised and reinforcement learning is whether the feedback received is evaluative or instructional. Instructional feedback tells you how to achieve your goals, while evaluative feedback tells you how far you will go. Supervised learning solves problems based on instructional feedback, while reinforcement learning solves problems based on evaluative feedback. Image classification is a practical example of problem solving by supervised learning with guided feedback. When the algorithm tries to classify some specific data, it learns from the instructive feedback which is the true category. Evaluative feedback, on the other hand, only tells you how far you’ve reached your goal. If you train a classifier with evaluative feedback, your classifier might say “I think this is a hamster,” and it would score 50 points. However, without any contextual information, we don’t know what the 50 points are. We need to do other categories, explore whether 50 points mean we are accurate or inaccurate. Maybe 10,000 is a better score, so we still don’t know what it is until we try to categorize other numbers.
Guessing a hamster gets two gold stars and a smiley face, while guessing a gerbil gets a silver star and a thumb
In many of the problems we’re interested in, the idea of evaluative feedback is much more intuitive and achievable. For example, imagine a system that controls the temperature of a data center. Instructional feedback doesn’t seem to be of any use here. How do you tell your algorithm what the correct Settings are for each part at any given time step? This is where evaluative feedback comes in handy. You can easily see how much electricity is being used at a particular time, or what the average temperature is, or even how many machines are overheated. This is actually how Google uses reinforcement learning to solve these problems. Let’s get right to it.
Markov decision making process
Given that we know state S, if future state conditions are independent of past states, then state S has A Markov property. That means that S describes all the past states up to the present. If this is hard to understand, let’s explain it with an example to make it easier. Suppose a ball flies through the air if its state is determined by its position and velocity enough to describe its current position and its next position (regardless of physical models and outside influences). Therefore, this state is Markovian. But if we only know the position of the ball and we don’t know its velocity, it’s no longer markov. Since the current state is not a generalization of all previous states, we need information from previous points in time to build a suitable model of the sphere.
Reinforcement learning can be modeled as a Markov Decision Process, or MDP(Markov Decision Process). MDP is a directed graph with node and edge states that can describe transitions between Markov states. Here is a simple example:
A simple Markov decision making process
This MDP shows the process of learning Markov decision making. You start out in an “incomprehension” state, and then you have two possible actions, study or not study. If you choose not to learn, there is a 100% chance that you will return to a state of incomprehension. However, if you choose to learn, there is only a 20% chance that you will return to where you started, which is an 80% chance that you will be in a state of understanding.
In fact, I’m sure there’s more than an 80% chance of getting to an understanding state, and the core of MDP is actually very simple, in a state you can take a series of actions, and after you take those actions, there’s some distribution of what states you can get to. This transformation can also be well identified in the case of taking an unlearned action.
The goal of reinforcement learning is to learn how to spend more time on more valuable states. In order to have a more valuable state, we need MDP to provide more information.
You don’t need an MDP to tell yourself to eat when you’re hungry, but reinforcement learning mechanisms do
This MDP adds a bonus mechanism that gives you a bonus for each state you transition to. In this example, you get a negative reward because the next state is hunger, and an even more negative reward if the next state is starvation. If you’re full, you get a positive reward. Now that our MDP is fully formed, we can start thinking about how to take action to achieve the highest rewards available.
Since this MDP is so simple, it’s easy to find ways to stay in a higher reward zone and eat when we’re hungry. In this model, we don’t have many other choices when we’re full, but we will inevitably get hungry again and eat right away. Problems of interest in reinforcement learning actually have much larger and more complex Markov decision-making processes, and we often don’t know these strategies until we start to actually explore them.
Formal reinforcement learning problems
Now that we have a lot of the basic material we need, we need to turn our attention to the terminology of reinforcement learning. The most important components are agent and environment. Agents are indirectly controlled and exist in the environment. Reviewing our Markov decision model, an agent can choose an action that has a significant effect on it in a given state. However, an agent does not have complete control over the dynamics of the environment, which receives these actions and returns new states and rewards
This diagram from Sutton and Barto’s book Reinforcement Learning: An Introduction (which is highly recommended) is a great illustration of the interaction between an agent and its environment. At some time step t, the agent is in state s_t and takes action a_t. The environment then returns a new state S_t +1 and a reward R_t +1. The reward is on the T +1 time step because it is returned by the environment in the state s_t+1 at T +1, so it makes more sense to keep the two in line (see figure above).
Now that we have a framework for reinforcement learning problems, we are ready to learn how to maximize the reward function. In the next section, we will learn more about state value functions and action value functions, as well as Bellman equations, which lay the foundation for reinforcement learning algorithms, and further explore some simple and effective dynamic programming solutions.
Rewards and Rewards
As mentioned earlier, agents in reinforcement learning learn how to maximize future cumulative rewards. The term used to describe future cumulative rewards is called rewards and is usually denoted by R. We also use the subscript t to indicate the return value at a time step. The mathematical formula is expressed as follows:
If we let the series go on indefinitely, we might get an infinite return, but that would make the definition of the problem meaningless. So this equation only makes sense if the reward we expect is finite. Tasks with termination procedures are called situational tasks. Card games are a good example of situational problems. The situation begins with dealing cards to everyone, and inevitably ends according to certain rules of the game. And then, in the next round, another scenario starts, dealing with the cards again.
Rather than using future accumulative rewards, it is more common to use future accumulative discount rewards:
So 0 is less than gamma is less than 1. Defining return values this way has two advantages: not only can we define return values in an infinite progression, but it also gives better weight to subsequent returns, which means we care more about coming returns than what we will get in the future. The smaller the value of gamma, the more true it is. In the special case, let’s set gamma equal to 0 or 1. When gamma is equal to 1, we’re back to the first equation, where we care about all the payoffs, not how far into the future. On the other hand, when gamma equals zero, we care about the current return, not any future return. This will result in a lack of long range in our algorithms. It will learn to take action that is best suited to the present situation without considering its impact on the future.
strategy
Strategy, called π (s, A), describes one way of doing things. It is a function that takes a state and an action and returns the probability of taking that action in that state. Therefore, for a given state, it must satisfy. In the example below, when we are hungry, we have a choice between eating and not eating.
Our policy should describe how to take action in each state. So, an equally likely random strategy would look something like this: E for eating, E for not eating. This means that if you are hungry, you are equally likely to choose to eat or not to eat.
Our goal with reinforcement learning is to learn an optimal strategy called π *, which tells us how to act in order to maximize our rewards. This is just a simple example, and it’s easy to see that the optimal decision in this example is to eat when you’re hungry. In this example, as with many MDPs (Markov decision processes), the optimal decision is deterministic. For every best state there is an best action. Sometimes this is written as
π *(s)= A, which is a mapping from states to optimal decision actions under those states.
The value function
We use the value function to get the optimal learning strategy. There are two types of value functions in reinforcement learning: state value function, expressed as V(s); And behavior value function, expressed as Q(s, A).
A state value function describes the state value at the time a policy is executed. Here’s an expected return of executing our strategy π starting at state S:
It is important to note that even in the same environment, the value function can change depending on the policy. This is because the value function of the state depends on how you behave, because what you do in a particular state affects your expected return. Also note the importance of expectations. (Expectation is like an average, which is the return you expect to see). The reason we use expectations is that when you get to a state, something random happens. You might have a random strategy, which means we need to combine the results of all the different actions we take. Similarly, transition functions can be random, that is, we can’t end up in any state with 100% probability. Remember the example above: when you select an action, the environment returns to the next state. There may be multiple states that can be returned, even an action. And we’ll get more information about that in the Bellman equation. All randomness is expected to be taken into account.
The other value function we’re going to use is the action value function. Action value function refers to the value generated by taking an action in a certain state when we take a specific strategy. This is the expected return for a given state and action under strategy π :
The same comments about state value functions apply to action value functions. It takes into account the randomness of future actions, as well as the randomness of returning states from the environment.
Behrman equation
Richard Bellman, an American applied mathematician, has derived the following equations that allow us to begin solving these MDPs (Markov decision processes). In reinforcement learning, Behrman equation is everywhere, so it is necessary to understand how reinforcement learning algorithm works. But before we get to the Behrman equation, we need to know some more useful notation. We define P and R as follows:
It’s another way of saying we’re going to start at state S, take action A, and get the expected (or average) reward from state S prime.
Finally, armed with this knowledge, we are ready to derive the Bellman equation. We will take the state value function into account in the Bellman equation. According to the definition of return, we can modify Formula (1) as follows:
If we wanted to derive the first reward from the total return, the formula could be rewritten as follows:
Here the expectation can be described as the expected return of continuing from state S if we take strategy π. Expectations can be described by summing over all possible actions and all possible return states. The next two equations help us take the next step.
By assigning the expected value to these two parts, we can convert our equation into the following form:
Notice that equation (1) is the same thing as the end of this equation. Therefore, we can replace it with the following:
The action value function of the Bellman equation can be derived in a similar way. Those who are interested can see the detailed steps at the end of the article. The end result is as follows:
The Bellman equation is important because it allows us to express the values of one state into the values of the other states. That means that when we know what state ST plus 1 is, we can easily calculate what state ST is. This opens the door to solving the problem of iterating the value of each state, because if we know the value of the next state, we know the value of the current state. The most important thing to remember here is the number of the equations. Finally, with the advent of the Bellman equation, we can begin to investigate how to compute optimal strategies and write our first reinforcement learning agent program.
Next step: dynamic programming
In the next article, we will examine the use of dynamic programming to calculate optimal policies, which will lay the foundation for more advanced algorithms. However, this will be the first opportunity to actually write reinforcement learning algorithms. We’ll look at policy iterations and value iterations and their strengths and weaknesses. Until then, thank you for reading.
As promised: Derive the action value function of Bellman equation (Behrman equation)
Just like the process of deriving the state value function of Bellman equation, we obtained a series of equations by using the same derivation process. Now we continue to deduce from equation (2) :
Relevant link: Reinforcement Learning: an; incompleteideas.net/sutton/book… RL Course by David Silver www.youtube.com/watch?v=2pW… RL Course by David Silver PPT www0. Cs. Ucl. Ac. UK/staff/d.s il… Everything You Need to Know to Get Started in Reinforcement Learning joshgreaves.com/reinforceme… Understanding RL: The Bellman Equations joshgreaves.com/reinforceme…