In the last article, reinforcement learning 1 — Reading Markov Decision Process MDP introduced Markov process, this paper will introduce how to use dynamic programming method to solve.

There are two key points of dynamic programming:

  • One is that the optimal solution of the problem can be composed of the optimal solution of several small problems, that is, the optimal solution of the problem can be obtained by looking for the optimal solution of the sub-problem.
  • Second, we can find the recursion relationship between the states of the subproblems and deduce the states of the larger subproblems from the smaller states of the subproblems.

In the previous article we mentioned the Behrman equation of state value:


v PI. ( s ) = a A PI. ( a s ) ( R ( s . a ) + gamma s S P ( s s . a ) v PI. ( s ) ) v_{\pi}(s) = \sum_{a \in A}\pi(a|s)\left( R(s, a) + \gamma \sum_{s’ \in S} P_{(s’|s, a)} \cdot v_\pi(s’)\right)

From this formula, we can see that we can define the state value function of each state to solve the subproblem. Meanwhile, this formula is a recursive formula, which means that we can use the state value in the previous iteration cycle to calculate the state value of a certain state S in the current iteration cycle. It can be seen that reinforcement learning meets these two conditions.

Decision Making can be divided into two processes

  • Prediction (Strategy Evaluation)

    • Input: MDP (S, A, P, R, gamma > < S, A, P, R, \ gamma > < S, A, P, R, gamma > PI \ PI PI and strategy
    • Value function VπV_\piVπ
  • Control (policy Control, i.e. finding the optimal policy)

    • Input: MDP (S, A, P, R, gamma > < S, A, P, R, \ gamma > < S, A, P, R, gamma >
    • Outputs: Optimal value function V∗V^*V∗ and optimal strategy π∗\ PI ^*π∗

I. Strategic evaluation

First, we look at how to use dynamic programming to solve the prediction problem of reinforcement learning, that is, to solve the state value function of a given strategy. The process of solving this problem is often called Policy evaluation.

The basic idea of strategy evaluation is to start from any state value function, according to the given strategy, combine with Behrman expectation equation, state transition probability and reward synchronous iteration to update the state value function until it converges, and get the final state value function under the strategy.

Suppose we have calculated the state values of all the states vt(s ‘) v_t(s’)vt(s ‘) in round T iteration, So in round T +1 we can use the state value calculated in round T to calculate the state value of round T +1 vt+1(s)v_{t+1}(s)vt+1(s). This is accomplished by using the Bellman equation, namely Iteration on Bellman Exception Backup


V t + 1 ( s ) = a A PI. ( a s ) ( R ( s . a ) + gamma s S P ( s s . a ) V t ( s ) ) V_{t+1}(s) = \sum_{a \in A}\pi(a|s) \left(R(s, a) + \gamma \sum_{s’ \in S} P_{(s’|s, a)} \cdot V_t(s’)\right)

1. Example of strategy evaluation

This is a classic Grid World example. We have a 4×4 16 grid. Only the upper left and lower right cells are termination cells. The value of this position is fixed at 0, and if an individual reaches these two squares, he or she stops moving, and the reward is 0 for each round thereafter. Each time an individual moves in the other 16 cells, the immediate reward RRR is -1. Note that the individual can only move one cell at a time, and can only move up, down, left and right 4 choices. It can not move diagonally. If it moves out of the boundary cell, it will directly move back to the previous boundary cell. The attenuation factor is defined as γ=1\gamma=1γ=1. The strategy given here is a random strategy, where each cell has a 25% chance of moving to the four surrounding cells.

First we initialize all the cells with a state value of 0, as shown above when k=0k=0k=0. Now we begin the strategy iteration. Since the value of the stop cell is fixed at 0, we can leave it out of the iterative process. When k=1k=1k=1, we use the above Behrman square to calculate the value of the first cell in the second row:


v 1 ( 21 ) = 1 4 [ ( 1 + 0 ) + ( 1 + 0 ) + ( 1 + 0 ) + ( 1 + 0 ) ] = 1 V_1 ^ = {(21)} \ frac {1} {4} [(- 1 + 0) + (- 1 + 0) + (- 1 + 0) + (- 1 + 0)] = 1

Value of the second cell in the second row:


v 1 ( 22 ) = 1 4 [ ( 1 + 0 ) + ( 1 + 0 ) + ( 1 + 0 ) + ( 1 + 0 ) ] = 1 V_1 ^ {} (22) = \ frac {1} {4} [(- 1 + 0) + (- 1 + 0) + (- 1 + 0) + (- 1 + 0)] = 1

The other cells are similar. The result of the first iteration of state value is shown in the figure above when k=1k=1k=1. Now we’re done with the first iteration. It’s time for the second dynamic programming iteration. Again, the value of the first cell in the second row:


v 2 ( 21 ) = 1 4 [ ( 1 + 0 ) + ( 1 1 ) + ( 1 1 ) + ( 1 1 ) ] = 1.75 V_2 ^ = {(21)} \ frac {1} {4} [(- 1 + 0) + (- 1-1) + (- 1-1) + (- 1-1)] = 1.75

The value of the second cell in the second row is:


v 2 ( 22 ) = 1 4 [ ( 1 1 ) + ( 1 1 ) + ( 1 1 ) + ( 1 1 ) ] = 2 V_2 ^ {} (22) = \ frac {1} {4} [(- 1-1) + (- 1-1) + (- 1-1) + (- 1-1)] = 2

The final result is the figure above when k=2k=2k=2. The third iteration is as follows:


v 3 ( 21 ) = 1 4 [ ( 1 1.7 ) + ( 1 2 ) + ( 1 2 ) + ( 1 + 0 ) ] = 2.425 V_3 ^ = {(21)} \ frac {1} {4} [(- 1-1.7) + (- 1-2) + (- 1-2) + (- 1 + 0)] = 2.425

The final result is the figure above when k=3k=3k=3. And so on, until the strategic value of each grid changes very little. At this point we have the state value of all the cells based on the random strategy.

Second, policy control

1, the Policy Iteration

For the Policy control problem, a feasible method is to adjust our action strategy in time according to the state value that we have evaluated based on any given strategy. This method is called Policy Iteration.

How do you adjust? The simplest way is greed. Consider the following greedy strategy: the behavior an individual chooses in one state is the one that has the highest state value of all possible subsequent states. To the right of the picture above. When we calculated the final state value, we found that the values around the first cell in the second row were 0,-18 and -20 respectively. At this time, we used the greedy method, so we adjusted the action strategy to move in the direction of state value 0 instead of moving randomly. That’s the arrow pointing up. Now the values around the second cell in the second row are -14,-14,-20, -20. So our whole strategy is to move in the direction of -14, which is up to the left.

  • So Policy Iteration is divided into two parts
    • Evaluate the policy π\ PI π, Evaluate the policy π\ PI π
    • Improve the policy π\ PI π,

PI. = g r e e d y ( v PI. ) \pi’ = greedy(v^{\pi})

If we have a strategy of PI, PI PI, we can use the strategy to estimate the value of its state function v PI v_ (s)/PI (s) v PI (s), and then according to the strategy to improve extract better strategy PI ‘\’ PI PI ‘, then calculate v PI ‘v_ (s) {\ PI’} ‘of PI (s) v (s). Then a better strategy π “\ PI” π “” is obtained until the correlation satisfies the correlation termination condition.

The calculation formula is as follows:

1. Evaluate the value of STH


v i ( s ) = a A PI. ( a s ) ( R ( s . a ) + gamma s S P ( s s . a ) v i 1 ( s ) ) v_{i}(s) = \sum_{a\in A} \pi(a|s) \left( R(s, a) + \gamma \sum_{s’ \in S} P(s’|s, a) \cdot v_{i-1}(s’) \right)

2. Improve strategy


q i ( s . a ) = R ( s . a ) + gamma s S P ( s s . a ) v i ( s ) PI. i + 1 ( s ) = a r g m a x a    q PI. i ( s . a ) q_i(s,a) = R(s, a) + \gamma \sum_{s’ \in S} P_{(s’|s,a)} \cdot v_i(s’) \\ \pi_{i+1}(s) = argmax_a \; q^{\pi_i}(s,a)

Each iteration is based on a defined strategy, so policy π\ PI π is no longer written, and the corresponding subscript is added

Q π I (s,a)q^{\pi_i}(s,a)qπ I (s,a) can be imagined as a table, in which the horizontal axis represents different states and the vertical axis represents the value generated by different actions in each state, and then the behavior value with the greatest value in the current state is selected as the current state value.

The specific algorithm of Policy Iteration is:

Bellman Optimality Equation

Using the relationship between state value function and action value function, get


v ( s ) = m a x a q ( s . a ) v_*(s) = max_a q_*(s,a)

q ( s . a ) = R ( s . a ) + gamma s S P s s a V ( s ) q_*(s,a) = R(s, a) + \gamma \sum_{s’ \in S} P_{s’s}^a \cdot V_*(s’)

When optimal, the value of a state is equal to the maximum value of the action in the current state

Combine these two equations and you have the Bellman Optimality Equation


v ( s ) = m a x a ( R ( s . a ) + gamma s S P s s a v ( s ) ) v_*(s) = max_a (R(s, a) + \gamma \sum_{s’ \in S} P_{s’s}^a \cdot v_*(s’))

q ( s . a ) = R ( s . a ) + gamma s S P s s a m a x a q ( s . a ) q_*(s,a) = R(s, a) + \gamma \sum_{s’ \in S} P_{s’s}^a \cdot max_{a’}q_*(s’, a’)

2, the Value Iteration

Observing the figure in section 3, we find that if we adjust the action strategy with greedy method, when k=3k=3k=3, we have already obtained the optimal action strategy. You don’t have to iterate until the state values converge. Then our strategy iteration is optimized to value iteration.

For example, when k=2k=2k=2, the values around the first cell in the second row are 0,-2 and -2 respectively. At this time, we use the greedy method, so we adjust the action strategy to move in the direction of state value 0 instead of moving randomly. That’s the arrow pointing up. The values around the second cell in the second row are -1.7,-1.7,-2, -2. So our entire action strategy is to move in the direction of the state value -1.7, which is up to the left.

Instead of waiting for the state value to converge, we adjusted the strategy in time as the state value iterated, which greatly reduced the number of iterations. At this point, our updating method of state value is also different from policy iteration. The current iteration formula of Behrman equation is as follows:


v ( s ) = m a x a q ( s . a ) v i + 1 ( s ) please m a x a A    ( R ( s . a ) + gamma s S P ( s s . a ) V i ( s ) ) v_*(s) = max_a q_*(s,a) \\ \Downarrow \\ v_{i+1}(s) \leftarrow max_{a \in A} \; \left(R(s, a) + \gamma \sum_{s’ \in S} P_{(s’|s,a)} \cdot V_i(s’)\right)

Then the optimal strategy π\ PI π is extracted directly


PI. ( s ) = a r g m a x a    q PI. ( s . a ) PI. ( s ) please a r g m a x a    ( R ( s . a ) + gamma s S P ( s s . a ) V e n d ( s ) ) \pi^*(s) = argmax_a \; q^{\pi}(s,a) \\ \Downarrow \\ \pi^*(s) \leftarrow argmax_a \; \left(R(s, a) + \gamma \sum_{s’ \in S} P_{(s’|s,a)} \cdot V_{end}(s’)\right)

The algorithm of Value Iteration is:

3. Examples show

The following is a Page from Stanford University to help you intuitively understand the two different iterations: strategy iteration and value iteration. At the beginning, as shown in the upper left of the figure below, each grid represents a state, and each state has four actions, up, down, left and right. Each state has a number, representing the value of the current state. In some states, there is also a number in the lower half, representing the reward that can be obtained by entering the current state. Our strategy is equal probability in all four directions, i.e., 0.25 probability in each direction. All you have to do is figure out the optimal strategy for each state.

3.1 Policy Iteration

Click Policy EvaluationPolicy \, as shown in the upper right. EvaluationPolicyEvaluation performs a strategy evaluation, you can see some status has changed, the state of the corresponding value V (s) V (s) V (s) has been updated also, at this point and click the Policy UpdataPolicy \; UpdataPolicyUpdata is used to update the policy, as shown in the following figure. You can see that the policy of some states has changed, and the policy has been improved based on the value of the current state. This iteration results in the bottom right, where each state has the best state value and strategy.

3.2 Value Iteration

Value iteration is the continuous iteration of state value VVV, and then extract the corresponding action value QQQ, and then find a maximum current state value from the corresponding Q. Click Toggle Value IterationToggle\; Value \; IterationToggleValueIteration wait a few seconds to see iteration results:

4. Code understanding

Gym is a tool library opened by OpenAI for developing and comparing various reinforcement learning algorithms. It provides many built-in environments and is a good platform for learning reinforcement learning. We used one of the simplest environments provided for this time, FrozenLake-v0. As shown below, the agent is in position S at the beginning, G stands for gold, and we need to control the agent to get more rewards when finding gold. As in the above example, the Agent has four actions in each state (up, down, left and right), F stands for obstacle, AND H stands for hole. The specific description can be accessed: Gym.openai.com/envs/Frozen… Look at it.

4.1, the Policy Iteration

import numpy as np
import gym

def extract_policy(v, gamma = 1.0) :
    Extracting strategy from value function
    policy = np.zeros(env.env.nS)
    for s in range(env.env.nS):
        q_sa = np.zeros(env.env.nA)
        for a in range(env.env.nA):
            q_sa[a] = sum([p * (r + gamma * v[s_]) for p, s_, r, _ in  env.env.P[s][a]])
        policy[s] = np.argmax(q_sa)
    return policy

def compute_policy_v(env, policy, gamma=1.0) :
    "" compute value function ""
    v = np.zeros(env.env.nS)
    eps = 1e-10
    while True:
        prev_v = np.copy(v)
        for s in range(env.env.nS):
            policy_a = policy[s]
            v[s] = sum([p * (r + gamma * prev_v[s_]) for p, s_, r, _ in 
                        env.env.P[s][policy_a]])
        if (np.sum((np.fabs(prev_v - v))) <= eps):
            break
    return v

def policy_iteration(env, gamma = 1.0) :
    """ Policy-Iteration algorithm """
    # env.env.nS: 16
    # env.env.nA: 4
    # env.env.ncol / env.env.nrow: 4
    policy = np.random.choice(env.env.nA, size=(env.env.nS))  
    max_iterations = 200000
    gamma = 1.0
    for i in range(max_iterations):
        old_policy_v = compute_policy_v(env, policy, gamma)
        new_policy = extract_policy(old_policy_v, gamma)
        if (np.all(policy == new_policy)):
            print ('Policy-Iteration converged at step %d.' %(i+1))
            break
        policy = new_policy
    return policy

if __name__ == '__main__':

    env_name  = 'FrozenLake-v0' 
    env = gym.make(env_name)

    optimal_policy = policy_iteration(env, gamma = 1.0)
    print(optimal_policy)
    # Policy-Iteration converged at step 6.
	# [0. 3. 3. 3. 0. 0. 0. 0. 3. 1. 0. 0. 0. 2. 1. 0.]
Copy the code

4.2, the Value Iteration

def extract_policy(v, gamma = 1.0) :
    Extracting policy from state function
    policy = np.zeros(env.env.nS)
    for s in range(env.env.nS):
        # q_sa: value of all actions in state S
        q_sa = np.zeros(env.action_space.n)
        for a in range(env.action_space.n):
            for next_sr in env.env.P[s][a]:
                # next_sr is a tuple of (probability, next state, reward, done)
                p, s_, r, _ = next_sr
                q_sa[a] += (p * (r + gamma * v[s_]))
        # print("q_sa: ", q_sa)
        policy[s] = np.argmax(q_sa)
        # print('np.argmax(q_sa): ', policy[s])
    return policy

def value_iteration(env, gamma = 1.0) :
    """ Value-iteration algorithm """
    # env.env.nS: 16
    # env.env.nA: 4
    # env.env.ncol / env.env.nrow: 4
    v = np.zeros(env.env.nS)  
    max_iterations = 100000
    eps = 1e-20
    for i in range(max_iterations):
        prev_v = np.copy(v)
        for s in range(env.env.nS):
            # env.env.p [s][a]]: probability of action A in state S
            q_sa = [sum([p*(r + prev_v[s_]) for p, s_, r, _ in 
                         env.env.P[s][a]]) for a in range(env.env.nA)] 
            v[s] = max(q_sa)
        if (np.sum(np.fabs(prev_v - v)) <= eps):
            print ('Value-iteration converged at iteration# %d.' %(i+1))
            break
    return v

if __name__ == '__main__':
    
    env_name  = 'FrozenLake-v0' 
    # Env is the environment information of this problem provided by GYM
    env = gym.make(env_name)
    gamma = 1.0

    optimal_v = value_iteration(env, gamma)
    policy = extract_policy(optimal_v, gamma)
    print('policy:', policy)
    # Value-iteration converged at iteration# 1373.
	# policy: [0. 3. 3. 3. 0. 0. 0. 0. 3. 1. 0. 0. 0. 2. 1. 0.]
Copy the code

Five, the summary

This blog introduces how to use dynamic programming to solve MDP problems, and also introduces two iterative algorithms. It can be found that for these two algorithms, there is a prerequisite that reward R and state transition matrix P are known, so we can use strategy iteration and value iteration algorithms. In this case we call it the Model base. Similarly, if we don’t know the reward and state transition matrices in the environment, we call it Model free. So how to solve MDP problem in Model_free case? This is the Monte Carlo (MC) sampling method for the next article.

References:

1, station B Teacher Zhou’s intensive learning outline second lecture

2. Blog ParkSd Pinard