This article was first published on:Walker AI

Policy Optimization is a kind of algorithm in reinforcement learning. Its basic idea is different from value-based algorithm. Therefore, many textbooks divide model-free RL into two categories, Policy Optimization and value-based. This blog series will refer to OpenAI’s how-to tutorial Spinning Up 1. The Spinning Up series is a great tutorial for getting started with Policy Optimization, especially for beginners. Policy Gradient (PG) algorithm is the core concept of strategy optimization. In this chapter, we will start from the simplest PG derivation and reveal the mystery of strategy optimization algorithm step by step.

1. Intuitive understanding

An intuitive explanation of strategy gradient can be expressed in one sentence: “If an action increases the final return, increase the probability of the occurrence of the action, and conversely, decrease the probability of the occurrence of the action”. This sentence conveys two meanings:

  • We’re thinking about the effect of the action on the return, not the state or anything else.
  • We adjust the probability of the occurrence of an action instead of assigning a score to an action, which is different from value-based algorithms.

2. Strategy gradient derivation

In this section, we will deduce the basic formula of strategy gradient step by step. This section is very important. If you understand the derivation process, you will basically understand the core idea of strategy gradient. Therefore, be patient to understand the content of this section, preferably to reach the point of derivation.

  • Maximizing return function

We use a parameterized neural network to represent our strategy πθ\pi_ thetaπθ, so that our goal can be expressed as adjusting θ\theta theta to maximize the expected return, expressed by the formula:


J ( PI. Theta. ) = E [ PI. …… tau R ( tau ) ] ( 1 ) J(\pi_\theta)=E\underset{\pi \sim \tau}[R(\tau)] —(1)

In the formula (1), tau \ tau tau said a complete path from start to finish. In general, for maximization problems, we can use the gradient ascent algorithm to find the maximum.


Theta. = Theta. + Alpha. J ( PI. Theta. ) ( 2 ) \theta^*=\theta + \alpha\nabla J(\pi_\theta) —(2)

To be able to get the optimal parameters step by step, we need to get ∇ J(πθ)\nabla_{theta} J\left(\pi_{theta} right)∇ J(πθ), and then use the gradient ascent algorithm, the core idea is so simple.

  • Policy gradient

The key is to find the gradient of the final return function J(πθ)J(\pi_\theta)J(πθ) with respect to θ\thetaθ. This is the policy gradient. The algorithm to solve RL problem by optimizing the strategy gradient is called the strategy gradient algorithm. TRPO all belong to strategy gradient algorithm. ∇ J(πθ)\nabla_{theta} J\left(\pi_{theta} right)∇ J(πθ), which is the most core of this blog.


Theta. J ( PI. Theta. ) = Theta. E tau …… PI. Theta. [ R ( tau ) ] ( 3 ) \nabla_{\theta} J\left(\pi_{\theta}\right) = \nabla_{\theta} \underset{\tau \sim \pi_{\theta}}{\mathrm{E}} [R(\tau)] – (3)

= Theta. tau P ( tau Theta. ) R ( tau ) ( 4 ) =\nabla_{\theta} \int_{\tau} P(\tau \mid \theta) R(\tau) \quad —(4)

= tau Theta. P ( tau Theta. ) R ( tau ) ( 5 ) =\int_{\tau} \nabla_{\theta} P(\tau \mid \theta) R(\tau) \quad —(5)

= tau P ( tau Theta. ) Theta. log P ( tau Theta. ) R ( tau ) ( 6 ) =\int_{\tau} P(\tau \mid \theta) \nabla_{\theta} \log P(\tau \mid \theta) R(\tau) —(6)

= E tau …… PI. Theta. [ Theta. log P ( tau Theta. ) R ( tau ) ] ( 7 ) =\underset{\tau \sim \pi_{\theta}}{\mathrm{E}}\left[\nabla_{\theta} \log P(\tau \mid \theta) R(\tau)\right] —(7)

In the above derivation, the log derivative technique is used: the derivative of log⁡x\log xlogx with respect to XXX is 1x\frac{1}{x}x1. Therefore, we can get the following formula:


Theta. P ( tau Theta. ) = P ( tau Theta. ) Theta. log P ( tau Theta. ) ( 8 ) \nabla_{\theta} P(\tau \mid \theta)=P(\tau \mid \theta) \nabla_{\theta} \log P(\tau \mid \theta) —(8)

Hence, formula (5) to Formula (6), and then we expand formula (7) further, mainly ∇ log⁡P(τ∣θ)\nabla_{\theta} \log P(\tau \mid \theta)∇ logP(τ∣θ). Let’s look at P(τ∣θ)P(tau \mid \theta)P(τ∣θ)


P ( tau Theta. ) = rho 0 ( s 0 ) t = 0 T P ( s t + 1 s t . a t ) PI. Theta. ( a t s t ) ( 8 1 ) P(\tau \mid \theta)=\rho_{0}\left(s_{0}\right) \prod_{t=0}^{T} P\left(s_{t+1} \mid s_{t}, a_{t}\right) \pi_{\theta}\left(a_{t} \mid s_{t}\right) —(8-1)

Add log to multiply to add:


log P ( tau Theta. ) = log rho 0 ( s 0 ) + t = 0 T ( log P ( s t + 1 s t . a t ) + log PI. Theta. ( a t s t ) ) ( 8 2 ) \log P(\tau \mid \theta)=\log \rho_{0}\left(s_{0}\right)+\sum_{t=0}^{T}\left(\log P\left(s_{t+1} \mid s_{t}, a_{t}\right)+\log \pi_{\theta}\left(a_{t} \mid s_{t}\right)\right) —(8-2)

Compute the gradient of the log function and cancel out some constants:


Theta. log P ( tau Theta. ) = Theta. log rho 0 ( s 0 ) + t = 0 T ( Theta. log P ( s t + 1 s t . a t ) + Theta. log PI. Theta. ( a t s t ) ) \nabla_{\theta} \log P(\tau \mid \theta) = \cancel{\nabla_{\theta} \log \rho_{0}\left(s_{0}\right)} + \sum_{t=0}^{T}\left(\cancel{\nabla_{\theta} \log P\left(s_{t+1} \mid s_{t}, a_{t}\right)} + \nabla_{\theta} \log \pi_{\theta}\left(a_{t} \mid s_{t}\right)\right)

= t = 0 T Theta. log PI. Theta. ( a t s t ) ( 9 ) =\sum_{t=0}^{T} \nabla_{\theta} \log \pi_{\theta}\left(a_{t} \mid s_{t}\right) —(9)

Therefore, combining formula (7) with formula (9), we get the final expression


Theta. J ( PI. Theta. ) = E tau …… PI. Theta. [ t = 0 T Theta. log PI. Theta. ( a t s t ) R ( tau ) ] ( 10 ) \nabla_{\theta} J\left(\pi_{\theta}\right)=\underset{\tau \sim \pi_{\theta}}{\mathrm{E}}\left[\sum_{t=0}^{T} \nabla_{\theta} \log \pi_{\theta}\left(a_{t} \mid s_{t}\right) R(\tau)\right] \quad —(10)

Formula (10) is the core expression of PG algorithm. It can be seen from this formula that the strategy gradient we require is actually an expectation, and the specific engineering implementation can adopt Monte Carlo idea to obtain the expectation, that is, sampling to obtain the mean to approximate the expectation. We collect a series of D={τ I} I =1… N \ mathcal {D} = \ left \ {\ tau_ {I} \ right \} _ {I = 1, \ ldots, N} D = {I} tau I = 1,… N, each trajectory is obtained by agent using strategy πθ\pi_{\theta}πθ interactive sampling with the environment, then the strategy gradient can be expressed as:


g ^ = 1 D tau D t = 0 T Theta. log PI. Theta. ( a t s t ) R ( tau ) ( 11 ) \hat{g}=\frac{1}{|\mathcal{D}|} \sum_{\tau \in \mathcal{D}} \sum_{t=0}^{T} \nabla_{\theta} \log \pi_{\theta}\left(a_{t} \mid s_{t}\right) R(\tau) —(11)

Where ∣ D ∣ | \ mathcal {D} | ∣ D ∣ says the number of sampling of the trajectory. Now that we have completed the detailed derivation of the strategy gradient, we can breathe a sigh of relief and make some modifications on the basis of Formula (10).

Before making simple changes, let’s summarize formula (10), which is the core formula of PG algorithm after all:

  • Compared with our common supervised learning algorithms, we will define loss function, then the loss function differentiates the parameters, and the gradient descent algorithm is used to continuously minimize Loss. For the PG algorithm, our “loss function” is actually the logarithm of the expected return, and our goal is to maximize the expected return, so the gradient rise algorithm is used here.
  • In general supervised learning algorithms, the distributions of training samples and test samples are the same. Loss function is obtained from the samples with fixed distribution and is independent of the parameters we want to optimize. For PG algorithm, however, we have a sampling process, based on the strategy of the existing policy is different, the sampling to get samples of the different, lead to the final calculated loss also has bigger difference, this makes the network easily fitting, I will be back when it comes to more advanced Actor – Critic framework, using the confrontation mentality, to solve this problem.
  • For general supervised learning, loss is better as small as possible. Loss is also a very effective indicator to evaluate whether training is completed. However, for PG algorithm, the “Loss function” here is of little significance, mainly because the expected return here only applies to the data set generated by the current strategy. Therefore, it does not mean that the model performs better when loss is reduced.
  • We can think of R(τ)R(tau)R(τ) as the weight of logπθ(at∣st)log\pi_\theta(a_t \mid s_t)logπθ(at∣st) with small rewards, It shows that the effect of action ATA_TAT under STS_TST is not good, and the probability of atA_TAT under STS_TST state is reduced; on the contrary, the probability of action occurrence is increased if the reward is larger, so as to achieve the purpose of selecting the most appropriate action.

3. Improve the return function

If we look at formula (10), R(τ)R(tau)R(τ), which means the return of the entire trajectory, does not make sense. Applying the same reward to all actions in a trajectory is equivalent to assigning the same weight to each action in the trajectory. Obviously, there are good and bad actions in the action sequence, and they all take the same reward, which cannot achieve the purpose of reward and punishment. So how do we represent the reward for performing an action in a certain state?

A more intuitive idea is that the current action will affect the subsequent state and get an immediate reward. Then we only need to use the accumulated discount return to represent the return of the current action, which can be expressed by the formula:


R ^ t t = t T R ( s t . a t . s t + 1 ) ( 12 ) \hat{R}_{t} \doteq \sum_{t^{\prime}=t}^{T} R\left(s_{t^{\prime}}, a_{t^{\prime}}, s_{t^{\prime}+1}\right) —(12)

This is called reward to go in spinning up, so formula (10) can be expressed as:


Theta. J ( PI. Theta. ) = E tau …… PI. Theta. [ t = 0 T Theta. log PI. Theta. ( a t s t ) t = t T R ( s t . a t . s t + 1 ) ] ( 13 ) \nabla_{\theta} J\left(\pi_{\theta}\right)=\underset{\tau \sim \pi_{\theta}}{\mathrm{E}}\left[\sum_{t=0}^{T} \nabla_{\theta} \log \pi_{\theta}\left(a_{t} \mid s_{t}\right) \sum_{t^{\prime}=t}^{T} R\left(s_{t^{\prime}}, a_{t^{\prime}}, s_{t^{\prime}+1}\right)\right] —(13)

Of course, the weight allocation using reward to Go is still quite preliminary, so we can use a more advanced weight allocation method to further reduce the variance of reward allocation. Due to space, we will talk about it later.

4. To summarize

In this chapter, we spent a lot of time deriving the core formula of strategy gradient (PG) and obtained the key expression (10). Understanding this formula is very helpful for our subsequent understanding of the whole PG algorithm family. We hope you can understand the derivation process of this formula carefully.


PS: more dry technology, pay attention to the public, | xingzhe_ai 】, and walker to discuss together!


  1. OpenAI Spinning Up spinningup.openai.com/ ↩