This is the sixth day of my participation in the August More text Challenge. For details, see:August is more challenging

This article is the eighth in the introduction to reinforcement learning series, and DDPG was mentioned earlier when we talked about actor-critic. DDPG is an algorithm proposed by The Google DeepMind team for the output of deterministic actions. It solves the shortcoming of action-Critic neural network that each parameter update is correlated before and after, resulting in a partial view of the problem by the neural network. It also solves the problem that DQN cannot be used for continuous action.

Deep Deterministic Policy Gradient

Introduction to the

DDPG is depth deterministic strategy gradient algorithm. It is also a way to solve the problem of continuity control. These methods are model-free, off-policy, and policy-based. Original text transmission:

  • “Deterministic Policy Gradient algorithms.” ICML. 2014.
  • Lillicrap, Timothy P., et al. “Continuous control with deep reinforcement learning.” arXiv preprint arXiv:1509.02971 (2015).

DDPG, we can break it down. Deep means you need a neural network. Deterministic means Deterministic to ultimately output only one action. Policy Gradient, as we already know, is the Policy Gradient algorithm. DDPG can be viewed as an extended version of DQN. The difference is that DQN ultimately outputs an action vector, whereas DDPG ultimately outputs only one action. Furthermore, DDPG allows DQN to be extended to continuous action Spaces.

The network structure

The structural form of DDPG is similar to actor-critic. DDPG can be divided into two big networks: policy network and value network. DDPG continues the idea of fixed target networks in DQN, and each network is subdivided into target networks and real networks. But the target network updates are a little different. Here is a detailed analysis.

Let’s start with strategic networking, or actors. Actor output is a deterministic action, the network that produces this deterministic action is defined as a=μθ(s)a=\mu_{\theta}(s)a=μθ(s). The previous policy Gradient adopted a random policy, and each acquisition action required sampling of the distribution of the current optimal policy, while DDPG adopted a deterministic policy, which was directly determined by the function μ\muμ. The estimation network of Actor is μθ(s)\mu_{\theta}(s)μθ(s), θ\thetaθ is the parameter of neural network, this estimation network is used to output real-time action. In addition, Actor has a target network with the same structure but different parameters, which is used to update the value network Critic. Both networks output action.

Let’s take a look at the value network, or Critic. Its function is to fit the value function Qω(s,a)Q_{\omega}(s,a)Qω(s,a). There is also an estimate network and a target network. Both networks output the value q-value of the current state on the output side, but differ on the input side. The target network input of Critic has two parameters, which are the observed value of the current state and the action of the target network output of Actor. The input of the estimation network of Critic is the action of the estimation network output of the current Actor. The target network is used to calculate QtargetQ_{target}Qtarget,

Here’s a diagram to visualize the whole process:

As we can see, the update of the value network is based on the gradient descent of TD-error. As a judge, Critic does not know whether the action output by Actor is good enough at the beginning, and it also needs to learn to give an accurate score step by step. Therefore, by means of the target network fitting the value QωQ_{\omega}Qω at the next moment, And the real income R, we can get QtargetQ_{target}Qtarget, let QtargetQ_{target}Qtarget minus the current QQQ to calculate the mean square error, then can construct Loss. In fact, the update method is similar to DQN, the only difference is that the parameters of the target network are updated slowly in DDPG algorithm, instead of directly copying the parameters of the existing network every N steps in DQN.

The update of the strategy network (Actor) is based on the gradient rise, because the goal of the Actor is to find an action A can maximize the output value Q, so the optimization of the gradient of the strategy network is to maximize the output value of the Q value of the network. The Loss function has a minus sign to minimize the error.

It is worth noting that DDQG also borrows from DQN’s Experience Replay technique. DDPG will also take a period of sequence (s, a, r, s’) (s, a, r, s’) (s, a, r, s’) stored in the pool. Every time you train, just sample a minibatch from the experience pool.

Algorithm process

The pseudocode is as follows:

First, initialize the Actor, Critic, and their respective target networks — four networks and the experience pool replay Buffer R.

When the Actor network output action, DDPG realizes exploration by adding random noise, so that the agent can better explore the potential optimal strategy. Then there is the technique of experience replay. The agent and the environment interaction data (st, at, rt, st + 1) (s_t a_t, r_t, s_ (t + 1}) (st, at, rt, st + 1) stored in the R. Then a minibatch is randomly sampled from each training.

In terms of parameter update, the target value yiy_iyi is calculated by using the target network Q’Q ‘Q ‘, and the mean square error between Yiy_iyi and the current value Q is used to construct the loss function for gradient update. For the strategy network of Actor, in fact, it is to substitute the deterministic action function of Actor into a of Q-function, then find the gradient, and finally update the target network.

summary

Simple: DQN+ actor-critic =>Deep Deterministic Policy Gradient (DDPG). In fact, DDPG is actually closer to DQN, but it uses a structure similar to actor-CRITIC. DDPG has absorbed the advantages of single-step update of strategy gradient in actor-Critic, as well as the technique of ESTIMATING Q value in DQN. The biggest advantage of DDPG is the ability to learn more effectively in sequential actions.

reference

  1. What is DDPG – Reinforcement Learning (Reinforcement Learning) | don’t bother Python (mofanpy.com)
  2. Chapter 12 Deep Deterministic Strategy Gradient (DDPG) algorithm (dataWhalechina.github. IO)
  3. 【 Reinforcement Learning 】DDPG(Deep Deterministic Policy Gradient) algorithm
  4. Deep Reinforcement Learning Notes — DDPG Principle and Implementation (PyTorch)