The previous article reinforcement learning 13 – DDPG algorithm in detail introduced the DDPG algorithm, this paper introduces TD3 algorithm. The full name of TD3 is Twin Delayed Deep Deterministic Policy Gradient (Double delay depth Deterministic strategy). It can be seen that TD3 is the upgraded version of DDPG algorithm, so if you know DDPG, then TD3 algorithm is not a problem.
TD3 algorithm mainly to DDPG three improvements, will be explained in the following one, the two code is also very similar, here only to show the improved part, if not familiar with the DDPG algorithm can refer to the last blog reinforcement learning 13 – DDPG algorithm in detail and combat.
Complete TD3 algorithm code address intensive learning — TD3 algorithm code address also hope to conveniently a star, again very grateful
1, Double Critic network
As we know, DDPG originates from DQN, and DQN originates from q-learning. These algorithms seek optimal strategies by estimating Q value. In reinforcement Learning, the target value of updating Q network is y= R +γ maxa ‘Q(S’, A ‘)y= R +\gamma\; Max_ {a’}Q(s’,a’)y=r+γmaxa ‘Q(s’,a’), because the sample has noise ϵ\epsilonϵ, so in the real case, the maximum value of the estimated action value with error is usually larger than the true value:
This inevitably reduces the accuracy of the valuation function. Since the calculation of the valuation method is based on The Behrman equation, that is, the subsequent state is used to update the valuation, this property aggravates the decline of accuracy. Using an inaccurate estimate with each policy update will cause errors to accumulate. These accumulated errors will lead to the overestimation of a bad state, which ultimately leads to the failure of the strategy to be optimized to the optimal and the failure of algorithm convergence.
In the DQN algorithm, the problem of Q overestimation is solved by using dual networks to select and evaluate actions respectively, namely DDQN algorithm. In TD3 algorithm, we also use two Critic networks to evaluate Q value, and then select the Q value of the smaller network to update, so as to alleviate the phenomenon of overestimation of Q value. This may lead to slight underestimation, which leads to slower training, but it’s better than overestimation.
Note: Here we use two Critic networks, and each Critic network has a corresponding Target network. It can be understood as two independent Critic networks, which evaluate the input action and then calculate the smaller value through the min() function as the update Target. So TD3 algorithm uses a total of 6 networks.
Code implementation:
self.q_net1 = QNetwork(state_dim, action_dim, hidden_dim)
self.q_net2 = QNetwork(state_dim, action_dim, hidden_dim)
self.target_q_net1 = QNetwork(state_dim, action_dim, hidden_dim)
self.target_q_net2 = QNetwork(state_dim, action_dim, hidden_dim)
self.policy_net = PolicyNetwork(state_dim, action_dim, hidden_dim, action_range)
self.target_policy_net = PolicyNetwork(state_dim, action_dim, hidden_dim, action_range)
Copy the code
As shown above, there are two sets of Q networks for estimating Q values, one set of policy networks. The specific network update process is the same as DDPG. The only difference is that after the Q value is calculated by the two Critic network, the minimum value is selected to calculate the target value:
target_q_min = tf.minimum(self.target_q_net1(target_q_input), self.target_q_net2(target_q_input))
target_q_value = reward + (1 - done) * gamma * target_q_min
Copy the code
Next, update the Critic network and Policy network respectively.
2. Delay Actor network updates
The second technique used in TD3 is the delayed updating of Policy. In a dual network, we keep the target network out of sync with the current network update and update the target network after the current network update d times (replication parameters). This reduces the cumulative error and thus the variance. Similarly, we can delay the update of policy network, because the parameter update in actor-critic method is slow. On the one hand, delayed update can reduce unnecessary repeated updates, and on the other hand, it can reduce the accumulated errors in multiple updates. While reducing the frequency of updates, you should also use soft updates:
The implementation of policy network delay update is also simple, requiring only an IF statement
if self.update_cnt % self.policy_target_update_interval == 0
Copy the code
Where update_cnt is the update count, and policy_target_update_interval is the update interval of the policy network. The policy network is updated after each critic has been updated for a certain number of times.
5, Smoothing Regularization Target Policy Smoothing Regularization Target Policy
Above, we delay the policy update to avoid excessive accumulation of errors. Next, we will consider whether the error itself can be reduced. So the first thing we need to do is figure out where the error comes from.
The root of the error is the deviation generated by the value function estimation. Knowing the reason, we can solve it. In machine learning, a common method to eliminate estimation bias is to regularize parameter updates. Similarly, we can introduce this method into reinforcement learning:
A natural idea in reinforcement learning is that for similar actions, they should have similar values.
So we want to be able to smooth out the values of the small area around the target action in the action space to reduce errors. Add a noise to the Q value of the target action ϵ\epsilonϵ :
The noise here can be thought of as a regularization, which makes the value function update smoother.
Code implementation:
def evaluate(self, state, eval_noise_scale) :
state = state.astype(np.float32)
action = self.forward(state)
action = self.action_range * action
# add noise
normal = Normal(0.1)
noise = normal.sample(action.shape) * eval_noise_scale
eval_noise_clip = 2 * eval_noise_scale
noise = tf.clip_by_value(noise, -eval_noise_clip, eval_noise_clip)
action = action + noise
return action
Copy the code
As shown in the code, adding noise to the action is implemented in the policy policy network evaluation section. The evaluate() function takes two arguments: state is the state of the input, and eval_noise_scale is used to adjust the noise. As you can see, the output action is first computed forward. Here’s how to add noise to the action: Sample (action.shape), and then multiply by the parameter eval_noise_scale to scale the noise. To prevent the noise from being pulled out too much or too little, we cut the noise. The range is twice as large as eval_noise_scale. Finally, add the noise to the action and print it.
Algorithm pseudocode:
Reference:
zhuanlan.zhihu.com/p/55307499
Zhuanlan.zhihu.com/p/86297106?…