We introduced DQN algorithm, but DQN still has some problems, this article introduces the improved algorithm for DQN problem
First, Double DQN algorithm
1. Algorithm introduction
The questions of DQN include: Is the calculation of Q Target accurate? All through Max Qmax\; QmaxQ, ok? Obviously, there is a problem, and this is due to the inherent flaw of Q-Learning – overestimation
Overestimation means that the estimated value function is larger than the true value function, which mainly stems from the maximization operation in Q-learning. For TD Target:
The maxmaxmax operation makes the estimated value function larger than the real value of the function, because DQN is an off-policy method. Instead of using the real action of the next interaction, it updates the target value function with the action that has the highest value currently. (Note: For a real policy and in a given state, the action with the largest Q value is not always selected, so the direct selection of the target value with the largest Q value will often result in the target value being higher than the real value). The improvement method of Double DQN is to use different value functions to realize the selection and evaluation of actions respectively. In Nature DQN, we just put forward two Q networks. Therefore, the calculation of TD Target can be divided into the following two steps:
- 1) Obtain the action AAA of the maximum function through the current Q Estimation network:
- Amax (s’,w) a_{Max}(s’,w)amax(s ‘,w)amax(s ‘,w)
TD Target in Double DQN can be calculated as:
The DDQN and DQN processes are identical except for calculating the Target Q value.
2. Code presentation
As can be seen from the above, the only difference between Double DQN and DQN is the estimation of Q value, and the rest of the process is the same. Here’s the code:
target = self.target_model(states).numpy()
# next_q_values [batch_size, action_diim]
next_target = self.target_model(next_states).numpy()
# next_q_value [batch_size, 1]
next_q_value = next_target[
range(args.batch_size), np.argmax(self.model(next_states), axis=1)]# next_q_value = tf.reduce_max(next_q_value, axis=1)
target[range(args.batch_size), actions] = rewards + (1 - done) * args.gamma * next_q_value
Copy the code
Double DQN = Double DQN = Double DQN = Double DQN Thank you
Dueling DQN algorithm
1. Introduction to the algorithm
In DQN algorithm, Q value output by neural network represents action value, so will the simple action value evaluation be inaccurate? As we know, the value of Q(S,a)Q(S,a)Q(S,a) is related to both State and action, but the two “related” degrees are not the same, or the influence is not the same, and we hope to reflect the difference between the two aspects.
Dueling-DQN algorithm improves DQN from the network structure. The action value function output by neural network can be divided into state value function and advantage function, namely:
These two functions are then approximated using a neural network.
To recap, we introduced the definition of state value function V(s)V(s) in the previous MDP section:
The state value function is the sum of all possible actions in the state multiplied by the probability of taking the action. More generally, the value function V(s)V(s)V(s) is the average value of all the action value functions in this state about the action probability; The action value function q(s,a)q(s,a)q(s,a) represents the value that can be obtained by selecting action A under state S.
So what is the dominant function? Advantage function A PI (s, A) = Q PI PI (s, A) – V (s) A_ \ PI = Q_ (s, A), PI (s, A) – V_, PI (s) A PI PI (s, A) = Q (s, A) – V PI (s). This is the value of the current action relative to the average value. So, advantage here refers to the advantage of the action value over the value of the current state. If the advantage is greater than zero, the action is better than average, and if the advantage is less than zero, the current action is not as good as average. In this way, actions that are better than average actions will have larger outputs, thus speeding up the network convergence process.
2. Code presentation
Similarly, Dueling DQN differs from DQN in that the rest of the flow is exactly the same as the network structure. Without further explanation, the code for creating the model is attached below:
def create_model(input_state_shape) :
input_layer = tl.layers.Input(input_state_shape)
layer_1 = tl.layers.Dense(n_units=32, act=tf.nn.relu)(input_layer)
layer_2 = tl.layers.Dense(n_units=16, act=tf.nn.relu)(layer_1)
# state value
state_value = tl.layers.Dense(n_units=1)(layer_2)
# advantage value
q_value = tl.layers.Dense(n_units=self.action_dim)(layer_2)
mean = tl.layers.Lambda(lambda x: tf.reduce_mean(x, axis=1, keepdims=True))(q_value)
advantage = tl.layers.ElementwiseLambda(lambda x, y: x-y)([q_value, mean])
# output
output_layer = tl.layers.ElementwiseLambda(lambda x, y: x+y)([state_value, advantage])
return tl.models.Model(inputs=input_layer, outputs=output_layer)
Copy the code
Complete code reinforcement learning — Dueling DQN code address, please point a star can you? Thank you