This is the third day of my participation in the August More text Challenge. For details, see:August is more challenging

Learn more about DQN

This article is the fifth in a series on introduction to reinforcement learning. We introduced Q-learning before, today we introduce an in-depth version of Q-learning. The learning objective of this section: What is DQN? Relationship with Q-learning? What is the value function approximation? How can neural networks be trained?

Introduction to the

DQN is Deep Q Network. DQN is a model-free, value-based, and off-policy method. It was proposed in Playing Atari with Deep Reinforcement Learning by DeepMind in 2013. In the paper of DeepMind in 2015, human-Level Control Through Deep Reinforcement Learning proposed the new version of DQN using CNN.

Reviewing Q-learning, we know that Q-learning is an off-policy algorithm. The update of the algorithm is mainly based on a Q table, which stores all values of the entire state-action space. Every learning is to update this Q table. The update mode is as follows:


Q ( s . a ) please Q ( s . a ) + Alpha. [ r + gamma max a Q ( s . a ) Q ( s . a ) ] Q(s,a)\leftarrow Q(s,a)+\alpha[r+\gamma\max_{a’}Q(s’,a’)-Q(s,a)]

However, in reality, there may be a lot of states or actions of an agent. At this time, it is obviously not allowed to establish a Q table in memory, and it will also face great overhead in calculation. In order to solve the problem of excessive state space (also called dimension disaster), a DQN algorithm is proposed, which uses neural network to calculate the Q value.

Value function approximation

In simple terms, is to use a function f (s, a) f (s, a) f (s, a) to approximate Q (s, a) Q (s, a) Q (s, a). The function can be linear or non-linear.


Q ^ ( s . a Theta. ) material Q ( s . a ) \hat{Q}(s,a|\theta)\approx Q(s,a)

Theta \theta theta is the parameter of the function. How do you fit a function like this? So DQN uses the convolutional neural network CNN to fit the function. CNN takes state SSS as input and outputs a vector containing the value of all actions. The parameters are trained until convergence. Below is the CNN structure in the paper

How do you train?

As we know, neural network training requires labeled data, and sufficient samples are needed to train parameters. DQN constructs an experience pool to solve this problem.

In simple terms, after initialization, run for a period of time as q-learning does, using the ϵ−greedy\epsilon-greedyϵ− Greedy strategy. The state SSS, action AAA, reward RRR, and state s’s ‘for the next moment are stored in the experience pool.

This step is also an important step in DQN. With the experience pool, DQN can be repeated. Meanwhile, during training, a batch of sample data is randomly sampled from the Experience pool. This method is also called Experience Replay. One of the benefits of this method is to weaken the correlation between data and make the neural network more stable.

With samples, we can construct the following Loss Function:


L ( Theta. ) = E [ ( Q t a r g e t Q ( s . a Theta. ) ) 2 ] = E [ ( r + gamma max a ( Q ( s . a Theta. ) ) Q ( s . a Theta. ) ) 2 ] L(\theta)=E[(Q_{target}-Q(s,a|\theta))^2] =E[(r+\gamma\max_{a’}(Q(s’,a’|\theta))-Q(s,a|\theta))^2]

The gradient is calculated according to the Loss function, and the gradient is updated using SGD until it converges to the desired θ\thetaθ.

Algorithm process

Here’s the pseudo code of the CNN version from 2015

Looking carefully at the pseudocode above, we find that there are two parameters of the neural network θ,θ−\theta,\theta^{-}θ,θ−. In fact, DQN uses two CNNS, one to calculate the QtargetQ_{target}Qtarget which is the network where θ−\theta^-θ− is located, and one to calculate the current Q(s,a)Q(s,a)Q(s,a) Q(s,a) which is θ\theta theta. So why add target Network?

This method is also called Fixed Q-targets. Add a target network with the same structure but different parameters. The purpose is to prevent the original Q network from overfitting. Imagine if we only had a Q network, and every time theta \theta theta is updated, the goal will change, meaning that the goal it is pursuing will always change. The network may not be very stable. So we introduce a target network, we fix that target, we update it over time, so that our original Q network is pursuing a fixed target when updating θ\theta theta theta.

summary

To put it simply, DQN applies the neural network, takes the state as the input of the neural network, outputs all the action values, and then directly selects the action with the maximum value as the next action according to the principle of Q-learning. In other words, in Q-learning, neural Network is used to calculate Q value, and Q-table is transformed into Q-network, so as to deal with high-dimensional state or action space problems. DQN also has disadvantages. For example, DQN cannot be used for continuous control problems, because the operation of MAX Q makes DQN only handle discrete problems.

reference

  1. Reinforcement Learning — What happens from Q-Learning to DQN? – zhihu
  2. DQN neural network – Reinforcement Learning (Reinforcement Learning) | don’t bother the Python