In the introduction to reinforcement learning — DQN, we introduced the origin of DQN in detail, as well as the two methods proposed by DQN algorithm to solve the problem that reinforcement learning is difficult to converge: empirical replay and fixed target value. In this article we will use code to achieve DQN algorithm
I. Introduction to the environment
1. Gym introduction
GymGymGym is a simulation platform for research and development of algorithms related to reinforcement learning. It provides interfaces for many problems and environments (or games). And the user does not need too much understanding of the internal implementation of the game, by simply calling can be used to test and simulation, and compatible with common numerical algorithm library such as TensorFlowTensorFlowTensorFlow.
import gym
env = gym.make('CartPole-v1')
env.reset()
for _ in range(1000):
env.render()
env.step(env.action_space.sample()) # take a random action
env.close()
Copy the code
The running results are as follows:
As can be seen from the above code, the core interface of GYM is Env. As a unified environment interface, Env contains the following core methods:
reset(self)
: Resets the state of the environment and returns to observation. If the turn ends, this function is called to reset the environment informationstep(self, action)
: Perform the actionaction
Advance by one time step, go backobservation
.reward
.done
.info
.observation
It’s environmental observation, which isstate
reward
Represents the reward receiveddone
Indicates whether the current callback endsinfo
Returns some diagnostic information that is not usually used
Render (self, mode = "human", the close = False)
: Redraws a frame of the environment.close(self)
: Closes the environment and clears the memory.
The above code first imports the GYM library, line 2 creates the Cartpole-v01 environment, and line 3 resets the environment state. Control 1000 time steps (*timestep) in the for loop, refresh the environment screen for each timestep in line 5, take a random action (0 or 1) for the current environment state in line 6, and close the simulation environment after the end of the loop in line 7.
2, Cartpole-V1 environment introduction
CartPole is a basic environment provided by gym, that is, CartPole game. In the game, there is a car with a pole standing on it. The initial state will be different after each reset. The trolley needs to move left and right to keep the pole upright. In order to ensure the game continues, the following two conditions need to be met:
- The Angle of tilt of the rod θ\theta theta shall not be greater than 15°
- The moving position x of the trolley should be kept within a certain range (2.4 units from the middle to both sides)
For the cartpole-V1 environment, its actions are two discrete moves left (0) and right (1). The environment includes four variables, trolley position, trolley speed, rod Angle and Angle change rate. The following code looks like this:
import gym
env = gym.make('CartPole-v0')
print(env.action_space) # Discrete(2)
observation = env.reset()
print(observation) # [-0.0390601-0.04725411 0.0466889 0.02129675]
Copy the code
The following uses the Cartpole-v1 environment as an example to introduce the implementation of DQN
Two, code implementation
1. Implementation of experience playback pool
class ReplayBuffer:
def __init__(self, capacity=10000) :
self.capacity = capacity
self.buffer = []
self.position = 0
def push(self, state, action, reward, next_state, done) :
if len(self.buffer) < self.capacity:
self.buffer.append(None)
self.buffer[self.position] = (state, action, reward, next_state, done)
self.position = int((self.position + 1) % self.capacity)
def sample(self, batch_size = args.batch_size) :
batch = random.sample(self.buffer, batch_size)
state, action, reward, next_state, done = map(np.stack, zip(*batch))
return state, action, reward, next_state, done
Copy the code
Firstly, define an experience playback pool with a capacity of 10000. The function push is to add the information the agent interacts with the environment to the experience pool. The circular queue is implemented here, and note the operation of position pointer. When data is needed to update the algorithm, use sample to randomly select a batCH_size data from the experience queue, and use zip function to package each data together:
zip: a=[1.2], b=[2.3].zip(a,b) => [(1.2), (2.3)]
Copy the code
Each column is then converted to a list using the stack function and returned
2. Network structure
This series of reinforcement learning code, is the use of tensorLayer, is to do some encapsulation of TensorFlow, make it more easy to use, the emphasis is also built specifically for reinforcement learning interface, the following is the official website:
TensorLayer is a deep learning and reinforcement learning library for researchers and engineers based on Google TensorFlow. It provides a higher-level deep learning API that not only speeds up experiments for researchers, but also reduces rework for engineers in actual development. TensorLayer is very easy to modify and extend, which makes it suitable for both machine learning research and applications.
Define the network model:
def create_model(input_state_shape) :
input_layer = tl.layers.Input(input_state_shape)
layer_1 = tl.layers.Dense(n_units=32, act=tf.nn.relu)(input_layer)
layer_2 = tl.layers.Dense(n_units=16, act=tf.nn.relu)(layer_1)
output_layer = tl.layers.Dense(n_units=self.action_dim)(layer_2)
return tl.models.Model(inputs=input_layer, outputs=output_layer)
self.model = create_model([None, self.state_dim])
self.target_model = create_model([None, self.state_dim])
self.model.train()
self.target_model.eval(a)Copy the code
It can be seen that the use of TensorLayer is similar to tensorFlow. As long as you have the basic tensorFlow, you can understand it at a glance. In the above code, we define a function to generate the network model. We then create a current network model and a target network target_model. We know that the target network in DQN is used as a “target” to evaluate the current target value, so we set it to evaluation mode and call eval(). The model network is the one we want to train, so we call train() to set it to training mode.
3. Algorithm control process
for episode in range(train_episodes):
total_reward, done = 0.False
while not done:
action = self.choose_action(state)
next_state, reward, done, _ = self.env.step(action)
self.buffer.push(state, action, reward, next_state, done)
total_reward += reward
state = next_state
# self.render()
if len(self.buffer.buffer) > args.batch_size:
self.replay()
self.target_update()
Copy the code
When the length of the experience pool is greater than one batch_size, the function replay() is called to update the network parameters of the network model. The target_update() function is then called to copy the model network parameters to the target_model network.
4. Update network parameters
def replay(self) :
for _ in range(10):
states, actions, rewards, next_states, done = self.buffer.sample()
# compute the target value for the sample tuple
# target [batch_size, action_dim]
# target represents the current fitting level
target = self.target_model(states).numpy()
next_q_value = tf.reduce_max(self.target_model(next_states), axis=1)
target_q = rewards + (1 - done) * args.gamma * next_q_value
target[range(args.batch_size), actions] = target_q
with tf.GradientTape() as tape:
q_pred = self.model(states)
loss = tf.losses.mean_squared_error(target, q_pred)
grads = tape.gradient(loss, self.model.trainable_weights)
self.model_optim.apply_gradients(zip(grads, self.model.trainable_weights))
Copy the code
This part should be the core code of DQN. In the replay() function, we update the current network repeatedly for ten times, in order to change the update frequency of the two networks, which is conducive to network convergence.
Specific update part: As we know, DQN is to replace Q table in Q-Learning with neural network. There are many similarities between the two. We can compare the update method of Q-Learning. In the form of Q table, we obtain the action value of a certain state Q is directly obtained by subscript, so in the neural network, the state needs to be input into the neural network and obtained through forward calculation.
The third line first fetches a batCH_size data, a process called sample. In line 7 we first get the current action value. Target represents the action value calculated based on the current network parameters. Then line 8 retrieves all the actions for the next state under the current network parameter and uses the reduce_max() function to find the maximum value of the actions. Then lines 9 and 10 use the maximum action value of the next state to calculate target_q, i.e. R +γ maxa ‘Q^(s’,a’,w)r + \gamma\; max_{a’}\; \ hat {Q} (s, ‘a’, w) r + gamma maxa ‘Q ^ (s’,’ a, w), and then update the target. Note that we have been using the target_model network for the target calculation above, and the target network is only used to evaluate the network state.
Q^(s,a,w)\hat{Q}{(s, a,w)}Q^(s,a,w); MSE function is used to calculate its loss function. Finally update the Model network.
DQN code address also please give a star, thank you
3. DQN summary
Although the two solutions proposed by DQN are good, there are still problems to be solved, such as:
- Is the Q Target calculated correctly? All through Max Qmax\; QmaxQ, ok?
- Q value represents action value, so will the simple action value evaluation be inaccurate?
The improvement to the first problem is Double DQN, and the improvement to the second problem is Dueling DQN. They are all improved versions of DQN, which we will introduce in our next article.