This article was originally published in: Walker AI
The first amazing appearance of DRL (Deep Reinforcement Learning) should be the DQN (Deep Q Network) algorithm proposed by DeepMind in 2013 when it was applied to Atari games for the first time. Today (2017), DRL remains one of the most cutting-edge research areas. However, in just four years, DRL has evolved from playing Atari to playing Alphago, Dota AI and StarCraft AI, refreshing people’s three views one after another.
1. What is Q-learning
Q-Learning algorithm is a method to solve reinforcement Learning control problems using sequential difference. Through current state SSS, action AAA, instant reward RRR, attenuation factor γγγ, exploration rate ϵϵϵ, on the optimal action value function QQQ and the most strategic ππ.
-
SSS: indicates the environment status. The environment status at TTT time StS_tSt
-
AAA: Action taken by the agent at TTT time AtA_tAt
-
RRR: reward of the environment. The reward Rt+1R_{t+1}Rt+1 corresponding to the action AtA_tAt taken by agent in state St at TTT time will be obtained at t+1t+1t+1
-
Gamma \gamma gamma: Discount factor, weight of current delay reward
-
ϵ\ Epsilon ϵ : exploration rate. In Q-learning, we will select the action with the greatest value in the current iteration, which may lead to some actions that have not been executed again. When agent selects actions, there is a small probability that the action with the greatest value in the current iteration is not selected.
1.1 Introduction to q-learning algorithm
Firstly, based on state SSS, we select action AAA with ϵ− Greed ϵ− Greed ϵ− Greed, and then perform action AAA to get reward RRR and enter state S’s’s ‘. The updating formula of QQQ value is as follows:
Q (S, A) = Q (S, A) + alpha (R + gamma maxQ (‘ S, A) – Q (S, A)) Q (S, A) = Q (S, A) + \ alpha (R + \ gamma MaxQ (S ‘, a) – Q (S, a)) Q (S, a) = Q (S, a) + alpha (R + gamma maxQ (‘ S, a) – Q (S, a))
1.2 Algorithm flow of Q-learning
-
Randomly initializes the value corresponding to the state and action values. (Initialize QQQ table)
-
For I from 1 to T (T: total number of iterations)
A) Initialize SSS as the first state in the sequence of the current state
B) Select action AAA in the current state SSS using ϵϵϵ− Greedy method
C) Perform the current action AAA in state SSS to get the new state S’s’s ‘and reward RRR
D) Update value function Q(S,A)Q(S,A)Q(S,A) : Q (S, A) = Q (S, A) + alpha (R + gamma maxQ (‘ S, A) – Q (S, A)) Q (S, A) = Q (S, A) + \ alpha (R + \ gamma MaxQ (S ‘, a) – Q (S, a)) Q (S, a) = Q (S, a) + alpha (R + gamma maxQ (‘ S, a) – Q (S, a))
E) ‘S = S = S S’ S = S ‘
F) If DonedonedOne completes the current iteration
1.3 Here is an example of Q_table
(1) Game map
-
The black boxes are traps
-
The yellow boxes are exits (bonus points)
(2) This is the Q table after a training model
(3) Take a simple example
- If the agent enters the maze at the position of “1”, it will have a Q table. The maximum Q value of the downward direction is 0.59, and the agent will go to the position of “5”.
- After the position of “5”, the agent has a Q table, and the maximum Q value is 0.66 if it goes down. If it still goes down, the agent will go to the position of “9”.
- After the position of “9”, the agent has a Q table, and the maximum Q value is 0.73 if it moves to the right. If it still moves downward, the agent will move to the position of “10”.
- After the position of “10”, the agent has a Q table, and the maximum Q value is 0.81 if it goes down. If it still goes down, the agent will go to the position of “14”.
- After the position of “14”, the agent has a Q table. If the agent moves to the right, the maximum Q value is 0.9. If it still moves downward, the agent will move to the position of “15”.
- After the position of “15”, the agent will have a Q table. If the agent moves to the right, the maximum Q value is 1. If the agent still moves downward, the agent will move to the position of “16” and arrive at the end point.
Finally, the agent’s action path is 1–>5–>9–>10–>14–>15–>16
Each run, the QQQ table values will change, but the principle remains the same.
If you want to see it more visually, click here
2. DQN(Deep Q Network)
As mentioned before, the decision of Q-learning is based on the value of Q table, and the action is selected to be executed because the reward is more after the action is executed. The state space and the action space that WE talked about earlier are very small, but if the state space and the action space get very, very big, can we still represent it with a Q table? Obviously not, so we introduce the value function approximation.
2.1 Value function approximation
Because of the large scale of a problem’s state in practical problems, a feasible solution is to use the value function approximation. We introduce a state value function v^\hat vv^, described by the weight ω\omegaω, with state SSS as input, and calculate the value of state SSS: V ^(s,w)≈ V π(s)\ Hat v(s,w)\approx v_\ PI (s)v^(s,w)≈ V π(s)
The ω\omegaω mentioned above is equivalent to the parameters in our neural network. Through the input state SSS, MC(Monte Carlo)/TD(sequential difference) is used to calculate the value function as the output, and then the weight ω\omegaω is trained until convergence. In fact, the so-called DQN is the combination of neural network and Q-learning, turning Q table into Q network.
2.2 Deep Q-learning algorithm ideas
DQN is an off-policy algorithm. In the words of Teacher Li Hongyi, it can watch others learn. Then why can DQN watch others learn? DQN takes an experiential replay approach to learning. The reward obtained by each agent interaction with the environment, the current state and the next state and other data are saved for the subsequent update of Q network.
In fact, Nature DQN is the second generation of DQN. DQN NIPS is the most original DQN, and there are many versions of DQN, such as Double DQN,Dueling DQN and so on. The reason why I am here to introduce Nature DQN to you! Personally, I think this version of DQN should be the most classic. Let’s take a look at how DQN does reinforcement learning.
2.3 Algorithm Flow Chart
Input: total iteration round number TTT, state characteristic dimension NNN, action dimension AAA, step size AAA, attenuation factor γ\gammaγ, exploration rate ϵ\epsilonϵ, current Q network QQQ, target Q network Q’Q ‘Q ‘, batch gradient descent sample number MMM, Target Q network parameters update frequency PPP.
Output: Q Network parameters
-
Initialize all state and action value QQQ randomly, initialize all parameters ω\omegaω of current QQQ network randomly, initialize target Q network Q’Q ‘parameter ω\omegaω = ω\omegaω, empty experience pool DDD
-
For I from 1 to T
A) Initialize the environment, obtain the first state SSS, and obtain the eigenvector \phi$$(S)
B) Use \phi$$(S) as input in the QQQ network to get the Q values of all actions in the QQQ network. Use ϵ\ Epsilon ϵ- greedy method to select the corresponding action AAA from the current Q output
C) Perform the action AAA under state SSS, get the new state S’s ‘, and its corresponding \phi$$(S’) and reward RRR, whether is the end state isdoneisdoneisdone
D) will be {ϕ (S) \ phi (S) ϕ (S), AAA, RRR, ϕ (‘ S) \ phi (‘ S) ϕ (‘ S), isdoneisdoneisdone} DDD pool will be the five elements in the experience
E) ‘S = S = S S’ S = S ‘
F) Take MMM samples from the experience pool DDD, {ϕ (Sj) \ phi (S_j) ϕ (Sj), AjA_jAj, RjR_jRj, ϕ (Sj ‘\ phi (S’ _j ϕ (Sj ‘, isdonejisdone_jisdonej), j = 1, 2, 3, 4… Mj = 1, 2, 3, 4… Mj = 1, 2, 3, 4… M, calculate the current QQQ value:
G) using mean square deviation loss function \ left (\ frac {1} {m} \ right) $$\ sum_ {j = 1} ^ m (yj – Q (ϕ (Sj), Aj, omega)) 2 y_j – Q (\ phi (S_j), A_j \ omega)) ^ 2 yj – Q (ϕ (Sj), Aj, omega)) 2 by gradient descent back propagation neural network updating parameters \ omega, omega omega
H) If I %P=0, with the new target QQQ network parameters ω ‘= ω\omega’=\omega ‘= ω
I) If S’s’s ‘is the terminated state, the current iteration is completed; otherwise, skip to Step (2)
2.4 DQN implementation code
(1) Network structure
class Net(nn.Module) :
def __init__(self, ) :
super(Net, self).__init__()
self.fc1 = nn.Linear(N_STATES, 50)
self.fc1.weight.data.normal_(0.0.1) # initialization
self.out = nn.Linear(50, N_ACTIONS)
self.out.weight.data.normal_(0.0.1) # initialization
def forward(self, x) :
x = self.fc1(x)
x = F.relu(x)
actions_value = self.out(x)
return actions_value
Copy the code
- Both networks have the same structure.
- Parameter weight is different, one is updated in real time, one is updated at intervals.
(2) Selection of actions
def choose_action(self, x) : #x is the four values of the current state
x = torch.unsqueeze(torch.FloatTensor(x), 0) Add one dimension to the 0th dimension of the data
# input only one sample
if np.random.uniform() < EPSILON: # greedy # greedy
actions_value = self.eval_net.forward(x) Pass eval_net to get the next action
action = torch.max(actions_value, 1) [1].data.numpy() ## returns the index of the maximum value in this row
action = action[0] if ENV_A_SHAPE == 0 else action.reshape(ENV_A_SHAPE) # return the argmax index
else: # random
action = np.random.randint(0, N_ACTIONS)
# action = random.sample(N_ACTIONS)
action = action if ENV_A_SHAPE == 0 else action.reshape(ENV_A_SHAPE)
return action
Copy the code
An exploratory value (ϵ)(\epsilon)(ϵ) has been added, which is less likely to be a randomly selected action.
(3) Experience pool
def store_transition(self, s, a, r, s_) : Both #s and s_ are 4 values, respectively, are the position movement speed Angle movement Angle
transition = np.hstack((s, [a, r], s_))
Replace the old memory with new memory
index = self.memory_counter % MEMORY_CAPACITY
self.memory[index, :] = transition # Replace index experience with transition
self.memory_counter += 1
Copy the code
(4) Update network parameters
def learn(self) :
# target parameter update Specifies the target parameter update
if self.learn_step_counter % TARGET_REPLACE_ITER == 0:
self.target_net.load_state_dict(self.eval_net.state_dict()) ## Assign the eval_NET parameter to target_net every 200 steps
self.learn_step_counter += 1
Sample Batch Transitions
sample_index = np.random.choice(MEMORY_CAPACITY, BATCH_SIZE) Select BATCH_SIZE randomly from MEMORY_CAPACITY
b_memory = self.memory[sample_index, :]
b_s = torch.FloatTensor(b_memory[:, :N_STATES]) # first state
b_a = torch.LongTensor(b_memory[:, N_STATES:N_STATES+1].astype(int)) # action
print("-- -- -- -- -- -- -- --")
print(b_a)
print("-- -- -- -- --")
b_r = torch.FloatTensor(b_memory[:, N_STATES+1:N_STATES+2]) # score
b_s_ = torch.FloatTensor(b_memory[:, -N_STATES:]) # Next state
# q_eval w.r.t the action in experience
q_eval = self.eval_net(b_s).gather(1, b_a) # shape (Batch, 1) The Q value of the current state is calculated using eval_net
# print("++++++")
# print(self.eval_net(b_s))
# print(self.eval_net(b_s).gather(1,b_a))
# print("+++++++")
q_next = self.target_net(b_s_).detach() # detach from graph, don't backpropagate detach prevents targent -- net from propagating back
q_target = b_r + GAMMA * q_next.max(1) [0].view(BATCH_SIZE, 1) # shape (batch, 1)
loss = self.loss_func(q_eval, q_target)
self.optimizer.zero_grad() #zer -- grad sets the gradient of all optimizers to 0
loss.backward() # Backpropagation
self.optimizer.step() Perform the next optimization
Copy the code
DQN is the threshold of deep reinforcement learning. Once you step through the gate, the learning will be easy.
PS: more dry technology, pay attention to the public, | xingzhe_ai 】, and walker to discuss together!