This article was first published on:Walker AI
I believe that reinforcement learning is, in a sense, the future of artificial intelligence. — Richard Sutton, father of reinforcement learning
To put it simply, through reinforcement learning, an agent can know in what state it is in and what actions it should take to get the maximum reward. Reinforcement learning is divided into online learning and offline learning. This paper starts from Q-learning(offline) and Sarsa(online), and discusses the similarities and differences between the two.
1. A brief introduction of Q – learning
Q-learning is a value-based algorithm among reinforcement Learning algorithms. Q is Q(S, A), which is the expectation that taking action A (A ∈A) can obtain profit under S state (S ∈S) at a certain moment, and the environment will feedback corresponding reward R according to agent’s action. Therefore, the main idea of the algorithm is to build the State and Action into a Q-table to store Q values, and then select the actions that can obtain the maximum benefits according to THE Q values.
A simpler understanding is that we use ε−greedy\varepsilon-greedyε−greedy to select action A based on state S, and then execute action A to obtain the next state S ‘and reward R
Where: S represents the current state, A represents the current behavior, α is the learning rate, R is the reward value, A * represents the current optional behavior set, γ is the discount rate, that is, the value weight of the current reward.
Attached here is the main code:
Update # S
def chose_direction(s, q_table) :
if np.random.uniform() > EPSILON or q_table.iloc[s, :].all() = =0:
direction = np.random.choice(ACTIONS)
else:
direction = ACTIONS[q_table.iloc[s, :].argmax()]
return direction
Copy the code
# Learning process
s = 0
is_terminal = False
step_count = 0
while not is_terminal:
a = chose_direction(s, q_table)
s_, r = update(s, a)
q_predict = q_table.loc[s, a]
ifs_ ! ="terminal":
q_target = r + GAMA*q_table.iloc[s_, :].max(a)else:
q_target = r
is_terminal = True
q_table.loc[s, a] += ALPHA*(q_target-q_predict)
a = s_
Copy the code
2. Briefly Sarsa
Sarsa decision-making part is exactly the same as Q-Learning, which adopts the form of Q table to make decisions. We will select actions with higher values in the Q table and implement them in the environment to exchange rewards. But the Sarsa update is different. The five letters ‘sarsa’ mean s(current state), A (current action), R (reward), S (next state), a(next action), which means that we have already thought about the a of the current S and the next S ‘and a’. The formula for Sarsa is as follows:
Here is the main code for the Sarsa update:
s = 0
is_terminal = False
step_count = 0
a = chose_direction(s, q_table)
while not is_terminal:
s_, r = update(s, s)
q_predict = q_table.loc[s, a]
ifs ! ="terminal" ands_ ! ="terminal":
a_ = chose_direction(s_, q_table)
q_target = r + GAMA*q_table.loc[s_, a_]
else:
q_target = r
is_terminal = True
q_table.loc[s,a] += ALPHA*(q_target-q_predict)
s = s_
a = a_
Copy the code
3. Give examples to compare the similarities and differences
3.1 Game Map
-
Grey is trap
-
Yellow is the bonus point
The game starts at position 1, and the trap or bonus point is considered game over.
3.2 Comparison and analysis of results
- Gray is the current state in which this behavior will lead to a trap
- The yellow is the current state where if you do this behavior you’ll get to the reward point
Q-learning chooses the nearest path, while Sarsa tends to avoid traps early. For example, at the point 13, Q-Learning only chooses upward. However, it can be seen from the table that Sarsa chooses left to avoid danger at the point 13.
Q-learning and Sarsa are exactly the same in the decision-making part. Decisions are made in the form of Q table, and behaviors with large values are selected from the Q table to be imposed into the environment in exchange for rewards. Q-learning is the behavior that will bring the greatest benefits by selecting S’, but this behavior may not be selected when making decisions. Sarsa removes maxQ\ {Max}Q maxQ, and instead selects the Q value of the actual A’ above S’, finally calculates the difference between the reality and the estimate like Q-Learning, and updates the values in the Q table.
4. To summarize
Q-learning algorithm and Sarsa algorithm both start from state S, select an action A ‘according to the current Q-table using certain strategies (ε−greedy\varepsilon-greedyε−greedy), and then observe the next state S’. And select action A ‘according to q-table again. They just pick a transpose differently. According to the algorithm description, when selecting action A ‘of the new state S’, Q-Learning uses greedygreedygreedy, i.e., selecting a’ with the maximum value. At this time, it only calculates which A ‘can make Q(S,a) reach the maximum value, but does not really use this action A’. Sarsa, on the other hand, used the ε−greedy\varepsilon-greedyε−greedy strategy and actually adopted the action A ‘. That is:
Due to the differences in update methods, Q-Learning is a greedy and brave algorithm, who does not care about traps, while Sarsa is a conservative algorithm, who is very sensitive to errors and death. Different algorithms applied in different scenarios have different benefits.
PS: more dry technology, pay attention to the public, | xingzhe_ai 】, and walker to discuss together!