This article was first published on:Walker AI
The introduction
Monte Carlo method is not a specific algorithm, but a general term for a class of random algorithms, based on the idea that the “frequency” of events is used to replace the “probability” of events. In machine learning, this approach can be applied to situations where the model is unknown, and it simply needs to be learned from experience, including the states, actions, and rewards of the sample sequence. After a number of experiences are obtained, the reinforcement learning task is solved by averaging the returns of all samples.
1. Strategy evaluation
In the case of a given strategy, we can use Monte Carlo method to learn the state value function, that is, the expected return at the beginning of this state — the expected cumulative future discount reward, the formula is as follows:
That is, if we want to estimate the value of vπ(s)v_\ PI (s)vπ(s), that is, the value of state SSS following the strategy π\ PI π, we can calculate the average return of first access to state SSS over all turns as an estimate of vπ(s)v_\ PI (s)vπ(s). This method is the first access to the MC method; An alternative way to calculate the average return of SSS per visit over all rounds is the MC per visit method.
And PI (s) to estimate v v/PI (s) v (s) of the value of PI, PI is to estimate q (s, a) q_ \ PI PI (s, a) q (s, a). Because the state value function can not directly determine the strategy under the condition of unknown model, we have to use the state action value function Q π(s,a)q_\ PI (s,a)qπ(s,a) to determine the strategy. The two methods we use to estimate qπ(s,a)q_\ PI (s,a)qπ(s,a) is the average return of all episode visits (s,a)(s,a)(s,a).
First access to MC algorithm flow:
Input: policy π\ PI π for evaluation
Initialization:
For all SSS ∈ SSS, any V(s)∈RV(s)\in RV(s)∈R
Returns(s)←Returns(s)\leftarrowReturns(s)← an empty list for all SSS ∈ SSS
Keep looping (for each turn) :
Use PI \ PI PI to generate a turn: S0,A0,R1,S1… , ST – 1, the AT – 1, RTS_0, A_0, R_1, S_1, \ dots, S_ {1} T -, A_ {1} T -, R_TS0, A0, R1, S1,… , ST – 1, the AT – 1, RT
G please 0 G \ leftarrow0G please 0
For each step in the cycle, t= t −1, t −2… , 0 T = T – 1, 2 T -, \ dots, 0 T = T – 1, T – 2,… Zero:
G please gamma G + Rt + 1 G \ leftarrow \ gamma G + R_ (t + 1} G please gamma G + Rt + 1
Unless StS_tSt occurs in S0,S1… , St – 1 s_0 S_1 \ dots, S_ {1} t – S0 and S1,… , St – 1:
Add GGG to list Returns(s)Returns(s)Returns(s)
V (s) please business (Returns (s)) V (s) \ leftarrow business (Returns (s)) V (s) please business (Returns (s))
2. Policy control
2.1 Discussion
After we have the value function, the next step is to upgrade to approximate the optimal value function and optimal strategy.
The method of strategy promotion is to target the current value function, even if the strategy is greedy. For each s∈\Ss\in\Ss∈\ s, we only need to select the action that maximizes the action value function:
We then for each PI k + 1 (s) \ pi_ + 1} {k (s) PI k + 1 (s) from Q PI kQ_} {\ PI k Q PI k greed, so we can get:
That is, every π(k+1)\pi_(k+1)π(k+1) is better than πk\pi_kπk, converging to the optimal strategy and value function over enough turns. In order for this process to converge, the following two assumptions must be satisfied:
(a) We have an unlimited number of rounds for strategic evaluation.
(b) Turns are all ways of exploring beginnings.
2.2 Solve the first hypothesis
It is obviously impossible to go through an infinite number of episodes. Here, the best method is to fit Q πkq_{\ PI k}qπk. The main fitting methods are as follows:
Method 1: Make each strategy evaluation infinitely close to Q πkq_{\ PI k}qπk. With some methods and some assumptions, and with enough steps, a certain degree of convergence can be guaranteed
Method 2: Give up trying to complete the policy evaluation before jumping to policy promotion. At each step of the evaluation, we move the value function toward Q πkq_{\ PI k}qπk. An extreme example is a value iteration, where an iterative strategy evaluation is performed for each step of strategy improvement.
2.3 Resolve the second hypothesis
Specifically, there are two ways to solve the second hypothesis, which we call on-policy approach and off-policy approach. The on-policy approach attempts to estimate and promote the same policy that we use to make decisions; The off-policy estimates and promotes a different strategy than the one used to generate the data.
2.3.1 on – the policy strategy
We use the ϵ− Greedy \epsilon-greedyϵ−greedy strategy, which is a trade-off between development and exploration. That is, the action with the maximum estimated action value is selected most of the time, but there is still a probability that ϵ\epsilonϵ will choose the random action.
On-policy first access MC control algorithm flow:
Initialization:
π\ PI π Arbitrary ϵ−soft\epsilon-softϵ−soft policy
To all, s ∈ s ∈ a. a (s) s \ s, in a \ in a (s) s ∈ s ∈ a. a (s), any Q (s, a) ∈ RQ \ in RQ (s, a) ∈ R (s, a)
For all S ∈S, A ∈A(s)s\in S,a\in a (s) S ∈S, A ∈A(s) S, Returns(S)←Returns(s)\leftarrowReturns(S)← empty list
Loop through:
Follow the strategy π\ PI π and generate a turn: S0,A0,R1,S1… , ST – 1, the AT – 1, RTS_0, A_0, R_1, S_1, \ dots, S_ {1} T -, A_ {1} T -, R_TS0, A0, R1, S1,… , ST – 1, the AT – 1, RT
G please 0 G \ leftarrow 0 G please 0
For each step in the turn, t= t −1, t −2… , 0 T = T – 1, 2 T -, \ dots, 0 T = T – 1, T – 2,… Zero:
G please gamma G + Rt + 1 G \ leftarrow \ gamma G + R_ (t + 1} G please gamma G + Rt + 1
Unless St,AtS_t,A_tSt,At occurs in S0,A0,S1,A0… , St – 1, the At – 1 s_0, A_0, S_1, A_0, \ dots, S_ {1} t -, A_ {1} t – S0, A0, S1, A0,… , St – 1, the At – 1:
Add GGG to the list Returns(St,At)Returns(S_t,A_t)Returns(St,At)
Q (St, At) please business (Returns (St, At)) Q (S_t A_t) \ leftarrow business (Returns (S_t A_t)) Q (St, At) please business (Returns (St, At))
A ∗ please argmaxQ (St, A) A ^ * \ leftarrow argmax Q (S_t, A) A ∗ please argmaxQ (St, A)
For all a∈ a (St)a\in a (S_t)a∈ a (St):
PI (a ∣ St) please {1 – ϵ + ϵ / ∣ a (St) ∣ if a = a ∗ ϵ / ∣ a (St) ∣ if a indicates a ∗ \ PI (a | S_t) 1 – \ \ leftarrow \ begin {cases} epsilon + \ epsilon / | a | & if \ a = (S_t) A ^ * \ \ \ epsilon / | A (S_t) | & if \ \ A neq A ^ * {cases} \ end PI (A ∣ St) please {1 – ϵ + ϵ / ∣ A (St) ∣ ϵ / ∣ A (St) ∣ if A = A ∗ ∗ if A = A
2.3.2 off – the policy strategy
Off-policy uses two strategies in strategy estimation and strategy promotion. One is the behavioral strategy μ\muμ. It is an exploratory strategy, which is specially used to generate episodes and accumulate experience. The other is the target strategy π\ PI π : learning from the experience generated by the behavioral strategy μ\muμ to maximize the reward and become the optimal strategy.
2.3.2.1 Importance sampling in the off-policy policy
Importance sampling is to estimate the expected value of a random variable on one distribution, but the sample is from another distribution. The method of materiality sampling is applied to the separation strategy in order to weight the return based on the probability ratio of the trajectory of the event that occurs under the target strategy and the behavior strategy. The ratio of the two probabilities is called the importance sampling rate. A given initial state \ \ S_t \ St St, then under the strategy of PI \ PI PI, then the state of the movement trajectory of ats, St + 1, At + 1,…, StA_t, S_} {t + 1, A_ {t + 1}, \ \ cdots, S_tAt, St + 1, At + 1,…, is the probability that St occurred
PPP represents the state transition probability function. Therefore, the sampling rate of importance under target strategy and behavior strategy is:
It can be seen from the above equation that the importance sampling rate ultimately only depends on the two strategies and sequences, and has nothing to do with MDP. The following is the formula of the off-policy evaluation strategy:
(1) Original importance sampling
(2) Weighted importance sampling
τ(s)\tau(s)τ(s): represents the set of all states S at the time when they were first visited in an episode
T(T)T(T)T(T): return from time T to T(T)T(T)T(T)
{Gt} t ∈ tau (s) \ {G_t \} _ {t \ \ in the tau (s)} {Gt} t ∈ tau (s) : belong to the state’s returns
{rho t: t – 1} t ∈ tau (s) \ {\ rho_ {t, t – 1} \} _ {t \ \ tau in (s)} {rho t: t – 1} t ∈ tau (s) : on behalf of the corresponding importance sampling rate
2.3.2.2 Incremental mean calculation
So instead of taking the mean directly, we can take the mean incrementally. Suppose we have a series of returns G1,G2… gt-1G_1,G_2\cdots, g_T-1G1,G2… gt-1. For off-policy, because we use importance sampling, there is an extra weight factor
So there are
2.3.2.3 off-policy Indicates the policy algorithm
By integrating the importance sampling and incremental mean calculation, we can obtain the off-policy algorithm. Here, the behavior strategy is ϵ−soft\epsilon-softϵ−soft, and the target strategy is greedy algorithm.
Off-policy accesses MC control algorithm flow for the first time:
Initialize: for all S ∈S, A ∈ a (s)s\in s,a\in a (s) S ∈S,a∈ a (s) S:
Q(s,a)∈RQ(s,a)\in RQ(s,a)∈R (random value)
C (s, a) please 0 C (s, a) \ leftarrow0C please 0 (s, a)
Keep looping (for each turn) :
B ←b\leftarrowb← Any policy that overrides PI \ PI π
Generate a turn using strategy BBB: S0,A0,R1,S1… , ST – 1, the AT – 1, RTS_0, A_0, R_1, S_1, \ dots, S_ {1} T -, A_ {1} T -, R_TS0, A0, R1, S1,… , ST – 1, the AT – 1, RT
G please 0 G \ leftarrow 0 G please 0
W please 1 W \ leftarrow 1 W please 1
For each step of the cycle, t= t −1, t −2… , 0 T = T – 1, 2 T -, \ dots, 0 T = T – 1, T – 2,… , 0, :
G please gamma G + Rt + 1 G \ leftarrow \ gamma G + R_ (t + 1} G please gamma G + Rt + 1
C (St, At) (At) St, please C + WC (S_t A_t) \ leftarrow C (S_t A_t) + WC (St, At) (At) St, please C + W
Q (St, At) (At) St, please Q + WC (St, At) [G – Q (St, At)] Q (S_t A_t) \ leftarrow Q (S_t A_t) + \ frac {W} {C (S_t A_t)} [G – Q (S_t A_t)] Q (St, At) please Q (St, At) + C (St, At) W – Q (St, At) [G]
PI (St) please argmaxQ (St, a), PI (S_t), leftarrow argmax Q (S_t, a) PI (St) please argmaxQ (St, a)
If At≠π(St)A_t\neq \ PI (S_t)At=π(St) exit the inner loop (proceed to the next turn)
W please W1b At ∣ (St) W \ leftarrow W \ frac {1} {b (A_t | S_t)} W please Wb (At ∣ St) 1
3. Example blackjack
3.1 Rules of the Game
The rules of blackjack are as follows: there is a player and a dealer, and the outcome of each turn may be a win for the player, a win for the dealer or a tie. The turn begins with two cards each for the player and the dealer, and the player can see two of the player’s cards and one of the dealer’s cards. The player can then choose whether or not to have more cards. If more cards are selected (called a “hit”), the player can get another card and count the total points of all the cards in the player’s hand. The A of the card stands for 1 or 11. If the sum of points is greater than 21, the player is said to have lost the turn and the declarer wins; If the sum of points equals 21, then the player can again decide whether to ask for more cards until the player does not ask for more cards. If the player does not want more cards when the total is less than 21, then the total number of points in the player’s hand is the final player’s number. Next, the declarer shows the card he doesn’t show and draws more cards if his point is less than 17. If the dealer’s total number of points during the drawing exceeds 21, the dealer loses the turn and the player wins; If the final declarer’s total is less than or equal to 21, the player’s total is compared to the declarer’s total. If the player’s total is greater than the dealer’s total, the player wins; If the player and dealer have the same total, it is a tie; If the player’s total is less than the house’s total, the house wins.
3.2 Code Implementation
Here I use the off-policy strategy. In fact, for monte Carlo method, the most important thing is to solve the strategy evaluation and strategy control. I will give the experimental code below:
def obs2state(obs) :
# Convert observation information to status information
return (obs[0], obs[1].int(obs[2]))
# Strategy evaluation
def evaluate(env, target_policy, behavior_policy, episode_num=500000) :
# initialization
q = np.zeros_like(target_policy)# q (s, a)
c = np.zeros_like(target_policy)The cumulative weight of each state for the first n returns
for _ in range(episode_num):
state_actions = []# state action key value pairs
observation = env.reset()# Get observations
Generate a turn using behavior strategy
while True:
state = obs2state(observation)
action = np.random.choice(env. action_space.n, p=behavior_policy[state])
state_actions.append((state, action))
obs, reward, done, _ = env.step(action)
if done:
break
g = 0 # returns
rho = 1. # Importance sampling ratio
for state, action in reversed(state_actions):
g = gamma*g+reward # G please gamma G + Rt + 1
c[state][action] += rho C # (St, At) please C (St, At) + W
q[state][action] += (rho / c[state][action]*(g - q[state][action]))# Q (St, At) (At) St, please Q + W/C (At) St, [G - Q (St, At)]
rho *= (target_policy[state][action]/ behavior_policy[state][action])
Please # W W * PI (At | St)/b (ats) | St
if rho == 0:
break
return q
# Policy control
def off_policy(env,target_policy,behavior_policy,espisode_num=500) :
q=np.zeros_like(target_policy)
c=np.zeros_like(target_policy)
for i in range(espisode_num):
state_action=[]
obs=env.reset()
while True:
state = obs2state(obs)
action = np.random.choice(env.action_space.n,p=behavior_policy[state])
state_action.append((state,action))
observation, reward, done, _ = env.step(action)
if done:
break
# Complete an episode
g=0 # returns
rho=1# Importance sampling ratio
for state,action in reversed(state_action):
g = gamma*g+reward # G please gamma G + Rt + 1
c[state][action]+=rhoC # (St, At) please C (St, At) + W
q[state][action]+=(rho / c[state][action]*(g - q[state][action]))# Q (St, At) (At) St, please Q + W/C (At) St, [G - Q (St, At)]
# Strategy promotion π(St)←argmaxaQ(St,a)
a =q[state].argmax()
target_policy[state]=0
target_policy[state][a]=1
ifa! =action:break
rho /= behavior_policy[state][action]
return target_policy,q
Copy the code
3.3 Experimental Results
One episode came as shown in figure, the first player get [1, 4], banker showed 5, brand strategy decided to (1), then the player’s CARDS for,4,9 [1], the reward is 0, strategy to continue decided to brand, the results of the observations of the next round of player of sum for 23 points, more than 21 points, so the game over, reward – 1, Declarer wins.
The optimal strategy diagram is as follows:
1. With ace, a is 1; Without ace: ace is not used, that is, when A is 11, it can be seen that in the optimal strategy diagram, without ace: in most cases, cards are not added, and with ace: when the total number of players is less than 18, cards are most likely to be added.
4 summarizes
(a) Monte Carlo method is a learning method for estimating value functions and discovering optimal strategies. Unlike DP, we do not need a complete understanding of the environment. The Monte Carlo method requires only an experience sample sequence of states, actions, and rewards for actual or simulated interactions with the environment.
(b) The Monte Carlo method samples and averages the returns of each state-action pair.
(c) Strategy evaluation estimates state action value functions through average returns. Strategy improvement using greedy strategy PI (s) = argmaxq \ PI (s) (s, a) = argmax q (s, a) PI (s) = argmaxq (s, a)
(d) On-policy and off-policy: on-policy methods attempt to estimate and improve the same policy we use for decision making; The off-policy estimates and promotes a different strategy than the one used to generate the data.
5 References
[1]Reinforcement Learning
Scan the QR code to follow the official account of Walker AI
Xingzhe. AI is committed to using artificial intelligence and machine learning technologies to improve the productivity and user experience of the gaming and entertainment industry. We have a content security team, a gaming robotics team, a data platform team, an intelligent music team, and an automated testing team. > > If you are curious about the world and not afraid to ask challenging questions; Be able to tolerate uncertainty in the process of exploration and stick to it; Be able to find innovative ways to tackle challenges and have a micromanagement responsibility to ensure solutions are executed effectively. Please send us your resume, relevant work achievements and the specific position you are interested in. We welcome people who embrace challenges and innovative thinking to join our team. Please contact: [email protected]
If you have any needs regarding content security, gaming robots, data platforms, intelligent music and automated testing, we are also pleased to serve you. You can contact: [email protected]