This section is mainly to make some notes on the relevant contents of the book Intensive Learning written by The Great God Sutton, and briefly introduce the training ideas of the training problems.
Problem description
The problem of multi-armed slot machines is also known as multi-armed training. It’s a classic reinforcement learning problem.
The multi-armed slot machine comes from the science of gambling, and the problem is described like this:
A gambler, want to go to casinos to play slot machines, he found that there is a row in the casino slot machines, are identical in appearance, but it is not the same as the probability of each slot machines win money, he didn’t know what is the probability distribution of each slot machines win money, so for the gambler wanted to make a fortune, each choice which slot machines can achieve maximum reward?
We think of the choice of which slot machine to use as an action AAA, with each action corresponding to a value q(a) Q (a) Q (a) Q (a), expressed in the expectation of the reward generated by that action. Then you can write the problem in mathematical form:
Where, AtA_tAt represents the action performed at time T, and RtR_tRt represents the reward. So how do you estimate that expectation? The easiest way to do this is to average the actual rewards:
This estimation method is also called the sampling average method. Because each estimate is an average of the relevant rewards. We further consider only action AAA and simplify the symbol by assuming that RiR_iRi represents the reward obtained after the action is performed iii times, and QnQ_nQn represents the estimated value of the NNN time after the action is selected to be performed n−1n −1n −1, then
Here the step size α\alphaα is 1n\frac{1}{n}n1. Such a fixed and uniform step size, the sampling average method, can eliminate the sampling bias. In addition, the step size can also be any other parameter in the interval of [0,1]
Action choice
There are two important concepts in action selection: one is exploration, the other is exploitation. When we choose an action, we often evaluate each action to make it easier to measure whether the action is good or bad. Actions of high value are called greedy actions. Development is choosing all that you currently know about the value of an action, which means choosing greedy actions, while temptation means choosing non-greedy actions.
In the trial process, the short-term reward may be lower, but in the long run, when the better action is tested, the payoff is greater than if the action is always developed with the current strategy. The downside of testing is that it takes a lot of time. It is impossible for development and exploration to be carried out simultaneously in one action selection, so how to balance exploration and development is also an important issue in action selection.
The previous actions are selected using a greedy algorithm called ϵ−greedy\epsilon-greedyϵ−greedy.
Specifically, it maximized the current reward, making greedy choices on most actions, only to break out of the greedy and randomly select other actions on the extremely small probability ϵ\epsilonϵ.
UCB algorithm
There is also an algorithm for action selection, called UCB Algorithm (Upper Confidence bound), which is based on the upper confidence bound method. Because greedy actions are always focused on the biggest immediate reward and lack of temptation for the long term, they may miss the best action, and ϵ− Greedy \epsilon-greedyϵ− Greedy is random in exploration. In fact, some actions have been selected many times, with relatively little uncertainty. And those uncertain movements need to be explored many times. UCB takes this into account, taking into account the uncertainty between the reward value and the action, in the following form:
Where, c represents a constant greater than 0, which is used to control the degree of temptation, Nt(a)N_t(a)Nt(a) represents the number of times that action A is selected before time T. If Nt(a)=0N_t(a)=0Nt(a)=0, then a is the action satisfying the maximization. The basic idea of UCB is to introduce confidence intervals. Parameter C determines the confidence level, or simply the degree of uncertainty. The square root term on the right is a measure of uncertainty for the estimate of action A. As the number of times a is selected increases, that is, Nt(a), N_t(a), Nt(a) increases, the right term will decrease, the confidence interval will narrow, and the uncertainty will decrease, that is to say, A is no longer easy to be selected, and the actions with large mean value, that is, the actions with large value, tend to be selected. In addition, if A is not selected each time, but as time t increases, the molecule of the right term also increases, and the confidence interval widens, the uncertainty of A increases when the next selection is made, that is to say, A may also be selected again.
Gradient gambling machine algorithm
The above algorithm selects the action by considering the value of each action. The gradient gambling machine algorithm changes its thinking. Instead of directly considering the value of the action, it introduces a numerical preference function Ht(a)H_t(a)Ht(a) for each action A to consider the relative preference of one action to another. Ht(a)H_t(a) The larger the Ht(a), the more likely the action is to be selected. Ht(a) has nothing to do with the value of the reward, it only has to do with one action versus another. The probability of action AAA being selected at TTT time can be expressed by a SoftMax distribution:
Random gradient ascending update preference function:
R, (4) t\overline{R}_tRt stands for the mean value of reward at time T. If the reward is greater than the current mean value after the action AtA_tAt is selected, the probability of being selected at the next moment of the action will increase, and at the same time, the probability of other unselected actions will be updated in the opposite direction, and the probability of vice versa will decrease. This part can be derived in detail by reference to books.
Another Bayesian method (also known as Thompson sampling or a posteriori sampling) assumes that the initial distribution of action values is known, and then updates the distribution after each action selection.
summary
How to balance testing and development is an important question. ϵ− Greedy \epsilon-greedyϵ− Greedy algorithm, which randomly selects actions, regardless of temptation; The UCB algorithm takes the uncertainty of the action into account and implements the heuristic by prioritizing the actions of a smaller sample at each moment. Gradient gambling machine algorithm for a change of thought, do not estimate the value of the action, the introduction of preference function, using softMax distribution to select the probability of the action.
reference
- Barto, Sutton. Reinforcement Learning (Second Edition)