This is the 19th day of my participation in the August Wenwen Challenge.More challenges in August

BCQ

Reinforcement Learning of off-policy Deep Reinforcement Without Exploration (arxiv.org)

This article requires a general understanding of DDPG algorithms, see my previous article: Introduction to Reinforcement learning 8 – In-depth Understanding of DDPG (juejin. Cn)

Introduction to the

BCQ is Batch Constrained deep Q-learning. In this paper, batch constraint is added to the off-policy algorithm to avoid extrapolation error.

Motivation

What are the challenges of Batch RL?

Extrapolation error. Extrapolation error.

What causes an Extrapolation error?

  • Absent Data. (s, A)(s,a)(s,a) (s,a)(s,a)
  • There may be some Model bias. The data set (s, a) (s, a) (s, a) data, but not enough data, so the estimated from the sample of p (s’ ∣ s, a) p (s’ | s, a) p (s’ ∣ s, a) and real have certain error, leading to the error of the estimate.
  • Traning Mismatch. This means that data is not distributed in the same way. That is, there is a difference between the distribution of batch data obtained by the training policy in the updating process and the data distribution sampled by the current policy in the real environment, which will cause errors in the optimization process.

What problems do Extrapolation errors cause?

For example, a policy that samples actions that are not in the data set may not learn the value function at the next moment. For example, if the current state is S and the data set has s-> A1 and S -> A2, but no data set for S -> A3, the current strategy π\ PI π actually favors A3, but the data set does not provide guidance on whether a3 is better.

In order to solve this problem, a batch-constrained deep Q-learning algorithm is proposed. Batch-constrained means to choose policies within Batch constraints.

Algorithm process

General idea: When agent selects action under state, restrict the action to a range. In the present example, the state is S, the data set contains s-> A1,s-> A2, and no s-> A3. The strategy π\ PI π can only choose between A1 and A2, which means that it is difficult to estimate the value of A3. BCQ algorithm is the constraint of our strategy, so that it does not consider A3. Choose only between A1 and A2 (a1,a2 is easy to estimate value).

So how do you limit the scope of this action?

This paper proposes to use VAE as a generation model to generate some actions. First sample is taken from batch, then train VAE to generate actions similar to those in batch. Here VAE study data are real batch data.

Then the sampling from the generator, the use of network disturbance factor ϕ (s, a, ϕ) \ xi_ {\ phi} (s, a, \ phi) factor ϕ (s, a, ϕ) disturbance action value, a control in [a – ϕ, a + ϕ] [a – \ phi, a + \ phi] [a – ϕ, a + ϕ] range. The introduction of an improvement in the diversity of action is aimed at introducing the perturbation network ξϕ(S, A,ϕ)\xi_{\phi}(S, A,\phi) ϕ(S, A,ϕ).

Last update Q and Q-Target.

In fact, the algorithm is improved on DDPG, the whole process and DDPG is almost the same. We can combine the generation model G and the perturbation model ξ\xiξ as a policy network. However, it is composed of two parts. One part is responsible for making the generated action not deviate too far from the action in the data set, and the other part is responsible for maximizing the cumulative reward. In fact, the disturbed network can be regarded as the Actor in DDPG, and the policy π\ PI π is the Max of the action in the region, which can be expressed as:

The update of the disturbed network is as follows:

Then the value network part can also be regarded as DDPG Critic. The only difference is that BCQ uses two Q networks in the calculation of Q value. Here, Clipped Double Q-learning method is adopted to take the minimum estimated value of the two Q networks. The convex combination is taken. λ\lambda lambda takes the larger weight (λ=1\lambda=1λ=1 is Clipped Double q-learning).

So the final structure: there are 7 neural networks, including VAE generation network, two Q networks θ1, θ2\theta_1, \theta_2θ1, θ2, disturbance network ξ\xiξ and corresponding target network.

Here is the pseudocode:

Let’s focus on parameter updates:

  • The VAE parameters ω\omegaω are updated with real batch data (BC VAE loss)
  • The parameters of the two Q networks are updated with real Batch data (i.e. optimized Behrman equation).
  • The parameters ϕ\phiϕ of the disturbance network are updated with the actions generated by VAE

One more question, how do you guarantee convergence?

Four theorems are given in this paper.

First of all,Theorem 1Proved in offline data sets
B B
In fact, q-learning is to find the optimal strategy on the MDP corresponding to the data set.

Theorem 2 shows that the extrapolation error can be eliminated.

Theorem 3 guarantees the convergence of BCQ.

Theorem 4 states that BCQL can converge to the Optimal Policy on the MDP corresponding to dataset BBB.

summary

BCQ can be seen as an improvement on DDPG.

The essence of constraint is to keep the Batch RL from selecting parts of the dataset that are not covered, and thus selecting from well-estimated Q values.

BCQ takes into account extrapolation errors, so you can learn from arbitrary data, either expert data or generic subOptimal data.

For BCQ methods, constraint is mainly applied to policy, and another method called CQL (Conservative Q-learning) is applied to Q function. The extrapolation error may lead to overestimation of Q value. By adding a regularization term, In this way, the q-function learned is the lower bound of the real Q-function, and it has proved that through continuous iteration, the q-value on lower bounded can continuously improve the strategy.

reference

  1. Offline RL 教程 – 知乎 (zhihu.com)
  2. Batch-constrained Deep Q-learning (BCQ) – Zhihu.com
  3. BCQ: Batch-constrained Deep Q-Learning – Zhihu (Zhihu.com)
  4. 【 intensive Learning 119】BCQ – Zhihu (zhihu.com)