Excerpted from arXIV by Jose A. Arjona-Medina, Michael Gillhofer, Michael Widrich, Thomas Unterthiner, Sepp Hochreiter, Heart of the Machine.

In reinforcement learning, the existence of delay reward can seriously affect performance, which is mainly reflected in the exponential growth of the correction time of time difference (TD) estimation bias and the exponential growth of monte Carlo (MC) estimation variance with the increase of delay step number. To address this problem, researchers from the LIT AI Lab at Johann Kepler-Linz University in Austria proposed a new method based on return value decomposition. Experiments show that RUDDER’s speed is TD, MC and MC tree search (MCTS) exponential, and in the training of specific Atari games quickly beyond rainbow, A3C, DDQN and other famous reinforcement learning model performance.

The article has also sparked a lively discussion on Reddit, with users saying that the 50-page appendix is amazing so they don’t have to try to guess how the authors did their experiments. In addition, although the main experimental results are for only two simple games, the method is promising and worthy of further exploration.

Thesis: RUDDER: Return Decomposition for Delayed Rewards

Links to papers: arxiv.org/abs/1806.07…

Abstract: We propose a novel reinforcement learning approach for finite Markov decision processes (MDP) with delayed rewards. In this study, we demonstrate that the correction rate of time difference (TD) estimation bias is exponentially slower only at the delay step. In addition, it is shown that the variance of Monte Carlo estimation can increase the variance of other estimators exponentially in the delay step. We introduce a return decomposition method RUDDER that creates a new MDP using the same optimal strategy as the original MDP, but with a much lower latency for redistributing rewards. If the return value decomposition method is the best strategy, the new MDP will not have delayed rewards and TD estimates will not be biased. In this case, the reward tracks the Q value so that the expected future reward value remains zero. We confirm the theoretical results of TD estimation bias and MC estimation variance by experiments. In different length reward delay tasks, we showed that RUDDER was exponentially faster than TD, MC, and MC tree search (MCTS). In the Atari game deferred rewards Venture, RUDDER only studied for a short time, Its performance is better than rainbow, A3C, DDQN, Distributional DQN, Dueling DDQN, Noisy DQN and Prioritized DDQN. RUDDER significantly improved the current optimal performance of Bowling on the deferred reward Atari game in less learning time.


The introduction

In reinforcement learning, it is one of the central tasks to assign reliability values for receiving rewards for performing actions [107]. It is recognized that long-term reliability value assignment is one of the biggest challenges in reinforcement learning [91]. Current reinforcement learning methods are significantly slower in the face of chronically delayed rewards [84,86]. In order to learn about delayed rewards, there are three stages to consider :(1) discovering delayed rewards; (2) Tracking information related to delayed rewards; (3) Learn to receive delayed rewards and save them for later use. Recently successful reinforcement learning approaches offer solutions to one or more of these three stages. The most famous method is Deep Q-Network (DQN) [71, 72], which combines Q-learning and convolutional neural Network for visual reinforcement learning [58].

The success of DQN is attributed to experience Replay [64], which preserves the observed state reward transition process and then samples from it. Preferential empirical replay [93,50] facilitates the sampling of replay memories. In apE-X DQN training, different strategies perform exploration in parallel and share priority experience replay memory [50]. DQN is extended to double DQN (DDQN) [113,114], which reduces the bias of overestimation and thus AIDS exploration. Noisy DQN [21] is explored through a random layer in the policy network (see [40, 94]). Distributional Q-learning [8] can benefit from noise, because average values with high variance are more likely to be selected. Dueling network architectures [117,118] can estimate state values and action advantages separately, thus aiding exploration in unknown states. The strategy gradient method [124] is also explored through parallel strategy. A2C is improved by IMPALA’s parallel actors and a modification of the policy lag between actors and learners. A3C [70] and APE-X DPG [50], which combine asynchronous gradient descent, also rely on parallel strategies. Near end policy optimization (PPO) extends A3C through proxy goals and confidence domain optimization achieved by clipping or Kullback-Leibler punishment [96].

Recent approaches hope to solve the learning problems caused by delayed rewards. If the state associated with the reward is similar to the state encountered many steps ago, the functional approximation of the value function or critic [72,70] can fill the time interval. For example, suppose a function learns to predict the large reward value at the end of an episode if its state has certain characteristics. This function can generalize the correlation to the beginning of the episode and predict already high reward values for states that deal with the same features. Multi-step time difference (TD) learning [105,107] can improve DQN and strategy gradient [39, 70]. AlphaGo and AlphaZero use Monte Carlo tree search (MCTS) to reach better levels of go and chess than human professionals. MCTS simulates the game from a point in time until the game ends or an evaluation point is reached, so MCTS is able to capture long-term delayed rewards. Recently, world models using evolutionary strategies have been successful [36]. These forward methods are not feasible in probabilistic environments with high branching factors for state transitions. The backward method traces known goal states [18] or high reward states [30]. However, we must learn the backward model step by step.

We propose a backward learning process, which is based on a forward model. The forward model predicts the return value, and the back model analysis confirms the state and action leading to the return value. We use the Long short-term memory Network (LSTM) to predict the return value of an episode. LSTM has been applied in advantage learning [4] and learning strategies [37,70,38] in reinforcement learning. However, sensitivity analysis by “back propagation in models” [75,87,88,5] has serious flaws: local minima, instability, gradient explosion or disappearance problems in world models, proper exploration, actions analyzed by sensitivity alone rather than based on their contribution (correlation) [40,94].

Because sensitivity analysis is a great barrier to learning, we use contribution analysis for back analysis, such as contribution propagation [60], contribution methods [81], Back propagation [127], LRP [3], Taylor decomposition [3,73], or Integration gradient [103]. With contribution analysis, predicted return values can be decomposed into contributions along a state-action sequence. By replacing the prediction with the actual return value, we get a redistribution reward and a new MDP with the same optimal strategy as the original MDP. Redistributive reward [76,122] is very different from reward shaping, which turns reward into a function of state rather than action. Both reward shaping and look-back advice [123] retain the original reward, i.e. long delays are still possible, resulting in an exponentially slower learning process. We present RUDDER, which performs reward redistribution through return value decomposition to overcome the problems TD and MC have with delayed rewards. RUDDER can greatly reduce the variance of the MC and greatly avoid TD’s exponential slow deviation correction (TD is even bias free for optimal return value decomposition).

Figure 1: Experimental evaluation results of biases and variances of estimators with different Q values on Grid World. Left: The number of states affected by high variance increases exponentially as the number of delayed steps increases. Right: Different delays will reduce the deviation by 80% in the number of samples.

Figure 2: The number of observed states (logarithmic coordinates) of different methods learning the best strategy to achieve 90% return reward at different delays. We compared the Q learning implementation of temporal difference (TD), Monte Carlo (MC), Monte Carlo tree search (MCTS) and RUDDER. Left: Grid World environment. Right: charge-discharge environment. The gap between RUDDER and other methods increases exponentially as the number of delayed steps increases.

Table 1: RUDDER and other methods in learning Atari games Bowling and Venture results. The normalized human score and raw score for each method in over 200 test games in the no-command start condition: THE A3C score is not reported because it is not available in the no-command start condition. Scores for other methods were derived from previous studies [8, 39]. We selected the RUDDER model based only on 12M frames of training loss.

Figure 3: RUDDER learned Atari games Bowling and Venture at a faster rate of deferred rewards than other methods. The figure on the left is the normalized human score of each method in the training process of Bowling, and the figure on the right is the normalized human score of each method in the training process of Venture game, where the learning curve comes from previous studies [39, 50]. RUDDER achieved new current optimal performance on Bowling.

Figure 4: RUDDER’s breakdown of return values in two Atari games with long delayed rewards. Left: In Bowling, a reward is given only after a turn (consisting of multiple rolls). Identify actions that steer the ball in the right direction to hit the bottle. Once the ball hits the bottle, RUDDER then examines the delayed reward associated with hitting the bottle. Only 100 frames are shown, but the entire turn is over 200 frames long. In the original game, the reward was given at the end of the turn. Right: In Venture games, the reward is only given after the treasure is collected. RUDDER directs the intelligence (red) to the treasure (gold) through reward redistribution. Rewards are redistributed when entering a room with treasure. In addition, as the intelligence gets closer to the treasure, the redistribution reward increases. For brevity, the green curve in the figure shows the redistributive rewards before lambda was used. The environment is only rewarded at the end of collecting treasure (blue).

Source code: github.com/ml-jku/base…