Source: Deep Reinforcement Learning Lab

Hou Yuqing, Chen Yurong

Introduction:

Deep reinforcement learning is the combination of deep learning and reinforcement learning, which integrates the strong understanding ability of deep learning on visual and other perceptual problems and the decision-making ability of reinforcement learning to achieve end-to-end learning. The emergence of deep reinforcement learning makes reinforcement learning technology become practical and can solve complex problems in real scenes. Since the emergence of DQN (Deep Q Network) in 2013, there have been a large number of algorithms in the field of deep reinforcement learning, as well as papers on solving practical application problems. This paper will elaborate the development status of deep reinforcement learning and look into the future.

I am writing the AI Basics series, which has been released: \

AI Basics: An easy introduction to math

AI Basics: Python development environment setup and tips

AI Basics: Easy to get started with Python

AI foundation: Regular expressions

AI Basics: Numpy easy to get started

AI Foundation: Pandas

AI Basics: Scipy(Scientific Computing Library) easy introduction

AI Basics: An easy introduction to Data Visualization (Matplotlib and Seaborn)

AI Fundamentals: Use of machine learning library SciKit-learn

AI Basics: An easy introduction to machine learning

AI Fundamentals: Loss functions for machine learning

支那AI:支那Machine learning and deep learning practice data

AI Fundamentals: Feature Engineering – Category Features

AI Fundamentals: Feature Engineering – Digital feature processing

AI fundamentals: Sequence models for the foundations of natural language processing

AI fundamentals: Feature Engineering – Text feature processing

AI basics: Word embedding basics and Word2Vec

AI basics: Illustrated Transformer

AI basics: Understand BERT

AI fundamentals: A must-see paper for introduction to artificial intelligence

AI Fundamentals: Into deep learning

AI fundamentals: Optimization algorithms

AI Fundamentals: Convolutional neural networks

AI Fundamentals: Classical convolutional Neural Networks

AI Fundamentals: Deep Learning

AI Fundamentals: An overview of data enhancement methods

AI Basics: Essay writing tools

Follow-up updates

The text start

| | a, depth of reinforcement learning froth

In 2015, Volodymyr Mnih et al. from DeepMind published a paper in Nature on human-level control through deep Reinforcement learning[1]. This paper proposes a Deep Q-Network(DQN) model combining Deep learning (DL) technology and reinforcement learning (RL) idea, which shows a performance beyond human level on Atari game platform. Since then, Deep Reinforcement Learning (DRL) combining DL and RL has rapidly become the focus of artificial intelligence.

In the past three years, DRL algorithms have shown great success in different areas: beating the best human beings in video games [1], board games [2,3]; Control complex machinery for operation [4]; Allocate network resources [5]; Significant energy saving for data center [6]; Even automatic parameter tuning for machine learning algorithms [7]. Universities and enterprises have been involved in it and put forward dazzling DRL algorithms and applications. The past three years have been a red-hot period for DRL. David Silver, a DeepMind researcher in charge of the AlphaGo project, shouted “AI = RL + DL”, believing that DRL, which combines DL’s representation and RL’s reasoning abilities, would be the ultimate answer for artificial intelligence.

1.1 DRL reproducibility crisis

In the last six months, however, researchers have begun to rethink DRL. Because the details of important parameter Settings and engineering solutions are often not provided in the published literature, many algorithms are difficult to reproduce. In September 2017, the research group led by Doina Precup and Joelle Pineau, famous RL experts, published a paper on Deep Reinforcement Learning that Matters[8]. It is pointed out that there are many papers in DRL field, but it is difficult to reproduce experiments. This article arouses warm response in academic circle and industry. Many agree, and have strong doubts about the DRL’s actual capabilities.

This is not the first time Precup& Pineau has run afoul of DRL. As early as two months ago, the team conducted sufficient experiments to investigate the factors that make the DRL algorithm difficult to reproduce, Writing Reproducibility of Benchmarked Deep Reinforcement Learning Tasks for Continuous Control[9]. In August of that year, they reported “Reproducibility of Policy Gradient Methods for Continuous Control” at ICML 2017 [10] Examples are given to show in detail the difficulties caused by various uncertainties in the process of reproducing multiple algorithms based on strategy gradient. Joelle Pineau was invited to make a report entitled “Reproducibility of DRL and Beyond” at the highly anticipated NIPS DRL Symposium 2017 in December [11]. Pineau begins by describing the current “reproducibility crisis” in science: In a survey in Nature, 90 per cent of respondents identified reproducibility as a crisis in science, with 52 per cent identifying it as a serious problem. In another study, researchers in almost every field had a high rate of failure to reproduce other people’s or even their own past experiments. This shows how serious the reproducibility crisis is. A study conducted by Pineau on machine learning shows that 90 percent of researchers also recognize this crisis.

There is a serious “reproducibility crisis” in machine learning [11]

Pineau then presents the group’s extensive reproducibility experiments with different current DRL algorithms for the DRL domain. Experimental results show that different DRL algorithms have different effects under different tasks, different hyperparameters and different random seeds. In the report during the second half, would “can replicability crisis” calls for study of this problem, and according to the results of her research, put forward the 12 “reproducibility” on the basis of the test algorithm, announced plans in 2018 began to organize ICLR “can emersion experimental challenge” (” can emersion crisis “also received attention in other areas of machine learning, ICML 2017 has held Reproducibility in Machine Learning Workshop and will be holding its second this year) aiming to encourage researchers to do really solid work curbing the bubble in Machine Learning. This series of studies by Pineau & Precup has received a lot of attention.

Pineau’s reproducibility criterion for test algorithms based on a large number of surveys [11]

1.2 How many pits are there in DRL research?

Also in December, there was a heated discussion on Reddit about machine learning abuses [12]. It was pointed out that some representative DRL algorithms achieved excellent but difficult performance in the simulator, because the authors were suspected of modifying the simulator’s physical model during the experiment, but avoided discussing this in the paper.

Criticism of the existing DRL algorithm continues to come. On Valentine’s Day 2018, Once studied in Berkeley Artificial Intelligence Research Lab, Alexirpan of BAIR) delivered a bitter gift to the DRL circle through a blog post on Deep Reinforcement Learning Doesn’t Work Yet[13]. In this paper, he summarized several major problems existing in DRL algorithm from the experimental point of view through multiple examples:

  • Sample utilization is very low;
  • Ultimately not performing as well as model-based approaches;
  • Good reward functions are hard to design;
  • It is difficult to balance “exploration” and “utilization”, so that the algorithm falls into local minimum;
  • Overfitting of the environment;
  • Catastrophic instability…

Although the author ends the article with an attempt to address a series of issues that DRL should address next, many people see this article as a “retreat” for DRL. A few days later, GIT doctoral student Himanshu Sahni published the blog Reinforcement Learning never worked, and ‘deep’>\

Matthew Rahtz, another DRL researcher, responded to Alexirpan by telling his own bumpy course of trying to reproduce a DRL algorithm, which made us deeply understand how difficult it is to reproduce DRL algorithm [15]. Half a year ago, Out of research interest, Rahtz chose to reproduce the paper of Deep Reinforcement Learning from Human Preferences of OpenAI. In the process of reproduction, it almost stepped on all the pits summarized by Alexirpan. He sees the DRL algorithm as more of a mathematical problem than an engineering one. “It’s more like you’re solving a puzzle, and there’s no pattern to it, and the only way is to keep trying until something hits you and you figure it out. … A lot of little details that seemed insignificant became the only clue… Be prepared to be stuck for weeks at a time.” Rahtz gained valuable engineering experience in the process of reproduction, but the difficulty of the whole process cost him a lot of money and time. He mobilised various computing resources, including the school’s computer room, Google’s cloud computing engine and FloydHub, at a total cost of $850. Even so, a project that was supposed to be completed in three months ended up taking eight months, much of which was spent on debugging.

The actual time of reproducing DRL algorithm is much longer than the estimated time [15]

Rahtz eventually achieved his goal of reproducing the paper. His blog not only gives readers a detailed summary of all kinds of valuable engineering experience along the way, but also lets us feel from a specific case how much bubble DRL research actually exists, how many pits. One commented, “DRL may be successful not because it works, but because people put a lot of effort into it.”

Many prominent scholars have also weighed in. The prevailing view is that DRL may have the biggest bubble in AI. Machine learning expert Jacob Andreas tweeted meaningfully:

Jacob Andreas’s joke about DRL

DRL’s success is attributed to its being the only method in machine learning that allows training on test sets.

In the little more than a year since Pineau & Precup fired its first shot, DRL has been hammered from the top to the top. Just as I was preparing to contribute this article, Pineau was invited to make a report entitled Reproducibility, Reusability, and Robustness in DRL at ICLR 2018 [16] And officially launched the reproducible Experiment Challenge. It seems that academic ridicule of DRL will continue, and negative comments will continue to ferment. So what’s the problem with DRL? Is the outlook really so bleak? What is the future of RL if it does not combine with deep learning?

While everyone is making fun of DRL, renowned optimization expert Ben Recht gives an analysis from another perspective.

| | second, the model of intensive study the nature of the defect

RL algorithm can be divided into model-based method and model-free method. The former is mainly developed in the field of optimal control. Usually first by gaussian process (GP) or the bayesian network (BN), and other tools to build a model for the specific problem, and then by the method of machine learning or optimal control methods, such as model predictive control (MPC), linear quadratic regulator (LQR), linear quadratic gaussian (LQG) scheme, iterative learning control (ICL) and so on. The latter is more of a data-driven approach developed in the field of machine learning. The algorithm optimizes the action strategy by estimating the agent’s state, action’s value function or return function through a large number of samples.

Model-based vs. model-free [17]

Since the beginning of this year, Ben Recht has published 13 blog posts focusing on the model-free approach in RL from the perspective of control and optimization [18]. Recht pointed out that the model-free approach itself has several drawbacks:

Model-free methods cannot learn from samples without feedback signals, and the feedback itself is sparse, so model-free direction sample utilization is very low, while data-driven methods need a lot of sampling. In Atari’s Space Invader and Seaquest games, for example, intelligence gains points as training data increases. It may take 200 million frames to learn better with model-free DRL. AlphaGo’s version, first published in Nature, also requires 30 million disks for training. However, as for the problems related to mechanical control, the training data is far less easy to obtain than the data such as video images, so the training can only be carried out in the simulator. However, the Reality Gap between the simulator and the real world directly limits the generalization performance of the algorithm trained in it. In addition, the scarcity of data also affects its integration with DL technology.

Model-free approaches do not model specific problems, but try to solve all problems with a general algorithm. The model-based approach makes full use of the inherent information of the problem by building a model for a specific problem. The model-free approach abandons these valuable information while pursuing generality.

  • The model-based approach builds a dynamic model for the problem, which is explanatory. The model-free method is difficult to debug because it has no model and is not strong in interpretation.
  • Compared with model-based methods, especially those based on simple linear models, model-free methods are less stable and more likely to diverge in training.

To confirm this, Recht compared a simple random search method based on LQR with the best model-free method in a MuJoCo environment. In the case of similar sampling rates, the computational efficiency of model-based random search algorithm is at least 15 times higher than that of model-free method [19].

Model-based random search method ARS sling multi-model free method [19]

Through Recht’s analysis, we seem to have found the root of the DRL problem. In the recent three years, DRL algorithms that are popular in the field of machine learning mostly combine the model-free method with DL, while the natural defects of model-free algorithm correspond to the major problems of DRL summarized by Alexirpan (see above).

It seems that the root cause of DRL lies mostly in the model-free approach. Why is most DRL work based on a model-free approach? I think there are several reasons. First, the model-free method is relatively simple and intuitive, rich in open source implementation, and relatively easy to get started, thus attracting more scholars to conduct research and more likely to make breakthrough work, such as DQN and AlphaGo series. Second, the current development of RL is still in the initial stage, and the academic research focuses on the environment is determined and static, the state is mainly discrete, static and completely observable, and the feedback is also a definite problem (such as Atari games). For this relatively “simple”, basic, generic problem, the model-free approach itself is appropriate. Finally, encouraged by the idea that “AI = RL + DL”, DRL’s power is overestimated. The exciting capabilities demonstrated by DQN have led many people to expand around DQN and create a series of work that is also model-free.

Most DRL methods are extensions of DQN and belong to model-free methods [20]

So should DRL abandon model-free approaches in favor of model-based approaches?

| | 3, based on the model, or models, free problem is not so simple

3.1 Model-based approach has great potential in the future

Model-based approaches typically learn models from data and then optimize policies based on the learned models. The process of model learning is similar to system parameter identification in cybernetics. Because of the existence of the model, the model-based method can make full use of every sample to approximate the model, and the data utilization rate is greatly improved. Model-based approaches, on the other hand, usually have a 10^2 sampling rate improvement over model-free approaches in some control problems. In addition, the learned model is often robust to changes in the environment. When encountering a new environment, the algorithm can rely on the learned model for reasoning, which has good generalization performance.

Model-based methods have higher sampling rates [22]

In addition, the model-based approach is closely related to Predictive Learning, which has great potential. Predictive Learning can predict the future based on the model, which coincides with the demand of Predictive Learning. In fact, Yann LeCun used the model based approach as an example when introducing Predictive Learning in the widely watched NIPS 2016 keynote report [21]. The author believes that model – based RL method may be one of the important technologies to realize Predictive Learning.

In this way, the model-based approach seems more promising. But there is no such thing as a free lunch, and the existence of models raises several problems

3.2 Model-free method is still the first choice

The model-based DRL method is not so simple and intuitive, while the combination of RL and DL is more complex and difficult to design. At present, model-based DRL methods usually use Gaussian process, Bayesian network or probabilistic neural network (PNN) to build models, such as Predictron model proposed by David Silver in 2016 [23]. Other work, such as the Probabilistic Inference for Learning COntrol (PILCO)[24], is not based on neural network itself, but there is an extended version combined with BN. However, Guided Policy Search (GPS) does not rely on neural networks in the optimization of optimal controller, although neural networks are used in the optimization of Guided Policy Search (GPS) [25]. In addition, there are some models that combine neural networks with models [26]. These tasks are not as intuitive and natural as model-free DRL methods, and DL plays a different role.

In addition, the model-based approach also has several drawbacks of its own:

  • There’s nothing you can do about a problem you can’t model. Some domains, such as NLP, have a large number of tasks that are difficult to generalize into models. In this scenario, it is only possible to interact with the environment first with methods such as the R-Max algorithm to calculate a model for later use. But the complexity of this approach is generally high. In recent years, some work combined with predictive learning to build models has partially solved the problem of difficult modeling, and this idea has gradually become a research hotspot.
  • Modeling will bring errors, and the errors tend to increase with the iterative interaction between the algorithm and the environment, making it difficult to ensure the convergence of the algorithm to the optimal solution.
  • The model lacks generality and needs to be re-modeled every time a problem is changed.

In view of the above points, model-free method has relative advantages: model-free algorithm is still the best choice for many problems that cannot be modeled and imitation learning problems in reality. Moreover, the model-free method has asymptotic convergence in theory, and the optimal solution can be guaranteed after countless interactions with the environment, which is difficult to obtain by the model-based method. Finally, the biggest advantage of the free model is that it has very good versatility. In fact, model-free approaches often work better when dealing with really difficult problems. Recht also pointed out in his blog post that MPC algorithms that are effective in the control field are actually very related to model-free methods like Q-learning [18].

The difference between model-based approach and model-free approach can also be seen as the difference between knowledge-based approach and statistics-based approach. Generally speaking, both methods have their merits and it is hard to say that one is superior to the other. In the RL field, model-free algorithms only account for a small part, but due to historical reasons, currently model-free DRL methods develop rapidly and have a large number, while model-based DRL methods are relatively few. In the author’s opinion, we can consider doing more work on model-based DRL to overcome many problems existing in DRL. In addition, the semi-model method combining model-based method and model-free method can also be studied, which has the advantages of both methods. Classic works in this regard include the Dyna framework proposed by RL leader Rich Sutton [27] and the DyNA-2 framework proposed by his disciple David Silver [28].

From the above discussion, we seem to have found a solution to DRL’s current predicament. But there’s more to the DRL dilemma than that.

3.3 is not just a question of model

As mentioned above, Recht’s use of a random search-based approach to mock the model-free approach seems to condemn it to death. But the comparison is not fair.

In March 2017, Machine learning expert Sham Kakade’s research group published Towards Generalization and Simplicity in Continuous Control, It tries to find a simple and general solution to the continuous control problem [29]. They found that the current emulator had very big problems, and that the debugged linear strategy already worked very well – the emulator was so crude that it was no wonder that the random-based search approach beat the model-free approach on the same emulator!

It can be seen that the experimental platform in the RL field is still very immature, and the experimental results in such a test environment are not convincing enough. Many of the findings are questionable, as good performance may be achieved simply by using simulator bugs. In addition, some scholars point out that the performance evaluation criteria of the current RL algorithm are not scientific. Both Ben Recht and Sham Kakade put forward a number of specific suggestions for the development of RL, including test environment, benchmark algorithm, measurement standard, etc. [18,29]. As you can see, there is a lot to be improved and normalized in the RL domain.

So what’s next for RL?

| | four, reinforcement learning in a new light

The questioning and discussion of DRL and model-free RL allows us to re-examine RL, which is of great benefit to the future development of RL.

4.1 Review the research and application of DRL

The DQN and AlphaGo series work is impressive, but both tasks are relatively “simple” in nature. Because the environment of these tasks is deterministic and static, the state is mainly discrete, static, and fully observable, the feedback is deterministic, and the agency is single. So far, DRL has made no amazing breakthroughs in solving partially visible state tasks (such as StarCraft), state continuous tasks (such as mechanical control tasks), dynamic feedback tasks, and multi-agent tasks.

The task of DRL success is relatively simple in nature [30]

At present, a large number of DRL studies, especially those applied to computer vision tasks, are forced to construct a DL-based task of computer vision into RL problem for solving, and the results are often not as good as the traditional methods. This approach has led to an explosion in the number of papers in the DRL field. As DRL researchers, we should not take a DL task and force it to RL. Instead, we should try to improve the capability of existing methods in target recognition or function approximation by introducing DL for some tasks that are naturally suitable for RL.

In computer vision tasks, it is natural to obtain good feature representation or function approximation by combining DL. However, DL may not play a powerful role in feature extraction or function approximation in some fields. DL, for example, has so far been mostly used in the field of robotics for perception, not to replace methods based on mechanical analysis. Although there are some successful cases of DRL being applied to real world mechanical control tasks such as object grasping, such as QT-OPT [70], it often requires a lot of debugging and training time. It should be clear that the current DRL algorithm is more used in simulators than in real environments because of the randomness of its output. Currently, there are three main types of tasks that are useful and only need to run in simulators: video games, board games and automatic machine learning (such as Google’s AutoML Vision).

This is not to say that DRL applications are trapped in simulators — DRL can be powerful if the differences between simulators and the real world can be addressed for a specific problem. Recently, researchers at Google have made great efforts to improve the simulator for the movement of quadruped robots, so that the movement strategies trained in the simulator can be perfectly transferred to the real world, achieving amazing results [71]. However, considering the instability of RL algorithm, it is not necessary to blindly pursue end-to-end solution in practical application, but to separate feature extraction (DL) from decision (RL), so as to obtain better interpretation and stability. In addition, modularizing RL (encapsulating THE RL algorithm into a module) and integrating RL with other models will have broad prospects in practical applications. It is also worth investigating how DL can be used to learn a representation suitable as an input to an RL module.

4.2 Review of RL research

Machine learning is an interdisciplinary research field, and RL is a branch with a remarkable interdisciplinary nature. The development of RL theory was inspired by the fields of physiology, neuroscience and optimal control, and is still being studied in many related fields. In control theory, robotics, operations research, economics and other fields, there are still many scholars devoted to RL research. Similar concepts or algorithms are often reinvented in different fields and given different names.

The development of RL is influenced by multiple disciplines [31]

Warren Powell, a famous operations research expert from Princeton University, once wrote an article entitled AI, OR and Control Theory: A Rosetta Stone for Stochastic Optimization sorted out the corresponding names of the same concept and algorithm in RL in AI, OR (Operations Research) and Control Theory. It has broken down barriers between different fields [32]. Due to the characteristics of various disciplines, RL research in different fields has its own characteristics, which makes RL research can fully learn from the essence of thought in different fields.

Here, based on my own understanding of RL, the author tries to summarize some research directions:

  • Model-based approach. As mentioned above, the model-based approach can not only significantly reduce sampling requirements, but also lay a foundation for predictive learning by learning the dynamic model of the task.
  • Improve data utilization and scalability of model-free approaches. These are the two pitfalls of model-free learning, Rich Sutton’s ultimate research goal. This is a difficult area, but any meaningful breakthrough will bring great value.
  • More efficient Exploration Strategies. Balancing “exploration” and “exploitation” is the essence of RL, which requires us to design more efficient exploration strategies. In addition to a number of classic algorithms, such as Softmax, ϵ-Greedy[1], UCB[72] and Thompson Sampling[73], a large number of new algorithms have been proposed in recent academic circles. For example, the Intrinsic Motivation [74], curved-driven Exploration[75], count-based Exploration[76], etc. In fact, many of the ideas of these “new” algorithms appeared as early as the 1980s [77], and the organic combination with DL has brought them into renewed attention. In addition, OpenAI and DeepMind successively proposed to enhance exploration strategy by introducing noise on strategy parameters [78] and neural network weights [79], opening a new direction.
  • Integrated with Imitation Learning (IL). ALVINN[33], the earliest successful case in the field of machine learning and autonomous driving, was based on IL. Pieter Abbeel, currently the top scholar in the field of RL, designed the algorithm of helicopter control through IL when he was following Andrew Ng as a doctoral student [34], which became the representative work in the field of IL. In 2016, the end-to-end autonomous driving system proposed by Nvidia also learns through IL [68]. AlphaGo also learns by IL. IL is between RL and supervised learning. It has the advantages of both. It can get feedback and convergence faster and has reasoning ability, which is of great research value. For an introduction to IL, see the review [35].
  • Reward Shaping. Reward is feedback, and its influence on the performance of RL algorithm is huge. Alexirpan’s blog has already shown how bad RL algorithms can produce without well-designed feedback signals. The designed feedback signal is always the research hotspot in RL field. In recent years, many RL algorithms based on “curiosity” and hierarchical RL algorithms have emerged. The idea of these two algorithms is to insert feedback signals in the process of model training, so as to partially overcome the problem of too sparse feedback. Another idea is learning feedback function, which is one of the main ways of Inverse RL (IRL) learning. The popular GAN in recent years is also based on this idea to solve the problem of generative modeling. Ian Goodfellow, who proposed GAN, also believes that GAN is a way of RL [36]. GAIL[37], which combines GAN with traditional IRL, has attracted the attention of many scholars.
  • Transfer learning and multi-task learning in RL. The current sampling efficiency of RL is extremely low, and the knowledge learned is not universal. Transfer learning and multi-task learning can solve these problems effectively. By migrating the strategies learned from the original task to the new task, it avoids learning from the beginning for the new task, which greatly reduces the data requirements and improves the adaptive ability of the algorithm. One of the difficulties in using RL in real environments is the instability of RL. A natural idea is to use transfer learning to transfer stable strategies trained in simulators to real environments, where strategies can be satisfied with a little exploration in the new environment. However, a major problem faced by this research field is the Reality Gap, that is, the simulation environment of the simulator is too different from the real environment. A good simulator can not only effectively fill the reality gap, but also meet the needs of large sampling of RL algorithm, so it can greatly promote the research and development of RL, such as sim-to-Real mentioned above [71]. At the same time, this is a combination of RL and VR technology. Recently, academia and industry have been making efforts in this area. In the area of autonomous driving, simulators like Gazebo, EuroTruck Simulator, TORCS, Unity, Apollo, Prescan, Panosim and Carsim have their own features, CARLA simulator [38] developed by Intel Research Institute has gradually become the research standard in the industry. Other areas of simulator development are also blossoming: in the home environment simulation area, MIT and the University of Toronto jointly developed the feature-rich VirturalHome simulator; MIT has also developed Flight Goggles simulators for drone training.
  • Improve generalization ability of RL. The most important goal of machine learning is generalization ability, and most existing RL methods perform poorly on this indicator [8]. It is not surprising that Jacob Andreas would criticize the success of RL as “train>
  • Hierarchical RL (Hierarchical RL, HRL). Professor Zhou Zhihua summed up three conditions for the success of DL: layer-by – layer processing, characteristic internal changes and sufficient model complexity [39]. HRL not only meets these three conditions, but also has stronger reasoning ability, which is a promising research field. HRL has demonstrated strong learning ability in tasks that require complex reasoning, such as Montezuma’s Revenge on Atari [40].
  • Combined with Sequence Prediction. Sequence Prediction and RL and IL solve similar but different problems. There are many ideas to borrow from each other. At present, some methods based on RL and IL have achieved good results on Sequence Prediction task [41,42,43]. A breakthrough in this direction will have a wide impact on many tasks in Video Prediction and NLP.
  • (Model-free) methods explore the safety of behavior (Safe RL). Compared with model-based methods, model-free methods lack predictive ability, which makes their exploration behavior more unstable. One approach is to combine the Bayesian method to model the uncertainty of RL agent behavior, so as to avoid too dangerous exploration behavior. In addition, in order to safely apply RL to a real environment, the danger zone can be delineated in the simulator with the help of mixed reality technology, and the agent’s behavior can be constrained by limiting the agent’s activity space.
  • The relationship between RL. Recently, “relational learning”, which is used to study the relationship between objects so as to make inference and prediction, has attracted extensive attention in academic circles. Relational learning tends to be a chain of states constructed during training, and intermediate states are disjointed from the final feedback. RL can send the final feedback back to the intermediate state to achieve effective learning, so it becomes the best way to achieve relational learning. VIN[44] and Pridictron[23] proposed by DeepMind in 2017 are both representative works in this regard. In June 2018, DeepMind published Generative Query Network work on relational learning topics such as relational induction bias [45], relational RL[46], relational RNN[47], graph networks [48] and the Generative Query Network, which was published in Science. GQN) [49]. This series of high profile work will lead the relationship RL boom.
  • Counter sample RL. RL is widely used in mechanical control and other fields, which require higher robustness and security than image recognition and speech recognition. Therefore, countermeasures against RL attacks are a very important problem. Recent studies have shown that many algorithms of classical models, such as DQN, cannot withstand the disturbance of adversarial attacks because they will be manipulated by adversarial samples [50,51].
  • Process input for other modes. In the field of NLP, RL has been applied to deal with many modal data, such as sentence, discourse, knowledge base and so on. However, in the field of computer vision, RL algorithm mainly extracts features of images and videos through neural networks, and rarely involves data of other modes. We can explore ways to apply RL to other modal data, such as rGB-D data and liDAR data. Once the difficulty of feature extraction of a certain kind of data is greatly reduced, alphaGo-level breakthroughs may be achieved after the organic combination with RL. Intel Research Institute has carried out a series of work in this area based on CARLA simulator.

4.3 Re-examine the application of RL

The current view is “RL can only play games, chess, and everything else”. The author thinks that we should not be too pessimistic about RL. In fact, RL can surpass human beings in video games and board games, which has proved the powerful reasoning ability of RL. After reasonable improvement, it is expected to be widely used. Often, the transition from research to application is not straightforward. IBM Watson®, for example, is known for its ability to understand and respond to natural language and beat human contestants to win the 2011 Jeopardy! The title. One of the supporting technologies is the RL technology used by Gerald Tesauro in the development of TD-Gammon [52] [53]. What was “only for” chess has become an integral part of the best question-and-answer systems. Today’s RL level of development is much higher than then, how can we not have confidence?

RL also plays a central role behind the powerful IBM Watson®

Through investigation, we can find that RL algorithm has been widely used in various fields:

  • Control area. This is one of the birthplaces of RL idea and the most mature field of RL technology application. The fields of control and machine learning have developed similar ideas, concepts, and techniques that can be borrowed from each other. For example, the currently widely used MPC algorithm is a special RL. In the field of robotics, compared with DL, WHICH can only be used for perception, RL has its own advantages compared with traditional methods: Traditional methods such as LQR, which generally learn a trajectory level strategy based on graph search or probability search, have high complexity and are not suitable for reprogramming. However, RL method can learn strategies in the state-action space and has better adaptability.
  • Autonomous driving. Driving is a sequential decision process, so it is a natural fit for RL. From ALVINN and TORCS in the 1980s to CARLA now, the industry has been trying to use RL to solve the problems of single-vehicle autonomous driving and multi-vehicle traffic scheduling. Similar ideas are widely used in various aircraft and underwater uAVs.
  • NLP. Compared with the tasks in the field of computer vision, many tasks in the field of NLP are multi-round, that is, the optimal solution (such as dialogue system) needs to be sought through multiple iterations. And the feedback signal of the task is often obtained after a series of decisions (such as machine writing). In recent years, RL has been applied to many tasks in the field of NLP, such as text generation, text summarization, sequence tagging, conversational robots (text/speech), machine translation, relationship extraction and knowledge graph inference. There are also many successful application cases, such as MILABOT’s model developed by Yoshua Bengio’s research group [54], Facebook chatbot [55], etc. Microsoft Translator [56] et al. In addition, RL technology has been used in a number of missions that span the two modes of NLP and computer vision, such as VQA, Image/Video Caption, Image Grounding, and Video Summarization.
  • Recommendation systems and retrieval systems. Moreover, a series of native algorithms in RL have long been widely applied in the fields of commodity recommendation, news recommendation and online advertising. In recent years, a series of works have applied RL to information retrieval and sorting tasks [57].
  • Financial sector. RL’s powerful sequential decision making ability has been concerned by the financial system. Both Wall Street giant jpmorgan Chase and startups like Kensho have incorporated RL technology into their trading systems.
  • Selection of data. In the case of sufficient data, how to choose data to achieve “fast, good and economical” learning has great application value. Recently, a series of efforts have emerged in this regard, such as the reinforcement co-training proposed by Jiawei Wu of UCSB [58].
  • Operations research areas such as communications, production scheduling, planning and resource access control. Tasks in these areas often involve the process of “selecting” actions, and labeled data is difficult to obtain, so RL is widely used for solving.

For a more comprehensive review of the applications of RL, see reference [59,60].

Despite the success of the applications listed above, it is important to recognize that RL is still in its infancy. There is no general-purpose RL solution that has matured into a plug-and-play algorithm like DL. Different RL algorithms are leading in their respective fields. Before finding a universal method, we should design special algorithms for specific problems. For example, in the field of robotics, methods based on Bayesian RL and evolutionary algorithms (such as CMAES[61]) are more suitable than DRL. Of course, different fields should learn from and promote each other. Randomness exists in the output of RL algorithm, which is the essential problem brought by its “exploration” philosophy. Therefore, we should not blindly All in RL, nor should WE RL in All, but to find the problem suitable for RL to solve.

Different RL methods should be used for different problems [22]

4.4 Re-examine the value of RL

At NIPS 2016, Yan LeCun believes that the most valuable problem is “Predictive Learning”, which is similar to unsupervised Learning. His speech represents the mainstream view in the academic circle recently. Ben Recht believes that RL is more valuable than Supervised Learning (SL) and Unsupervised Learning (UL). He corresponds these three learning styles to description analysis (UL), predictive analysis (SL) and guidance analysis (RL) in business analysis [18].

Descriptive analysis, which summarizes existing data to get a more robust and clear representation, is the easiest problem, but also the least valuable. Because the value of descriptive analysis is more aesthetic than practical. For example, “how to render a picture of a room with a GAN” is far less important than “predicting the price of the room based on the picture of the room”. The latter is predictive analytics — making predictions about current data based on historical data. However, in both descriptive analysis and predictive analysis, the system is not affected by the algorithm, while guidance analysis further models the interaction between the algorithm and the system, and maximizes the value gain by actively influencing the system.

By analogy with the two examples above, the guiding analysis is to solve the problem of “how to maximize the price of a room by making a series of modifications to the room”. This kind of problem is the most difficult because it involves complex interactions between algorithms and systems, but it is also the most valuable because the natural goal of guided analysis (RL) is to maximize value and is how humans solve problems. In addition, both descriptive analysis and predictive analysis, the environment of the problem dealt with is static and unchanging, which is not true for most practical problems. Guided analysis, on the other hand, is used to deal with dynamic changes in the environment, even allowing for cooperation or competition with other adversaries, which is more similar to most practical problems faced by humans.

Guiding and analyzing problems is the most difficult and valuable [18]

In the final section, I will attempt to introduce readers to a new way of looking at RL by discussing rL-like approaches to learning from feedback in a broader context.

| | five, the generalized RL – learn from feedback

The term “RL broadly” is used in this section to refer to multidisciplinary research on “learning from feedback”. Different from the RL from machine learning, cybernetics, economics and other fields introduced above, this section covers a wider range of disciplines. All systems involving feedback learning are tentatively called generalized RL.

5.1 RL in the broad sense is the ultimate goal of artificial intelligence research

In 1950, In his landmark paper Computing Machinery and Intelligence[62], Turing introduced the concept of the “Turing test”, in which a human being (code name C) asks an arbitrary series of questions to two objects he cannot see, using a language they all understand. The objects are: one is A normal thinking person (code B) and the other is A machine (code A). Machine A passes the Turing test if, after several inquiries, C cannot find A substantial difference between A and B.

Note that the Turing test already incorporates the idea of “feedback” — humans rely on the feedback of a program to make judgments, and artificial intelligence programs learn from the feedback to deceive humans. In the same paper, Alan also said: ‘Instead of trying to build a program that simulates the adult brain directly, why not build a program that simulates the child brain? If it gets proper education, it gets an adult brain.” — Isn’t this how RL learns to gradually improve its ability through feedback? It can be seen that when the concept of artificial intelligence was put forward, its ultimate goal was to build a good enough system to learn from feedback.

In 1959, artificial intelligence pioneer Arthur Samuel formally defined “machine learning”. It was Samuel who, in the 1950s, developed an RL-based chess program that became one of the earliest successes in artificial intelligence [63]. Why do ai pioneers tend to focus on RL-related tasks? A review of RL in the classic book Artificial Intelligence: A Modern Approach may answer this question: It can be argued that RL contains all the elements of artificial intelligence: An agent is placed in an environment and must learn to encompass all of AI. An agent is placed in an environment and must learn to behave successfully therein.) [64]

Not only in the field of artificial intelligence, but also in philosophy, the significance of behavior and feedback to the formation of intelligence is emphasized. Enactivism holds that behavior is the basis of cognition, and behavior and perception promote each other. The intelligent body obtains behavior feedback through perception, while behavior brings the intelligent body real and meaningful experience to the environment [65].

Behavior and feedback are the building blocks of intelligence [65]

It seems that learning from feedback is indeed a core element of achieving intelligence.

Back to artificial intelligence. After the success of DL, it was combined with RL to become DRL. After the success of knowledge base related research, Memory mechanism is gradually added into RL algorithm. And variational reasoning has found the junction with RL. Recently, the academic circle has begun to reflect on DL, rekindling interest in causal reasoning and symbol learning, so there have been some work related to relation RL and symbol RL[66]. By reviewing the academic development, we can also summarize a characteristic of the development of artificial intelligence: whenever a breakthrough is made in a related direction, it will always return to the RL problem and seek to combine with RL. Rather than seeing DRL as an extension of DL, think of IT as a return to RL. So we don’t have to worry about the DRL bubble, because RL is the ultimate goal of ARTIFICIAL intelligence, and it has a lot of vitality and will see waves of development in the future.

5.2 RL in the broad sense is the form of all machine learning systems in the future

In his final blog post [67], Recht emphasized that a machine learning system is not just a machine learning system, but an RL system, as long as it is improved by receiving external feedback. A/B testing, now widely used on the Internet, is the simplest form of RL. Future machine learning systems, on the other hand, will deal with distributed, dynamically changing data and learn from feedback. Therefore, it can be said that we are about to be in an era of “all machine learning is RL”, and the academic and industrial circles are in urgent need of increasing research efforts on RL. Recht discussed this issue in detail from the social and moral perspectives [67] and summarized his thoughts on RL from the perspective of control and optimization into a review article for readers to think about [69].

5.3 RL in a broad sense is the common goal of many fields of research

As mentioned in Section 4.2, RL has been invented and studied separately in the field of machine learning. In fact, the idea of learning from feedback is also being studied continuously in many other fields. To name just a few:

In psychology, classical conditioning versus operant conditioning is like SL versus RL; The famous psychologist Albert Bandura’s theory of “observational learning” is very similar to IL; The “projective identity” proposed by psychoanalyst Melanie Klein can also be regarded as an RL process. Of all the fields of psychology, Behaviorism is the closest to RL. Its representative, John Broadus Watson, applied behaviorist psychology to advertising industry, which greatly promoted the development of advertising industry. It’s hard not to think that one of the most mature applications of RL is Internet advertising. The cognitive behavioral therapy developed by behaviorism under the influence of cognitive science is similar to the strategy transfer method in RL. Behaviorism has a deep relationship with RL, and can even be said to be another source of RL thought. Limited by space, this paper cannot provide details. Interested readers may refer to psychological literature such as [53].

In the field of pedagogy, there have been comparisons and studies on “active learning” and “passive learning”. Cone of Experience is the representative study, and its conclusion is very similar to the comparison between RL and SL in the field of machine learning. “Inquiry learning” advocated by Educationalist Dewey refers to the learning method of actively exploring and seeking feedback;

  • In the field of organizational behavior, scholars explore the difference between “proactive personality” and “passive personality” and their impact on organizations.
  • In the field of enterprise management, “exploratory behavior” and “exploitative behavior” have always been a research hotspot.

It can be said that almost all the fields involving selection, feedback, and learning from feedback have the idea of RL in various forms, so I call it the generalized RL. These disciplines provide rich research materials for the development of RL and accumulate a large number of ideas and methods. At the same time, the development of RL will not only affect the field of artificial intelligence, but also promote the common progress of many disciplines included in the broad RL.

| | six, materials were reviewed

1. Video (from getting started to giving up)

1.1 Tencent _ Zhou Morfan _ Intensive learning, tutorial, code

  • www.bilibili.com/video/av169…
  • morvanzhou.github.io/
  • Github.com/AndyYue1893… 1.2 DeepMind_David Silver_UCL Deep reinforcement learning course (2015), PPT, notes and codes
  • www.bilibili.com/video/av453…
  • Blog.csdn.net/u_say2what/…
  • Zhuanlan.zhihu.com/p/37690204 1.3 big _ Li Hongyi _ depth of reinforcement learning (Chinese) course (2018), PPT, notes
  • www.bilibili.com/video/av247…
  • Speech.ee.ntu.edu.tw/~tlkagk/cou…
  • Blog.csdn.net/cindy_1102/… 1.4 UC Berkeley_Sergey Levine_CS285(294) Deep Reinforcement learning (2019), PPT, code
  • www.bilibili.com/video/av694…
  • Rail.eecs.berkeley.edu/deeprlcours…
  • Github.com/berkeleydee…

2. Books

2.1 Intensive Learning Bible _Rich Sutton_ Chinese book, English e-book, code ★★★★★(basic must-read, helpful to understand the essence of intensive learning)

  • Item.jd.com/12696004.ht…
  • Incompleteideas.net/book/the-bo…
  • Github.com/AndyYue1893…

★★★★★ ★★

  • Item.jd.com/12506442.ht…
  • Github.com/AndyYue1893…

★★★★(From basic to frontier, with code)

  • Item.jd.com/12344157.ht…

Reinforcement Learning With OpenAI TensorFlow and Keras Using Python_OpenAI

  • Pan.baidu.com/share/init?… : av5p)

3. The tutorial

3.1 OpenAI Spinning Up (Online learning platform, including principles, Algorithms, Papers, codes)

  • spinningup.openai.com/en/latest/
  • Spinningup. Readthedocs. IO/zh_CN/lates…
  • zhuanlan.zhihu.com/p/49087870

3.2 Don’t bother With Python(Easy to understand)

  • morvanzhou.github.io/

4. PPT

Reinforcement learning_Nando de Freitas_DeepMind_2019

  • Pan.baidu.com/s/1KF10W9Gi…

4.2 Policy Optimization_Pieter Abbeel_OpenAI/UC Berkeley/Gradescope

  • Pan.baidu.com/s/1zOOZjvTA…

Algorithm of 5.

What are the specific differences between the two RL genres behind DeepMind and OpenAI?

  • www.zhihu.com/question/31… Three classical algorithms

5.1 DQN

Volodymyr, et al. “Human level control through deep reinforcement learning.” Nature 518.7540 (2015): 529. (Nature Edition)

  • Storage.googleapis.com/deepmind-da…

5.2 DDPG

David. Silver, et al. “Deterministic policy gradient algorithms.” ICML. 2014.

  • Proceedings. MLR. Press/v32 / silver1…

5.3 A3C

Mnih. Volodymyr, et al. “Asynchronous methods for deep reinforcement learning.” International conference on machine learning. 2016.

  • www.researchgate.net/publication…

6. Environmental

6.1 OpenAI Gym

  • gym.openai.com/

Google Dopamine 2.0 6.2

  • Github.com/google/dopa…

6.3 Emo Todorov Mujoco

  • www.mujoco.org/

6.4 General grid world environment class

  • zhuanlan.zhihu.com/p/28109312
  • Cs.stanford.edu/people/karp…

Framework of 7.

7.1 OpenAI Baselines

  • Github.com/openai/base…

7.2 Baidu PARL(strong scalability, good reproducibility, friendly)

  • Github.com/paddlepaddl…

DeepMind OpenSpiel (Debian and Ubuntu only,28 board games and 24 algorithms)

  • Github.com/deepmind/op…

Paper 8.

8.1 Dr. Zhang Chuheng, Tsinghua University ★★★★★ ★[2]

  • Zhuanlan.zhihu.com/p/46600521 Zhang Chuhang: reinforcement learning summary

8.2 NeuronDance u u u u

  • Github.com/AndyYue1893…

8.3 paperswithcode u u u u

  • www.paperswithcode.com/area/playin…
  • Github.com/AndyYue1893…

4. Spinning Up ★★★★★

  • zhuanlan.zhihu.com/p/50343077

9. Conferences & journals

9.1 Conference: AAAI, NIPS, ICML, ICLR, IJCAI, AAMAS, IROS, etc

9.2 Journals: AI, JMLR, JAIR, Machine Learning, JAAMAS, etc

9.3 Computer and Artificial Intelligence Conference (Journal) Ranking

  • www.ccf.org.cn/xspj/rgzn/
  • Mp.weixin.qq.com/s?__biz=Mzg…
  • www.aminer.cn/ranks/conf/…

No. 10. The public

10.1 Deep Reinforcement Learning Lab ★★★★★

10.2 Heart of machine ★★★★★

AI Technology Review ★★★★

10.4 New Jiwon ★★★

11. Zhihu

11.1 user

  • Xu Tie – Cruiser Technology (wechat public account with the same name), Flood Sung (GitHub with the same name)
  • Tian Yuandong, Zhou Boli, Yu Yang, Zhang Chuheng, Tianjin baozi Stuffing, JQWang2048 and its mutual concern daniu, etc

11.2 column

  • David Silver’s Intensive Learning Open Course Chinese Explanation and Practice (Ye Qiang, very classic)
  • Intensive learning knowledge lecture hall (” In depth, in simplicity, intensive learning: principles of introduction “author of Tianjin baozi stuffing)
  • Smart Units (Duke, Floodsung, WXAM, General Ai focus, Flood Sung: Deep Learning Papers Reading Roadmap great)
  • Landing Methodology of Deep reinforcement Learning (Daniu, Xi ‘an Jiaotong University, with rich practical operation experience)
  • [JQWang2048, GitHub: NeuronDance, CSDN: J. Q. Wang]
  • Reinforcement Learning and Neural Networks
  • Notes by David Silver (Chen Xionghui, NTU, DiDi AI Labs)

12. The blog

12.1 hat BOY

  • Blog.csdn.net/u013236946/…

12.2 J. Q. Wang

  • blog.csdn.net/gsww404

12.3 Keavnn

  • stepneverstop.github.io/

12.4 big mouth

  • blog.otoro.net/

13. Website

13.1 OpenAI

  • www.openai.com/

13.2 DeepMind

  • www.deepmind.com/

13.3 the Berkeley

Bair.berkeley.edu/blog/?refre…

conclusion

While there are still a lot of questions to be answered in the RL field and a lot of bubbles in the DRL direction, we should see great progress in the research and application of the RL field itself. This area deserves continued research, but it needs to be applied rationally. The research on feedback-based learning is not only promising to achieve the ultimate goal of artificial intelligence, but also of great significance to the development of machine learning and many other fields. This is really the best path to artificial intelligence. The road is fraught, but there is light at the end of the tunnel.

| | author introduction

Hou Yuqing, Dr, presently for Intel China research institute of cognitive computing laboratory and the department of computer science and technology, tsinghua university state key laboratory of intelligent technology and system joint training postdoctoral researcher, research interests for the theory and application of reinforcement learning, the research direction for reinforcement learning based on the depth of visual information processing and meta learning. Graduated from Peking University in 2016, her research direction is multi-modal learning. He has published 7 academic papers and 5 US/international patents and applications.

Yurong Chen, Ph.D., is currently chief researcher of Intel and director of Cognitive Computing Laboratory of Intel China Research Institute. Responsible for leading the research on visual cognition and machine learning, driving the innovation of intelligent visual data processing technology based on Intel platform. He has published more than 50 academic papers and has more than 50 US/international patents and applications.

| | thanks

During the writing of this paper, I have received positive feedback from Researchers Yiwen Kuo, Zhongxuan Liu and Xuesong Shi of Intel Research Institute. Dr. Shane Gu, University of Cambridge, Professor Zhang Chongjie, School of Cross-Information, Tsinghua University, and Professor Lin Zhouchen, Department of Intelligent Science, School of Information Science, Peking University, respectively provided valuable guidance on model-based methods, RL generalization performance and RL model optimization methods. In addition, special thanks to reinforcement learning researcher Flood Sung, who introduced several cutting-edge research applications in the field of RL to the author and provided this research exchange platform.

reference

[1] Mnih, Volodymyr, et al. “Human-level control through deep reinforcement learning.” Nature 518.7540 (2015): 529.

[2] Silver, David, et al. “Mastering the game of Go with deep neural networks and tree search.” Nature 529.7587 (2016): 484-489.

[3] Silver, David, et al. “Mastering the game of go without human knowledge.” Nature 550.7676 (2017): 354.

[4] Levine, Sergey, et al. “End-to-end training of deep visuomotor policies.” arXiv preprint arXiv:1504.00702, 2015.

[5] Mao, Hongzi, et al. “Resource management with deep reinforcement learning.” Proceedings of the 15th ACM Workshop on Hot Topics in Networks. ACM, 2016.

[6] deepmind.com/blog/deepm

[7] Jaques, Natasha, et al. “Tuning recurrent neural networks with reinforcement

learning.” (2017).

[8] Henderson, Peter, et al. “Deep reinforcement learning that matters.” arXiv

preprint arXiv:1709.06560 (2017).

[9] Islam, Riashat, et al. “Reproducibility of benchmarked deep reinforcement learning tasks for continuous control.” arXiv preprint ArXiv: 1708.04133 (2017).

[10] riashatislam.files.wordpress.com

[11] sites.google.com/view/d

[12]reddit.com/r/MachineLea

[13] alexirpan.com/2018/02/1

[14] himanshusahni.github.io

[15] amid.fish/reproducing-d

[16] rodeo.ai/2018/05/06/rep

[17] Dayan, Peter, and Yael Niv. “Reinforcement learning: “The good, the bad and the ugly.” Current Opinion in Neurobiology 18.2 (2008): 185-196.

[18] argmin.net/2018/05/11/o

[19] Mania, Horia, Aurelia Guy, and Benjamin Recht. “Simple random search provides a competitive approach to reinforcement learning.” arXiv preprint ArXiv: 1803.07055 (2018).

[20] Justesen, Niels et al. “Deep Learning for Video Game Playing.” arXiv PrePrint arXiv:1708.07902 (2017).

[21] youtube.com/watch?

[22] sites.google.com/view/i

[23] Liu, J., Et al. “Predictron: End-to-end learning and planning.” arXiv Preprint arXiv:1612.08810 (2016).

[24] Deisenroth, Marc, and Carl E. Rasmussen. “PILCO: A model-based and data-efficient approach to policy search.” Proceedings of the 28th International Conference on machine learning (ICML-11). 2011.

[25] Levine, Sergey, and Vladlen Koltun. “Guided policy search.” International Conference on Machine Learning. 2013.

[26] Weber, Théophane, et al. “Imagination-augmented agents for deep reinforcement learning.” arXiv preprint arXiv:1707.06203 (2017).

[27] Sutton, Richard S. “Dyna, an integrated architecture for learning, planning, D) and Reacting.” ACM SIGART Bulletin 2.4 (1991): 160-163.

[28] Silver, David, Richard S. Sutton, and Martin Müller. “Sample-based learning and search with permanent and transient memories.” Proceedings of the 25th international conference on Machine learning. ACM, 2008.

[29] Rajeswaran, Aravind, et al. “Towards generalization and simplicity in continuous control.” Advances in Neural Information Processing Systems. 2017.

[30] andreykurenkov.com/writ

[31] UCL Course on RL: www0.cs.ucl.ac.uk/staff

[32] Powell, Warren B. “AI, OR and control theory: A rosetta stone for stochastic optimization.” Princeton University (2012).

[33] Pomerleau, Dean A. “Alvinn: An autonomous land vehicle in a neural network.” Advances in neural information processing systems. 1989.

[34] Abbeel, Pieter, and Andrew Y. Ng. “Apprenticeship learning via inverse reinforcement learning.” Proceedings of the twenty-first international conference on Machine learning. ACM, 2004.

[35] Osa, Takayuki, “An algorithmic Perspective on imitation learning.” Foundations and Trends® in Robotics 7.1-2 (2018): 1-179.

[36] fermatslibrary.com/arxi url=https%3A%2F%2Farxiv.org % 2 FPDF % 2 f1406. 2661. PDF

[37] Ho, Jonathan, and Stefano Ermon. “Generative adversarial imitation learning.” Advances in Neural Information Processing Systems. 2016.

[38] github.com/carla-simula

[39] 36kr.com/p/5129474.html

[40] Vezhnevets, Alexander Sasha, et al. “Feudal networks for hierarchical reinforcement learning.” arXiv preprint arXiv:1703.01161 (2017).

[41] Ranzato, Marc’Aurelio, et al. “Sequence level training with recurrent neural networks.” arXiv preprint arXiv:1511.06732 (2015).

[42] Bahdanau, Dzmitry, et al. “An actor-critic algorithm for sequence prediction.” arXiv preprint arXiv:1607.07086 (2016).

[43] Keneshloo, Yaser, et al. “Deep Reinforcement Learning For Sequence to Sequence Models.” arXiv preprint arXiv:1805.09461 (2018).

[44] Watters, Nicholas, et al. “Visual interaction networks.” arXiv preprint arXiv:1706.01433 (2017).

[45] Hamrick, Jessica B., Et al. “Relational Variability bias for physical construction in humans and machines.” arXiv preprint arXiv:1806.01203 (2018).

[46] Zambaldi, Vinicius, et al. “Relational Deep Reinforcement Learning.” arXiv preprint arXiv:1806.01830 (2018).

[47] Santoro, Adam, et al. “Relational recurrent neural networks.” arXiv preprint arXiv:1806.01822 (2018).

[48] Battaglia, Peter W., et al. “Relational inductive biases, deep learning, “ArXiv preprint arXiv:1806.01261 (2018).

[49] Eslami, SM Ali, et al. “Neural scene representation and rendering.” Science 360.6394 (2018): 1204-1210.

[50] Huang, Sandy, et al. “Adversarial attacks on neural network policies.” arXiv preprint arXiv:1702.02284 (2017).

[51] Behzadan, Vahid, and Arslan Munir. “Vulnerability of deep reinforcement learning to policy induction attacks.” International Conference on Machine Learning and Data Mining in Pattern Recognition. Springer, Cham, 2017.

[52] Tesauro, Gerald. “Temporal difference learning and TD-Gammon.” Communications of the ACM 38.3 (1995): 58-68.

[53] Jones, Rebecca M., et al. “Behavioral and neural properties of social reinforcement learning.”Journal of Neuroscience 31.37 (2011): 13039-13045.

[54] github.com/YBIGTA/DeepN

[55] Lewis, Mike, et al. “Deal or no deal? end-to-end learning for negotiation dialogues.” arXiv preprint arXiv:1706.05125 (2017).

[56] microsoft.com/zh-cn/tra

[57] Derhami, Vali, et al. “Applying reinforcement learning for web pages ranking algorithms.” Applied Soft Computing 13.4 (2013): 1686-1692.

[58] Wu, Jiawei, Lei Li, and William Yang Wang. “Reinforced Co-Training.” arXiv preprint arXiv:1804.06035 (2018).

[59] Li, Yuxi. “Deep reinforcement learning: An overview.” arXiv preprint arXiv:1701.07274 (2017).

[60] Feinberg, Eugene A., and Adam Shwartz, eds. Handbook of Markov decision processes: methods and applications. Vol. 40. Springer Science & Business Media, 2012.

[61] en.wikipedia.org/wiki/C

[62] Turing, Alan M. “Computing machinery and intelligence.” Parsing the Turing Test. Springer, Dordrecht, 2009. 23-65.

[63] en.wikipedia.org/wiki/A

[64] Russell, Stuart J., and Peter Norvig. Artificial intelligence: a modern approach. Malaysia; Pearson Education Limited,, 2016.

[65] Noë, Alva. Action in perception. MIT press, 2004.

[66] Garnelo, Marta, Kai Arulkumaran, And Murray Shanahan. “Reinforcement learning for deep symbolic Reinforcement.” arXiv Preprint arXiv:1609.05518 (2016).

[67] argmin.net/2018/04/16/e

[68] Bojarski, Mariusz, et al. “End to end learning for self-driving cars.” arXiv preprint arXiv:1604.07316 (2016).

[69] Recht, Benjamin . “A Tour of Reinforcement Learning:The View from Continuous Control.” arXiv preprint arXiv: 1806.09460

[70] Kalashnikov, Dmitry, et al. “QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation.” arXiv preprint arXiv:1806.10293 (2018).

[71] Tan, Jie, et al. “Sim-to-Real: “Learning Agile Locomotion For Quadruped Robots.” arXiv preprint arXiv:1804.10332 (2018).

[72] Auer, Peter. “Using confidence bounds for exploitation-exploration trade-offs.” Journal of Machine Learning Research 3.Nov (2002) : 397-422.

[73] Agrawal, Shipra, and Navin Goyal. “Thompson sampling for contextual bandits with linear payoffs.” International Conference on Machine Learning. 2013.

[74] Mohamed, Shakir, and Danilo Jimenez Rezende. “Variational information maximisation for intrinsically motivated reinforcement learning.” Advances in neural

information processing systems. 2015.

[75] Pathak, Deepak, et al. “Curiosity-driven exploration by self-supervised prediction.” International Conference on Machine Learning (ICML). Vol. 2017. 2017.

[76] Tang, Haoran, et al. “# Exploration: A study of count-based exploration for deep reinforcement learning.” Advances in Neural Information Processing Systems. 2017.

[77] McFarlane, Roger. “A Survey of Exploration Strategies in Reinforcement Learning.” McGill University, Ca / ~ CS526 / Roger. PDF, Email exchange with factcheck.org: April (2018).

[78] Plappert, Matthias, et al. “Parameter space noise for exploration.” arXiv preprint arXiv:1706.01905 (2017).

[79] Fortunato, Meire et al. “Noisy Networks for Exploration.” arXiv Preprint arXiv:1706.10295 (2017).

[80] Kansky, Ken, et al. “Schema networks: “Zero-shot Transfer with a generative distribution model of intuitive physics.” arXiv preprint arXiv:1706.04317 (2017).

[81] Li, Da, et al. “Learning to generalize: Meta-learning for domain generalization.” arXiv preprint arXiv:1710.03463 (2017).

[82] berkeleyautomation.github.io

[83] contest.openai.com/2018

“`php

Highlights of past For beginners entry route of artificial intelligence and data download AI based machine learning online manual deep learning online manual download update (PDF to 25 sets) note: WeChat group or qq group to join this site, please reply “add group” to get a sale standing knowledge star coupons, please reply “planet” knowledge like articles, point in watching

Copy the code