Video introduction: Evaluating design trade-offs in visual model-based reinforcement learning

Model-free reinforcement learning has been successfully demonstrated in a range of fields, including robotics, control, gaming, and self-driving cars. These systems learn by simple trial and error, and therefore require a lot of trial and error before solving a given task. In contrast, model-based reinforcement learning (MBRL) learning environment models (often referred to as world models or dynamics models) enable agents to predict the outcome of potential actions, thereby reducing the amount of environmental interactions required to solve a task.

In principle, all that is necessary for planning is to predict future rewards, which can then be used to select near-optimal future actions. However, many recent methods, such as Dreamer, Pei, and Simple, also use training signals to predict future images. But are images of the future really necessary or helpful? What are the benefits of doing visual MBRL algorithms that are actually derived from also predicting future images? The computational and representation costs of predicting the entire image are considerable, so it is important to know if this is really useful for MBRL research.

In “Models, Pixels, and Rewards: Evaluating Design trade-offs in Visual Model-based reinforcement learning,” we demonstrate that predicting future images provides significant benefits and is in fact a key factor in training successful visual MBRL agents. We have developed a new open source Library, called the World Models Library, which enables us to rigorously evaluate various World model designs to determine the relative impact of image prediction on the return rewards of each model.

World Models Library

Designed for visual MBRL training and evaluation, the World Models Library provides an empirical study of the impact of each design decision on the final performance of large-scale agents across multiple tasks. The library introduces a platform-independent visual MBRL simulation loop and API to seamlessly define new world models, planners and tasks or picks from an existing catalog, which includes agents (e.g., Opai), video modes (e.g., SV2P), and a variety of DeepMind Control tasks and planners, Examples are CEM and MPPI.

Using this library, developers can investigate the impact of varying factors in the MBRL, such as model design or presentation space, on the performance of an agent on a set of tasks. The library supports training agents from scratch or on a pre-collected set of tracks, as well as evaluating pre-trained agents on a given task. Models, programming algorithms, and tasks can be easily mixed and matched to any desired combination.

To provide maximum flexibility for the user, the library is built using the NumPy interface, which allows different components to be implemented in TensorFlow, Pytorch, or JAX. Check out this COLab for a quick introduction.

The effects of image prediction

Using the world model library, we trained several world models with different levels of image prediction. All of these models used the same input (previously observed images) to predict images and rewards, but they predicted different percentages of images. As the number of image pixels predicted by agents increases, agent performance, as measured by real rewards, generally improves.

Interestingly, the correlation between reward prediction accuracy and agent performance was not so strong, and in some cases, more accurate reward prediction even led to a decrease in agent performance. Meanwhile, there is a strong correlation between the image reconstruction error and the performance of the agent.

This phenomenon is directly related to exploration, when an agent attempts more risky and potentially less rewarding actions to gather more information about unknown options in the environment. This can be shown by testing and comparing models in an offline setting (that is, learning policies from a pre-collected data set rather than online RL, which learns policies by interacting with the environment). The offline setting ensures that there is no exploration and that all models are trained on the same data. We observe that models that are better suited to the data generally perform better in offline Settings, and surprisingly, these models may differ from those that perform best when learning and exploring from scratch.

conclusion

We have shown empirically that predictive images can significantly improve task performance compared to models that only predict expected rewards. We also show that the accuracy of image prediction is closely related to the ultimate task performance of these models. These findings can be used for better model design and are particularly useful for any future setup where the input space is high-dimensional and data collection is costly.

If you want to develop your own models and experiments, head over to our repository and collaboration LABS, where you can find instructions on how to reproduce this work and how to use or extend the World Model library.

Update note: first update wechat public number “rain night blog”, later update blog, after will be distributed to each platform, if the first to know more in advance, please pay attention to the wechat public number “rain night blog”.

Blog Source: Blog of rainy Night