Qin Haoran, a native of Shenyang, graduated from Northeastern University. An enthusiast of reinforcement learning techniques. Traditional software development is the first wave, AI is the last wave.

Speaking of the introduction of reinforcement learning, I wonder if you have stepped into the world of reinforcement learning from Sarsa and Q-Learning, DQN, Policy Gradient, and DDPG. After learning these basic algorithms, today we will take a look at the advanced algorithm SAC and see how SAC can be easily applied to GYM Box2D LunarLanderContinuous-v2 mission using the PARL reinforcement learning framework of flying OARS. So that our lunar lander can adapt to all kinds of conditions, smooth landing.

This paper mainly includes the following three parts:

  • SAC algorithm paper introduction

  • SAC algorithm sample code introduction

  • How to use SAC algorithm to play with moon lander

SAC algorithm paper

SAC, short for Soft Actor-Critic, was proposed by Tuomas Haarnoja et al., Berkeley Artificial Intelligence Research Laboratory (BAIR), in 2018. Original link:

Arxiv.org/abs/1801.01…

For those of you who have read the paper, SAC can be seen as an enhanced version of DDPG. So why does the paper want to enhance DDPG?

Question 1: Why is DDPG enhanced?

According to the paper, there are two factors that make the practical application of DEEP reinforcement learning difficult:

  • Very high sample complexity

  • Brittle Convergence properties

In order to overcome these two difficulties, SAC, an off-policy actor-critic deep reinforcement learning algorithm based on maximum entropy reinforcement learning framework, is proposed in this paper. (soft actor – critic, An OffPolicy Actor – Deep RL Algorithm based on the Maximum Entropy Reinforcement Learning Framework)

How to enhance DDPG**? 六四运动

Before we talk about that, what is maximum entropy? In the DDPG algorithm, we aim to maximize reward. In SAC, the entropy of strategy distribution should be maximized as well as reward. This is the maximum entropy.

Some of you might ask, well, what is entropy? We don’t talk about thermodynamic entropy, we don’t talk about information entropy. Entropy, as I understand it, is a measure of how random the distribution is. The more random the distribution, the higher the entropy. Then Actor takes maximizing entropy as one of the optimization goals, which means that the randomness of the strategy distribution should also be maximized. The more random the strategy distribution, the more stable and exploratory the algorithm becomes.

Question 3: What is enhanced?

So what does adding entropy maximization actually enhance. Personal understanding is to make the distribution of decisions not like DDPG tend to concentrate on an optimal solution. You want to have more strategies that are equally good at the same time. This can greatly increase the robustness to adapt to a wide variety of environments.

By solving these three problems, we have gained some understanding of SAC. Next, let’s take a look at the experimental data from the paper to get a sense of the power of SAC. SAC and several other mainstream deep reinforcement learning algorithms are presented in this paper. The training curves of Benchmark in six reinforcement learning tasks are given. SAC is represented in yellow in the figure.

From the training curve shown in the figure, SAC showed good stability across several tasks of varying difficulty (the yellow shaded area is narrow and concentrated near the solid line). In hoper-V1, HalfCheetah-V1, ant-v1, and Humanoid(RLLab), SAC’s final return is significantly higher than other algorithms, especially in the most complex control space up to 21 dimensions of Humanoid(RLLab). Showed a clear advantage.

Example of SAC algorithm

SAC algorithm is so good, will it be difficult to implement? Take a look at an EXAMPLE of SAC in PARL, an open source deep learning framework for PaddlePaddle.

Example code link, GitHub SAC based on PARL example:

Github.com/PaddlePaddl…

(If you have difficulty accessing GitHub, you can search Gitee or find it in the example directory.)

The structure of PARL framework is roughly as shown in the figure above, which adopts layer upon layer nested structure and is suitable for most reinforcement learning algorithms. Specifically for SAC algorithm, the innermost Model encapsulates the network structure of Q network and policy network, and outputs Q value and action value through value() and policy(). At the outer layer, Algorithm mainly encapsulates the loss function. Through predict(), the action value is output and the network weight of Model is updated inwards through Learn (). The outermost Agent is mainly responsible for interacting with the environment and feeding the data obtained from the environment to Algorithm.

In terms of implementation, the whole implementation is very clean because it uses the structure of PARL framework and encapsulated algorithm. It is divided into three files:

  • mujoco_agent.py

  • mujoco_model.py

  • train.py

Take A look at Mujoco_model.py, which encapsulates two Model classes to implement the policy network and the Q network, the A and C in SAC. Both classes inherit PARL’s Model. ActorModel defines the Actor network structure (policy network) and implements the policy method. This method outputs action values based on the input OBS, which is the current state of the external environment, through the internal policy network.

class ActorModel(parl.Model):
    def __init__(self, act_dim):
        hid1_size = 400
        hid2_size = 300
        self.fc1 = layers.fc(size=hid1_size, act='relu')
        self.fc2 = layers.fc(size=hid2_size, act='relu')
        self.mean_linear = layers.fc(size=act_dim)
        self.log_std_linear = layers.fc(size=act_dim)
    def policy(self, obs):
        hid1 = self.fc1(obs)
        hid2 = self.fc2(hid1)
        means = self.mean_linear(hid2)
        log_std = self.log_std_linear(hid2)
        log_std = layers.clip(log_std, min=LOG_SIG_MIN, max=LOG_SIG_MAX)
        return means, log_std
Copy the code

CriticModel defines the network structure of Critic (Q network) and implements the value method. This method outputs Q values through the internal Q network, based on the input OBS and the action values output by the policy network.

class CriticModel(parl.Model):
    def __init__(self):
        hid1_size = 400
        hid2_size = 300
        self.fc1 = layers.fc(size=hid1_size, act='relu')
        self.fc2 = layers.fc(size=hid2_size, act='relu')
        self.fc3 = layers.fc(size=1, act=None)
        self.fc4 = layers.fc(size=hid1_size, act='relu')
        self.fc5 = layers.fc(size=hid2_size, act='relu')
        self.fc6 = layers.fc(size=1, act=None)
    def value(self, obs, act):
        hid1 = self.fc1(obs)
        concat1 = layers.concat([hid1, act], axis=1)
        Q1 = self.fc2(concat1)
        Q1 = self.fc3(Q1)
        Q1 = layers.squeeze(Q1, axes=[1])
        hid2 = self.fc4(obs)
        concat2 = layers.concat([hid2, act], axis=1)
        Q2 = self.fc5(concat2)
        Q2 = self.fc6(Q2)
        Q2 = layers.squeeze(Q2, axes=[1])
        return Q1, Q2
Copy the code

Take a look at mujoco_agent.py, which encapsulates the MujocoAgent class, which inherits PARL’s Agent. Mainly implemented predict, SAMPLE, learn three functions. Predict is used to output an action value based on the data obtained from the environment, namely the state of the environment. Sample also outputs action values according to the environment state. The difference is that this method also outputs a random action value with a certain probability to explore new actions. Learn realized the function of optimizing model internal network parameters according to environment state, action value, return value and other data.

class MujocoAgent(parl.Agent):
    def predict(self, obs):
        obs = np.expand_dims(obs, axis=0)
        act = self.fluid_executor.run(
            self.pred_program, feed={'obs': obs},
            fetch_list=[self.pred_act])[0]
        return act

    def sample(self, obs):
        obs = np.expand_dims(obs, axis=0)
        act = self.fluid_executor.run(
            self.sample_program,
            feed={'obs': obs},
            fetch_list=[self.sample_act])[0]
        return act

    def learn(self, obs, act, reward, next_obs, terminal):
        feed = {
            'obs': obs,
            'act': act,
            'reward': reward,
            'next_obs': next_obs,
            'terminal': terminal
        }
        [critic_cost, actor_cost] = self.fluid_executor.run(
            self.learn_program,
            feed=feed,
            fetch_list=[self.critic_cost, self.actor_cost])
        self.alg.sync_target()
        return critic_cost[0], actor_cost[0]
Copy the code

Finally, take a look at train.py. Here is the training script. After instantiating actor and critic, SAC algorithm encapsulated by PARL is used to nest layer upon layer and finally an agent instance that can interact with the environment is obtained:

    actor = ActorModel(act_dim)
    critic = CriticModel()
    algorithm = parl.algorithms.SAC(
        actor,
        critic,
        max_action=max_action,
        gamma=GAMMA,
        tau=TAU,
        actor_lr=ACTOR_LR,
        critic_lr=CRITIC_LR)
    agent = MujocoAgent(algorithm, obs_dim, act_dim)
Copy the code

Experience playback can be implemented directly using PARL’s built-in ReplayMemory class, and parameters can be set to ReplayMemory imported from PARL.

    from parl.utils import ReplayMemory
    rpm = ReplayMemory(MEMORY_SIZE, obs_dim, act_dim)
Copy the code

Once the above setup is OK, start Python train.py and you’ll be happy to start training.

Space is limited here only about a brief idea, interested students can go directly to see the source code.

SAC algorithm Practice

Finally, we have a look at the performance of SAC algorithm in GYM Box2D LunarLanderContinuous-v2 task through practice.

Also based on the PARL framework, the code is also very simple. Compared to the sample, there are two major changes in the training script.

First, introduce gym library

import gym
Copy the code

Then, create the Lunar Lander environment in Main

env = gym.make('LunarLanderContinuous-v2')
Copy the code

In addition, in order to visualize the training process, we also added reward saving and reading as well as visual codes

np.save('train_step_list', train_step_list)
np.save('train_reward_list', train_reward_list)
np.save('evaluate_step_list', evaluate_step_list)
np.save('evaluate_reward_list', evaluate_reward_list)
Copy the code

Add the following statement to the Notebook file to visualize the output.

train_step_list = np.load('train_step_list.npy')
train_reward_list = np.load('train_reward_list.npy')

plt.figure()
plt.title('train reward')
plt.xlabel('step')
plt.ylabel('reward')
plt.plot(train_step_list, train_reward_list)
plt.grid()
plt.show()
Copy the code

After a few hours of training on the GPU, we can get the visual output shown below.

From the output of training and evaluation, the convergence is good, and the advantages of SAC algorithm are perfectly reflected in this practice.

The full review

Firstly, we summarize the main content of the paper, and analyze the purpose, principle and function of SAC algorithm. By introducing maximum entropy, the algorithm makes the decision distribution tend to be diversified, so as to adapt to more complex practical applications and achieve better application effects.

Secondly, the sample outline of SAC algorithm given by PARL framework is introduced. Thanks to the power of the PARL framework, the sample code is concise and clear, easy to read and use.

Finally, in the practice section, we understand the application of this framework in LunarLanderContinuous-v2 environment. With the help of SAC algorithm, our lunar module landed smoothly and accurately. In the training process, it shows good performance of fast convergence and stable output.

Practice code links:

Aistudio.baidu.com/aistudio/pr…

Want to know how to implement SAC algorithm succinctly? Want to train and play with the moon lander? Click the link above to check out the open source code, Fork it and run it.