1. What is reinforcement learning

In many other machine Learning algorithms, the learner learns how to do, while Reinforcement Learning (RL) is to learn which action can get the biggest reward in a particular situation in the process of trying. In many scenarios, current actions affect not only current rewards, but also future states and a set of rewards. The three most important specificities of RL are:

  1. It’s basically a closed loop;
  2. Does not directly indicate which actions to select;
  3. A series of actions and reward signals can affect a long time later.

Reinforcement Learning (RL), also known as Reinforcement Learning, evaluation Learning or Reinforcement Learning, is one of the paradigms and methodologies of machine Learning. It is used to describe and solve the problem that agents maximize returns or achieve specific goals through learning strategies in the process of interacting with the environment [1].

In the figure above, agent represents itself. If it is autonomous driving, agent is a car. If you play a game, it is the game character you currently control, such as Mario. As Mario moves forward, the environment changes all the time. There are small monsters or obstacles, and it needs to jump to avoid them, which is action (such as walking forward and jumping). The driverless action is the car turning left, right or braking, etc. It interacts with the environment all the time. The action will feed back to the environment and then change the environment. If the self-driving car driving target is 100 meters, it drives forward 10 meters, the environment will change, so every time the action will cause the environment to change. The change of the environment will feed back to itself (agent), which is such a cycle; Feedback comes in two ways:

  1. A good reward is positive feedback,
  2. Negative feedback is when you do bad work.

An Agent may do well or badly, and the environment will always give it feedback. It will try its best to make favorable decisions. Through such a cycle of repetition, an Agent will do better and better, just like a child will gradually distinguish right from wrong in the process of growing up, which is reinforcement learning.

2. Reinforcement learning model

As shown on the left side of the figure above, an agent(e.g. player/agent, etc.) takes an action that affects the environment, i.e. changes the state, and the environment feeds back to the agent and the agent gets a reward (e.g. Points/fractions), and continue the cycle until it ends.

This process is equivalent to a Markov decision process. Why is it called that? Because it conforms to The Markov hypothesis:

  • The current state St is determined only by the previous state ST-1 and the behavior, and has nothing to do with the more states that precede it.

As you can see on the right hand side of the figure above, S0 is rewarded r1 and becomes S1, rewarded R2 and becomes S2 by a0, and so on until it ends.

2.1 Discounted future rewards

According to the above description, we have determined a concept, that is, the agent’s decision at present will definitely maximize the future benefits. Then, the sum of rewards corresponding to a Markov decision process is:


T moment (present) future reward, only consider the later reward, the previous change does not change:


Next, ** the action in the current situation can get results, but the impact on the future is uncertain. ** This is also in line with our real world, for example, no one knows that a butterfly flapping its wings once will cause hurricane-like effects (butterfly effect). So, if the current behavior is uncertain about the future, a discount is applied by adding a coefficient gamma, which is a value from 0 to 1:


The further away from the present, the greater the penalty factor in Gamma, the more uncertain it becomes. The goal is to strike a balance between current and future decisions. If we take gamma to 0, we’re not thinking about the future, we’re thinking about the present. If gamma is 1, we’re taking the future into account, and we’re overthinking it. So normally gamma takes a value between 0 and 1.

Rt can be expressed as Rt+1 and written recursively:


2.2 Q – Learning algorithm

Q(S, a) function (Quality), the Quality function is used to represent the discounted future reward under the condition that the agent takes a action in S state and then takes the optimal action (regardless of the choice of future action) :


Assuming that we have this Q function, we can get the maximum profit value of each decision at the current time of T. By comparing these profit values, we can get that a decision at the time of T has the highest profit among these decisions.


Therefore, according to the recursive formula of Q function, it can be obtained:


** This is the noted Behrman formula. ** Behrman’s formula is actually quite reasonable. Maximizing future rewards for one state is the sum of maximizing immediate rewards plus the maximum future rewards for the next state.

The core idea of **Q-learning is: ** We can iteratively approximate q-functions by Behrman formula.

2.3 Deep Q Learning (DQN)

Deep Q Learning(DQN) is a combination of neural network and Q-learning method.

2.3.1 Function of neural network

Use a table to store each state state and the Q value that each action has at that state. The problem today is that it is so complex that there can be more states than there are stars in the sky (like playing go). If all tables were used to store them, our computers would not have enough memory, and it would be time-consuming to search for corresponding states in such a large table each time. But there is one approach to machine learning that is good at this sort of thing: neural networks.

We can take the state and action as the input of the neural network, and then get the Q value of the action after the neural network analysis. In this way, we do not need to record the Q value in the table, but directly use the neural network to generate the Q value.

In another form, we can only input the state value, output all the action values, and then directly select the action with the maximum value as the next action according to the principle of Q Learning.

We can imagine that the neural network takes in information from the outside, like the eyes, nose and ears, and then outputs the value of each action through the brain, and then selects the action through reinforcement learning.

2.3.2 Neural network calculates Q value

This part is just like the neural network of supervised Learning. I input the state value and the output is Q value. According to a large number of data, I train the parameters of the neural network and finally get the calculation model of Q-learning.

3. The difference between reinforcement learning and supervised learning and unsupervised learning

  1. Supervised learning is like when you’re studying and you have a mentor around you who knows what’s right and what’s wrong.

    Reinforcement learning will, without any label, get a result by trying some behaviors first, and adjust the previous behavior through feedback on whether the result is right or wrong. In this way, the algorithm can learn to choose what behavior can get the best result in what situation.

  2. Supervised learning can tell the algorithm what kind of input corresponds to what kind of output. Supervised learning makes bad choices and immediately feeds back to the algorithm.

    Reinforcement learning is the reward function given to the machine, which is used to judge whether the behavior is good or bad. In addition, the feedback of reinforcement learning results is delayed. Sometimes, it may be necessary to take many steps before you know whether a previous choice of a certain step is good or bad.

  3. The input of supervised learning is independent and identically distributed.

    Reinforcement learning is always faced with changing inputs, and every time an algorithm makes an action, it influences the input for the next decision.

  4. Supervised learning algorithms ignore this balance and are just exploitative.

    In reinforcement learning, an agent can make a trade-off between exploration and exploitation and choose the best reward.

  5. Unsupervised is not learning input-to-output mapping, but patterns (automatic mapping).

    For reinforcement learning, it obtains some kind of mapping from state to action by learning training instances that have no conceptual markers, but are associated with a delayed reward or utility, which can be considered as a delayed conceptual marker.

The essential difference between reinforcement learning and the former two is that there is no clear data concept of the former two, it does not know the outcome, only the goal. The concept of data is a lot of data, and supervised and unsupervised learning requires a lot of data to train and optimize your model.

Supervised learning Unsupervised learning Reinforcement learning
The label Correct and strict labeling There is no label No labels, adjust by feedback of results
The input Independent isodistribution Independent isodistribution The input is always changing, and every time the algorithm makes an action, it affects the input for the next decision.
The output Input to output Self-learning mapping The reward function is used to judge whether an action is good or bad

4. What is multitasking

In machine learning, we are usually concerned with optimizing a specific metric, whether it is a standard value or an enterprise KPI. To achieve this goal, we train a single model or a collection of models to perform a given task. We then refine the model by fine-tuning the parameters until the performance stops improving. While doing so can yield acceptable performance for a task, we may be missing information that can help us do better on the metrics we care about. Specifically, this information is the monitoring data of the relevant task. By sharing presentation information across related tasks, our model generalizes better on the original task. This approach is called multi-task Learning.

There are some commonalities in different tasks, and these commonalities constitute a connection point of multitasking learning, that is, tasks need to get results through this commonality. Click through rates and conversion rates, for example, depend on the same data input and neural network hierarchy. Multilingual speech recognition, etc.

5. References

  • GitHub
  • Reinforcement learning

Machine Learning


Author: @ mantchs

GitHub:github.com/NLP-LOVE/ML…

Welcome to join the discussion! Work together to improve this project! Group Number: [541954936]