Reinforcement learning can be found in many of the most advanced AI applications, from AlphaGo to self-driving cars. How did this technology slowly learn to complete tasks from scratch and grow into an expert “beyond human level”? This article will give a brief introduction. From DeepLearning4j, compiled by Heart of the Machine.
Neural networks have been responsible for recent breakthroughs in areas such as computer vision, machine translation and time series prediction — and can be combined with reinforcement learning algorithms to create amazing results such as AlphaGo (see: DeepMind’s next go program, AlphaGo Zero, is back in Nature without human knowledge.
Reinforcement learning refers to goal-oriented algorithms that learn how to achieve a goal or maximum in some specific steps; For example, maximizing the number of points earned by a few actions in a game. They can start from a blank state and then achieve human-level performance under the right conditions. Like young children stimulated by candy and corporal punishment, these algorithms were punished when they made wrong predictions and rewarded when they made correct ones — which is where reinforcement comes in.
Reinforcement algorithms combined with deep learning can beat human champions at go and Atari. That may not sound convincing, but it is far better than what they have done before, and the most advanced advances are rapid.
Two reinforcement learning algorithms, Deep-q Learning and A3C, have been implemented on Deeplearning4j and are now available to play Doom.
Reinforcement learning addresses the problem of the association between immediate action and its associated delayed response. Just like humans, reinforcement learning algorithms must wait a while to see how their decisions turn out. They operate in an environment of delayed response, where it is difficult to understand which actions lead to which results over multiple steps.
We can expect reinforcement learning algorithms to perform better in more ambiguous real-world environments, where they can choose from any number of possible actions rather than from a limited number of video game action options. That is, over time, we want them to have real-world value in achieving goals.
Introduction to Intensive Learning (docs. Skymind. Ai /docs? __hstc…).
Definition of reinforcement learning
Reinforcement learning can be understood by looking at the concepts of agents, environments, states, actions, and rewards, which we will explain in the following sections. Uppercase letters represent collections of things, lowercase letters represent instances of things; For example, A is the set of all possible actions, and A is an instance contained in this set.
-
Agent: an intelligent individual capable of taking action; For example, a drone that can complete a delivery, or a Super Mario that moves toward a target in a video game. Reinforcement learning algorithm is an agent. And in real life, that intelligence is you.
-
Action: A is the set of actions that an agent can take. An action is almost obvious, but it should be noted that the agent is selecting from a list of possible actions. In a video game, this list of actions might include running right or left, jumping high or low, squatting or standing still. In the stock market, this list of actions may include buying, selling, or holding any securities or variations on them. When dealing with drones flying in the air, the action options involve a lot of velocity and acceleration in three dimensions.
-
Environment: Refers to the world in which an agent walks. This environment takes the agent’s current state and actions as input, and the output is the agent’s reward and next state. If you are an intelligent agent, then your environment is the physical laws and social rules that can process actions and determine the consequences of your series of actions.
-
State (S) : A State is the specific real-time State of the agent; That is, a specific place and moment, this is a specific, immediate configuration that connects an agent to other important disconnections, such as tools, enemies and/or rewards. It is the current situation returned by the environment. Have you ever been in the wrong place at the wrong time? That’s definitely a state.
-
Reward (R) : A Reward is the feedback we give to measure the success or failure of an agent’s actions. For example, in a video game, when Mario touches gold coins, he wins points. Faced with any given state, an agent outputs to the environment in the form of an action, and the environment returns a new state of the agent (which is affected by the action based on the previous state) and a reward (if any). Rewards can be immediate or delayed. They can effectively assess the agent’s actions.
-
Policy (π) : Policy is the policy used by an agent to make the next action based on the current state.
-
Value (V) : The expected discounted long-term return, not the short-term return R. We define Vπ(s) as the long-term return based on strategy π when the current state is S.
-
Q Value or action Value (Q) : Q Value (q-value) is similar to the above Value, except that it takes another parameter — the current action A. Vπ(s) refers to the long-term return based on the current state s, action A, and strategy π.
So, context is a function that translates actions taken in the current state into the next state and reward; Agents are functions that convert new states and rewards into the next action. We can know the function of an agent, but we can’t know the function of the environment. The environment is a black box where we can only see the inputs and outputs. Reinforcement learning is an agent trying to approximate the environment as a function of the environment, so that we can send the action that maximizes the reward to the black box environment.
In the feedback loop above, each subscript T and t+1 representing a time step refers to a different state: the state at time T and time t+1. Unlike other forms of learning, such as supervised and unsupervised learning, reinforcement learning can only be thought of as a series of sequential state-action pairs.
Reinforcement learning judges actions by their consequences. It is goal-oriented and its goal is to learn the sequence of actions that will enable an agent to achieve its goal. Here are some examples.
-
In video games, the goal is to finish the game with the highest possible score, so every extra score earned during the game affects the agent’s subsequent actions; In other words, an intelligence might learn that in order to maximize its score, it should shoot warships, touch coins, or dodge meteors.
-
In the real world, A robot’s goal might be to move from point A to point B. Every inch the robot moves from point A to point B counts as A score.
Reinforcement learning can be distinguished from supervised and unsupervised learning by the interpretation of input. We can illustrate how they differ by describing the “things” they learn.
-
Unsupervised learning: That’s what it looks like. (Unsupervised learning algorithms recognize similarities between things that do not have names, and by further extension, they can detect opposites or perform anomaly detection by identifying unusual or dissimilar instances.)
-
Supervised learning: That thing is a double cheeseburger. (Labels, contact names and faces…) These supervised learning algorithms look at the associations between data entity instances and their labels; In other words, supervised learning algorithms need to have a labeled data set. Those tags are used to “police” and correct algorithms that might make wrong guesses when predicting tags.
-
Reinforcement learning: Eat this because it tastes good and will help you live longer. (Rewards based on short-term and long-term rewards are the same as calories you eat or how long you live.) Reinforcement learning can be thought of as supervised learning in an environment with sparse feedback.
Domain selection in reinforcement learning
Think of an auto-reinforcement learning agent as a blind person trying to navigate the world by relying on his ears and a white cane in his hand. Agents have little Windows that allow them to perceive their environment, but those Windows are even the least suitable for the least suitable ways of perceiving their environment.
In fact, deciding what type of input and feedback your agent should have is a complex problem to solve. This is known as the domain selection problem. Algorithms that learn to play video games can ignore this problem because their environments are artificial and strictly circumscribed. Video games therefore provide sterile laboratory environments in which reinforcement learning ideas can be tested. Domain selection requires human decision, usually based on knowledge or theory of the problem to be solved; For example, the choice of input fields in an algorithm for a driverless car might include information from radar sensors, cameras, and GPS data.
State-action Pair & complex reward probability distribution
The goal of reinforcement learning algorithms is to learn the best actions for any given state, which means that actions must be ordered and assigned individually. Since those actions are state dependent, we are actually measuring the value of the state-action pairs; That is, the action you take in a state is something you do in a place. Here are a few examples of how the value and significance of an action depends on the state in which the agent takes the action.
-
If the action here means marrying someone, then marrying a 35-year-old at the age of 18 May be very different from marrying a 35-year-old at the age of 90. The two outcomes may have different motivations and further lead to different outcomes.
-
If the action here is to shout “Fire,” it has a different meaning in a crowded theater than in a crowd of gunmen. If we don’t know the context, we can’t predict the outcome of our actions.
We use the above Q function to map the state-action pair to the value we want it to produce. The Q function takes the agent’s state and actions as inputs and maps them to possible rewards.
Reinforcement learning is the process of running an agent through a series of state-action pairs, observing the results caused by the state-action pairs, and adjusting the prediction of Q function until it can accurately predict the best action the agent should take. This kind of prediction is called strategy.
Reinforcement learning is an attempt to model complex probability distributions of a large number of state-action pairs and the rewards associated with them. The reinforcement learning with markov decision process (https://deeplearning4j.org/markovchainmontecarlo) together with a reason, markov decision process is a complex distribution of samples, in order to deduce a method of its properties. This is very similar to the problem that inspired Stan Ulam to develop the Monte Carlo method; The attempt to infer the chances of winning in a card game from a given hand.
Any statistical method is essentially ignorant. The great complexity of some phenomena (such as those related to biology, politics, or board games) makes it impossible to extrapolate from first principles. The only way to study them is statistically, to weigh events superficially, and to try to relate them, even if we don’t understand the mechanism by which they relate. Reinforcement learning, like deep neural networks, relies on sampling to extract information from data.
Reinforcement learning is iterative. In most interesting applications, it starts out with no understanding of the rewards of the current state-action pair. Reinforcement learning algorithms learn these associations by running from state to state, just as an athlete or musician improves their performance from state to state iteration.
The relationship between machine learning and time
You might think that reinforcement learning algorithms have a different relationship to practice than humans. We can take different actions in the same state and run the algorithm until we can infer which action is the best for the state. In fact, we give algorithm is used to set up their own groundhog day (http://www.imdb.com/title/tt0107048/0), they begin from a fool, then slowly gain wisdom.
Since humans never experience groundhog Day outside of the movies, reinforcement learning has the potential to be more and better than anthropology. You could argue that the real advantage of these reinforcement learning algorithms over humans is not their inherent nature, but their ability to live on many chips in parallel, and then train tirelessly day and night, so that they can learn more. An algorithm trained on go, such as AlphaGo, can play more games than any human is expected to in 100 lifetimes.
Deep neural networks and deep reinforcement learning
Where do neural networks fit in? Neural networks are agents that can learn to map state-action pairs and rewards. Like all neural networks, they use parameters to approximate functions related to input and output, and they learn by iteratively adjusting parameters, or weights, in the wrong decreasing direction.
In reinforcement learning, convolutional networks can be used to identify agent states. For example, the screen where Mario is, or the terrain in front of the drone. In other words, they play a typical role in image recognition.
However, convolutional networks can be explained more in reinforcement learning than in supervised learning. In supervised learning, the network assigns a label to a picture; That is, it maps the name to the pixel.
In fact, convolutional networks rank the labels that best fit the image according to probability. Given a picture of a donkey, the convolutional network might judge it to be a donkey 80% of the time, a horse 50% of the time, and a dog 30% of the time.
In reinforcement learning, given a picture representing a state, convolutional network can give a sequence of actions that can be taken in this state. For example, it might predict that running to the right would score 5 points, jumping would score 7 points, and running to the left would score 0 points.
After assigning a value to the desired reward, the Q-function simply selects the state-action pair with the highest Q-value.
In the initial stage of reinforcement learning, the parameters of the neural network may be randomly initialized. Using feedback from the environment, neural networks can use the gap between expected and actual rewards to adjust network parameters to improve the interpretation of state-action pairs.
This feedback loop is similar to error back propagation in supervised learning. However, supervised learning begins with the real label that neural networks are trying to predict. Its goal is to create models that map different images to their corresponding names.
Reinforcement learning relies on the environment to provide the algorithm with a scalar number corresponding to each new action. The rewards returned by the environment can be varied, delayed, or influenced by known variables, which introduces noise into the feedback loop.
This gives a more complete representation of the Q function, which takes into account not only the immediate reward generated by an action, but also the ability to sequentially defer rewards to depths of several time steps.
Just like humans, Q functions are recursive. Just like calling the wet human() function, human() itself contains another human() function, in which we are all the results. For a given state-action pair, calling Q requires us to call a nested Q function to predict the value of the next state. Which in turn depends on the Q function of the subsequent state, and so on.
code
Examples of RL4J are available here (github.com/deeplearnin…) .
package org.deeplearning4j.examples.rl4j;
import java.io.IOException;
import org.deeplearning4j.rl4j.learning.HistoryProcessor;
import org.deeplearning4j.rl4j.learning.async.a3c.discrete.A3CDiscrete;
import org.deeplearning4j.rl4j.learning.async.a3c.discrete.A3CDiscreteConv;
import org.deeplearning4j.rl4j.mdp.ale.ALEMDP;
import org.deeplearning4j.rl4j.network.ac.ActorCriticFactoryCompGraphStdConv;
import org.deeplearning4j.rl4j.util.DataManager;
/**
* @author saudet
*
* Main example forA3C with The Arcade Learning Environment (ALE) * */ public class A3CALE { public static HistoryProcessor.Configuration ALE_HP = new HistoryProcessor.Configuration( 4, //History length 84, //resize width 110, //resize height 84, //crop width 84, //crop height 0, //cropping x offset 0, //cropping y offset 4 //skip mod (one frame is picked every x ); public static A3CDiscrete.A3CConfiguration ALE_A3C = new A3CDiscrete.A3CConfiguration( 123, //Random seed 10000, //Max step By epoch 8000000, //Max step 8, //Number of threads 32, //t_max 500, //num step Noop warmup 0.1, //reward scaling 0.99, //gamma 10.0 // TD-error clipping); public static final ActorCriticFactoryCompGraphStdConv.Configuration ALE_NET_A3C = new ActorCriticFactoryCompGraphStdConv. The Configuration (0.00025, / / learning rate 0.000, / / l2 regularization null, null,false
);
public static void main(String[] args) throws IOException {
//record the training data in rl4j-data in a new folder
DataManager manager = new DataManager(true);
//setup the emulation environment through ALE, you will need a ROM file
ALEMDP mdp = null;
try {
mdp = new ALEMDP("pong.bin");
} catch (UnsatisfiedLinkError e) {
System.out.println("To run this example, uncomment the \"ale-platform\" dependency in the pom.xml file.");
}
//setup the training
A3CDiscreteConv<ALEMDP.GameScreen> a3c = new A3CDiscreteConv(mdp, ALE_NET_A3C, ALE_HP, ALE_A3C, manager);
//start the training
a3c.train();
//save the model at the end
a3c.getPolicy().save("ale-a3c.model"); //close the ALE env mdp.close(); }}Copy the code