introduce
Nowadays, “how to learn a new skill” has become a basic research problem for scientists all over the world. It’s easy to see why we have to solve this problem in the first place, and if we understand it, we can make humans do things that we might not have thought of before. Or we can train to do more “human” jobs, often in an era of true artificial intelligence.
While we don’t yet have a complete answer to these questions, a few things are understandable. Regardless of skill learning, we first need to interact with the environment. Whether we learn to drive a car or a baby learns to walk, learning is based on interaction with the environment. Learning from interaction is a fundamental concept in all theories of intellectual development and learning.
Reinforcement learning
Today, we are going to talk about reinforcement learning, which is a learning algorithm based on the interaction of the environment. Some believe that reinforcement learning is the real hope for strong artificial intelligence. This is also true, because the potential for reinforcement learning is truly enormous.
Currently, research on reinforcement learning is growing rapidly, and people generate a variety of learning algorithms for different applications. Therefore, it is especially important to be familiar with reinforcement learning techniques. If you are not familiar with reinforcement learning, I suggest you check out my previous articles on reinforcement learning and some open source reinforcement learning platforms.
Once you have mastered and understood the basics of reinforcement learning, read on. By the end of this article, you should have a thorough understanding of reinforcement learning and be able to do actual code implementation.
Note: In the code implementation section, we assume you already have a basic knowledge of Python. If you don’t already know Python, you should check out this tutorial first.
1. Identify a reinforcement learning problem
Reinforcement learning is learning how to do things, how to act in response to interactions with the environment. The end result is to maximize the value of the system’s return signal. Instead of being told which action to take, the learner is asked to discover for himself which action will yield the greatest return. Let’s explain this with a simple example:
Let’s take a child learning to walk as an example.
Here are the steps your child takes to learn to walk:
1. The first thing your child will notice is how you walk. You use both legs, one step at a time, one step at a time. The child will latch on to this concept and try to emulate you.
But soon he or she learns that the child must stand up before he or she can walk! This can be a challenge for children when learning to walk. So the child tries to stand up on his/her own, and he/she keeps falling down, but still keeps getting up.
3. But there is another challenge to be met. Standing up is relatively easy, but staying on your feet is another challenge. Finding support in a narrow air, the child managed to stay upright.
4. Now, the child’s real job is to start learning to walk. But learning to walk is easier said than done. There are a lot of things that need to be done in a child’s brain, such as balancing the body and deciding which foot to put down next and where.
That sounds like a very difficult task, right? It was actually a challenge, learning to stand before learning to walk. But now that we’ve learned to walk, we don’t have to worry about it anymore. Now you can see why this is so difficult for children.
Let’s formalize the example above. The problem to be stated in the example is the “walking problem”, where the child is an intelligent agent trying to manipulate the environment (walking on the ground) by taking actions (walking), he/she is trying to move from one state (i.e., every step he/she takes) to another state. When he/she completes a submodule of the task (i.e., the child takes a few steps), the child receives a reward (e.g., some chocolate), but when he/she cannot walk, he/she does not receive any chocolate (a negative feedback process). That’s a simple description of the problem of verbal learning.
This is a great introductory video on reinforcement learning.
2. Comparison with other machine learning methods
Reinforcement learning is one of the more noisy machine learning algorithms. The following is a description of the types of machine learning algorithms.
Let’s compare the differences between reinforcement learning algorithms and other types of algorithms:
- Supervised learning vs. reinforcement learning: In supervised learning, there is an external “supervisor” who has knowledge of all the environment and shares this knowledge with the agent to help the agent complete its tasks. There are problems with this, however, because there are so many combinations of sub-tasks in a single task that an agent should perform and achieve the goal. So creating an “overseer” is almost impractical. In a chess game, for example, there are tens of thousands of moves that can be made. Thus, building a knowledge base of winnable gameplay is a tedious task. In these problems, it is more reasonable and feasible to learn from your own experience and acquire knowledge. This is the main difference between reinforcement and supervised learning. In both supervised and reinforcement learning, there is a mapping between inputs and outputs. However, in reinforcement learning, there is a reward feedback function for the agent, instead of directly telling the agent the final answer as in supervised learning.
- Unsupervised vs. reinforcement learning: In reinforcement learning, there is a mapping process from input to output, but this process does not exist in unsupervised learning. In unsupervised learning, the main task is to find a basic pattern, not a mapping relationship. For example, if the task is to recommend news articles to a user, the unsupervised learning algorithm looks at similar articles that the person has read before and recommends them to others. Reinforcement learning algorithm, on the other hand, builds a “knowledge graph” through some articles of users and obtains users’ continuous feedback, so as to know the favorite relationship between users and articles.
There is a fourth type of machine learning called semi-supervised learning, which is essentially a combination of supervised and unsupervised learning. It is different from reinforcement learning, similar to supervised learning and semi-supervised learning with direct reference answers, and reinforcement learning does not have.
3. Framework for solving reinforcement learning problems
To understand how to solve a reinforcement learning problem, let’s use a classic example of a reinforcement learning problem — a multi-arm gambling machine. First, we need to understand the basic problems of exploration and development, and then define a framework for solving reinforcement learning problems.
Tiger Machine looks like the picture above, assuming you’ve played Tiger Machine many times.
Now what you want to do is get the most out of the Tiger Machine, as quickly as possible. What would you do?
A naive idea would be to just pick a Tiger Machine and play with it all day. It sounds very boring, but Tiger Machine may give you some “reward” — that is, let you win money. Using this method, your chances of winning the lottery are about 0.00000….. 1. That said, most of the time you’re probably sitting in front of Tiger Machine and losing money. For the record, this can be defined as a pure development approach. But is it the best option? The answer, of course, is no.
Let’s look at another way. We can pull every lever of Tiger Machine and pray to God that we hit at least one. Of course, this is another naive way of thinking that you will only be pulling the lever all day, but it will only pay you a little bit. For the record, this approach is purely exploratory.
Neither approach is optimal and we must find the right balance between them to get the maximum return. This is called the exploration and development dilemma of reinforcement learning.
First, we formally define a framework for solving reinforcement learning problems and then list possible approaches to solving the problem.
Markov decision process:
In reinforcement learning scenarios, our mathematical framework for defining problems is called markov decision processes. This can be designed to:
- Set of states: S
- Action set: A
- Reward function: R
- Strategy: PI.
- Value: V
We must take certain actions (A) to move us from the beginning state to the end state (S). Every time we take an action, we are rewarded with something in return. Of course, the nature of the rewards we receive (positive or negative) is determined by our actions.
Our set of strategies (π) is defined by our set of actions, and the rewards we get determine our value (V). Our task here is to maximize our value by choosing the right strategy. So we must maximize the following equation:
For time T, all possible S.
Traveling salesman problem
Let’s take another example to illustrate.
This problem is representative of a series of travel salesman (TSP) problems. The task is to get from point A to point F at the lowest possible cost. The number on each side between two letters indicates the cost of the distance between two places. If it’s a negative number, that means you’re getting paid to go that way. We define value as the total value you get when you go the full distance with your chosen strategy.
Here’s a notation:
- Set of state nodes: {A,B,C,D,E,F}
- The action set is from one location to another: {A->B, C->D, etc}
- The reward function is the edge value
- The policy function refers to the complete path planning, such as: {A -> C -> F}
Now suppose you are at location A, and the only road you can see is your next destination (i.e., you can only see B, D, C, E), and you don’t know anything about the other locations.
You can take the greedy algorithm to get the most favorable step in the current state, that is, you choose {A->D} from {A-> (B,C,D,E)}. Similarly, now you are at D and want to get to F. You can use {D -> F} from {D -> (B, C, F)} to get the most bang for your buck. Therefore, we take this path.
So far, our strategy is {A -> D -> F}, and we get A return of -120.
A: congratulations! You just implemented the reinforcement learning algorithm. This algorithm is called the Epsilon greedy algorithm. This is a greedy algorithm that is tested step by step to solve the problem. Now, if you (the salesman) want to go from point A to point F again, you will always go this way.
Other ways to travel?
Can you guess which category our strategy falls into (pure exploration or pure development)? Note that the strategy we adopted was not an optimal strategy. We have to “explore” a bit, and then find the best strategy. Here, the approach we take is local strategy learning, and our task is to find the best strategy among all possible strategies. There are many ways to solve this problem. Here, we briefly outline some of the main categories:
- Strategy first: Our focus is on finding the best strategy
- Return first: Our focus is on finding the best return value, which is cumulative rewards
- Action First: Our focus is on taking the best action at each step
In future articles, I will discuss reinforcement learning algorithms in depth. Until then, you can refer to this paper on the investigation of reinforcement learning algorithms.
4. Realization of reinforcement learning
Next, we will use the deep Q learning algorithm. Q learning is a strategy-based learning algorithm, which has a function representation similar to that of neural network. This algorithm was used by Google and beat Atari games.
Let’s look at the pseudo-code for Q learning:
Initialize the value table ‘Q(s, a)’. Select a strategy based on the action (for example, Epsilon greed) as the action selected for this state. 4. Based on this action, observe the return value ‘r’ and the next new state s.5. Update the value of the status with the observed reward and the maximum reward for the possible next state. Update according to the above formula and parameters. 6. Set the state to the new state and repeat the process until the final state is reached.
A brief description of Q learning can be summarized as follows:
Let’s first take a look at the Cartpole problem and then continue writing our solution.
When I was a child, I remember choosing a stick and trying to balance it with one finger. My friends and I used to have a competition where whoever could keep the stick balanced the most time would get a chocolate bar as a reward.
Here is a simple video describing a real Cart-Pole system.
Let’s start coding!
Before we can start writing, we need to install some software.
Step 1: Install the Keras-RL package
From the terminal, you can run the following commands:
git clone https://github.com/matthiasplappert/keras-rl.git
cd keras-rl
python setup.py install
Copy the code
Step two: Install CartPole environment dependencies
We assume that you already have PIP installed, so you just need to install it using the following command:
pip install h5py
pip install gym
Copy the code
Step 3: Start writing code
First we need to import some modules that we need
import numpy as np import gym from keras.models import Sequential from keras.layers import Dense, Activation, Flatten from keras.optimizers import Adam from rl.agents.dqn import DQNAgent from rl.policy import EpsGreedyQPolicy from rl.memory import SequentialMemoryCopy the code
Then, set the relevant variables
ENV_NAME = 'CartPole-v0' # Get the environment and extract the number of actions available in the Cartpole problem env = gym.make(ENV_NAME) np.random.seed(123) env.seed(123) nb_actions = env.action_space.nCopy the code
Then, we construct a very simple single-layer neural network model.
model = Sequential()
model.add(Flatten(input_shape=(1,) + env.observation_space.shape))
model.add(Dense(16))
model.add(Activation('relu'))
model.add(Dense(nb_actions))
model.add(Activation('linear'))
print(model.summary())
Copy the code
Next, we configure and compile our agent. We set the policy to Epsilon greed, and we set our storage space to sequence storage, because we need to store the results of our actions and the rewards for each action we perform.
policy = EpsGreedyQPolicy()
memory = SequentialMemory(limit=50000, window_length=1)
dqn = DQNAgent(model=model, nb_actions=nb_actions, memory=memory, nb_steps_warmup=10, target_model_update=1e-2, policy=policy)
dqn.compile(Adam(lr=1e-3), metrics=['mae'])
# Okay, now it's time to learn something! We visualize the training here for show, but this slows down training quite a lot.
dqn.fit(env, nb_steps=5000, visualize=True, verbose=2)
Copy the code
Now, let’s test our reinforcement learning model
dqn.test(env, nb_episodes=5, visualize=True)
Copy the code
Here is the output of the model:
Look, you just built a reinforcement learning robot!
5. Add complexity
Now that you’ve seen a basic implementation of reinforcement learning, let’s start learning more questions, a little more complexity at a time.
Hannotta problem
For those of you who don’t know the game, the Hanotta problem was invented in 1883. It consists of three sticks and a series of disks of varying sizes (for example, the three in the image above). Starting with the leftmost stick, the goal is to move the leftmost disk to the rightmost disk with the least number of moves.
If we’re going to tackle this problem, let’s start with the processing state:
- Initial state: all three disks are on the leftmost stick (numbered 1,2,3 from top to bottom)
- End state: all three disks are on the rightmost stick (numbered 1,2,3 from top to bottom)
All possible states:
Here are 27 possible states:
Where, (12)3* indicates that disks 1 and 2 are above the leftmost stick (numbered from top to bottom), disk 3 is above the middle stick, and the rightmost stick has no disk.
Numerical rewards:
Since we want to solve this problem with the minimum number of moves, we can reward each move with -1.
Strategy:
Now, if we ignore any of the technical details, there could be several states from one state to the next. For example, state (123)** moves to state (23)1, or state (23)1, when the numeric reward is -1. If you now see a concurrent state, then each of the 27 states mentioned above can be represented as a graph similar to the traveling salesman problem, where we can find the optimal solution by experimenting with various states and paths.
3 x 3 Rubik’s cube problems
Although I can solve the problem for you, I want you to solve the problem yourself. You can follow the same train of thought AS I did above, and you should be able to solve it.
Start by defining the start and end states, and next, define all possible states and their transitions, along with the rewards and policies. Eventually, you should be able to use the same approach to build your own solution.
6. Keep abreast of the latest developments in reinforcement learning
As you realize, a Rubik’s cube is many times more complex than a Hannotta problem. Now, let’s imagine the number of states and strategies chosen in a board game like Go. Recently, Google DeepMind created a deep reinforcement learning algorithm and beat Lee.
More recently, with the success of deep learning. The emphasis is now slowly shifting towards applying deep learning to solve reinforcement learning problems. In the latest flood of news, a deep reinforcement learning algorithm created by Google DeepMind beat Lee. A similar situation has occurred in video games, where deep reinforcement learning algorithms have been developed to achieve human accuracy and, in some cases, surpass human accuracy. Research and practice still need to move forward together, with industry and academia planning joint efforts to achieve better adaptive learning robots.
Here are a few of the main areas where reinforcement learning has been applied:
- Game theory interacts with multiple agents
- The robot
- Computer network
- Vehicle navigation
- medical
- Industrial logistics
There are so many areas that are still unexplored, and combined with the current craze of applying deep learning to reinforcement learning, I believe there will be breakthroughs in the future!
Here’s one of the latest:
7. Other resources
I hope you now have a good understanding of how reinforcement learning works. Here are some additional resources to help you learn more about reinforcement learning. Later I will take you into the learning journey of intensive learning, please look forward to ~~~
- Videos on reinforcement learning
- Books on reinforcement learning
- GitHub’s repository for reinforcement learning
- David Silver’s Reinforcement learning course