Data technique salon 002 】 【 China: appropriate letter agile data | China construction practice appropriate letter technique salon will be online in the evening of May 23, eight live, click sign up

Project Description:

In this project, you will use reinforcement learning algorithms to build an autonomous maze-walking robot.

  • As shown above, the intelligent robot is shown in the upper right corner. In our maze, there are traps (red bombs) and ends (blue target points). The robot must avoid traps and reach its destination as soon as possible.

  • The actions that the car can perform include: up u, right R, down D, left L.

  • After performing different actions, different rewards will be obtained according to different situations. Specifically, there are the following situations.

    • Hit the wall: -10

    • Walk to the end: 50

    • Walk to the trap: -30

    • The rest: -0.1

  • We need to implement a Q Learning robot by modifying the code in robot.py.

Section 1 Algorithm understanding

1.1 Overview of reinforcement learning

Reinforcement learning, as a kind of machine learning algorithm, also allows an agent to learn “experience” through “training” in order to achieve a given task. However, different from supervised and unsupervised learning, in the framework of reinforcement learning, we focus more on learning through the interaction between an agent and the environment. Generally, in supervised and unsupervised learning tasks, an agent often needs to achieve this goal through a given training set, supplemented by a given training goal (such as loss minimization function), and through a given learning algorithm. In reinforcement learning, however, an agent learns through the rewards it receives from its interaction with the environment. The environment can be virtual (a virtual maze, for example) or real (a self-driving car collecting data on a real road).

There are five core components in reinforcement learning, which are Environment, Agent, State, Action and Reward. Node T at a certain time:

  • An agent is perceiving its state from its environment

  • An agent chooses its actions according to certain criteria

  • The environment feeds rewards back to the agent based on the actions the agent chooses

With a reasonable learning algorithm, an agent will successfully learn an in-state in such a problem setting

1.2 Calculate Q value

In our project, we want to implement the reinforcement Learning algorithm based on Q-learning. Q-learning is a Value Iteration algorithm. Unlike Policy Iteration algorithms, Value Iteration algorithms calculate the Value or Utility of each “state” or “state-action” and then try to maximize that Value when performing the action. Therefore, the accurate estimation of each state value is the core of our value iteration algorithm. Usually we think about maximizing the long-term rewards of an action, not only the immediate rewards of the action, but also the long-term rewards of the action.

In q-learning algorithm, we record this long-term reward as Q value, and we will consider the Q value of each “state-action”. Specifically, its calculation formula is as follows:

So for the current state-action

Generally, however, we use a more conservative method to update Q table, that is, introducing relaxation variable alpha to update the Q table according to the following formula, so that the iterative change of Q table is more gentle.

Let’s do it based on what we know

Given: As shown in the figure above, the robot is located in S1, the action is U, and the reward for the action is the same as the default setting of the question. Q values of each action performed in S2 are: U: -24, R: -13, D: -0.29, L: +40, and γ is 0.9.

1.3 How do I Select actions

Exploration – utilization is a very important problem in reinforcement learning. Specifically, according to the above definition, we try to maximize the long-term reward by making the robot choose the best decision every time. But this has the following disadvantages:

  • In the initial learning, our Q value will be inaccurate. If we choose according to the Q value at this time, it will cause errors.

  • After a period of learning, the robot’s route will be relatively fixed, and the robot will not be able to effectively explore the environment.

So we need a way to solve these problems and increase robot exploration. Therefore, we consider using the Epsilon-greedy algorithm, that is, when the car selects the action, it randomly selects the action with a partial probability and selects the action with the optimal Q value with a partial probability. At the same time, the probability of choosing a random action should decrease gradually as the training progresses.

In the following code block, implement the logic of the Epsilon-greedy algorithm and run the test code.

import random  
import operator  

actions = ['u'.'r'.'d'.'l']  
qline = {'u': 1.2.'r':-2.1, 'd':-24.5, 'l': 27} epsilon = 0.3# Random selection with a probability of 0.3
    
def choose_action(epsilon):          
   action = None  
     ifThe random uniform (0,1.0) < = epsilon:# with some probability
        action = random.choice(actions)# Implement random selection of actions
     else:   
         action = max(qline.items(), key=operator.itemgetter(1))[0] Otherwise select the action with the maximum Q value
     return action  
Copy the code
range(100):  

    res += choose_action(epsilon)  

print(res)  

res = ' '  

for i in range(100):  

     res += choose_action(epsilon)  

print(res)  
 ldllrrllllrlldlldllllllllllddulldlllllldllllludlldllllluudllllllulllllllllllullullllllllldlulllllrlr
Copy the code

Section 2 code implementation

2.1 MazeSuch understanding

We started with the Maze class Maze, which is a very powerful function that can randomly create a Maze based on your requirements, or read in Maze map information based on a given file.

  • Use Maze(“file_name”) to create a Maze from the specified file, or use Maze(maze_size=(height, width)) to randomly generate a Maze.

  • Use the trap number parameter to set the number of traps in a maze when creating it.

  • Type the name of the Maze variable and press Enter to display the Maze image (e.g. G =Maze(“xx.txt”).

  • It is recommended that the size of the maze be between 6 and 12 in length and between 10 and 12 in width.

In the following code block, create your maze and show it off.


from Maze import Maze  
%matplotlib inline  
%confer InlineBackend.figure_format = 'retina'  
   ## to-do: Create mazes and show themG =Maze(maze_size=(6,8), trap_number=1) g Maze of size (12, 12)Copy the code

You may have noticed that in the maze we have placed a robot by default. In fact, we’ve configured an API for the maze to help the robot move and sense. The two apis you’ll use later are maze. Sense_robot () and maze. Move_robot ().

  • Maze. Sense_robot () is a parameterless function that outputs the robot’s current position in the maze.

  • Maze. Move_robot (direction) Moves the robot in the input direction and returns the reward value for the action.

Move the robot randomly and record the rewards, showing the robot’s last position.

rewards = []      
 ## Circulate and move the robot randomly for 10 times, and record the reward
for i in range(10):  
    res = g.move_robot(random. Choice(actions))  
     rewards.append(res)     
 ## Outputs the last position of the robot
print(g.sense_robot())     
## Print the maze and observe the robot positionG (0, 9)Copy the code

2.2 RobotClass implements

The Robot class is the one we need to focus on. In this class, we need to implement a number of functions in order for us to successfully implement an reinforcement learning agent. So in general, we moved the Robot around the environment artificially, but now by implementing the Robot class, the Robot will move on its own. By implementing the learning function, the Robot class will learn how to select the optimal action and update the corresponding parameters in reinforcement learning.

First, Robot has multiple inputs, where alpha=0.5, gamma=0.9, and epsilon0=0.5 represent the default values for parameters related to reinforcement learning. Maze, as you’ve already seen, should be the object of the Robot’s Maze.

Then look at the robot.update function, which specifies the program that Robot needs to execute each time it executes an action. According to these procedures, the function of each function is also clear.

Check the effect by running the following code (remember to change the maze variable to the name of the variable you created the maze with).

Import random import operator class Robot(object): def __init__(self, maze, alpha=0.5, gamma=0.9, epsilon0=0.5): self. Maze = maze self.valid_actions = self.maze.valid_actions self.state = None self.action = None# Set Parameters of the Learning Robot  
         self.alpha = alpha  
         self.gamma = gamma    

         self.epsilon0 = epsilon0  
         self. Epsilon = epsilon0  
          self.t = 0    

          self.Qtable = {}  
          self. Reset()    

    def. reset(self):  
         """ Reset the robot """  
         self.state = self.sense_state()  
         self.create_Qtable_line(self.state)    

    def. set status(self, learning=False, testing=False):  
         """ Determine whether the robot is learning its q table, or executing the testing procedure. """  
         self. Learning = learning  
         self.testing = testing     

     def. update_parameter(self):  
         """ Some of the paramters of the q learning robot can be altered, update these parameters when necessary. """  
         if self.testing:  
             # TODO 1. No random choice when testing  
            self. Epsilon = 0  
         else:  
             # TODO 2. Update parameters when learning  Self. Epsilon * = 0.95return self. Epsilon     

     def. sense_state(self):  
         """ Get the current state of the robot. In this """  
  
           # TODO 3. Return robot's current state  
                    return self.maze.sense_robot()    

     def. create_Qtable_line(self, state):  
        """ Create the qtable with the current state """  
         # TODO 4. Create qtable with current state  
         # Our qtable should be a two level dict,  
         # Qtable[state] ={'u':xx, 'd':xx, ... }
         # If Qtable[state] already exits, then do  
         # not change it.  Self. Qtable. Setdefault (state, {a: 0.0for a in self.valid_actions})             
     def. choose_action(self):  
         """ Return an action according to given rules """     
         def. is_random_exploration():    

             # TODO 5. Return whether do random choice  
             # hint: generate a random number, and compare  
             # it with epsilon  
            returnRandom. Uniform (0, 1.0) <= self.epsilonif self. Learning:  
             if is_random_exploration():  
                # TODO 6. Return random choose aciton  
                 return random. Choice(self.valid_actions)  
             else:  
                 # TODO 7. Return action with highest q value  
                 return max(self.Qtable[self.state].items(), key=operator.itemgetter(1))[0]  
         elif self.testing:  
             # TODO 7. choose action with highest q value  
             return max(self.Qtable[self.state].items(), key=operator.itemgetter(1))[0]  
         else:  
             # TODO 6. Return random choose aciton  
            return random. Choice(self.valid_actions)     

    def. update_Qtable(self, r, action, next_state):  
         """ Update the qtable according to the given rule. """  
         if self. Learning:  
             # TODO 8. When learning, update the q table according  
             # to the given rules  
            self.Qtable[self.state][action] = (1 - self.alpha) * self.Qtable[self.state][action] + self.alpha * (  
                         r + self.gamma * max(self.Qtable[next_state].values()))  
                           
   
   
    def. update(self):  
             """ Describle the procedure what to do when update the robot. Called every time in every epoch in training or testing. Return current action and reward. """  
         self.state = self.sense_state()  # Get the current state  
         self.create_Qtable_line(self.state)  # For the state, create q table line  
   
        action = self.choose_action()  # choose action for this state  
         reward = self.maze.move_robot(action)  # move robot for given action  
   
        next_state = self.sense_state()  # get next state  
         self.create_Qtable_line(next_state)  # create q table line for next state  
   
         if self. Learning and not self.testing:  
             self.update_Qtable(reward, action, next_state)  # update q table  
            self.update_parameter()  # update parameters     

        return action, reward  
 # from Robot import Robot  
 # g = Maze (maze_size = (6, 12), trap_number = 2)
 g=Maze("test_world\maze_01.txt")  
 robot = Robot(g) Remember to change the maze variable to the name of the variable you created the maze with
 robot.set_status(learning=True,testing=False)  
 print(robot. The update (), g ('d', -0.1)
Maze of size (12, 12)

Copy the code

2.3 withRunnerClass training Robot

With that done, we are ready to train and tune our Robot. We prepared another great Runner like to realize the whole training process and visualization. You can successfully train a robot using the following code. And you will generate a video named filename in the current folder, recording the entire training process. By watching this video, you can identify problems during training and optimize your code and parameters.


Try using the following code to train the robot and tune the parameters. Optional parameters include:

  • The training parameters

    • Number of trainingepoch
  • Robot parameters:

    • Epsilon0 (initial epsilon value)

    • Epsilon decay (which can be linear, exponential, and can be adjusted at any rate), you need to adjust it in robot.py

    • alpha

    • gamma

  • Maze parameters:

    • Size of the maze

    • The number of traps in the maze

  • Optional parameters:

  • epoch = 20

  • Epsilon0 = 0.5

  • Alpha = 0.5

  • Gamma = 0.9

  • Maze_size = (6, 8)

  • trap_number = 2

from Runner import Runner  
  
g = Maze(maze_size=maze_size,trap_number=trap_number)  
r = Robot(g,alpha=alpha, epsilon0=epsilon0, gamma=gamma)  
r.set_status(learning=True)  
   
 runner = Runner(r, g)  
runner.run_training(epoch, display_direction=True)  
 #runner. Generate_movie (filename = "final1.mp4") # You can comment the line to speed things up, but you won't see the video.
 g
   
Copy the code


The runner. Plot_results () function can print some parameter information of the robot in the training process.

  • Success Times represents the cumulative number of successes of the robot in the training process, which should be a cumulative increasing image.

  • Accumulated Rewards, representing the value of Accumulated Rewards earned by a robot each time it trains its epoch, should be an incremental picture.

  • Running Times per Epoch means the number of Times of car training in each training Epoch (the Epoch will be stopped and transferred to the next training when the end point is reached), which should be a progressively decreasing image.

    Output training results using 'runner. Plot_results ()'.Copy the code
  runner.plot_results()  
Copy the code

Author: Yang Fei

Source: College creditease.cn/