Introduction to the

This paper introduces the concept of Reinforcement Learning (RL) and uses DQN to train a model that can play FlappyBird

FlappyBird

This game many people have played, very abuse, the following is a pygame to reproduce FlappyBird, github.com/sourabhv/Fl…

Install if pyGame is not available

pip install pygame
Copy the code

Run Flappy. Py to start the game, or pythonW to run the code if the keys fail

pythonw flappy.py
Copy the code

The principle of

Unsupervised learning without labels, such as clustering; There is supervised learning and there are labels, such as categorization; Reinforcement learning is somewhere in between, with labels accumulating through trial and error

RL consists of several components:

  • State (S) : The State of the environment, such as the current game interface in FlappyBird, can be represented by a picture
  • Action (A) : The set of actions that can be taken under each S, for example in FlappyBird you can choose between two As, “jump” or “do nothing”
  • 14. Reward (R) : A Reward given after performing an A under an S, such as in FlappyBird, which can be A successful jump over A water pipe (positive Reward), A bump into A water pipe, or A fall to the ground (negative Reward)

In this way, the game simply starts with an initial S, performs A, gets R, moves on to the next S, and so on until it reaches A termination S


Define a function to calculate the sum of rewards over the course of the game


And the sum of returns from a certain point forward


But we’re not entirely sure what we’ll get for each step, so we might as well multiply it by a decay factor between zero and one


In this way, the recursive relationship between the total returns of the two adjacent steps can be obtained


DQN is A common algorithm in reinforcement learning. It mainly introduces Q function (Quality, value function), which is used to calculate the maximum total return that can be obtained by executing A at A certain S


With the Q function, for the current state S, we only need to calculate the Q value corresponding to each A, and then select the A with the largest Q value, which is the optimal action strategy (strategy function).


When the Q function converges, the recursive formula of the Q function can also be obtained


Q function can be realized using neural network and trained:

  • The structure of the neural network is defined and initialized randomly. The input is S and the number of outputs is the same as the size of the action set
  • A is selected randomly with A certain probability every time; otherwise, the optimal A is selected using the strategy function, that is, the combination of random exploration and directed strategy
  • Maintain a memory module for accumulating data generated during play
  • Warm-up period: no training, mainly in order to let the memory module first accumulate certain data
  • Exploratory period: gradually reduce the random probability, transition from random exploration to directed strategy, and take out some data from the memory module each time to train the model
  • Training period: fixed random probability, further training model, make Q function further convergence

For an introduction to the principles of reinforcement learning and DQN, please refer to the following article, ai.intel.com/demystifyin…

implementation

Modified based on the following items, github.com/yenchenlin/…

The code in the game is simplified and modified from the previous Flappy. Py by removing the background image and fixing the character and water pipe colors, the game will automatically start and continue after it has died, mainly for the model to automatically perform and collect data

Load the library

# -*- coding: utf-8 -*-

import tensorflow as tf
import numpy as np
import random
import cv2
import sys
sys.path.append('game/')
import wrapped_flappy_bird as fb
from collections import deque
Copy the code

Define some parameters

ACTIONS = 2
GAMMA = 0.99
OBSERVE = 10000
EXPLORE = 3000000
INITIAL_EPSILON = 0.1
FINAL_EPSILON = 0.0001
REPLAY_MEMORY = 50000
BATCH = 32
IMAGE_SIZE = 80
Copy the code

Define some network input and auxiliary functions. Each S consists of four consecutive game screenshots

S = tf.placeholder(dtype=tf.float32, shape=[None, IMAGE_SIZE, IMAGE_SIZE, 4], name='S')
A = tf.placeholder(dtype=tf.float32, shape=[None, ACTIONS], name='A')
Y = tf.placeholder(dtype=tf.float32, shape=[None], name='Y') k_initializer = tf.truncated_normal_initializer(0, 0.01) b_initializer = tf. Constant_initializer (0.01) def conv2d(inputs, kernel_size, filters, strides):return tf.layers.conv2d(inputs, kernel_size=kernel_size, filters=filters, strides=strides, padding='same', kernel_initializer=k_initializer, bias_initializer=b_initializer)

def max_pool(inputs):
    return tf.layers.max_pooling2d(inputs, pool_size=2, strides=2, padding='same')

def relu(inputs):
    return tf.nn.relu(inputs)
Copy the code

Define network structure, typical convolution, pooling, full connection layer structure

h0 = max_pool(relu(conv2d(S, 8, 32, 4)))
h0 = relu(conv2d(h0, 4, 64, 2))
h0 = relu(conv2d(h0, 3, 64, 1))
h0 = tf.contrib.layers.flatten(h0)
h0 = tf.layers.dense(h0, units=512, activation=tf.nn.relu, bias_initializer=b_initializer)

Q = tf.layers.dense(h0, units=ACTIONS, bias_initializer=b_initializer, name='Q')
Q_ = tf.reduce_sum(tf.multiply(Q, A), axis=1)
loss = tf.losses.mean_squared_error(Y, Q_)
optimizer = tf.train.AdamOptimizer(1e-6).minimize(loss)
Copy the code

Implement memory modules with a queue, start the game, and choose to do nothing about the initial state

game_state = fb.GameState()
D = deque()

do_nothing = np.zeros(ACTIONS)
do_nothing[0] = 1
img, reward, terminal = game_state.frame_step(do_nothing)
img = cv2.cvtColor(cv2.resize(img, (IMAGE_SIZE, IMAGE_SIZE)), cv2.COLOR_BGR2GRAY)
_, img = cv2.threshold(img, 1, 255, cv2.THRESH_BINARY)
S0 = np.stack((img, img, img, img), axis=2)
Copy the code

Continue to play and train the model

sess = tf.Session()
sess.run(tf.global_variables_initializer())

t = 0
success = 0
saver = tf.train.Saver()
epsilon = INITIAL_EPSILON
while True:
    if epsilon > FINAL_EPSILON and t > OBSERVE:
        epsilon = INITIAL_EPSILON - (INITIAL_EPSILON - FINAL_EPSILON) / EXPLORE * (t - OBSERVE)

    Qv = sess.run(Q, feed_dict={S: [S0]})[0]
    Av = np.zeros(ACTIONS)
    if np.random.random() <= epsilon:
        action_index = np.random.randint(ACTIONS)
    else:
        action_index = np.argmax(Qv) 
    Av[action_index] = 1

    img, reward, terminal = game_state.frame_step(Av)
    if reward == 1:
        success += 1
    img = cv2.cvtColor(cv2.resize(img, (IMAGE_SIZE, IMAGE_SIZE)), cv2.COLOR_BGR2GRAY)
    _, img = cv2.threshold(img, 1, 255, cv2.THRESH_BINARY)
    img = np.reshape(img, (IMAGE_SIZE, IMAGE_SIZE, 1))
    S1 = np.append(S0[:, :, 1:], img, axis=2)

    D.append((S0, Av, reward, S1, terminal))
    if len(D) > REPLAY_MEMORY:
        D.popleft()

    if t > OBSERVE:
        minibatch = random.sample(D, BATCH)
        S_batch = [d[0] for d in minibatch]
        A_batch = [d[1] for d in minibatch]
        R_batch = [d[2] for d in minibatch]
        S_batch_next = [d[3] for d in minibatch]
        T_batch = [d[4] for d in minibatch]

        Y_batch = []
        Q_batch_next = sess.run(Q, feed_dict={S: S_batch_next})
        for i in range(BATCH):
            if T_batch[i]:
                Y_batch.append(R_batch[i])
            else:
                Y_batch.append(R_batch[i] + GAMMA * np.max(Q_batch_next[i]))

        sess.run(optimizer, feed_dict={S: S_batch, A: A_batch, Y: Y_batch})

    S0 = S1
    t += 1

    if t > OBSERVE and t % 10000 == 0:
        saver.save(sess, './flappy_bird_dqn', global_step=t)

    if t <= OBSERVE:
        state = 'observe'
    elif t <= OBSERVE + EXPLORE:
        state = 'explore'
    else:
        state = 'train'
    print('Current Step %d Success %d State %s Epsilon %.6f Action %d Reward %f Q_MAX %f' % (t, success, state, epsilon, action_index, reward, np.max(Qv)))
Copy the code

Run dqn_Flappy. Py to train the model from scratch. At first, the character jumps wildly and can’t jump a water pipe, but as the training progresses, the character learns to perform steadily

You can also run the trained model directly using the following code

# -*- coding: utf-8 -*-

import tensorflow as tf
import numpy as np
import cv2
import sys
sys.path.append('game/')
import wrapped_flappy_bird as fb

ACTIONS = 2
IMAGE_SIZE = 80

sess = tf.Session()
sess.run(tf.global_variables_initializer())

saver = tf.train.import_meta_graph('./flappy_bird_dqn-8500000.meta')
saver.restore(sess, tf.train.latest_checkpoint('/'))
graph = tf.get_default_graph()

S = graph.get_tensor_by_name('S:0')
Q = graph.get_tensor_by_name('Q/BiasAdd:0')

game_state = fb.GameState()

do_nothing = np.zeros(ACTIONS)
do_nothing[0] = 1
img, reward, terminal = game_state.frame_step(do_nothing)
img = cv2.cvtColor(cv2.resize(img, (IMAGE_SIZE, IMAGE_SIZE)), cv2.COLOR_BGR2GRAY)
_, img = cv2.threshold(img, 1, 255, cv2.THRESH_BINARY)
S0 = np.stack((img, img, img, img), axis=2)

while True:
    Qv = sess.run(Q, feed_dict={S: [S0]})[0]
    Av = np.zeros(ACTIONS) 
    Av[np.argmax(Qv)] = 1

    img, reward, terminal = game_state.frame_step(Av)
    img = cv2.cvtColor(cv2.resize(img, (IMAGE_SIZE, IMAGE_SIZE)), cv2.COLOR_BGR2GRAY)
    _, img = cv2.threshold(img, 1, 255, cv2.THRESH_BINARY)
    img = np.reshape(img, (IMAGE_SIZE, IMAGE_SIZE, 1))
    S0 = np.append(S0[:, :, 1:], img, axis=2)
Copy the code

reference

  • A Flappy Bird Clone using python-pygame: github.com/sourabhv/Fl…
  • Flappy Bird Hack using Deep Reinforcement Learning: github.com/yenchenlin/…
  • Demystifying Deep Reinforcement Learning: ai.intel.com/demystifyin…

Video lecture course

Deep and interesting (1)