Introduction to the
This paper introduces the concept of Reinforcement Learning (RL) and uses DQN to train a model that can play FlappyBird
FlappyBird
This game many people have played, very abuse, the following is a pygame to reproduce FlappyBird, github.com/sourabhv/Fl…
Install if pyGame is not available
pip install pygame
Copy the code
Run Flappy. Py to start the game, or pythonW to run the code if the keys fail
pythonw flappy.py
Copy the code
The principle of
Unsupervised learning without labels, such as clustering; There is supervised learning and there are labels, such as categorization; Reinforcement learning is somewhere in between, with labels accumulating through trial and error
RL consists of several components:
- State (S) : The State of the environment, such as the current game interface in FlappyBird, can be represented by a picture
- Action (A) : The set of actions that can be taken under each S, for example in FlappyBird you can choose between two As, “jump” or “do nothing”
- 14. Reward (R) : A Reward given after performing an A under an S, such as in FlappyBird, which can be A successful jump over A water pipe (positive Reward), A bump into A water pipe, or A fall to the ground (negative Reward)
In this way, the game simply starts with an initial S, performs A, gets R, moves on to the next S, and so on until it reaches A termination S
Define a function to calculate the sum of rewards over the course of the game
And the sum of returns from a certain point forward
But we’re not entirely sure what we’ll get for each step, so we might as well multiply it by a decay factor between zero and one
In this way, the recursive relationship between the total returns of the two adjacent steps can be obtained
DQN is A common algorithm in reinforcement learning. It mainly introduces Q function (Quality, value function), which is used to calculate the maximum total return that can be obtained by executing A at A certain S
With the Q function, for the current state S, we only need to calculate the Q value corresponding to each A, and then select the A with the largest Q value, which is the optimal action strategy (strategy function).
When the Q function converges, the recursive formula of the Q function can also be obtained
Q function can be realized using neural network and trained:
- The structure of the neural network is defined and initialized randomly. The input is S and the number of outputs is the same as the size of the action set
- A is selected randomly with A certain probability every time; otherwise, the optimal A is selected using the strategy function, that is, the combination of random exploration and directed strategy
- Maintain a memory module for accumulating data generated during play
- Warm-up period: no training, mainly in order to let the memory module first accumulate certain data
- Exploratory period: gradually reduce the random probability, transition from random exploration to directed strategy, and take out some data from the memory module each time to train the model
- Training period: fixed random probability, further training model, make Q function further convergence
For an introduction to the principles of reinforcement learning and DQN, please refer to the following article, ai.intel.com/demystifyin…
implementation
Modified based on the following items, github.com/yenchenlin/…
The code in the game is simplified and modified from the previous Flappy. Py by removing the background image and fixing the character and water pipe colors, the game will automatically start and continue after it has died, mainly for the model to automatically perform and collect data
Load the library
# -*- coding: utf-8 -*-
import tensorflow as tf
import numpy as np
import random
import cv2
import sys
sys.path.append('game/')
import wrapped_flappy_bird as fb
from collections import deque
Copy the code
Define some parameters
ACTIONS = 2
GAMMA = 0.99
OBSERVE = 10000
EXPLORE = 3000000
INITIAL_EPSILON = 0.1
FINAL_EPSILON = 0.0001
REPLAY_MEMORY = 50000
BATCH = 32
IMAGE_SIZE = 80
Copy the code
Define some network input and auxiliary functions. Each S consists of four consecutive game screenshots
S = tf.placeholder(dtype=tf.float32, shape=[None, IMAGE_SIZE, IMAGE_SIZE, 4], name='S')
A = tf.placeholder(dtype=tf.float32, shape=[None, ACTIONS], name='A')
Y = tf.placeholder(dtype=tf.float32, shape=[None], name='Y') k_initializer = tf.truncated_normal_initializer(0, 0.01) b_initializer = tf. Constant_initializer (0.01) def conv2d(inputs, kernel_size, filters, strides):return tf.layers.conv2d(inputs, kernel_size=kernel_size, filters=filters, strides=strides, padding='same', kernel_initializer=k_initializer, bias_initializer=b_initializer)
def max_pool(inputs):
return tf.layers.max_pooling2d(inputs, pool_size=2, strides=2, padding='same')
def relu(inputs):
return tf.nn.relu(inputs)
Copy the code
Define network structure, typical convolution, pooling, full connection layer structure
h0 = max_pool(relu(conv2d(S, 8, 32, 4)))
h0 = relu(conv2d(h0, 4, 64, 2))
h0 = relu(conv2d(h0, 3, 64, 1))
h0 = tf.contrib.layers.flatten(h0)
h0 = tf.layers.dense(h0, units=512, activation=tf.nn.relu, bias_initializer=b_initializer)
Q = tf.layers.dense(h0, units=ACTIONS, bias_initializer=b_initializer, name='Q')
Q_ = tf.reduce_sum(tf.multiply(Q, A), axis=1)
loss = tf.losses.mean_squared_error(Y, Q_)
optimizer = tf.train.AdamOptimizer(1e-6).minimize(loss)
Copy the code
Implement memory modules with a queue, start the game, and choose to do nothing about the initial state
game_state = fb.GameState()
D = deque()
do_nothing = np.zeros(ACTIONS)
do_nothing[0] = 1
img, reward, terminal = game_state.frame_step(do_nothing)
img = cv2.cvtColor(cv2.resize(img, (IMAGE_SIZE, IMAGE_SIZE)), cv2.COLOR_BGR2GRAY)
_, img = cv2.threshold(img, 1, 255, cv2.THRESH_BINARY)
S0 = np.stack((img, img, img, img), axis=2)
Copy the code
Continue to play and train the model
sess = tf.Session()
sess.run(tf.global_variables_initializer())
t = 0
success = 0
saver = tf.train.Saver()
epsilon = INITIAL_EPSILON
while True:
if epsilon > FINAL_EPSILON and t > OBSERVE:
epsilon = INITIAL_EPSILON - (INITIAL_EPSILON - FINAL_EPSILON) / EXPLORE * (t - OBSERVE)
Qv = sess.run(Q, feed_dict={S: [S0]})[0]
Av = np.zeros(ACTIONS)
if np.random.random() <= epsilon:
action_index = np.random.randint(ACTIONS)
else:
action_index = np.argmax(Qv)
Av[action_index] = 1
img, reward, terminal = game_state.frame_step(Av)
if reward == 1:
success += 1
img = cv2.cvtColor(cv2.resize(img, (IMAGE_SIZE, IMAGE_SIZE)), cv2.COLOR_BGR2GRAY)
_, img = cv2.threshold(img, 1, 255, cv2.THRESH_BINARY)
img = np.reshape(img, (IMAGE_SIZE, IMAGE_SIZE, 1))
S1 = np.append(S0[:, :, 1:], img, axis=2)
D.append((S0, Av, reward, S1, terminal))
if len(D) > REPLAY_MEMORY:
D.popleft()
if t > OBSERVE:
minibatch = random.sample(D, BATCH)
S_batch = [d[0] for d in minibatch]
A_batch = [d[1] for d in minibatch]
R_batch = [d[2] for d in minibatch]
S_batch_next = [d[3] for d in minibatch]
T_batch = [d[4] for d in minibatch]
Y_batch = []
Q_batch_next = sess.run(Q, feed_dict={S: S_batch_next})
for i in range(BATCH):
if T_batch[i]:
Y_batch.append(R_batch[i])
else:
Y_batch.append(R_batch[i] + GAMMA * np.max(Q_batch_next[i]))
sess.run(optimizer, feed_dict={S: S_batch, A: A_batch, Y: Y_batch})
S0 = S1
t += 1
if t > OBSERVE and t % 10000 == 0:
saver.save(sess, './flappy_bird_dqn', global_step=t)
if t <= OBSERVE:
state = 'observe'
elif t <= OBSERVE + EXPLORE:
state = 'explore'
else:
state = 'train'
print('Current Step %d Success %d State %s Epsilon %.6f Action %d Reward %f Q_MAX %f' % (t, success, state, epsilon, action_index, reward, np.max(Qv)))
Copy the code
Run dqn_Flappy. Py to train the model from scratch. At first, the character jumps wildly and can’t jump a water pipe, but as the training progresses, the character learns to perform steadily
You can also run the trained model directly using the following code
# -*- coding: utf-8 -*-
import tensorflow as tf
import numpy as np
import cv2
import sys
sys.path.append('game/')
import wrapped_flappy_bird as fb
ACTIONS = 2
IMAGE_SIZE = 80
sess = tf.Session()
sess.run(tf.global_variables_initializer())
saver = tf.train.import_meta_graph('./flappy_bird_dqn-8500000.meta')
saver.restore(sess, tf.train.latest_checkpoint('/'))
graph = tf.get_default_graph()
S = graph.get_tensor_by_name('S:0')
Q = graph.get_tensor_by_name('Q/BiasAdd:0')
game_state = fb.GameState()
do_nothing = np.zeros(ACTIONS)
do_nothing[0] = 1
img, reward, terminal = game_state.frame_step(do_nothing)
img = cv2.cvtColor(cv2.resize(img, (IMAGE_SIZE, IMAGE_SIZE)), cv2.COLOR_BGR2GRAY)
_, img = cv2.threshold(img, 1, 255, cv2.THRESH_BINARY)
S0 = np.stack((img, img, img, img), axis=2)
while True:
Qv = sess.run(Q, feed_dict={S: [S0]})[0]
Av = np.zeros(ACTIONS)
Av[np.argmax(Qv)] = 1
img, reward, terminal = game_state.frame_step(Av)
img = cv2.cvtColor(cv2.resize(img, (IMAGE_SIZE, IMAGE_SIZE)), cv2.COLOR_BGR2GRAY)
_, img = cv2.threshold(img, 1, 255, cv2.THRESH_BINARY)
img = np.reshape(img, (IMAGE_SIZE, IMAGE_SIZE, 1))
S0 = np.append(S0[:, :, 1:], img, axis=2)
Copy the code
reference
- A Flappy Bird Clone using python-pygame: github.com/sourabhv/Fl…
- Flappy Bird Hack using Deep Reinforcement Learning: github.com/yenchenlin/…
- Demystifying Deep Reinforcement Learning: ai.intel.com/demystifyin…
Video lecture course
Deep and interesting (1)