Serve as a happy fat house, play game is home necessary, no matter be king glory, eat chicken, original god these big heat game, still jump jump, synthesize big watermelon, 2048, these be all the rage get small game, za have to involve. But in order to become a “number one player”, I always crazy to each major community, website looking for a variety of strategies, grow up with strategies, I often think, when CAN I become a generation of strategies god ah, let everyone learn my technology, is not very exciting!
It was a flash of inspiration. After all, I am a fat nerd with a little bit of technology. I used to be obsessed with DeepMind and run AlphaGo, which is not enough to train an AI to play.
Reinforcement learning training 2048 games, observe how AI find the exit.
Since want to practice, that from 2048 this kind of simple, do not test operation, pure strategy of the game start with it. Searched on the net one time, as expected let me find open source 2048 game environment.
GitHub address: github.com/rgal/gym-20…
The next step is to combine this algorithm with reinforcement learning.
The algorithm part is very simple, at present I just use the most traditional DQN, can achieve a good model effect in 10 minutes. If you have ideas, you can try RainBow, PPO, A2C, SAC and other algorithms. I believe they will achieve better results.
I developed this model using ModelArts from Huawei Cloud (it is an online, out-of-the-box AI platform with free GPU computing power, unlimited use per day, not too cool!). , so the code runs in IPynb.
The overall steps can be roughly divided into three steps:
1. Create the environment
2. Create a DQN algorithm
Def learn(self, buffer): if buffer.size >=self.args.batch_size: # update the parameter of target_model if self.learn_step_counter %args.target_update_freq == 0: self.target_model.load_state_dict(self.behaviour_model.state_dict()) self.learn_step_counter += 1 # Sampling a quintuple randomly from the Replaybuffer (current observation, action, next observation, end of game, S1, A, S2, done, r =buffer.get_sample(self.args.batch_size) s1 =torch.FloatTensor(s1).to(device) s2 = torch.FloatTensor(s2).to(device) r = torch.FloatTensor(r).to(device) a = torch.LongTensor(a).to(device) if args.use_nature_dqn: q =self.target_model(s2).detach() else: Q = self.behaviour_model(s2) # Everyaction's q =r+gamma*(1-0 or 1)*q_max target_q =r+ torch.FloatTensor(args done)).to(device) * q.max(1)[0] target_q =target_q.view(args.batch_size, 1) eval_q = self.behaviour_model(s1).gather(1,torch.reshape(a, shape=(a.size()[0], -1))) # Calculate loss function loss = self.criterion(eval_q,target_q) self.optimizer.zero_grad() loss.backward() self.optimizer.step() Def get_action(self, state, explore=True): def get_action(self, state, explore=True): if np.random.uniform() >=args.epsilon: action = randint(0,self.action_dim - 1) else: # Choose the best action accordingto the network. q =self.behaviour_model(torch.FloatTensor(state).to(device)) m, index = torch.max(q, 1) action =index.data.cpu().numpy()[0] else: q = self.behaviour_model(torch.FloatTensor(state).to(device)) m, index = torch.max(q, 1) action =index.data.cpu().numpy()[0] return action classReplayBuffer: def __init__(self, buffer_size, obs_space): self.s1 = np.zeros(obs_space, dtype=np.float32) self.s2 = np.zeros(obs_space,dtype=np.float32) self.a = np.zeros(buffer_size,dtype=np.int32) self.r = Zeros (buffer_size, dType = Np.float32) self.done = np.zeros(buffer_size,dtype= Np.float32) # replayBuffer size Self. Buffer_size = buffer_size self.size = 0 self.pos = 0 s2,done, reward): self.s1[self.pos] = s1 self.a[self.pos] = action if not done: self.s2[self.pos] = s2 self.done[self.pos] = done self.r[self.pos] = reward self.pos = (self.pos + 1) %self.buffer_size Self. size = min(self.size + 1,self.buffer_size) def get_sample(self, sample_size): i = sample(range(0, self.size), sample_size) return self.s1[i], self.a[i],self.s2[i], self.done[i], self.r[i]Copy the code
3. Create a network model
What I’m using here is a very simple three-level convolutional network
Def __init__(self, obs,available_actions_count): available_actions_count (self, obs,available_actions_count): super(Net, self).__init__() self.conv1 = nn.Conv2d(obs, 128,kernel_size=2, stride=1) self.conv2 = nn.Conv2d(128, 64,kernel_size=2, stride=1) self.conv3 = nn.Conv2d(64, 16,kernel_size=2, stride=1) self.fc1 = nn.Linear(16,available_actions_count) self.relu = nn.ReLU(inplace=True) def forward(self, x): x = x.permute(0, 3, 1, 2) x = self.relu(self.conv1(x)) x = self.relu(self.conv2(x)) x = self.relu(self.conv3(x)) x = self.fc1(x.view(x.shape[0], -1)) return xCopy the code
Complete the above three steps and you are ready to start your training happily:
print('\ntraining... ') begin_t= time.time() max_reward= 0 fori_episode in range(args.epochs): Set ep_r = 0 while True: A = dqn.get_action(np.expand_dims(s,axis=0)) Info = env.step(a) # Memory.add_transition (s, a, s_, done,r) ep_r += r # Learn (memory) if done: print('Ep: ', i_episode, '| Ep_r: ', round(ep_r, 2)) if ep_r > max_reward: Max_reward = ep_r print("current_max_reward{}". Format (max_reward)) # Save (dqn.behaviour_model,"2048.pt") break s = s_ print("finish! time cost is {}s".format(time.time() - begin_t))Copy the code
I only trained for 10 minutes, in this strict environment of no wrong steps, reasoning can reach 256 points, with more advanced algorithms, longer training time, 2048 is not a dream.
Detailed code access: click here to link can be directly run online, or download marketplace.huaweicloud.com/markets/aih…
I came into contact with this technology source in huawei cloud AI full-stack growth plan last year. It is said that Huawei Cloud has started a new round of AI training camp this year. Python, ModelArts, MindSpore AI framework, depth, reinforcement and machine learning help us become the “KING of AI”! You can also get amazing gifts like Mate 30Pro, smart watch and wireless earphones while learning comprehensive AI knowledge in a short time! I have scanned the following QR code to register ~ do you still have to wait?
Click to follow, the first time to learn about Huawei cloud fresh technology ~