Just a month after OpenAI Five beat professional Dendi in DOTA’s 1V1 last year, the advanced OpenAI Five system defeated human amateurs in a 2-1 defeat. GG (the last game the humans won was a lie).

By comparison, what kind of advanced AI hero did humans lose this time?

This time, OpenAI Five played against Five advanced players (commentator + ex-pro) — Blitz, Cap, Fogged, Merlini, and Moonmeander, with an average ladder score of around 7,000. The OpenAI Five, on the other hand, is the equivalent of 180 years of human gaming, according to public sources, and learning against itself every day is very complicated, Training required running the extended version of Near-end Policy Optimization (PPO) on 256 Gpus and 128,000 cpus.

 

 

It uses a separate LSTM (Long short-term memory Recursive neural network) for each hero, and with no human data, it learns recognizable strategies, suggesting that reinforcement learning can produce long-term planning on a realizable scale.

Moreover, in terms of context, unlike the fixed rules of board games, a complex video game like DOTA2 is a 5V5 versus strategy game, and DOTA games have been in development for decades, with hundreds of thousands of lines of logic updated every two weeks, and constantly changing semantics.

As a result, DOTA is a difficult game for the AI to play. It needs to solve four problems first: long-term vision; Local observation state; High-dimensional, continuous action space; High dimensional, continuous observation space.

▌ Model Architecture

Each network of OpenAI Five contains a single-layer, 1024-Unit LSTM that can view the current game state (pulled from Valve’s Bot API) and emit actions through several possible Action Heads. Each head has semantic meaning, such as the number of ticks to delay the action, which action to select, the X or Y coordinates of the action in the grid around the cell, and so on. Action Heads is calculated independently.

OpenAI Five uses Observation Space and Action Space for interactive demonstrations. OpenAI Five treats the world as a list of 20,000 numbers and performs operations by emitting a list of eight enumeration values. By selecting different actions and goals, we can learn how OpenAI Five codes each action and how it views the world. Below is what a human would see.

 

 

OpenAI Five can react to missing pieces of state associated with what it sees. For example, until recently, the OpenAI Five’s watch area included the sniper’s skill range (the area where the bullet landed on the enemy). However, we observed that The OpenAI Five could learn to get out of (though not out of) the sniper’s skill range, because it could see its health decrease when it entered the area.

▌ explore

Even with learning algorithms that can handle a longer field of view, we still need to explore the environment. Because even with all the restrictions, there are still hundreds of items, dozens of buildings, spells, unit types, long tail game mechanics, and all the combinations that result, it’s not easy to effectively explore this vast space.

OpenAI Five can start with random weights and learn from self-gaming. In order to avoid “strategy collapse”, the intelligence trains against itself 80% of the time and against its past self 20% of the time. Against themselves, the hero will first wander aimlessly around the map. After a few hours of training, the agent began to have some concepts, such as construction, middle alignment, etc. After a few days, they stuck to the basic human strategy: trying to steal Bountyrunes, etc., from their opponents. With further training, they can master the advanced strategy of five heroes pushing towers together.

OpenAI Five uses the randomization method used in 1V1 robots. It also uses a new lane Assignment strategy. At the beginning of each training game, they randomly “assigned” each hero to a subset of lanes, and until a randomly chosen time, the hero was penalized if he deviated from those routes.

Of course, there are rewards to help an agent explore the environment, including net worth, kills, deaths, assists, last hits and other metrics. They followed up each agent’s reward by reducing the average reward of the other teams to prevent agents from finding positive-sum games.

Items and skill builds are hard-coded as well, and Courier Management is introduced as a scripted baseline.

▌ Rapid

 

​ 

 

This system was implemented using a common RL training system called “Rapid”, which is suitable for any multiplayer environment.

 

 

The training system is divided into rollout workers, which run game replicas, agents, which collect experience, and Optimizer Nodes, which perform synchronous gradient descent across GPU groups. Each training also includes components that evaluate the training robot and the sample robot, as well as monitoring software such as TensorBoard, Sentry, and Grafana.

 

 

In the process of synchronous gradient descent, each GPU component will calculate the gradient calculation of its own batch part, and then calculate the average of the overall gradient. Instead of averaging using a protocol algorithm for messaging pretexts, they now use the encapsulation functions of Nvidia’s multicaton-like NCCL2 framework for GPU parallel computing and data transfer across networks. The delay for synchronizing 58MB of data (the parameter used for OpenAI Five) is shown in the table, and the delay is low enough to satisfy the GPU flag that most of the data is being evaluated in parallel.

▌ Being different from humans

OpenAI Five gets exactly the same information as humans, but the system immediately responds to things like location, health, and item updates that human players need to watch on a regular basis. OpenAI Five has an average APM of 150-170 (theoretically up to 450 considering every four frames of movement) and an average response time of 80 milliseconds, much faster than the average human.

Many pro players trained with bots after TI ended last year. According to Blitz, the Solo Bot has changed the way people think about the pace of solo play, with bots preferring a fast-paced style, and most players now use a fast-paced style to compete against bots.

 

 

The pacing and execution of the AI in Dota2 is very strong, does that mean there is no room for improvement? Of course not. OpenAI Five has limitations, such as the system being weak at the final strike, objective priority matching a common professional strategy, and short-term rewards often sacrificed to gain long-term rewards such as strategic map control.

In subsequent TI exhibitions this year, professional players will continue to challenge AI, but the results are likely to be “mocking” humans, Open AI said. Perhaps more importantly, will there be “AI vs AI” fairy fights in a complex game like Dota2?

▌ recommended

  • Tencent cloud the biggest activity of the year, registration will get 500 minus 350 volumes! Cloud server minimum 20% off, minimum 325 yuan/year! Stamp this direct activity scene!

  • Ali Cloud to implement the national cloud computing plan, registration will receive 1000 yuan of gold scroll, cloud server as low as 20% off, the lowest 293 yuan/year! Poke this direct activity!


The original blog.csdn.net/dQCFKyQDXYm…