Starcraft has been a classic of the real-time strategy genre for over a decade. But now it has become a major platform and tool for deep reinforcement learning and artificial intelligence algorithm research. Because it contains complex problems such as multi-agent cooperation, multi-task learning and macro-strategic planning, once some breakthroughs and progress are made, it will have a great impact on business and social development. For example, DeepMind, Facebook and other foreign companies have invested a lot of manpower in general artificial intelligence research based on it. In this talk, we will introduce how Alibaba studies ai algorithms in the starCraft game environment, focusing on the application of multi-agent cooperation in micro combat scenarios.
Good afternoon, everyone! I am Long Haitao from Alibaba Cognitive Computing Lab. Today, I would like to talk with you about “Starcraft and artificial Intelligence”. First of all, I will introduce why we choose StarCraft to do pioneering research on artificial intelligence, and then our preliminary attempts and achievements in this field. Finally, I will discuss some topics that we can continue to study in starCraft in the future.
Why choose StarCraft as the setting for ai algorithm research
First of all, you might wonder why starCraft was chosen as a platform for our AI research. Our cognitive computing laboratory is affiliated below the search business, our team members are all basic search, advertising, recommend, algorithms such background, we mainly do is CTR before estimate of optimization, and some of the CVR conversion optimization, from the “double 11” after last year, we’d like to do some frontier exploration in the aspect of cognitive intelligence, We all agree that games are a great platform for studying AI algorithms. First of all, it’s a very clean platform that can generate data in an endless stream and iterate very fast, which means that its intelligence is observable. Also, it’s close to the real world and the real world, and StarCraft has been a very popular game for over a decade, accumulating a lot of data, so we can learn from previous experiences, and that’s one of the things we’re looking at. Most importantly, it presents a very big and complex challenge for AI, mainly including the following six points:
First, it’s an environment of imperfect information
Than everyone can see, like the go or chess, under complete information game, starcraft is the fog of war, so have to go to units, investigation, about the opponent’s information, and thus in uncertain do intelligent decision, this is very different than other games or one aspect of the greater challenges.
Second, it has a huge search space
Go has a search space of about 10^170, starCraft on a 128×128 map with a population cap of 400 units has a search space of about 10^1685, which is 10 orders of magnitude higher than Go, and that’s without factoring in other states (health, etc.). There is no way that any single algorithm can solve all of starCraft’s problems.
Third, it’s a real-time versus game
You can think for a minute or two in Go, but in StarCraft, if the normal game is 24 frames per second, you have to react quickly in 42 milliseconds, and that reaction is not one action, it’s a series of actions, and each unit takes action. This is a big challenge for the performance, efficiency, and engineering considerations of our algorithm.
Fourth, it requires an agent to have a long-term plan
Not a subconscious action, is needs to have a memory, need to consider the war should take what kind of strategy, the central disk should consider, how to should take what kind of strategy in late stage, and is the strategy plan according to the investigation to all of the information dynamic adjustment, the challenge of artificial intelligence is very, very big.
Fifth, reasoning in time and space
In starcraft to play good, must up based on temporal, spatial reasoning, such as location advantage, where tanks if could be better, if open extension meeting in which position to comparative advantage, where even create the armed force, all of which need to be in AI a spatial reasoning.
Sixth, multiple agents collaborate
Starcraft has a maximum of 400 units, so it requires multiple intelligents-multiple units, which is a big challenge for AI.
Starcraft in AI research or competition not only recently, in fact in 2010, there have been a lot of researchers on the study of AI in starcraft, mainly at the university of ALBERTA research strength, including some of the teachers and students, and there are three fixed and some round robin competition, everybody in the PK. This kind of AI is Classic AI, that is, it has no learning ability, no model, and no training, but is based on pre-programmed rules, so it is not very flexible. In fact, the AI with this algorithm is very, very far from really surpassing human beings or beating human goals. They can beat the built-in AI. But it’s not nearly as good as human professionals, or even ordinary players.
The other category is Modern AI, which is an algorithm based on autonomous learning of intelligent agents. This area has been very popular since last year. On the one hand, Alibaba and University College London, we recently collaborated on some new AI attempts based on StarCraft 1.
Google Deep Mind, which partnered with Blizzard last November to open up an API for starCraft 2 to allow people to develop their own AI algorithms based on StarCraft 2, also has teams working on this at Facebook.
Deep reinforcement learning
Reinforcement learning is a learning mechanism very similar to human learning. It learns through the interaction between the Agent and the environment. The Agent will observe the surrounding environment, and then the environment will give it some feedback. According to the status and feedback, the Agent will make some actions, which will more or less affect the environment, and the environment will give back some rewards according to the actions. Rewards may be rewards or punishments. Constantly adjust. There are two concepts behind Agent that are very important. One is the continuous optimization strategy, which Action is reasonable under what conditions. The other is the value function to evaluate the value of the current state.
The combination of reinforcement learning and deep learning is called deep reinforcement learning. Because deep learning or neural networks are perfect for this kind of representation learning, which can be expressed as a complex function. If policy or value is approximated by neural network, it is a very good improvement in engineering or efficiency. Taking AlphaGo as an example, the training is divided into three stages. The first stage is to learn human priori knowledge from human chess score and to learn a good policy network with a high winning rate through supervised learning. The second stage is to learn policy network based on supervised learning. Then I played against each other and optimized the Policy network through Policy gradient, which is better than the policy network learned before. In the third stage, the policy Network of reinforcement learning version is used to play chess, and the best game is obtained.
Multi-agent collaboration
Actually so far all the AI is based on the successful application of this kind of a single Agent, actually for human, collaboration of intelligent Agent is a very big, why our ancestors homo sapiens can rule the earth, one big reason is that they learn how to mass collaboration, and the cooperation is very flexible. Imagine a future full of such AI agents. Can they self-learn an intelligence of human-level collaboration?
We use the term Artificial Collective Intelligence, which has enormous implications for the present and the future. For example, mobile phone Taobao, now the vast majority of traffic is behind an algorithm recommended, regardless of advertising or search behind it is AI intelligence in doing, at present, these intelligent bodies are each out of the optimization, or launch their own goods.
In fact, what we are considering is that, for example, if there is a place like “love shopping” and “Guess you like” in the home page of mobile Taobao, can they jointly launch some such products so as to provide the best user experience and maximize the value of the platform. In fact, the future may be algorithmic economy, AI economy, all agents of this KIND of AI, for example, the streets may be full of self-driving unmanned cars, do they also need some cooperation to maximize the efficiency of traffic travel.
Recently, we put forward a multi-agent two-way cooperation network in the micro battle scene of StarCraft. You can download our paper for details about this network. This work was completed by us and UCL in cooperation to explore or solve the problems of multi-agent cooperation.
This is our proposed BiCNet(Multiagent Bidirectionally Coordinated Net) network structure, which is also a classical structure. It is divided into two parts. The part on the left is a policy network. That is to say, the environment of StarCraft will be abstracted from bottom to top, including map information, enemy units’ health, attack power and our unit’s information, to form a shared State. Through a two-way NETWORK such as RNN, full two-way communication will be carried out. Then, each Agent will figure out its own Action, such as who I attack in a certain place. The policy network on the left is what Action should be taken for the current State, and the network of a value on the right is evaluated according to the Action obtained by the previous policy and the abstract State, and what is the approximate Q value to make a prediction. When these actions are taken, the environment will give corresponding feedback, such as some rewards, to indicate whether this step is good or not, and then a Reword will come down from the network on the right to backspread and update the parameters inside.
This network has a few good designs:
First, the scalability of the network is very good, since casualties may occur during a war in StarCraft, and the network still works normally after the Agent’s death, including the continuous emergence of new agents. We see that two-way network parameters are shared, so it doesn’t matter.
Second, after we use such a two-way network in the middle, we have achieved a good balance between efficiency and performance. If we use a fully connected network, the calculation will be too much. However, we use a two-way network. In the front, we tell you what kind of Action you want to take, and then we tell the front people what kind of Action they take. After combining, we finally work out what kind of strategy should be added.
In fact, when designing some algorithms or models, our cognitive computing laboratory will refer to some current research achievements in neuroscience. We believe that research on cognitive psychology, brain, brain-like research or neuroscience should have two benefits for artificial intelligence.
The first advantage is that, is instructive in neuroscience, is that when you are in some specific questions or scenario to think, will encounter some problems, these problems may be no one ever solution, if there are similar in neuroscience, cross science of the structure and algorithm, these can be very good to solve your problem, bring some inspiration on the algorithm.
On the other hand, neuroscience can help you verify that, when you design an algorithm, if neuroscience has a similar structure, there is a high probability that the algorithm will work.
In fact, our actor-critic network also has a corresponding response in the human brain. The left is the actor-critic network, and the right is our brain. In the brain, the striatum is responsible for Actor and Critic, and the abdomen of the striatum is responsible for Critic. The back is responsible for the Actor part. After the Reward comes down, our brain will calculate the difference between the Reward and the expected Reward, and this difference will affect the Actor in the form of dopamine. Next time, you need to adjust according to this to make the next Action better.
In fact, dopamine is reflected in our algorithm as TD error, which is the error of the Reward we calculated, which is actually a good correspondence.
Experimental platform and practical effect
In order to implement such an algorithm model, we built an experimental platform based on Facebook’s TorchCraft, which encapsulates StarCraft 1 and Torch together, but we are used to TensorFlow and Python. Therefore, a package is made above, and this architecture is put into the OpenAI standard interface. If you are interested, you can have a try.
This framework is mainly divided into two parts, corresponding to reinforcement learning:
On the left is Environment, which is actually the game StarCraft, including the engine and the DLL in it. The DLL is based on BWEnv, which is an officially approved DLL. Based on the BWEnv DLL to the internal state, instructions encapsulated, in fact, this is a Server;
On the right is the Agent, which is a Client, so you can connect to many agents to play the game. The Environment will spit out the data of each frame to the Agent, and the Agent will abstract the data of each frame into a State, and then send the State to the model to learn or make predictions. In turn, some actions will be predicted, and these actions will be encapsulated into instructions. Send it back to the StarCraft Environment, for example, shoot or run away, which is one of the experimental platforms we built for StarCraft.
Here are some of the effects of our platform, summed up in five observable intelligences.
The first kind, can cooperate to walk.
This example is three gunners fighting a Super dog, the dog is edited by us, the health is very large, all of a sudden can not be hit. In the early stage of training, three musksmen fight a dog, a/ B, they did not learn to cooperate with each other, so they often collided with each other. After tens of thousands of rounds of training, they gradually learned to cooperate with each other, so that they could not collide with each other.
Scene two, fight and retreat
This is a combination of Hit and Run skills, such as three musketeers fighting a fanatic, using the advantage of ranged attack to destroy the enemy.
Third, cover attack
The three riflemen who were fighting a maniacal retreated at the same time, but in this scenario some of the riflemen might try to attract the dog or block it, allowing the other two riflemen to seize the time gap to destroy the dog. Very interesting is that this kind of collaboration is not appear in any case, if your environment is not so challenging, maybe it is simply Hit and Run is enough, and if our environment more severe, such as the dog blood volume increase, from 3 to 4, or blood from 210 to 270, It was interesting to see that he had learned another, more advanced form of cover-attack coordination.
Fourth, group fire attack
This example is 15 spearmen fighting 16 spearmen, how do you think to win? The strategy may be that 3 or 4 spearmen will automatically form a group, and the 3 spearmen will kill one first and then another, so as to concentrate their fire, but instead of 15 spearmen shooting one, they will spread their fire a little bit. Finally, there may be 6 spearmen left on our side, and the other side may be wiped out. This is a combination that they learn automatically after many rounds of learning.
Fifth, not only can lancers learn to cooperate with each other, but also can cooperate with multiple arms and heterogeneous agents.
This example is that two transport aircraft, each conveyor belt a tank, an elephant to play normal, two tanks play an elephant is certainly dozen however, plus transport to cooperate, the elephant attack one tank, transport to timely receive the tanks, the elephant fan the air, at the same time another conveyor hurriedly put the tank bottom go to, To attack elephants, so that once again elephants may be wiped out without any advantage. This is based on our previous work to make BiCNet a show of collaboration.
Some thoughts on the future
However, in StarCraft, it is not only about micro battles, but also about macro strategies. In fact, we have some thoughts about how to “macro + micro” to play the whole game, which may not be particularly mature, but we can discuss together.
Set a Goal for each level
To play a full-game, simple single-level reinforcement learning may not be able to solve the problem, because the Action space is too big. A more natural way is to do hierarchical learning. Maybe the upper level is strategic planning, and the lower level is combat, economic development, route finding, map analysis, etc. In this way, one level at a time, the upper level sets a goal for the lower level, and the lower level designs a goal for the lower level. In fact, this is similar to human problem decomposition.
Imitation Learning
What we think is worth studying and discussing is Imitation Learning. The example of AlphaGo just mentioned is also Imitation Learning. The first step is to learn better strategies through supervised Learning, and then to improve the strategies through self-playing. In StarCraft, for example, when two of our musketeers were fighting a dog, we thought it would be a good strategy for one musketeer to draw the dog around, and then another musketeer to stand near the center and shoot the dog, and both musketeers wouldn’t die with one drop of blood. However, this strategy is difficult to learn, so we first artificially let it draw circles in the inside of the soldier, after a few steps, the soldier learned to draw circles by himself, with the dog, and then another soldier chasing after the ass, this kind of exploration is very effective.
Continual Learning
Continual Learning, if we want to move towards universal intelligence, this is the problem we can’t get around.
Continual Learning like people, we learned to walk, the next time we learned to speak, we are Learning to speak may not have to walk it the ability to forget, but some scenes in starcraft, neural network Learning to learn when B, this time may forget A thing.
For example, at the beginning, we trained a rifle soldier to fight a dog. The dog was a weak AI built in the computer. After learning to fight and retreat, the rifle soldier was sure to kill the dog. We then turn the training of a dog, the dog to play computer gun soldiers, the dog learned the best strategy is always chasing the bite, never hesitate, hesitate will be destroyed, so it is a vicious dog, has been chasing gun soldiers bite. Then we trained the musketeer and the dog at the same time to play chess, and we found a balance, where the musketeer kept running and the dog kept chasing, and starCraft was designed to be very balanced. Then the musketeer learned to run all the time, and we put the musketeer back to the original environment, that is, hit a computer dog, and found that it would also run all the time, it would not fight while retreating.
As you can see, when it learns, it learns A and then learns B, and A forgets, this is A very big challenge for general artificial intelligence. Recently DeepMind also published A paper on related work, which is A promising direction, and you can check it out. Their algorithm is called EWC.
Introducing Memory
Last but not least, there are several challenges mentioned above, one of which is long-term planning. In long-term planning, we think a better approach is to introduce Memory mechanism into reinforcement learning, which is also a popular direction at present, such as Memory Networks and DNC. The problem to be solved is: What should we remember in the process of learning so that we can achieve a good maximum Reward.
So the main point of today’s talk is that starCraft actually has a very, very rich set of scenarios for general artificial intelligence or cognitive intelligence, and there can be a lot of very interesting topics in that. I’ve just listed four directions, but there are many, many more.
Welcome interested students to play the game seriously with us. Thank you!
introduction
Haitao Long (also known as Deheng) joined Alibaba in 2013. Currently, he is a technical expert at the Cognitive Computing Lab, focusing on scientific and technological innovation in deep learning, reinforcement learning and neuroscience. During my tenure at Alibaba, I was also responsible for the architecture design of search express business, including the next-generation offline system, online engine and indexing kernel.
Thanks to Du Xiaofang for correcting this article.
To contribute or translate InfoQ Chinese, please email [email protected]. You are also welcome to follow us on Sina Weibo (@InfoQ, @Ding Xiaoyun) and wechat (wechat id: InfoQChina).