Good morning! Ali sister to test a math problem: there are 100,000 pieces of different sizes of goods, to average into 10,000 boxes, how should be distributed?
It’s a solvable problem for you to be smart. But! If the problem is a little harder: 100,000 items that can get smaller and bigger all the time, and there are a lot of restrictions like “mineral water can’t sit on potato chips”, how do you get the most balanced packing scheme in a few seconds?
Alibaba’s engineers face these kinds of questions every day. The server requirements of countless applications are changing every moment, and new applications are constantly added. How to evenly distribute these applications among tens of thousands of machines with different specifications?
In order to liberate human engineers and better allocate computing resources, “Da Ling”, the AI allocation officer of Alibaba computing resources, took office on November 6, ready to meet the challenge of Tmall Double 11.
During his internship, Darling drove the data center resource allocation rate to more than 90%, saved half the servers in some businesses, and was able to lock down abnormal machines in two seconds with a 94% hit rate.
Reducing mechanical duplication saves Ali half of the machine
Open mobile taobao, the home page can see “good goods”, “guess you like” and other commonly used function modules. Previously, you manually assigned the number of servers for each module and monitored its performance. For engineers, the challenges are enormous.
“As the size of singles Day has increased year by year, such work is no longer suitable for people to do.” Alibaba senior search research and development expert Zheng Nan said. To this end, Alibaba search team conducted a lot of training and engineering on “Daling”. During the internship, “Daling” completely replaced human beings and doubled the resource allocation rate in intelligent scheduling of recommendation platform, which is equivalent to saving half of the machine.
“This algorithm can quickly produce optimal deployment solutions and move applications and data around based on traffic, ensuring that no machine is lazy.” ‘All we have to do is feed her with data,’ Mr. Zheng says. ‘We have to provide information about the size of the spreadsheet, the number of visits and the current deployment plan, and we save watching her over tea.’ “She could even clone a real service online and stress test it herself to see if it was the best solution.”
It takes only 2 seconds to isolate abnormal machines with 94% accuracy
What happens if a machine goes wrong in Alibaba’s global data centers and is not handled in a timely manner? During tmall Double 11, nearly one million users may fail to place orders.
To avoid such a situation, a large number of engineers keep an eye on the health of the cluster every Singles Day. If abnormal machines are found, they should be isolated manually or even directly offline, which is commonly known as “killing machines” in the industry.
But from the time an exception occurs, to the time it is discovered, to the time it is handled, the whole process can sometimes take several minutes. Ding Yu, a senior expert at Alibaba’s dispatching system, said, “We have already reached the limit of what people can do. Last year, we began to explore artificial intelligence technology to seek breakthroughs, and finally found a solution to the problem by associating uncertain factors such as time, load and service status with data algorithms.
This year ding Yu team and “Daring” cooperation, “Daring” depth modification, to achieve more accurate and fast detection of abnormal machines. The algorithm collects 2.9 billion machines’ operation status every day. In previous tests, abnormal machines were processed about 1,000 times per day, and the scheduling accuracy reached 94% during the promotion period, taking only 2 seconds. As soon as the abnormal machine is found, the precise shot is not ambiguous, which can be called the top “assassin”.
From warehouse to data center tmall double 11 AI everywhere
“In fact, The predecessor of Daling is really a warehouse manager, just to solve the problem of packing packages,” said Dr. Zhu Shenghuo, head of Alibaba iDST machine learning algorithm.
A year ago, iDST and novice network algorithm engineers have developed a set of algorithm, can be in the customer order, to the property of commodity, quantity, weight, volume, or even put the position of the comprehensive calculation, can quickly and the dimensions of the box and bearing weight matching and calculate the need several boxes, How to arrange goods in boxes to save most packaging. The whole calculation process, less than a second.
After AI intervened, Cainiao’s warehouses reduced packaging materials by more than 5% compared to the past. What is this concept? Based on the 467 million packages generated on Tmall’s Singles Day in 2015, 23 million boxes could be saved a day if the technology is used. World class puzzle: How to minimize the surface area of a container when packing different items into it?
“Based on the same idea, we took this algorithm to the data center and developed Darling,” zhu said. Darling’s work began by building deep learning, online learning models based on cluster monitoring data, so that the current and future state of each machine and application in the cluster could be known. On this basis, through the application of reinforcement learning, combinational optimization and other technologies, Darling can learn and judge in the complex environment and make a series of intelligent decisions such as staggered peak arrangement and fragmentation, so as to optimize the resource allocation rate and stability of the cluster globally.
The original post was published on November 7, 2017
This article is from the cloud community partner “Ali Technology”. For relevant information, you can pay attention to the wechat public account of “Ali Technology”