About the author: Liang Dingan is now working in Tencent Social network Operation Department, responsible for the operation and maintenance of social platforms and value-added services, expert committee member of Open Operation and Maintenance Alliance, Tencent Cloud evangelist, and Tencent Classroom operation and maintenance lecturer.

SNG social network operation Department manages nearly 100,000 Linux servers to support Tencent’s massive social business and users, such as 247 million QQ daily live, 596 million Qzone monthly live (data source: Tencent q2 2016 earnings report) and many tens of millions of online fat business.

In the face of the ever-increasing business volume of social UGC business, how can we not only ensure the development of business, but also effectively control the growth of operating costs? Is the operation and maintenance team is imminent to solve the problem of operating costs. After continuous exploration and deep digging, we are glad that in the past 2 years, we found an effective way of equipment cost management – fine capacity management of equipment cost optimization road, and for 2 consecutive years, every year for the company to save hundreds of millions of operating costs.

As we all know, increasing the utilization rate of equipment is an effective way to control operating costs commonly used in the field of operation and maintenance. Then how to develop appropriate measurement and management methods for different equipment usage scenarios and different types of equipment? Please look at the 6 methods summarized in Tencent operation and maintenance practice:

Method 1: Performance management

When it comes to measuring server usage, CPU usage is the number one concern. With the popularity of multi-core hyper-threading CPUS, uneven CPU load becomes a devour of equipment operating costs in massive O&M scenarios.

In order to find and optimize the unbalanced load of multi-core CPUS, we proposed a measurement index of CPU extreme. CPU (extreme) =CPU (Max) -CPU (min). If CPU (extreme) >30%, the device has an unreasonable CPU usage problem, which needs to be optimized and rectified. (Note: For the optimization method, refer to multi-queue NIC optimization and CPU affinity, which is not expanded in this article.)

Similarly, in the module capacity management of distributed cluster, the operation and maintenance specifications require the implementation of module consistency management, including capacity consistency. For this reason, we also propose the measurement index of module capacity difference, module CPU difference = THE CPU usage of THE IP with the highest CPU – the CPU usage of the device with the lowest CPU. If the CPU usage of different devices in the same module is greater than 30%, the capacity of the module is not properly used and needs to be optimized. (Note: In general, this situation is caused by inconsistent management issues such as configuration, weight, and scheduling.

Method 2: Density management method

For the rationality of memory usage, it is difficult to measure directly by memory usage. Therefore, in memory device usage, we propose a control method of density management — access density. The formula for calculating the access density is as follows: the device memory access density under the module should be consistent; otherwise, it will be included in the consistency rectification category of uneven load. Through the statistical analysis of the access density of the full memory module, we can get an average load level, combined with the actual needs of capacity management, improve the average level or optimize the module below the level, can realize the purpose of optimizing the equipment cost management. The density management method also applies to SSDS. (Note: The access density is affected by the size of service request packages. However, in massive O&M scenarios, some cases can be ignored.)

Method 3: Feature management

The feature management method is similar to QPS management of function modules. It is used to measure whether the processing performance of business logic is optimal in a specific business scenario, and to draw an analysis conclusion based on the comparison of QPS in similar application scenarios of different products. This management approach varies by business logic and is illustrated in this article with examples.

For example, in mobile Internet business operation and maintenance scenarios, some scenarios can be measured by unconventional capacity management methods. For some individual but large-scale modules, we propose feature management method. For example, QQ, QQ Space, carrier pigeon and other services all have long-connection function modules. In this scenario, the capacity of CPU is small but the use of memory is large. Therefore, the number of long-connections maintained per G of memory can be used to compare QQ, QQ Space, carrier pigeon and other services, and urge the rectification and optimization of business programs with low performance.

For example, in the live broadcast scenario, there is a demand for real-time online transcoding of anchors’ videos, and different development may use different transcoding technology solutions. The same feature management method can also be used to measure whether there is room for optimization of online transcoding performance.

Method 4: Debris management

Tencent’s social network business has a long history. From the “big brother” QQ to the “new star” Penguin FM, its business types cover IM, UGC, multimedia, reading, animation, games, live streaming and other mainstream entertainment social gameplay. There are popular products and long-tail products. There are billions of modules per second, and there are dozens of modules per second. The fragmentation management method is prepared for small clusters with low request volume. Due to distributed and highly available operation and maintenance requirements, the minimum deployment unit in the production environment is usually two devices. In the physical machine era, modules with low access cost is a serious waste. However, with the widespread application of virtualization technology, the problems encountered in this scenario are easily solved. Virtualization technology is used to fragment hardware resources, so that small modules can take into account the cost of equipment and high availability.

Similar to the solution of virtualization to solve the utilization rate of fragmented resources, we also have the PaaS platform “Honeycomb”, which is based on the standard development framework SPP of Tencent social networking, to solve the capacity management problem of small business and small modules. (Follow up on Hive.)

Method 5: Bucket management

Tencent’s platform-level businesses, such as QQ, Qzone and QQ Music, have basically popularized the SET (special area) disaster recovery architecture capability of the three places and three activities, which is a real sense of remote multi-activity. (It happens that in the special session of 923 Shanghai O&M Conference, there will be a theme to share with the practice of o&M of remote DISASTER recovery. If you are interested, we sincerely invite you to attend.) For the operation and maintenance of platform-level business, according to the requirements of operation and maintenance standard management, we will divide several modules to achieve certain business scenarios into SETS (reduce operation and maintenance objects). In different social scenarios, we have obtained various types of SETS. Through automatic operation and maintenance capabilities, we can expand to the automatic operation and maintenance capabilities of sets. O&m can easily implement remote deployment of SET to achieve remote disaster recovery and fault tolerance of the service scenario.

As for the capacity management of SET, platform-level SET means that the number of users and requests will not explode, so for the operation and maintenance of SET, we must quantify the number of requests and users and other indicators of SET. Therefore, operations to SET a quantifiable indicators, in our scenario, such as online users, such as core request quantity depends on the purpose of the SET, based on the pressure measurement can get the most reasonable capacity of the single SET value, the value is in line with the barrel principle, management, also is our barrel SET consists of multiple modules (= SET = barrel, module board), To support a certain number of users, the capacity management of a SET is just like the principle of a bucket. The water level of the bucket depends on the shortest board, so the maximum capacity of a SET depends on the capacity of the module with the lowest performance in the SET.

The number of simultaneous online users of Tencent’s platform-level business is relatively stable, which means that how much redundant capacity needs to be prepared can be expected and planned for the realization of multiple places and multiple activities in the country. In other words, the number of SET to be deployed can be quantified in advance. At the same time, combined with the automated deployment, scheduling scheme, flexible strategy and lossy service capability of the business, we can achieve remote live at a very reasonable cost.

For example, suppose that we has a total of 1000 w online user at the same time, and users is relatively stable, we can plan three support 500 w online SET, respectively using business architecture scheduling ability let three SET the capacity of the average, the disaster scene, 1 SET is not available, the other two SET disaster can completely, under the plan, Extreme scenario 2 sets unavailable is to enable lossy service. Through quantitative SET management, business operation and maintenance can flexibly adjust the capacity level of SET according to the demand of cost management, so as to achieve the optimal cost-effective high availability architecture.

Method 6: Hardware selection method

Focus on hardware bottlenecks and upgrade hardware to reduce stand-alone operating costs. For example, when making UGC memory storage (QQ album, video) in the past, a large number of 2T hard disks were used. When 4T and 8T hard disks are used in mass production at a cost, the timely upgrade of hard disk capacity can effectively improve the storage of single machine, and achieve large cost at a small price with scale effect. For example, in the business scenarios of picture social networking or video social networking, due to the diverse demands of gameplay, a lot of logic with heavy computation will be extended, such as face recognition, yellow detection and other functions. At this time, choosing GPU devices instead of CPU devices is also an effective way to improve performance. (This method is especially suitable for UGC business whose storage capacity only increases, such as micro cloud, network disk, picture storage, video storage, etc.)

Postscript:

The six capacity management methods, including but not limited to the above, enable us to steadily and sustainably move forward in the social UGC business where user data only increases, but not decreases. Equipment cost management also involves a lot of detailed technical means and business code optimization, this paper just expounds the thinking of capacity management from the perspective of operation and maintenance, hoping to be able to throw a brick to draw inspiration, helpful to peers. Bandwidth cost management optimization will bring greater cost savings value, because it involves more technical points and methodology, this article does not go into depth.


Related recommendations: Capacity management system design to squeeze dry operating costs: 100 million and then save 200 million understanding of the characteristics and combat of a variety of processing chips (part 1) Video introduction: in the cloud to build a “light company”


This article has been published by Tencent Cloud technology community authorized by the author. Please indicate the source of the article when reprinted. For more cloud computing technology dry goods, please go to Tencent cloud technology community