Welcome to Tencent cloud community, get more Tencent mass technology practice dry goods oh ~

Author: Tencent Technology engineering official number

Wechat moments of friends consists of pictures and videos. Pictures in moments of friends are characterized by large requests and high consumption of computing resources, while videos mainly consume bandwidth. The data in moments of friends is stored forever. With the rapid development of services, the consumption of storage capacity, bandwidth, and equipment increases greatly. However, the increase in usage caused by major festivals aggravates the consumption and brings great pressure to operation and maintenance personnel.

Holiday support mainly consists of three aspects: software support refers to the optimization and evaluation of procedures and business logic to reduce the load; Hardware assurance mainly refers to the evaluation and expansion of bandwidth and machine load. Flexible measures refer to reducing the resources of some unimportant features through service adjustment to ensure the normal operation of key features.

Software security

The overall situation of friends circle:

The architecture of the circle of friends can be divided into two types: OC and IDC. IDC refers to a data center where data is stored, OC refers to an independent equipment room with an external network, and SOC refers to a large-scale OC. Each IDC has a set of interface devices, logical devices, and storage devices to support users’ requirements for uploading, downloading, and file storage.

The OC provides Internet access and carries users’ download traffic. Devices in each OC form a cache pool. If the local OC cache does not match, the user pulls files from the source in the IDC. The functions of each OC are the same. Users usually download files from the nearest OC. If a single OC fails, users can try again or switch to another OC to download files.

Dr And retry mechanism:

The module DISASTER recovery (Dr) of the circle of friends mainly realizes automatic deletion when a single machine fails. The main form is to find abnormal devices through heartbeat detection and other methods through the IP list of the master server, shield the faulty IP addresses and do not return them to the front end for use. Take the single-machine deletion at the front layer as an example:

If the ENTIRE OC or IDC encounters a fault, the fault can be rectified manually by O&M personnel or through the retry mechanism between modules

Retry for moments download:

In both the download process from the OC to the IDC and the source back process from the OC to the IDC, the system tries again after two failed attempts by default. In addition, the system selects a remote access point to avoid further retries to the faulty node. The principle is that the master of each layer returns at least two groups of IP lists to the front-end and ensures that the two groups of IP lists are remote nodes. Remote retries can be implemented only when the front-end fails.

However, retry is a double-edged sword because it will cause an increase in requests. During the holiday season, as the increase in requests has been very high, retry is more likely to cause problems and needs to be adjusted:

1. Deliver the IP address through the master route. On New Year’s Day/Spring Festival the request is doubled.

2. The personnel on duty closely monitor the IDC. If the IDC failure rate exceeds 20%, manually shut down the DEVICE and retry. This is done on Mid-Autumn/National Day holidays with low growth.

Retry control interface of Front module:

Hardware security

Capacity assessment and device expansion:

Before the holidays, the o&M personnel and the resource group will expand the capacity of equipment rooms and modules based on the service budget, service demand, and actual load. Requests that go up outside the budget are reduced or rejected through flexibility or overloading.

  • The capacity of the equipment room is determined by the upper limit of switch bandwidth
  • The capacity of devices at the access layer is evaluated based on the CPU and memory load ratio, and the traffic/packet ratio of nics.
  • The capacity of a storage tier is determined by the CPU load ratio, memory load ratio, and disk I/O count.

Spring Festival moments upload load:

Growth ratio, requested by the business side of the Spring Festival is the upload support nine times growth, download support double growth, more than the proportion of the request can be turned off, but according to the budget after expansion, to achieve the effect of the above, there are some modules cannot support this increase, especially compressed compress module, the module every growth times will need to support a large number of virtual machine capacity, This is not supported in the budget and requires a flexible strategy.

The flexible strategy

The flexible strategy of the circle of friends is divided into two layers:

The first layer is rough and flexible, that is, upload and download requests are directly restricted according to the proportion and service access. The restricted requests will be returned to the user with failure, which is the same as wechat C2C. This kind of request is generally used for rapid service recovery when the load capacity exceeds the estimated value of the system.

The second layer is flexible according to service characteristics, that is, from the business level, reduce the load of the system by reducing the clarity of pictures and videos and delaying user updates. The following details service flexibility

Main growth and bottleneck of Wechat business:

As shown in the previous device load estimate, both the access layer and the logical layer support only 5-fold growth within the budget, and the compress module only supports 1-fold growth.

1. Compress flexible

The Compress module compresses original images from clients into various formats and sizes as required to support specific service scenarios and save storage space and bandwidth. Due to the continuous development of compression technology, using more advanced compression formats, the higher the compression ratio of the same resolution image, the more compressed computing resources need to be consumed.

Therefore, if you reverse the current heVC format and replace it with JPEG format, you can save compression resources. The actual CPU load of COMPRESS can be reduced to 20%, which is a five-fold increase. But the average image size also goes up, causing download traffic to go up.

So the compromise was to reduce the resolution of images from 70 to 50 while uploading them back to JPEG. This reduced the average file size and offset the increase in traffic caused by switching back to JPEG. In the actual test, it is found that users’ perception of sharpness reduction is not obvious, and the user experience will not be affected if the function is opened briefly on holidays.

2. Flexibility of small video bit rate

The bandwidth of small video usually exceeds 1TB, and the festival effect increases obviously. The traffic reduction method adopted is similar to that of pictures, that is, the bit rate of uploaded videos is reduced, and the bandwidth is saved by reducing the average file size.

Flexible: small video bit rate 1800 -> 1200 average size 2.1MB -> 1.3MB

After the test, the decrease of bit rate will not affect the user experience, but because it takes effect for newly uploaded videos, there is a considerable delay to be reflected in the decrease of download bandwidth, and it takes about 4 hours to take full effect. So this soft measure needs to be activated before the holiday and cannot be used for emergencies.

Traffic changes when the bit rate reduction takes effect

3. Upload the TSSD buffer pool flexibility

Due to the preupload interface machine and the logic module of the following layer, they could not support the increase of 10 times. Therefore, two additional TSSD buffer pools are set up in the architecture. Buffer pools are used to temporarily store newly uploaded files and can support reading and writing. Add buffer pool 1 to the zone module and buffer pool 2 to the upload preupload as shown in the figure above. The role of the two buffer pools is different:

  • If the zone module is overloaded, the upload request that is actively overloaded will not return a failure directly. Instead, the request will be written to buffer pool 1. Files in buffer pool 1 cannot be downloaded, but will be delivered at a slow speed and written to the back-end module. So the main function of buffer pool 1 is to slow down a large number of upload requests in a short period of time, rather than completely cancel out upload requests, and files in buffer pool 1 cannot be downloaded.
  • Buffer pool 2 was added to the Preupload module. In the Preupload module, there is a limit on the number of write requests to store TFS. If the number of upload requests exceeds the storage capacity of TFS, Preupload will write the requests to buffer pool 2. When a user downloads a file, he or she will determine the file identifier. If the file is stored in buffer pool 2 instead of TFS, he or she will go to buffer pool 2 to retrieve the file. So buffer pool 2 can replace the function of TFS and protect the underlying modules. When buffer pool two is taken down, the files in it need to be manually written to TFS.

4. Flexible timeline in the moments of friends

Timeline is WeChat update the timestamp of the circle of friends, the principle of the flexible is a circle of friends will notify users close friends update the timestamp of the cache first, not to the user’s WeChat terminals, such WeChat cannot see the update the contents of the circle of friends, also won’t produce download image/video request, can be downloaded directly to reduce traffic.

This will not update after timeline flexibility

But there are a few caveats:

  • It is easy to cause user complaints, and users will obviously perceive that the content in the moments of friends has become less.
  • If the cache timeline duration is too long, the cache delivery process must be slow; otherwise, download traffic will surge further.

The Spring Festival performs the flexible steps manually



Recommended reading

During the Spring Festival, wechat access bursts. How can storage services pass smoothly?

In 6 months, nearly 100 billion wechat payment transaction records were cleaned. What are they going to do?

Cloud server from 20 yuan/month, and enjoy a thousand yuan renewal package

This article has been published by Tencent Cloud Technology community authorized by the author