Tencent cloud TSF is integration of external open source framework and tencent internal after years of exercising the PaaS platform and enterprise distributed application service development and hosting platform, this paper mainly responsible for service hosted PaaS platform of TSF mystery, TSF platform from the technical aspects, how to invoke services hosted one trillion times a day and governance.
Brief introduction:
Tencent cloud TSF is integration of external open source framework and tencent internal after years of exercising the PaaS platform and enterprise distributed application service development and hosting platform, this paper mainly responsible for service hosted PaaS platform of TSF mystery, TSF platform from the technical aspects, how to invoke services hosted one trillion times a day and governance.
The predecessor of TSF PaaS platform is CAE(Cloud App Engine), and its core architecture is developed by referring to Cloud Foundry design. In order to provide more convenient services for developers, TSF connects with many basic services of the company, such as Tencent gateway TGW, name service L5, internal authentication service, and message queue, so that users can complete one-stop development, online, hosting services on TSF platform. In addition, it also supports services such as health check, process monitoring, and log aggregation display for hosted applications. Allowing developers to only care about their application code, while everything else is provided by the platform, greatly improves the efficiency of developers and reduces operation and maintenance costs. The following diagram briefly describes the relationship between the PaaS platform and users with different roles.
Overall architecture Diagram
Introduction to core Competencies:
Currently, Tencent has tens of thousands of applications hosted on TSF PaaS platform, and these applications receive more than a trillion requests per day. The problems solved and core capabilities of TSF PaaS platform are introduced below.
Elastic expansion capacity
In many of the company’s Internet businesses, there are often situations where a large number of users suddenly access a set of services at the same time, such as a flash sale, a “kill” event, or a game that starts on the hour. Servers often face a flood of requests in a short period of time, quickly eating up CPU and memory, causing the server to crash, and further retries by front-end users, leading to an avalanche. Therefore, in some very important business activities, operations need to prepare a large number of machines in advance (the estimated value is much higher than the actual value), deploy the program, and wait for the arrival of the activity. If there is an automatic scaling mechanism, the capacity can be automatically expanded during activities, and it can be easily offline when it is not needed, the whole operation will be much simpler. Elastic scaling is one of the basic capabilities of PaaS platform. TSF provides flexible rules in various ways according to different business demand scenarios within the company:
Rule 1: You can configure rules for multiple dimensions, such as physical load, request volume, delay, and returned error codes, on the node where the application is located within a certain period of time. Once the elastic condition is triggered, the platform automatically expands or reduces the capacity accordingly.
Rule 2: For applications with periodic peaks and troughs in request volume, you can configure timed elastic scaling, as shown in the following figure.
Figure: Timed capacity expansion and reduction rule
Whether dynamic scaling or timed scaling, its background implementation principle is similar. The overall scheduling architecture is as follows:
Figure: Schematic diagram of the elastic expansion module
Configure the system: Users can set elastic scaling triggering rules on the console based on service conditions. The rules include the following dimensions:
Sampling interval: 1~60s, can be arbitrarily configured, the platform has the ability to collect data in seconds. Continuous high load times must be specified in the configuration rule. The first condition for automatic service expansion is that at least one instance in a service group reports high load data for several times in a row. At the same time, the cooling time should be configured. During the cooling time, the platform will ignore the information of high load of instances, so as to avoid triggering the expansion again before the expansion is completed. Performance indicators in the configuration rules include CPU, memory, disk, network adapter traffic, number of TCP connections, number of requests, error ratio, and latency.
Notification center: responsible for pushing module high load information and expansion decision information externally.
Flow system: After receiving the information, the notification center triggers the automatic expansion process, including resource deployment, program package installation, and configuration delivery.
The elastic expansion ability was brought into full play in the Spring Festival when the Wechat red envelope was born. In 2014, the Wechat red envelope was popular all over the country before the Spring Festival, with a large number of users pouring in every second. TSF extended automatically and flexibly according to the pre-configured rules, successfully supporting the sudden massive requests. It won precious time for the early development of wechat red envelope. The following figure shows the automatic expansion task list. It can be seen that the same service automatically triggered many expansion operations in a short period of time, greatly reducing the pressure of operation and maintenance, and truly reaching unattended.
Figure: Elastic scaling task query list
Grayscale release, flow control ability
TSF platform can complete the management of the entire application life cycle, from code development to CI/CD to version upgrade and rollback. During the launching of new functions, a grayscale process is usually required to control the proportion of traffic that can access the new version. If abnormal problems are found, roll back to the stable version in time. Also, grayscale publishing is not a transient process and can take a long time. For example, a major framework or system update may take a long time, the entire service may be old and new within a few months, or two versions may need to be iterated separately. From the perspective of products, it may be more flexible. It is likely that there are five or six online schemes collecting data. Every day, when some new ideas come into being, we need to put on some small versions to see the effect. In this case, there will never be a uniform version of the gray scale online and instead will be the norm to respond to changing needs and challenges.
TSF grayscale publishing system includes the following two aspects
Accurate traffic distribution control: From the perspective of o&M risk control, it is necessary to control the affected traffic within an accurate range, and know which users will have problems before going online, rather than knowing who will be affected. A common scenario is for a new version to be accessible only to employees within a company, in bits and pieces from city to province. TSF grayscale publishing system can operate application instances in groups. The new version is grayscale publishing according to groups, and part of the traffic is imported into grayscale groups to observe whether it meets expectations. The specific imported part of the traffic can be internal employees of the company, or can be divided according to other dimensions.
Monitoring system support: Accurate distribution of traffic is only the first step, then it is more important to obtain multiple versions of key metrics. For operations, it might be system level metrics such as error rate, throughput, latency, CPU memory consumption. For products, it may depend on changes in CTR, PV, UV and other business indicators. These can be presented graphically in the PaaS console, making it easy to make decisions about the next gray level.
Complete logging and monitoring system
A comprehensive statistical monitoring and logging system is one of the most fundamental capabilities of a PaaS platform. The monitoring and logging system of TSF is built by using EFK(ES+ Filebeat+Kinana) scheme at the bottom level. The monitoring data dimensions are mainly divided into the following categories:
A. Visits: number of requests, number of successes, number of failures, percentage of successes, percentage of year-on-year quarter-on-quarter fluctuations
B. Response packet size and response delay
C. Machine load: CPU, memory, read and write disks, number of TCP connections, incoming and outgoing traffic, and incoming and outgoing packets
D. Process monitoring
The collection frequency of the first two types is minute, and the monitoring of machine load and process is second. Each service can be customized as required.
The preceding “monitor” supports wechat, SMS, and email alarms as strings, and curve alarms as shown in the following figure. You can set alarm triggering conditions as required.
In addition, the TSF background analyzes and compares big data based on historical reported data and alarm policies for intelligent operation and maintenance.
Figure: Schematic diagram of statistical monitoring
Distributed job capability
TSF PaaS’s distributed operating system is based on Quartz, which fully inherits its powerful scheduling capabilities and diverse scheduling methods, as well as its distributed clustering capabilities.
The TSF distributed Job control platform can not only complete all o&M operations, but also automatically select nodes with low load to deliver computing tasks, making full use of device resources. In addition, the details of the execution of all jobs can be queried on the console. It can be seen from the following figure that the distributed job platform currently performs hundreds of thousands of tasks a day.
Figure: Schematic diagram of distributed operation
Conclusion:
This paper introduces the core capabilities of TSF platform service life cycle management based on several core function points that Tencent’s internal access to THE TSF platform focuses on in the daily operation process. If you want to know more about the technical details of the internship, you are welcome to further exchange.