1. The background
As the company’s business continues to expand, user traffic continues to increase, and so does the size and complexity of the r&d system. The stability of online services is becoming more and more important, and service performance issues, as well as capacity issues, are becoming more and more obvious.
Therefore, it is necessary to build an effective pressure measurement system to provide safe, efficient and real online full-link pressure measurement service, to escort online services.
On the construction of full-link pressure measurement, the industry has a lot of articles, but involving specific technical implementation, but very little introduction. This paper wants to introduce in detail how to design and implement the full-link pressure measurement system from design to landing. I hope I can give some reference and inspiration to students in the same industry from the perspective of technology practice.
2. Solutions
2.1 Industry Practice
Full-link pressure measurement has been widely practiced in the industry, such as Amazon and PTS[1][2] of Ali, Quake[3][4] of Meituan, ForceBOT[5] of Jd, TestPG[6] of Autonavi, etc., all of which provide us with rich practical experience and a large number of excellent technical solutions. We have extensively absorbed the full link pressure measurement construction experience of major Internet companies, and based on the business requirements of Bytedance, designed and developed a full link pressure measurement system Rhino.
2.1 architecture diagram
As a company-level full-link pressure measurement platform, Rhino platform aims to provide single-service, full-link, safe, reliable, authentic and efficient pressure measurement for all operations in the company, helping businesses to efficiently and conveniently complete performance testing tasks and accurately evaluate online service performance and capacity risks.
Therefore, from the beginning of Rhino platform design, we set the following goals:
- Safety: All manometry is done online, so theoretically all manometry is lossy to the online user. The pressure platform will ensure the safety of the pressure from two aspects: service status and pressure data.
- High efficiency: reduce the cost of scripting, data construction and monitoring, and automate all phases of the process.
- Accuracy: Accurate pressure control, accurate link pressure gauge monitoring, accurate pressure gauge reporting results, and performance & capacity data.
- High coverage: Need to support the pressure requirements of different business lines within the company, such as search, advertising, e-commerce, education, gaming, etc.
Rhino is a distributed, full-link pressure measurement system that can be extended horizontally to simulate the real business operation scenarios of a large number of users and perform a full range of performance tests online. It is mainly divided into control center (Rhino Master) module, pressure measurement link service module, monitoring system module, pressure measurement engine module, as shown in the figure. Each module is performed by multiple microservices. Each line graph represents one or more microservices.
3. Core functions
- The core of the whole link pressure measurement platform is data structure, pressure measurement isolation, link governance, task scheduling, pressure measurement fusing, pressure measurement engine, pressure measurement monitoring and so on. We’ll take a closer look at these aspects of design and implementation in Rhino.
3.1 Data Construction
Data construction is the most important and complicated part in the process of pressure measurement. The modeling of pressure measurement data directly affects the accuracy of pressure measurement results.
- For service performance defect scanning, performance tuning, and new live services, it is recommended to construct Fake data to pressure specified paths.
- For on-line capacity planning, performance capability verification, and performance Diff, it is recommended to use on-line real flow to make pressure measurement results more realistic.
- If user accounts are involved and the login status of users remains, you are advised to use the test account exclusively for the pressure test to avoid affecting real online users.
Basic data construction
To efficiently construct specific Fake pressure data, Rhino pressure platform provides a number of data construction methods:
- CSV file: Split data by column. Select the first row of the field name in the CSV file. Data is read in a row – increasing loop. If a pressure test task is split into multiple jobs, data files are also split to avoid data duplication among jobs.
- Increment: All variable types are numeric. +1 each time the pressure is sent, and then circulates from the minimum value to the maximum value.
- Random: The variable types are numeric and randomly generated during each voltage generation.
- Constant: Constant, which can be customized to any value.
Pressure test account
In the process of pressure testing, some pressure requests need to be logged in and kept in session. In addition, many pressure test requests involve user account information UserID, DeviceID and other data. The construction of user account is always a very difficult problem in the process of pressure measurement. Rhino platform through the user center, set up the pressure test exclusive account service, perfect solution to the pressure test process login state, as well as the test account and other problems. The specific process and user interface are shown below.
3.2 Pressure isolation
Pressure measurement flow isolation and pressure measurement data isolation need to be solved in pressure measurement isolation.
Pressure measurement flow isolation is mainly solved by constructing pressure measurement environment, such as offline pressure measurement environment, or swimlane /Set construction, to completely isolate the pressure measurement flow from the online process. The advantage is that the pressure measurement flow is completely isolated from the online flow, which does not affect the online users. Disadvantages: high cost of machine resources and maintenance, and the pressure measurement results need to be converted to get the line capacity, the accuracy of the results has certain problems. At present, the pressure measurement in the company is completed on the line cluster, and the online swimming lane is under construction.
The pressure measurement data isolation is mainly through the pressure measurement flow dyeing, so that the online service can identify the pressure measurement flow and the normal flow, and then carry out special treatment on the pressure measurement flow to achieve the purpose of data isolation. At present, the overall pressure measurement isolation framework of Rhino platform is shown as follows.
Pressure test mark
Pressure marking is the most common method of pressure flow staining.
- For RPC protocols, a Key: Value field is added to the header of the request as a pressure marker.
- For HTTP and other protocols, the Stress flag (key-value) is automatically injected into the request header.
- Stress_Tag Key:Value, where Key is a fixed Stress_Tag Value, but each Stress_Value task has a unique Stress_Value, which is mainly used to solve pressure test data conflicts and locate performance problems.
Manometry marks transparent transmission
All of the company’s infrastructure components, storage components, and RPC frameworks already support pass-through of pressure markers. The principle is to store the KV value of the pressure gauge mark in the Context, and then carry the Context in all downstream requests. The downstream service can complete the process of pressure gauge flow according to the pressure gauge mark in the Context. In real business, code modification is also very simple, just need to pass through the Context.
Golang service: writes pressure markers to the Context.
Python services: Store thread Context using threading.local().
Java services: Use ThreadLocal to store thread contexts.
Pressure switch measurement
In order to solve the safety problem of on-line pressure measurement, we also introduce pressure switch components.
- Each service and each cluster has a pressure switch. Only when the pressure switch is turned on, the pressure flow can flow into the service. Otherwise, it will be directly rejected by the underlying micro-service framework, and the business layer has no perception.
- In each IDC area, there is a global pressure master switch. Pressure flow is allowed to flow within the IDC only when the global pressure switch is turned on.
- In the event of a pressure flow problem on the line, in addition to turning off the pressure flow at the source, turning off the pressure switch on the target service immediately blocks the pressure flow.
Pressure measurement data isolation
The most complex problem in online manometry is that the manometry link involves write operation, so how to avoid polluting the online data and ensure that the manometry request remains the same as the online request path. There are many solutions in the industry, including shadow tables, shadow libraries, and data offset, as shown in Figure [7].
The Rhino platform has different solutions for different storage:
- MySQL and MongoDB: shadow tables. The SDK determines whether the traffic is pressure measured. If so, the table is mapped to the new table name according to the configuration. There are two configuration policies: read and write the shadow table, and read and write the online table.
- Redis: The Redis Key is prefixed with Stress. For example, Stress_Tag=Valuex, then read and write the Key of Redis =Valuex_Key. In this way, the problem of data conflict in multiple pressure measurement tasks can be solved. After the pressure test is complete, you only need to clear or expire Prefix=Valuex.
- MQ: For message queues, the Rhino platform has two policies. One is to directly discard the message queue, and then according to the performance of the message queue, a separate pressure measurement; The second is to pass through the pressure test mark in the Header, and the Consumer makes special processing according to the pressure test mark and business requirements. By default, services can configure the discard policy as required.
- Other stores, such as ES, ClickHouse, etc., have pressure test clusters. During the pressure test, a pressure test request is sent to the specified pressure test cluster.
Service pressure measurement transformation
Pressure testing of the service is required prior to pressure testing. For services that do not meet the pressure requirements (i.e. pressure data isolation), pressure modification is required.
- Pressure test: For storage services, all read and write operations will be rejected if the pressure switch is not turned on. If no, the Context is not loaded during storage service operations and needs to be modified.
- Pressure measurement transformation: pressure measurement transformation is a very important and difficult part in the whole link pressure measurement promotion. For services that are already live, pressure changes are also likely to introduce new bugs, so they are often difficult to implement. To address these issues, the Rhino platform has several solutions:
A. Minimize code changes and provide complete guidance manuals and code examples to reduce RD workload and reduce the possibility of code errors.
B. Provide simple and convenient online and offline HTTP&RPC pressure test request Debug tools to facilitate the verification of code changes.
C. For new projects, pressure test transformation should be added into project development specifications at the beginning of the project to reduce code changes in the later stage.
3.3 Link Management
Link to comb
The request call chain is very important for on-line manometry:
- Provide clear pressure measurement flow map and provide complete link monitoring.
- Complete the sorting of service dependence, and check whether the service/medium platform on which the pressure measurement depends has the conditions of pressure measurement, and whether the pressure measurement modification is needed.
- Link pressure switch management, pressure measurement upstream and downstream known.
The Rhino platform implements call chain retrieval through the company’s streaming logging system. A service passes through a LogID when requested or when requested downstream. The RPC framework prints call chain logs (RPC logs – caller logs, Access logs – called logs), all of which contain the LogID. The call chain retrieval is accomplished by stringing together all the service logs that a request passes through through the LogID.
The Rhino platform builds on the link carding capabilities provided by the company’s streaming logging system and is further optimized to meet pressure measurement needs:
- Automatic carding: Due to the company’s microservice architecture, the invocation link behind each request is extremely complex and cannot be completed by manual maintenance alone. The user only needs to provide the LogID in the request, and the Rhino platform can quickly tease out the service node that the request passes through, as shown in the figure below:
- Real-time carding: The call chain of the same request is constantly changing due to the changes of online services, new online and offline services, etc. The Rhino platform generally recommends using a LogID of less than an hour for grooming.
- Multi-tunable link combination: For the same interface, different call chains under different parameters are different. The Rhino platform will automatically merge the results of multiple LogID carding to complete the call chain and ensure the accuracy and integrity of link carding results.
Pressure measurement known
While the Rhino platform has a number of safety measures for manometry, for large manometry it is important to ensure a smooth flow of information. Therefore, the Rhino platform also provides a number of solutions when it comes to pressure measurement:
- One-click pull group: After links are sorted, you can pull the owners of upstream and downstream services in the link to the same group before pressure measurement to synchronize pressure measurement information.
- Pressure test known: each pressure test began to perform, will be pushed to the pressure test weekly group, such as pressure QPS, pressure test duration and other information.
- Pressure event: At the start of a pressure test, the Rhino platform will also send a pressure event to the event queue of the target service to quickly assess/locate whether a stability problem is due to pressure test, reducing the interference of RD line problem detection.
Pressure switch management
Before the pressure test, you need to enable the pressure test switch of the whole link. Otherwise, the pressure test traffic will be rejected by services, causing the pressure test to fail.
- One-click on: Prior to manometry, the Rhino platform can enable manometry on all connected nodes with one click.
- As you know, when the manometer switch is on, Rhino platform will automatically push relevant information to the corresponding service Owner to ensure that the service Owner understands the relevant manometer information, and the upstream manometer flow will pass through its service.
- Silent shutdown: When the manometer switch expires, Rhino will automatically silently shut down the manometer switch to ensure the safety of online services.
The service Mock
For services that cannot be squashed in the call chain (sensitive services), or third-party services, you need to Mock them in order to squint the integrity of the request. Common Mock schemes in the industry include:
- Modify the business code to change the service call to idle code. Advantages: Low implementation cost. Disadvantages: fixed return value, high code & business intrusion, hard to push. If the Mock position is relatively downstream, beyond the scope of the department to cover the business, it is very troublesome to push.
- Generic Mock services. Generic MockServer, which configures different Mock rules for different users, executes corresponding response delays, and returns corresponding response data. Advantages: No code intrusion, no business perception. Disadvantages: High implementation cost.
As the whole company uses microservice architecture, a pressure test involves a long link and fast Mock without business invasion.
Method became the preferred choice. The Rhino platform implements efficient, business-transparent Service mocks through the company’s Service Mesh and ByteMock systems.
The Rhino platform needs to register the dye forwarding rules with the Service Mesh and the Mock rules with the Mock Service before the test is executed. The service Mock is then performed by injecting the Mock dye flag into the pressure flow:
- Dyeing traffic forwarding based on Service Mesh. The first step is to inject the forwarding dye mark into the pressure flow and register the corresponding forwarding rule in the Service Mesh. When the Service Mesh detects dyeing traffic, it forwards it to the specified Mock Server, as shown.
- Mock Server-based request rule matching. The Mock rule is first registered on the Mock Server, along with the matching Response and Response latency. When the Mock Server receives the request, it responds according to the rules, as shown in the figure.
3.4 Voltage mode
Minimum dispatch unit
In Rhino platform, manometer Agent is a minimum scheduling unit. A pressure test task is usually divided into multiple sub-jobs and sent to multiple agents to complete them.
- Minimize container deployment and reduce resource waste. Pressure testing is very expensive on machine resources, with CPU &Memory usage usually above 80%. However, the machine resource utilization rate was less than 5% during the execution time of no pressure test. If a lot of resources are occupied for a long time, it will cause a great waste of machine resources. All the agents are deployed in containers, and the resource specifications of each container are as small as possible. In this way, it can meet the requirements of daily pressure measurement without occupying too many machine resources.
- Exclusive Agent: Only one Agent process can be started in a container. Each Agent can be occupied by only one pressure test task at a time, avoiding interference from multiple tasks and multiple processes and resource competition, improving pressure test stability.
- Dynamic capacity expansion to support massive QPS generation: During peak pressure measurement periods, Rhino platform will temporarily apply for machine resources to rapidly expand capacity and complete massive QPS support. After the pressure measurement is completed, the machine resources will be released immediately to reduce the waste of resources.
In the Spring Festival of 2020, Rhino temporarily increased its capacity to 4,000 + instances, supporting a single 3kW +QPS pressure survey. However, the daily Rhino platform only deployed 100+ instances, which can meet the daily pressure survey demand.
Intelligent pressure regulation
- Dynamic allocation of pressure measuring Agent: In the process of pressure measuring, the CPU/Memory usage of pressure measuring Agent is often too high (>90%), resulting in pressure failure and failure to reach the target QPS. Or the pressure measurement delay is too high, the pressure measurement results are not accurate. During voltage generation, the Rhino platform will monitor the CPU/Memory usage of each Agent in real time. When the CPU/Memory usage exceeds the threshold (>90%), the Rhino platform will dynamically allocate additional agents to reduce the load of each Agent and ensure the stability of the pressure measurement.
- Intelligent pressure adjustment: During the pressure measurement process, it is usually necessary to continuously adjust the QPS size to achieve the performance pressure measurement goal. This process takes a lot of energy and time. The Rhino platform can intelligently adjust the QPS size according to the performance index set by the pressure measurement task. When the pressure measurement target is reached, the automatic circuit breaker will stop the pressure measurement.
Pressure measurement link simulation
By default, Rhino platform divides full-link pressure measurement into public network pressure measurement and internal network pressure measurement. Public network pressure measurement mainly IDC network bandwidth, delay, IDC gateway new connection, forwarding and other capabilities; Intranet pressure test, mainly pressure test target service, target cluster performance, capacity and so on.
- By default, the Intranet pressure test must be transmitted within the IDC to reduce interference from network delay.
- For public network pressure measurement, The Rhino platform has Agent nodes deployed on the COMPANY’s CDN nodes, which makes use of the remaining computing capacity of CDN nodes to complete the construction of public network pressure measurement capacity.
Multi-machine room in the same city, multi-machine room in different places
The Rhino platform has Agent clusters deployed in each IDC. By default, the nearest compression Agent is selected for each IDC service to reduce the interference of network delay on the compression result, making the compression result more accurate and locating the compression problem easier.
Edge compute node Agent
In addition to the multi-room deployment, Rhino platforms also deploy pressure agents on edge compute nodes to simulate traffic requests from different operators in different regions, ensuring that traffic sources and traffic distribution are more realistic. On the Rhino platform, you can choose from different operators in different regions of the country to initiate pressure measurement flow.
3.5 Pressure test fuse
To address the risks of online manometry, the Rhino platform provides two types of fuse breaker to minimize the impact on online services in the event of an emergency.
Fuse based on alarm monitoring
Each pressure test task can associate alarm rules of any service in the chain. During the manometry, the Rhino platform actively listens for the alarm service. When an alarm is generated for a service in the call chain, the pressure test is stopped immediately. Unassociated alarms will also be recorded by the Rhino platform to facilitate the location of pressure problems.
Metric based fuses
You can define monitoring indicators and thresholds. When a threshold is reached, the system automatically stops the pressure measurement. Currently supports CPU, Memory, upstream stability, error logging, and other custom metrics.
In addition to the circuit breakers provided by the Rhino platform itself, the company’s service governance architecture provides additional circuit breakers, such as pressure switches that cut off pressure flow with a single click. Overload protection: Automatically discards pressure measurement traffic when the service is overloaded.
3.6 Task Model
The HTTP task
For HTTP protocol, reference Postman, all visual operation, to ensure that everyone can get started operation, greatly reduce the use of pressure measurement threshold and cost.
RPC task
For RPC tasks, Rhino also automatically parses the IDL and converts it to JSON format for easy user parameterization.
Customize -Go Plugin
The Rhino platform also provides a complete solution for non-HTTP /RPC protocols and pressure testing tasks with complex logic — the Go Plugin.
The Go Plugin provides a way to dynamically load the Go language shared objects compiled by others by directly defining a series of conventions or interfaces in the main program and the shared library, so that the host program can dynamically load the shared library after compilation, realizing the hot-plug plug-in system. In addition, the developers of the main program and the shared library do not need to share the code, as long as the conventions of both parties remain unchanged, there is no need to recompile the main program after modifying the shared library.
As long as the user according to the specification requirements, the implementation of a section of transmission voltage service logic code. The Rhino platform automatically pulls code and triggers compilation. And distributed the compiled plug-in SO file to multiple pressure testing agents. Agent dynamic load SO file, concurrent running, can achieve the purpose of pressure measurement. Rhino also builds a library of sample pressure code for common Go Plugin pressure scenarios. For beginners, a simple modification of the business logic code can be done. This solves the pressure problems of unusual protocols and complex pressure scenarios.
3.7 Pressure test engine
Single Agent with multiple engines
The minimum unit of pressure survey scheduling is the pressure survey Agent. However, each Agent has multiple pressure survey engines mounted to support different pressure survey scenarios. The Rhino platform adds a pressure engine adapter layer between the pressure data and the pressure engine to decouple the pressure data from the pressure engine. The pressure test engine adaptation layer will generate pressure test data of different schemas according to the selection of different pressure test engines, and enable different engines to complete the pressure test, which is transparent to users.
Pressure test engine
In the pressure engine, we have open source pressure engine, also have our own pressure engine.
The advantage of open source pressure test engine is to maintain many people, rich functions, stable and good performance, the disadvantage is that the input format is fixed, customization is difficult. In addition, there are usually different processes between Agent and open source pressure measurement engine, and the process communication also has a relatively big problem, which is not easy to control.
Self-developed pressure measurement engine, the advantage is that the Agent and usually run in a single process, easy to control; The downside may be slightly worse performance. But Golang naturally supports high concurrency, so the performance gap between homegrown and open source is not significant.
- HTTP protocol: default Gatling, single voltage generation performance is very good, far superior to Jmeter. In the case of intelligent pressure measurement, or dynamic adjustment, switch to the self-developed pressure measurement engine
- RPC protocol: self-developed engine, mainly using Golang coroutine +RPC connection pool to complete high concurrent pressure measurement.
- GoPlugin protocol: the self-developed engine can automatically load the custom pressure plug-in to complete the pressure test by taking advantage of the dynamic loading feature of Golang Plugin.
3.8 Pressure monitoring
Client monitoring
Due to the company’s monitoring system, the minimum time granularity is 30 seconds, and the data within 30 seconds will converge into a point. This time granularity is unacceptable for manometry. As a result, Rhino built its own client-side monitoring system.
- Each Request is marked with a dot based on the time the Request started.
- Within an Agent, the data generated within 1s on the same interface for the same task is summarized locally and reported to Kafka.
- Monitoring services consume the data in Kafka, summarize the data reported by multiple agents, and write it to the database.
- The front-end monitoring report pulls real-time monitoring summary data from the database to draw real-time monitoring curves
- In the monitoring data summary process, PCT99 calculation of request response time is difficult to deal with:
- Currently Rhino platform uses t-Digest algorithm to calculate PCT99 in 1 second
- The calculation of PCT99 in the whole time period is aggregated by PCT & AGV. That is, PCT99 is calculated by T-digest in unit time. PCT99 in the whole period is the average of PCT99 at all points.
- The overall calculation scheme has been aligned with the company’s server monitoring algorithm in order to reduce the Gap between client monitoring and server monitoring and reduce the interference factors of pressure measurement results analysis
Server monitoring
Server monitoring, directly connected to the company Metric system.
- During the manometry, the Rhino platform provides a large platter of core metrics for all nodes along the link and highlights nodes that may be at risk to provide real-time warning.
- Real-time, detailed monitoring graphs are also provided for each node.
- Each node provides core monitoring indicators, such as CPU, Memory, QPS, and Error_Rate, by default. You can modify the monitoring configuration on the Rhino platform to add other customized monitoring indicators.
Performance Profile
During the manometry, the Rhino platform can also capture a real-time performance Profile of the target server process and display it as a flame map for easy performance analysis and optimization, as shown in figure 2.
4. Pressure measurement practice
The Rhino manometry platform is a one-stop, full-link manometry platform for all bytedance students. The Rhino r&d team will not only be responsible for Rhino platform research and development tasks, but also cooperate with QA&RD to complete the company’s major projects and key business performance tests.
4.1 Support for major projects
Rhino platform will actively participate in and fully support the pressure measurement of major projects within the company. Among them, the typical projects are Douyin Spring Festival Gala, watermelon Million Hero, Spring Festival red envelope rain and so on.
Among them, the Byte Spring Festival red envelope Rain activity was completed by Rhino team. Byte Spring Festival Red envelope Rain activity is initiated by all byte clients during the Spring Festival, such as drawing cards for cash, red envelope koi, red envelope rain and a series of large-scale red envelope drainage activities. Its huge traffic scale, sudden traffic, and high complexity of business logic and network architecture all pose great challenges to Rhino platform.
In the Activity of Red Envelope Rain during the Spring Festival, all user traffic is connected to the aggregation machine room at the edge of the network through special lines of operators, and then filtered and verified before being forwarded to the core machine room. The IDCs are mutually backed up. The traffic routes are shown in the figure. Here, it is necessary to verify not only whether the back-end services can bear the expected process, but also the bandwidth of each dedicated line, the bandwidth of each gateway and forwarding capacity, the carrying capacity of each IDC and the bandwidth between them.
To this end, we divided the whole pressure survey into multiple stages to simplify the complexity of the pressure survey and reduce the difficulty of locating the pressure survey problems:
- The carrying capacity, bandwidth and gateway performance of each aggregation room were verified by dial-up /CDN pressure test.
- Pressure measuring Agent is deployed in each aggregation machine room to simulate user traffic distribution and to pressure the back-end service performance deployed in the core machine room.
- Single-interface single-instance pressure test, single-interface single-server room pressure test, scenario-oriented full-link single-server room pressure test, scenario-oriented full-link full-resource pressure test to verify back-end service performance in stages.
- Finally, the whole network dial test, to simulate the real Spring Festival red envelope rain peak flow, the overall verification of the whole system performance.
Supported by these large-scale projects, Rhino team not only learned a lot of business and architecture design knowledge, but also learned how business development students view pressure testing and how to use the platform, which helped us find more problems on the platform and promote the continuous iterative optimization of the platform.
4.2 Daily pressure test support
Daily pressure measurement support is also an important task for the Rhino platform. For various problems encountered in daily pressure measurement, we adopt various solutions to solve:
- Special person Oncall value week, one-to-one guidance.
- The detailed and perfect knowledge base of pressure measurement not only introduces how to use the platform, but also includes how to transform the pressure measurement, how to formulate the pressure measurement plan and how to locate the pressure measurement problem.
- Comprehensive performance training system: Regular performance testing sharing and professional pressure testing training for the QA&RD team.
4.3 Online Traffic Scheduling
Rhino platform also realizes regular scheduling of online flow to achieve the purpose of automatic pressure measurement of online instances [8] :
- The online traffic is gradually scheduled to the target instance to test the performance limit of the service instance, and the instance performance Profile is given to analyze the performance bottleneck of the instance.
- You can monitor service performance trends by observing performance changes of service instances through long-term traffic scheduling.
- The capacity of the entire cluster can be estimated based on instance performance at different resource watermarks. Complete service capacity estimation and online risk assessment.
- Based on swimlane traffic scheduling, service cluster capacity can be accurately predicted.
The specific implementation scheme is as follows:
- Change the Weight value of the target instance in load balancing and gradually increase the Weight value to divert more traffic to the target instance until the specified stop threshold is reached.
At present, 500+ microservices have been connected. Traffic scheduling is performed regularly every day to monitor the performance trend of online services, as shown in the following figure.
4.4 Normal pressure measurement
The Rhino platform is also currently implementing the company’s normalized pressure measurement system, which provides automated full-link pressure measurement with periodic timing to achieve the following objectives:
- Monitor the capacity of online service clusters in real time to prevent service performance deterioration.
- Monitors online link capacity in real time to prevent link performance deterioration.
The current Rhino standard manometry automatically performs manometry tasks periodically and unattended, and pushes the results. In the process of pressure measurement, the pressure switch will be automatically turned on according to the call chain to initiate pressure measurement flow. Monitor the service performance in real time, and automatically fuse the pressure test according to the Metric and alarm monitoring to ensure the safety of the pressure test.
At present, several business parties have access to the normal pressure measurement to ensure the stability of online services.
4.5 Pressure measurement in DevOps pipeline
When the service goes online, it will go through pre-release, online small flow gray scale, online full release. In this process, we can intercept service online defects through online testing cases and grayscale publishing. However, performance defect interception is not effective enough.
From the on-line fault tracking system, it can be found that many performance defects escape to the on-line due to the failure of performance pressure measurement before on-line.
In order to intercept various performance defects, the Rhino platform completes the DevOps platform. By registering the pressure test service as an atomic service on DevOps platform, developers can arrange the pressure test nodes at any location in any pipeline to achieve routine pressure test before going online. Stress testing in the DevOps pipeline not only helps RD find performance problems in code, but also Diff against the performance baseline to detect the smell of code performance deterioration.
5. Summary and outlook
5.1 summarize
Rhino manometry platform from the project to now, less than two years, its development has begun to take shape, as shown in the figure (monthly manometry execution statistics). During this period, I am very grateful to all the cooperation teams in the company, especially the architecture team and the mid-stage team for their support of the pressure measurement platform. Without their support, the construction of full-link pressure measurement would be difficult to complete.
5.2 Future Development
Deep customization of business
The general pressure measurement platform has been preliminarily built and can basically meet the daily pressure measurement needs of the business line. However, in the daily pressure measurement support process, it is found that different lines of business in the pressure measurement, but there are still a lot of pre and post work to be completed manually.
How to further reduce the cost of business side pressure measurement transformation, how to reduce the cost data preset pressure measurement environment, how to quickly complete the pressure measurement data cleaning, how to quickly pinpoint performance problems, and so on, Rhino pressure measuring platform will further in-depth business follow-up, more in-depth cooperation with each big business, to provide more depth business custom and effect for the research and development, Support business line development.
Pressure measurement and capacity planning
Whether the current resources of the business are sufficient and how much is its specific capacity; How long will its machine resources last based on current business growth?
How is the utilization of current service resources? Can it be optimized? How to further improve the utilization of resources and reduce the cost of machine resources?
How many resources do I need to apply for a large event? Is it possible to draw these conclusions without manometry, or using on-line flow data automatically, or using daily manometry data?
Pressure test and the SRE
How to ensure service stability, monitor service performance deterioration and give early warning in time, and check whether service governance measures such as traffic limiting, timeout, retry, and fusing are properly configured? These Rhino platforms will further explore how chaos tests can be used for disaster recovery drills to ensure service stability.
6. Recruiting
Currently, the Rhino team is very small and lacks r&d engineers related to performance testing and back-end development. Welcome students who are interested to join us. Resume: [email protected]; Email subject: Name – years of service – Rhino.
reference
[1] jm.taobao.org/2017/03/30/…
[2] testerhome.com/topics/1949…
[3] tech.meituan.com/2018/09/27/…
[4] tech.meituan.com/2019/02/14/…
[5] www.open-open.com/lib/view/op…
[6] www.infoq.cn/article/Nvf…
[7] tech.bytedance.net/articles/31…
[8] www.usenix.org/conference/…
More share
Fastbot: Smart Monkey on the move
Toutiao quality optimization – Graphic details page second practice
Android Camera memory problems analysis
Architecture evolution of Bytedance’s own online drainage playback system
Welcome to Bytedance Technical Team