The back view

In addition to providing high-quality video services for hundreds of millions of users every day, iQiyi also has sports, live broadcasting, literature and other businesses to serve more users in the circle. Massive businesses are conducting marketing activities almost every day, and the resulting traffic may introduce uncertainty to our services at any time. Iqiyi payment team provides comprehensive collection and payment services for all business lines to ensure users’ payment experience. In addition to ensuring the stability of services, the team also needs to deal with traffic challenges that may break out at any time. For the payment system to do a good job of accurate capacity assessment and plan is very important, the full link pressure measurement provides a powerful guarantee in this respect.

Based on the production environment, the full-link pressure test simulates the massive requests at peak hours to test the entire system link, and then carries out effective capacity assessment and system tuning. Because payment services are sensitive to data and complex, it is difficult to accurately and comprehensively evaluate the call links between systems and implement pressure measurement. Before implementing full-link pressure measurement, we often encounter the following problems:

  • The flow structure of production environment is complex, and it is difficult to evaluate the capacity of production environment effectively by single pressure measurement.

  • The traffic conversion evaluation does not match the actual user behavior, which leads to the plan cannot achieve the expected effect.

  • It is difficult for public resources/services to expose bottlenecks in local pressure measurement, which requires real peak flow to verify;

  • The link capacity is not aligned, resulting in the overall limitation of short board services and a serious waste of resources.

The main reason for the above problems is that we did not use the flow de-pressure measurement system in the real scene in the production environment, so we could not make an accurate evaluation. In order to solve the above problems, we started the exploration and practice of full-link pressure measurement in the payment business as the core scene.

Problem exploration and method practice

To carry out full-link pressure measurement, we have mainly carried out exploration and practice from the following aspects:

  • ** Core link sorting, ** clear pressure test target link and branch link, eliminate pressure test risk;

  • ** The pressure measurement environment is prepared, ** the pressure measurement link can transparently transmit the pressure measurement label and correctly handle the pressure measurement flow;

  • ** Flow structure, ** combined with real data and demand strategy analysis to develop pressure measurement flow model;

  • ** implementation and protection, business verification before ** implementation, implementation according to plan, good monitoring and review.

1

Core link combing

Core link combing connects multiple business lines together. In the implementation process, each business team sorts out its own core links to clarify its dependence on downstream services and to sort out bypass dependencies. Summarize the core links of each service line and get the final full link of pressure measurement. The core link may be coupled to external services but cannot participate in pressure measurement, such as third-party channels on which the payment system depends. In practice, we support payment requests of various channels through self-built channel mock service, and simulate payment success rate, payment callback delay, payment notification, etc., to form a closed loop of payment link.

Bypass dependent carding is also an important part, accurate and comprehensive carding is the guarantee of subsequent smooth pressure measurement. Bypass dependencies are evaluated based on actual business situations, such as risk control systems where we mock and accounting services degrade services through message components.

In a rob tickets and ticket business scenarios do link after combing, emphasis mainly involves the user login authentication, ticketing activities, tickets, cash register, payment and notice six service, dependence on risk control, recommend and push bypass did demotion processing after assessment, important to pay channels and ticketing services index for the mock, Finally, a core link of business closed loop is formed, as shown in the following figure:

2

Prepare the pressure test environment

In addition to the business server, the environment dependent on the core link also involves database, message, cache, log, etc. At this stage, we mainly consider and support the following points:

  • Pressure measurement marks transparent transmission

By injecting the agreed identifier into the inlet traffic to distinguish the pressure measurement traffic, the identification and transparent transmission of the identifier is the basis of the realization of the full link pressure measurement. Our system monitors and manages the system through APM (Application Performance Management), which naturally maintains the information of a call in HTTP, RPC, thread pool and middleware. Therefore, APM becomes the preferred carrier of pressure measurement identification. In practice, we realized the transparent transmission and identification of pressure measurement mark through simple transformation of APM.

  • Storage data isolation

Crushing traffic can affect databases containing write traffic, causing data clutter, which is especially important for payment traffic. For relational database, a new shadow table with the same structure as the original table is created, and the pressure flow is introduced into the shadow table to realize the isolation of pressure data. How to route the shadow traffic to the shadow table depending on the specific situation of the project, we through ShardingJDDC and Mybatis interceptor two ways to achieve the shadow table routing.

  • The message

In messaging middleware, instead of adding new topics, we added pressure markers to the message body. The consumer that needs to degrade processes the pressure messages through subscription policies. For example, we attach the convention UserProperty property to the RocketMQ message body, and the consumer subscribing as needed.

  • The water level

The middleware basically uses data isolation to process the pressure measurement flow. It is necessary to fill the shadow table with the same water level as the original table before the pressure measurement. An empty table behaves differently than a table with tens of millions of data.

There may be many technical components on the core link to support pressure measurement traffic. The principles and solutions are similar. Data isolation is required based on actual service conditions. For example, Redis can use shadow key or shadow cluster, MongoDB uses shadow collection, ElasticSearch uses shadow index, and log creates shadow directory. The following figure shows how the core link handles the pressure measurement traffic of iQiyi’s payment system after the all-link pressure measurement environment is ready:

3

Flow structure

In the single interface pressure test, we will generate a batch of data in accordance with the interface specifications in advance for standby, but such data is too simple and does not conform to the real business scenarios. The ideal method is to capture the log data in the production environment at the gateway layer for secondary processing and then play back the traffic. We did not realize this method in the initial exploration stage. We analyzed the existing data and interface tuning quantity, adjusted the data model in combination with the business strategy, and got the final pressure measurement model. For example, in payment of all links in the pressure test, our user data analysis in a production environment to choose the proportion of each payment, each service invocation proportion, and the effect of combining business strategy for users to buy index, fine-tuning the checkout conversion, ultimately determine the order service, channel notice, order query service adjust dosage ratio model.

4

Pressure test execution and protection

Before performing the pressure test, verify services and the environment of the whole link. Ensure that the pressure test bypass has been degraded and data has been isolated to ensure that the pressure test traffic does not affect normal service data.

Monitoring is an important means to evaluate the health of the system during the implementation of full-link pressure measurement, helping us find problems and stop losses in time. In the process of pressure measurement, we prepared indicators of the following dimensions in advance:

  • Core link service volume, success rate, time consuming;

  • Message backlog monitoring, Redis water level, hit ratio, database load and other performance indicators;

  • Basic indicators of the machine.

Each business leader needs to do their own traffic limiting and downgrading before the pressure test, and reserve security buffs for normal business traffic. It is not entirely dependent on the fuse breaker conditions of the pressure test platform. The pressure test will also verify whether our current limit value is reasonable, whether the downgrade plan can be carried out properly, whether it meets expectations.

Pressure measurement practice and results

After the full link pressure measurement was completed, we implemented the full link pressure measurement based on the following two aspects:

1

Verifying service Capacity

The payment team, live broadcast and ticketing teams conduct full-link pressure measurement in marketing and purchasing activities to effectively locate the short-board services in the full-link with the goal of meeting business demand capacity, and ensure that the capacity of the full-link reaches the expected target by means of expansion, asynchronization and degradation. In addition, traffic limiting and degradation policies are verified to ensure service availability at peak hours.

During the pressure test, the payment system provides support for multiple payment methods, and the channel mock service provides scenarios such as pull up delay, order success rate mock, synchronous response, asynchronous server notification, and notification delay that are close to reality, forming a full-link pressure test closed loop.

2

System ultimate pressure measurement

The payment system itself is also a system with a complex structure. The cashier, accounting, risk control, certification, channel and other related services are maintained by the relevant team. Any link protection is not in place may affect the stability of the whole system, so it is necessary to test the whole link of the payment system.

The implementation is divided into two directions:

  • Virtual currency consumption pressure measurement

Virtual currency consumption is characterized by the fact that it does not involve third-party channels, but it is also a standard payment method with the same life cycle as other payment methods, such as wechat and Alipay. The advantage is that it does not rely on third-party channels and effectively verifies the capacity of the payment system itself.

In the initial pressure test, we found that the TPS of virtual currency payment could only reach 60% of the designed capacity. Through the time analysis of the link, we found that part of the order number generation service was degraded, which increased the overall RT. Further analysis found that the auto-increment sequence of distributed order numbers depended on REDis, and jitter appeared in the interaction with Redis. After confirming with the middleware team, it was finally located that the fork sub-process caused service pause during RDB persistence. According to the requirements of the current scenario, we closed the RDB and cooperated with the optimization of downgrade strategy. After several rounds of pressure test and tuning, the TPS of the payment system was basically close to the load capacity limit of the current deployed service.

  • Mixed pressure measurement

Based on our data analysis of the production environment, we constructed a variety of payment methods to pressure the cashier service with a certain ratio. This scenario can truly simulate the upstream flow of the payment system and cover a more comprehensive business scenario. Locate the capacity mismatch of some service nodes through pressure test, and finally align the capacity through service level expansion and code optimization.

Full-link pressure test requires cross-business cooperation. Each pressure test should be planned in advance: what is the target, what data measurement is needed, what plans are needed, how to set and implement the pressure flow strategy, etc. Meanwhile, review and result follow-up should be done to ensure the orderly and controllable implementation of pressure test.

The future planning

In the whole practice of full-link pressure measurement, there are many and scattered demand points involved, such as data model, traffic scheduling, monitoring and tracking, performance analysis, etc., and many details are still in the exploration stage. With more practical experience, unified planning will be carried out to form a one-stop solution to reduce the access difficulty and implementation cost of full-link pressure measurement.

For full-link pressure measurement involving payment systems, traffic comes from upstream services, and it costs a certain amount to construct the diversity of payment methods. The subsequent plan supports automatic matching of single upstream traffic to diversified payment requests. Payment related businesses are sensitive to data, so they can provide payment pressure measurement data to report to upstream businesses, accounting and other related businesses, which can meet verification requirements when necessary.