background

Every system has its core metrics. For example, in the field of receiving orders: the first important thing for the incoming system is to ensure the accuracy of the incoming pieces, and the second important thing is to ensure the efficiency of the incoming orders. Clearing the settlement system is the first important to ensure accurate payment, the second important is to ensure timely payment. The system we are responsible for is the core link of Meituan-Dianping intelligent payment, which is responsible for 100% of the traffic of intelligent payment, and is called core transaction in internal practice. As it involves the capital flow between all offline merchants and users of Meituan-Dianping, the first and second most important thing for core transactions is stability.

problems

As a platform division, our ideal is to quickly support the first phase of the business; The second stage to control a direction; The third stage to observe the direction of the market, to lead a general direction.

The ideal is full, but the reality is that since the daily orders of hundreds of thousands at the beginning of 2017, by the end of the year, the daily orders have exceeded 7 million, the system is facing huge challenges. Payment channels are increasing; Links are lengthening; System complexity increases accordingly. From the initial POS machine to the later TWO-DIMENSIONAL code products, small white box, small black box, second payment…… The diversification of products, the positioning of the system is also changing at the moment. The system responds to change like a tortoise racing a rabbit.

Due to the rapid growth of the business, even without any system development upgrade, some accidents will occur suddenly. The frequency of accidents is increasing, and upgrading the system itself is often difficult. Infrastructure upgrades, upstream and downstream upgrades, often have a “butterfly effect” that is affected without warning.

Problem analysis

The stability of core transactions is fundamentally a question of how to achieve high availability.

The industry’s standard for high availability is measured in terms of system downtime:

Since the industry standard is a posterior indicator, we usually use OCTO, a service governance platform, to calculate availability, considering the guiding significance for daily work. The calculation method is:

There are two more commonly used key metrics of system reliability in the industry:

  • Mean Time Between Failures (MTBF) : Indicates the average Time that a system can operate normally before a failure occurs
  • Mean Time To Repair (MTTR) : indicates the average Repair Time when the system changes from the fault state To the working state

For core transactions, availability is ideally trouble-free. In the event of a failure, the factors to determine the impact are not only time, but also scope. The usability of core transactions can be broken down into:

Problem solving

1. We don’t die when others die

1.1 Eliminate dependence, weaken dependence and control dependence

Use the STAR rule for an example:

We need to design A system A, which will be completed by using our POS machine of Meituan-Dianping to connect to the bank for payment through system A. We will have some preferential activities such as full reduction and use of points.

Analyze the explicit demand and implicit demand for system A: 1> Need to receive upstream parameters, including merchant information, user information, equipment information, preferential information. 2> Generate the order number and put the order information of the transaction into the database. 3> Encrypt sensitive information. 4> To invoke the interface of the downstream bank. 5> Support refunds. 6> Synchronize the order information to the integral verification and cancellation departments. 7> Should be able to give the merchant an interface to view the order. 8> To be able to carry out payment settlement for merchants. Based on the above requirements, analyze how to make the core link “payment by POS machine” stable.

Analysis: demand 1 to 4 is the payment necessary link, can be done in a subsystem, let’s call it the collection subsystem. 5 through 8 are independent, and each can be done as a subsystem, depending on the number of developers, maintenance costs, and so on. It is worth noting that the dependencies between requirements 5-8 and the billing subsystem are not functional, but data dependent. That is, they all rely on the generated order data. The collection subsystem is the core of the whole system and has a very high requirement for stability. Other subsystems have problems, and the collection subsystem cannot be affected. Based on the above analysis, we need to make a decoupling between a collection subsystem and other subsystems, and unify the management of data to other systems. Here is called “subscription forwarding subsystem”, as long as the system does not affect the stability of the collection subsystem. The rough architecture diagram is as follows:

As can be seen from the figure above, there is no direct dependence between the collection subsystem, refund subsystem, settlement subsystem, information synchronization subsystem and order viewing subsystem. This architecture has the effect of eliminating dependencies. The collection subsystem does not need to rely on the data subscription and forwarding subsystem, which needs to rely on the data of the collection subsystem. We control the dependencies, and the data subscription and forwarding subsystem pulls data from the receiving subsystem without the receiving subsystem pushing data to the data subscription and forwarding subsystem. In this way, the data subscription and forwarding subsystem hangs, and the collection subsystem is not affected. Again, the data subscription and forwarding subsystem pulls data. For example, data is stored in the MySQL database, and data is pulled by synchronizing Binlog. If message queues are used for data transmission, the middleware has dependencies on message queues. If we design a disaster recovery scenario where message queues hang, direct RPC calls transfer data. For this message queue, dependency reduction is achieved.

1.2 There are no external calls in the transaction

External calls include calls to external systems and calls to underlying components. External calls are characterized by time uncertainty and will inevitably result in large transactions if they are included in a transaction. A large database transaction will cause other requests for database connections to fail to be obtained, which will cause all services related to the database to be in a waiting state, causing the connection pool to be full and multiple services to directly break down. If this isn’t done right, five stars on the danger scale. The following diagram shows the uncontrollable timing of external calls:

Solutions:

  • Check the code of each system to see if there are RPC calls, HTTP calls, message queue operations, caching, circular queries, and other time-consuming operations in the transaction. This operation should be moved outside of the transaction, and ideally only database operations are handled within the transaction.
  • Add monitoring alarms for large transactions. Get email and text alerts when big things happen. For database transactions, it is generally divided into three levels of transaction alarm of more than 1s, more than 500ms and more than 100ms.
  • It is recommended to use annotations instead of XML to configure transactions. The reason is that XML configuration transactions, the first readability is not strong, the second section is usually configured in a flood, easy to cause the transaction is too large, and the third for nested rules is not easy to deal with.

1.3 Set proper Timeout and Retry parameters

Dependencies on external systems and basic components such as caches and message queues. Suppose these dependents suddenly have a problem, and the response time of our system is: internal time + dependent timeout * retry times. If the timeout period is too long or the system does not return a message for a long time, the connection pool may be full and the system may die. If the timeout is set too short, 499 errors will increase and the system availability will decrease. Here’s an example:

Service A relies on data from both services to complete this operation. Normally, there is no problem. If the response time of service B becomes long or even stops service without your knowledge, and your client timeout time is set too long, the response time for you to complete this request will become long. At this time, if accidents happen, the consequences will be very serious.

Java Servlet containers, both Tomcat and Jetty, are multi-threaded models that use Worker threads to process requests. This configuration has an upper limit. After you have reached the maximum number of Worker threads, the remaining requests will be placed on a waiting queue. There is also a limit to wait queues. Once the queue is full, the Web Server will reject the service and return 502 on Nginx. If your service is a service with high QPS, then basically in this scenario, your service will be dragged down as well. If your upstream does not have a proper timeout, the failure will continue to spread upward. This process of failure amplification is the service avalanche effect.

Solutions:

  • The first step is to investigate how long the dependent service itself calls the downstream timeout. The caller’s timeout is longer than the dependent’s call downstream.
  • Figure out what the interface’s 99% response time is, and add 50% to that. If the interface depends on a third party and the fluctuation of the third party is large, 95% response time can also be used.
  • Retry Times If the system service is of high importance, the system retries three times by default. Otherwise, you do not need to retry.

1.4 Resolving Slow Query

Slow queries degrade application response performance and concurrency performance. As service traffic increases, the CPU usage of the database server increases sharply. In serious cases, the database does not respond and you must restart the database. For more information on slow queries, see our previous technology blog article “MySQL Indexing principles and Slow query optimization.”

Solutions:

  • The query is divided into real-time query, near real-time query and offline query. You can use Elasticsearch to create a query center that handles both real-time queries and offline queries.
  • Read/write separation. Write to master library, read to slave library.
  • Index optimization. Too many indexes affect database write performance. Insufficient indexes slow queries. The DBA recommends no more than four indexes for a data table.
  • Large tables are not allowed. When the amount of data reaches tens of millions, the efficiency of the MySQL database starts to decline dramatically.

1.5 a fuse

When the dependent service is unavailable, the service caller should provide the lossy service up through some technical means to ensure the flexible availability of the service. However, the system is not fusing. If a downstream service fails on the service invocation link due to code logic problems, network problems, call timeout, service promotion call volume surge, and insufficient service capacity, other services at the access layer may become unavailable. Below is the analysis of fishbone diagram without fusing effect:

Solutions:

  • Automatic fuses: Use Netflix’s Hystrix or Meituan’s own Rhino for quick failures.
  • Manual circuit breaker: If the downstream payment channel jitter or unavailable, you can manually close the channel.

2. The frequency should be low enough not to die

They don’t die to do two things: first, they don’t do, second, they don’t die.

2.1 do not make

About not making, I summarized the following 7 points: 1> improper guinea pig: only mature technology, not due to the technology itself problems affect the stability of the system. 2> Responsibility simplification: it does not weaken or inhibit its ability to fulfill the most important responsibilities because of responsibility coupling. 3> Process standardization: reduce the impact of human factors. 4> Process automation: make the system more efficient and safer operation. 5> Redundant capacity: in order to cope with the situation that users of the rival system cannot access our system and the advent of a rush, and for the sake of disaster recovery, the system should be at least twice as redundant. 6> Continuous refactoring: Continuous refactoring is an effective way to ensure that code is left untouched for long periods of time. 7> Timely repair of vulnerabilities: Meituan-Dianping has a security vulnerability operation and maintenance mechanism to remind and urge all departments to repair security vulnerabilities.

2.2 the undead

As for immortality, there are five non-death animals on earth: water bears, which can stop their metabolism in harsh environments; The rejuvenating “turritopsis”; “Clams” resting in their shells; Water, land, parasitic everything into the “planarian”; A “rotifer” with cryptic abilities. Their common feature in the field of system design is their strong fault tolerance. Here the concept of “fault tolerance” is: to make the system has the ability to tolerate faults, that is, in the case of failure, still have the ability to continue to complete the specified process. Tolerance is Fault Tolerance. To be precise, it is Fault Tolerance rather than Error.

3. Happen infrequently enough not to be killed by others

3.1 current limiting

In an open network environment, external systems often receive many intentional and unintentional malicious attacks, such as DDoS attacks and failed user rebrush. Although our teammates were all elite, we had to protect ourselves against upstream negligence. After all, there was no guarantee that other students would someday write a code that would retry indefinitely if downstream returns didn’t meet expectations. These massive internal and external calls, if left unprotected, tend to spill over into the background services, which can eventually cause the underlying services to go down. The following is a problem tree analysis of the effects of infinite flow:

Solutions:

  • A relatively reasonable maximum QPS can be obtained by measuring the service performance of the server.
  • Three algorithms commonly used in flow control are token bucket, leaky bucket and counter. This can be done using Guava’s RateLimiter. SmoothBurstry is based on the token bucket algorithm and SmoothWarmingUp is based on the leaky bucket algorithm.
  • For core transactions, OCTO, the meituan service governance platform, is used to do thrift interception. It supports interface granularity quota, single-machine/cluster quota, designated consumer quota, test mode working, and timely alarm notification. The test mode is to only alarm without really throttling. If the test mode is disabled, the system will throw exceptions when the traffic limiting threshold is exceeded. Traffic limiting policies can be disabled at any time.
  • You can use Netflix’s Hystrix or Meituan Dianping’s Own Rhino for specific flow limiting.

4. The fault scope should be small and isolated

Isolation refers to the separation of systems or resources to limit the scope of transmission and impact when a system failure occurs.

Physical server isolation principles

(1) Internal and external differences: the internal system is treated differently from the external open platform. ② Internal isolation: The upstream and downstream services are isolated from physical servers by channel, and low-traffic services are merged. ③ External isolation: according to channel isolation, channels do not affect each other.

Thread pool resource isolation

  • Hystrix uses the command mode to encapsulate each type of business request into a corresponding command request. Each command request corresponds to a thread pool, which is put into the ConcurrentHashMap. Note: Although the thread pool provides thread isolation, the client underlying code must also have timeout Settings and cannot block indefinitely so that the thread pool is constantly saturated.

Semaphore resource isolation

  • Developers can use Hystrix to limit the maximum number of concurrent requests a system can make to a dependency, which is basically a traffic limiting policy. Each time a dependency is called, it is checked to see if the semaphore limit has been reached and rejected.

5. Rectify faults quickly and quickly

Discovery is divided into pre-discovery, in-process discovery and post-discovery. The main means of pre-discovery are pressure test and failure drill. The main means found in the incident is monitoring alarm; The primary means of post-mortem discovery is data analysis.

5.1 Full-link online pressure test

Is your system suitable for full-link on-line pressure measurement? In general, full-link pressure measurement applies to the following scenarios:

① For systems with long links, multiple links and complex service dependencies, full-link on-line pressure measurement can locate problems more quickly and accurately. (2) Complete monitoring alarm, problems can be terminated at any time. ③ There are obvious business peaks and valleys. Even if there is a problem in the trough period, the impact on users is relatively small.

The main purposes of full-link on-line pressure measurement are:

Check the processing capability of the entire system. Check performance bottlenecks. Verify whether the current limiting, degradation, fusing, and alarm mechanisms meet the expectations and analyze data to adjust the thresholds

Simple implementation of full-link pressure measurement:

(1) Collect online log data for traffic playback. To isolate traffic from actual data, offset some fields. ② Data coloring processing. Middleware can be used to fetch and pass traffic tags. 3 You can use shadow data tables to isolate traffic, but pay attention to disk space. If the remaining disk space is less than 70%, use other methods to isolate traffic. ④ External calls may require mocks. The implementation can use a Mock service to randomly generate and return time-distributed delays from external calls online. In terms of pressure measurement tools, pTest developed by Meituan Dianping is used for core transactions.

6. Rectify the fault quickly

Locating requires reliable data. The so-called reliable is closely related to the problem to be discovered, irrelevant data will cause visual blind spots, affecting positioning. So for logging, make a concise logging specification. In addition, system monitoring, service monitoring, component monitoring and real-time analysis and diagnosis tools are also effective tools for positioning.

7. Rectify faults quickly

To solve, advance is to find and locate. The speed of the solution also depends on whether it is automated, semi-automated, or manual. Core transactions are intended to build a high availability system. Our slogan is: “Don’t reinvent the wheel, use the wheel well.” It’s an integrated platform with a mission: “Focus on core transactions for high availability, better, faster, and more efficient.”

There are many systems and platforms for discovery, positioning and processing available on Meituan-Dianping, but if you open the links or log in the system one by one, it will inevitably affect the speed of solution. So we have to do integration, so that problems can be solved in a one-stop shop. Examples of desired effects are as follows:

Tool is introduced

Hystrix

Hystrix implements a circuit breaker mode to monitor failures, and when the circuit breaker detects that a long wait has occurred on the calling interface, it uses a quick failure strategy to return an error response upward, thus preventing blocking. Here we focus on Hystrix’s thread pool resource isolation and semaphore resource isolation.

Thread pool resource isolation

advantages

  • Using threads allows for complete isolation of third-party code, and requesting threads can be quickly put back in.
  • When a failed dependency becomes available again, the thread pool is cleaned up and made available immediately, rather than a long recovery.
  • Can fully simulate asynchronous invocation, convenient asynchronous programming.

disadvantages

  • The main disadvantage of thread pooling is that it increases CPU, because execution of each command involves queuing (which is avoided by default using SynchronousQueue), scheduling, and context switching.
  • Adding complexity to thread-state dependent code that uses ThreadLocal, for example, requires manual passing and cleaning of thread-state (Netflix internally considers thread-isolation overhead small enough to not have a significant cost or performance impact).

Semaphore resource isolation

Developers can use Hystrix to limit the maximum number of concurrent requests a system can make to a dependency. This is basically a flow limiting policy. Each time a dependency is invoked, it checks to see if the semaphore limit is reached, and if so, it rejects it.

advantages

  • No new threads execute commands, reducing context switching.

disadvantages

  • Unable to configure a circuit breaker, every attempt must be made to obtain a semaphore.

Compare thread pool resource isolation with semaphore resource isolation

  • Thread isolation is run by other threads unrelated to the main thread; Semaphore isolation is done on the same thread as the main thread.
  • Semaphore isolation can also be used to limit concurrent access and prevent blocking proliferation, with the major difference from thread isolation being that the thread executing the dependent code is still the requesting thread.
  • Thread pool isolation applies to third-party applications or interfaces and isolation with a large amount of concurrency. Semaphore isolation applies to internal applications or middleware; Scenarios where concurrency requirements are not high.

Rhino

Rhino is a stability assurance component developed and maintained by meituan-Dianping infrastructure team, providing fault simulation, degradation drill, service fuse, service flow limiting and other functions. Compared to Hystrix:

  • CAT (the open source monitoring system of Meituan-Dianping, see the previous blog “In-depth Analysis of open Source Distributed monitoring CAT”) carries out a series of buried points to facilitate service abnormal alarm.
  • Access the configuration center, which provides dynamic parameter modification, such as forced fusing and modification failure rate.

Conclusion thinking

Wang Guowei talked about his academic experience in “Human Words”. He said: To become a great career and a university in ancient and modern times, one must go through three realms:

The first state

The west wind swept the green trees last night. Alone on the high building, looking at the end of the world road.

Second state

I have no regrets. I am haggard for Iraq.

The third state

He found thousands of baidu, suddenly look back, that person is in, the lights dim.

The high availability of core transactions is currently experiencing the first type: looking ahead and seeing where others have gone before, starting with learning from their experience.

In the next phase, now that we have identified the goal, we will work hard to continue to develop high availability. Eventually, when we do a lot of things, we’ll look back and have a clearer and deeper understanding of high availability. Stay tuned for our next post.

About the author

Xiaojing graduated from the Computer department of Northeastern University at the age of 20. Due to my outstanding language talent in my first company after graduation, I learned Japanese from scratch within one year and passed the International Japanese Language Examination level 1 with high scores. I worked as a Japanese translator for two years. Later, I worked at Renren and transformed myself into an Internet developer. Graduate student in Psychology, Chinese Academy of Sciences. He has nearly 100 patents for technological inventions and is a partner in a start-up company. Technical support experience in Tokyo, Japan and Silicon Valley. Currently, HE is a technical expert of Meituan Dianping, responsible for core transactions. (Welcome to pay attention to Jing ‘er’s personal technology public number: Programming life)

, recruiting

We are looking for an intern for the core transaction of Meituan Finance. High-speed development of the business needs high-speed development of the team, as the core department, we urgently need to believe that technology changes the world you! Interested parties please pay attention to my personal technology public number and message.

If you are interested in our team, you can follow uscolumn.