How does ele. me manage those annoying accidents?

The authors introduce

Ang Xu is senior Director of Technical Operation department and Risk Control Management Department of Ele. me. I am good at lean operation and maintenance and fine risk control. By cooperating with other teams in the company, I promote and improve the construction of operation and maintenance informatization, standardization and service, and gradually realize automatic operation and maintenance, delivery and data visualization, so as to ensure the stability of the system with low cost. Through data and rules adaptation, as well as product design, manual audit, risk control platform construction to make every yuan of subsidy used in the realization of the company’s set goals.

Ele. me, which does not just deliver food, but hummingbirds, breakfast and restaurants of the future, among many other platforms, is rapidly expanding. The whole product chain of takeout is long. It takes about 30 minutes from the time of placing an order to the time when the delivery finally arrives, which has a strong requirement on timeliness.

From a technology point of view, the biggest challenge for Ele. me is accidents. This paper will focus on the accident, divided into two parts: technical operation experience and experience. The first part of the experience is further divided into three stages: refinement, stability (capacity and change), and efficiency. The second part is the author’s understanding of operation and maintenance service.

I. Technical operation experience

The responsibility of technical operation is to try our best to cooperate with more people to achieve the goal of maintaining stability, which can be divided into two stages: operation and maintenance support and operation and maintenance service. At present, Ele. me is in the operation and maintenance service stage. The technical operation team, as Party B, is responsible for the development of products, the development of tested services, maintenance, stability, performance optimization, and resource utilization.

What does a technical team need to do in a rapidly expanding business?

The first stage, fine division of labor

Through the fine division of labor to promote parallel speed, let professional people use professional knowledge, the most effective way to improve work efficiency and code throughput, establish communication channels to speed up decision-making, information flow to ensure stability.

The fine division of labor is divided into three parts:

The first part is to do database splitting and code decoupling. The technical work focused on the split of the database, vertical split first, horizontal split only to do, in order to faster service business expansion, and some of the work of code decoupling.

Code decoupling is the process of thinking of the original code system as a mud ball and gradually breaking it into many pieces. Now there are more than ten business modules, each module has a special team to maintain, and the internal domain will be divided.

Ele. me is database, code split in parallel. Then, forced access to the new release system and single-instance, single-application, or physical split, is initiated.

In the whole process of code decoupling and fine division of labor, they encountered many problems, among which two typical accidents are:

Fault 1: Timeout causes slow back-end services, causing a chain reaction that leads to an avalanche of front-end services.

The user’s request time depends on the response time of the service on the RPC call path. When one of the nodes slows down and the entire cluster becomes unavailable, the usual emergency response is to stop the services from the front of the call chain and start them from the back.

When such problems occur, if there is no circuit breaker, the front-end service avalanches due to dependencies, and the service cannot recover itself. After the circuit breaker mechanism is added, the front-end service will restore itself when the faulty back-end node is restarted or network jitter is restored.

Accident 2: merchants need to retry for three consecutive days to receive orders, which is related to Redis governance.

When a switch Bug causes network jitter, Redis is the most affected. During network jitter, too many Redis connections will be established due to the backlog of requests, which will cause the Redis response delay to soar from 1ms to 300ms. Service processing is slowed down by slow Redis requests, while external requests are still backlogged, causing an avalanche.

At the beginning of the failure, the monitoring period of Zabbix was too long for o&M engineers to monitor. It took them three days to replicate the pressure survey before they were able to locate the fault. Later, o&M engineers built a new infrastructure monitoring tool, which collects all indicators in the /proc directory every 10 seconds and can locate problems in about three minutes.

In addition, retransmission of lost packets will also seriously affect the performance of Redis, because an HTTP engine to the back end may generate dozens or even hundreds of Redis requests, and one of them is hit and retry, the impact on the service is fatal.

Level of the second part of the fine division of labor is to form a team, such as the big data is horizontal team, business line is vertical team, after division, from the development of the entire business graph curve is very steep, deduce technology does not hamper the rapid development of the business, also is the throughput technology, new product development efficiency is healthy.

During this period, the operation and maintenance engineers also did several things, such as dividing monitoring into Metric, Log, Trace, and infrastructure. Set up Noc team, responsible for emergency response, and timely inform all members of the information through Oncall when problems are found. There are all kinds of cleaning, access release, SOA, downgrade circuit breaker development.

Spring cleaning

What is the concept of cleaning? After analyzing historical accidents, engineers make technical summaries and list the mistakes they often make into some feasible procedures to publicize them to the backbone of the department. Specific contents include:

Service governance for SOA

The main emphasis here is on domain division, high cohesion and low coupling.
Governance of common components.

The database Redis consists of two professional teams, one DA and the other DBA. The main solution of DA governance is to collect information of various industry partners, plan capacity, manage the usage posture of development, and solidify experience into the R&D process.
Sorting out business indicators

It includes the concept setting of TPS (state rotation and then according to the return state), the stagnation time of the state and the pile depth of the state, which is mainly the state rotation of some back-end services.
Reasonable setting of timeout chain and retry mechanism.
External dependencies and switches.

Why the emphasis on external dependencies? External dependencies can be divided into two categories. One is cooperation with other companies, such as calling the payment interface of other companies. Another type of dependency is between teams. Don’t trust anyone’s services. Bugs can happen at any time.
Critical path.

Why set critical path? One is a circuit breaker, the other is a downgrade. When a non-critical path has a problem, just drop it, do not affect the critical path. Another benefit is that when you do compensation later, you can target it.
The logs.

The team also has a lot of accidents in the log, which can be explained through case by case.
The goal of setting blind exercises is being achieved.

Because BaJiuBaiGe code interaction between technical engineers in itself is A complex system, is A very long business chain, critical path involved more than 100 service, simple function test is ok, but large capacity, it will be hard to locate problems between them, such as code coupling between A group and B group acceptance. The solution that came to mind was blind maneuvers.

The blind exercise can not only do acceptance on the business side, but also do infrastructure, including Redis cluster, MySQL cluster and network. There was a test that calculated the packet quantity of a Redis instance according to the packet loss rate of 1%, resulting in the bottom of the whole station business. At that time, the whole Redis cluster had 12 units, with hundreds of instances, one instance of the problem, such a big impact. Through blind exercises, technology is looking for solutions that minimize the impact of a single node outage.

The second stage is the period of stability. Enemy Number one is capacity.

In the phase of rapid business expansion, the biggest enemy to system stability is capacity, similar to boiling frogs or sudden avalanches. Due to the different ways of determining capacity in different languages, ele. me’s complex system composed of more than 1000 services, rapid change of business scenarios, frequent service changes and other factors, led to the capacity problem for nearly a year.

Finally, we adopted the method of regular online full-link pressure survey, launched a campaign of 100 people, lasted for more than a month, rectified nearly 200 hidden trouble points, and basically solved the capacity problem. Even in the trough of time, also adopted the whole union road suppression. It can also be done together with the pressure measurement of the technology before it goes online, and then the data can be integrated for analysis.

Down the accident

In the run-up to the 517 seckill push, the idea was to counter it with clusters of daily services, more than doubling the entire capacity before the event. But orders soared that day, and in the first few seconds after the seckill began, instantaneous concurrent requests were 50 times as high as usual. When the traffic peak comes, the flood peak directly jams the front-end Nginx network.

After reflection, the reasons for the problems are lack of experience in seckill scenarios, low estimation of flood peak data brought by activities, and non-prioritization of URL traffic limiting.

The improvement measures are to build a set of system specifically for seckill, mainly to do hierarchical protection, the establishment of user side cache, swimming lanes, clustering and competition cache.

The third stage is efficiency enhancement. Improve efficiency through tools, resources and architecture transformation.

Accident 1: All kinds of accidents occurred in hummingbird distribution for two consecutive weeks

The reasons are RMQ accumulation, UDP handle exhaustion, and incorrect use of fuse breaker. It can be seen that in the process of fast delivery of new business, the code quality and the use posture of external components are the risk points of accidents.

Accident 2: MySQL

SQL slow queries, from 2 to 3 per week, have decreased to very few recently. The solution is to use component governance. Component governance starts with servitization of your own resources and capacity. The second is to limit the flow and do downgrades. The third one is mainly some of the postures that limit development.

With those three things done, the technology then does the automation related work, mainly information, standardization, and orchestration. Another front indicator, KPI, is when some components are first used, some quantitative considerations need to be made. By doing this, the technology can basically prevent major failures.

Governance that uses postures has the greatest benefit for stability. Here are a few key points:

It is necessary to have a partner who is proficient in components, read the source code, understand all the pits encountered in the community, but also go into the front line of business development, understand the business scenarios, and preliminarily determine the use of components in the business scenarios.
Engineers carry out knowledge transfer, through various channels to standardization, development norms, clustering, development using posture and other knowledge points transfer in place.
Solidify experience or red line into resource application, architecture review process, tools as soon as possible.

Accident 3: RMQ

RMQ can be used in a variety of scenarios, including Python and Java. At the beginning of 2016, although the engineer did a review of technology and configuration, there were still many unexpected scenarios, mainly involving the following problems:

Partition, is the technology in the cutover, the core switch is to upgrade equipment. While the configuration in the RMQ cluster is self-healing when the device network is cutover, there are still many clusters that are not.

Therefore, the technology specially reserved a cold standby RMQ cluster, and deployed all the live network configuration to that cold standby cluster. If one of the more than 20 RMQ clusters on the line goes down, it can be cut in time.
The queue is blocked. Mainly trace consumption capacity, because business surge, consumption capacity is not enough, very easy to lead to queue congestion.
Usage scenarios. For example, when sending and receiving messages, if each message is sent and received, the link or Queue is rebuilt. This reconstruction results in an Event mechanism within RMQ. When requests grow to a certain level, RMQ throughput will be directly affected and RMQ capacity will drop to one tenth of its original capacity.

Long-standing difficulties: Fault location and recovery efficiency

The main reason for the slow fault location is that ele. me has too much information in the whole system. When a problem occurs, the engineer leading the fault location gets too much information.

The current approach is to carry out fragmented, carpet – style sweeping to remove barriers. What is a carpet sweep? Get enough information first, divide it up, and require every engineer involved to look at it. It covers takeout, merchants, payments and logistics, and then there’s basic business and network monitoring, some traffic on the Internet, some burden on the server, and so on.

At this point, the orderly self-certification of technical engineers becomes very important. What can be done at present is that everyone can see whether there is a problem with the service they are currently responsible for. Also need to do is to provide tools, such as switch packet loss, server packet loss. Use some tools to let technical engineers find problems in time, but this process takes time.

Another is in the self – evidence of the time, must carefully check. As a member of the team, each technical engineer is responsible for the corresponding section, but once some mistakes are made due to personal negligence or inadequate self-inspection, they have to “clean the pot” themselves. After locating a fault, improve the recovery efficiency and rectify the fault.

Also, emergency drills are important. Emergency drills are directly related to the efficiency of system recovery. When a cluster goes wrong, the technology can quickly recover.

Two, operation experience

Every accident is not accidental, many problems can be avoided through the correct use of posture, capacity estimation in advance, gray scale and other methods. If technology just solves this one thing on a case-by-case basis, accidents often occur at another point in time.

This requires engineers to do things in a thinking way, such as accident review, accident report review, and acceptance teams. Then, in each stage, the key points involved in an accident are repeatedly put forward, and the feasible operation specifications are constantly summarized.

Problem solving often requires a paradigm shift in how to take time away from the important and urgent tasks of the day.

And dare to play around. What does toss mean? Engineers should be very familiar with the maintenance system, so that they can be very accurate in locating and solving faults.

Finally, there is the issue of dark lights, especially in infrastructure. This was a headache at the time, and it took more than 10 minutes to an hour for the infrastructure to check a problem. Later, one of my friends changed his way of thinking and developed a system that helped the team solve this big problem very well. Therefore, the technical team of Ele. me has the courage to think and try hard.

This article is excerpted from CTO Speak, source: 51CTO Technology Stack Subscription Number

-END-

Move and maintain more dry goods

2017 Gdevops guangzhou station!

Your appointment will be on November 24th

Recent hot articles:

Build a stream computing seller log system based on Kafka+Strom

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

How does ele. me manage those annoying accidents?

An overview of how to create operation and maintenance platform integration!

Dry stuff: Advanced features and gameplay of MySQL for human use are rare

Build an intelligent log analysis platform from the perspective of ITOA operation and big data analysis

How does ele. me manage those annoying accidents?

An overview of how to create operation and maintenance platform integration!

Dry stuff: Advanced features and gameplay of MySQL for human use are rare

Build an intelligent log analysis platform from the perspective of ITOA operation and big data analysis

Related Posts

2021-05-12 Hot News

UGRec: Joint diagram recommendations for direct and indirect linkage modeling

Give me a break guys, MySQL mind map