Author | Liang Yong

background

Hello has evolved into a comprehensive mobile travel platform including two-wheel travel (Hello bike, Hello moped, Hello electric car, hello electric car), four-wheel travel (Hello ride-hailing, full network hailing, Hello taxi) and so on, and explores many local living ecology from hotels and in-store group buying. As the company’s business continues to grow, so does its traffic. We find that some serious accidents in production are often caused by sudden flow, so it is particularly important to control and protect the flow and ensure the high availability of the system. This article shares hallo’s experience in managing message traffic and microservice invocations.

The authors introduce

Liang Yong (old Liang), co-writer of RocketMQ Field And Progress column, participated in the review work of RocketMQ Technology Insider. ArchSummit Global Architect Conference Lecturer, QCon Case Study Club lecturer.

Currently mainly in the direction of back-end middleware, in the public number [Guannong Old liang] has been published more than 100 source code combat articles, covering RocketMQ series, Kafka series, GRPC series, Nacosl series, Sentinel series, Java NIO series. Currently, I am working as a senior technical expert in Harlow Travel.

Talk about governance

Before we start, let’s talk about governance. The following is Lao Liang’s personal understanding:

What is governance doing?

  • Let’s make our environment better

Need to know what’s not good enough?

  • Past experience
  • User feedback
  • The contrast

You also need to know is it always good?

  • Monitor the track
  • Warning notice

How to make it better when it’s bad?

  • Control measures
  • The emergency plan

directory

  1. Build a distributed message governance platform
  2. RocketMQ combat stomp pits and solve
  3. Build a high availability management platform for micro services

background

Streaking of the RabbitMQ

Companies have previously used RabbitMQ and the following are pain points for using RabbitMQ, many of which are caused by traffic limiting in RabbitMQ clusters.

  • Is the backlog too much to clear or not to clear? That’s a problem. Let me think about it.
  • Excessive backlog triggers flow control in a cluster? That’s really affecting business.
  • Want to consume data from the last two days? Please send it again.
  • Which services do you want to count? You’re gonna have to wait a little longer. I’m gonna go get the IP.
  • Are there any use risks such as big messages? Let me guess.

Streaking service

There was a glitch where multiple businesses shared a database. The database crashed during an evening rush hour.

  • The single database upgrade to the maximum configuration is still not resolved
  • After rebooting for a while, it was suspended again soon
  • So the cycle, suffering, silently waiting for the peak to pass

Consider: Both messages and services need sound governance

Build a distributed message governance platform

Design guidelines

Which are our key metrics and which are our secondary metrics is the primary question of message governance.

The design goal is to mask the complexity of the underlying middleware (RocketMQ/Kafka) by uniquely identifying dynamically routed messages. At the same time, a message governance platform integrating resource management and control, retrieval, monitoring, alarm, inspection, disaster recovery, visual operation and maintenance is built to ensure the smooth and healthy operation of message middleware.

Points to consider when designing a messaging governance platform

  • Provides an easy-to-use API
  • What are the key points to measure that there are no security risks when using the client
  • What are the key indicators to measure the health of a cluster
  • Visualize what common user/operations are
  • What measures can be taken to deal with these unhealthy conditions

As easy to use as possible

Design guidelines

To make a complex problem simple, that is the ability.

The minimalist unified API provides a unified SDK that encapsulates both messaging middleware (Kafka/RocketMQ).

An application

Automatic creation of topic consumption groups is not suitable for the production environment. It may cause loss of control, and is detrimental to lifecycle management and cluster stability. The application process needs to be controlled, but as simple as possible. For example, applying for all environments at one time takes effect, and generating associated alarm rules.

Client governance

Design guidelines

Monitor client usage and find appropriate measures

Scene: the playback

Scenario 1 Instantaneous flow and flow control for clusters

If the Tps of the current cluster is 10,000, then it jumps to 20,000 or more, such excessive and sudden increase of flow is likely to trigger cluster flow control. In this scenario, you need to monitor the sending speed of the client. After the speed and steep increase threshold are met, the sending speed becomes gentle.

Scenario 2 Message and cluster jitter

When a client sends a large message, for example, a message of several hundred KB or even several megabytes, the I/O time may be long and the cluster jitter may occur. To monitor the size of sent messages in such scenarios, we adopted a service to identify large messages through post-inspection, and promoted the use of peer compression or reconstruction to control the size of messages within 10KB.

Scenario 3 The client version is too Early

As functionality is iterated and SDK versions are updated, changes can introduce risks in addition to functionality. When using too low a version, one is that the functionality is not supported, and the other is that there may be security risks. To learn about the USE of the SDK, you can report the SDK version and promote users to upgrade it through inspection.

Scenario 4 Consumption traffic removal and Recovery

Consumption traffic removal and recovery are generally used in the following scenarios: First, traffic needs to be removed before application release; second, traffic needs to be removed before troubleshooting. To support this scenario, you need to listen on the client side for the remove/resume event that will consume pause and resume.

Scenario 5 Sending/consuming time Detection

How long does it take to send/consume a message? By monitoring the time consumption, the application with low performance can be detected through inspection, and the targeted transformation can be promoted to improve performance.

Scenario 6 Improving the troubleshooting efficiency

When troubleshooting a problem, you often need to retrieve information related to the message lifecycle, such as what messages were sent, where they were stored, and when they were consumed. This section can be used to concatenate the lifecycle within the message via msgId. The other is to string messages together in a request by burying rpcId/tracEid-like link identifiers in the message header.

Extraction of governance measures

Required monitoring information

  • Send/consume speed
  • Sending/consuming time
  • Message size
  • Node information
  • Link id
  • Version information

Common governance Measures

  • Regular inspection: If buried information is available, you can find risky applications through inspection. For example, the sending/consuming time is larger than 800 ms, the message size is larger than 10 KB, and the version is smaller than the specified version.
  • Sending smoothing: For example, when the instantaneous flow is detected to be 10,000 and the sudden increase is more than 2 times, the instantaneous flow can be smoothed by preheating.
  • Consumption traffic limiting: When a third-party interface needs traffic limiting, it can limit the traffic consumed by the interface. This can be implemented in combination with the HIGH availability framework.
  • Consumption removal: The consumption client is closed and resumed by listening for the removal event.

Topic/consumer group governance

Design guidelines

Monitor topic consumption group resource usage

Scene: the playback

Scenario 1 Impact of consumption backlog on services Some business scenarios are sensitive to consumption backlog, while others are insensitive to consumption backlog. For example, unlocking a bike is a second event, and the batch processing scenario related to information aggregation is not sensitive to backlogs. By collecting consumption backlog indicators, real-time alarms are sent to students in charge of applications that meet the threshold, so that they can know the consumption situation in real time. Scenario 2 Impact on the Sending speed The sending speed falls to zero? In some scenarios, the speed cannot drop to zero. If the speed drops to zero, the service is abnormal. By collecting speed indicators, you can generate real-time alarms for applications that meet the threshold. Scenario 3 Consumer node Offline The consumer node offline needs to be notified to the student in charge of the application. In this case, information about registered nodes needs to be collected. When the consumer node is offline, the alarm notification is triggered in real time. Scenario 4 Unbalanced Send/consumption The unbalanced send/consumption affects the performance. I remember that in one consultation, a student set the key to send messages as a constant. By default, the partition was hash based on the key, and all messages were sent into one partition. This performance could not be achieved in any case. In addition, the consumption backlog of each partition should be detected, and real-time alarm notification will be triggered when excessive imbalance occurs.

Extraction of governance measures

Required monitoring information

  • Send/consume speed
  • Sending partition details
  • Consumer backlog in each zone
  • Consumer group backlog
  • Registered Node Information

Common governance Measures

  • Real-time alarm: Provides real-time alarm notifications about the consumption backlog, sending/consuming speed, node disconnection, and partition imbalance.
  • Improve performance: If the consumption backlog cannot meet the demand, you can increase the pull thread, consumption thread, increase the number of partitions and other measures to improve.
  • Self-service search: Provides multi-dimensional search tools, such as time range, msgId search, and link system search for message life cycles.

Cluster Health Management

Design guidelines

What are the core metrics for measuring cluster health?

Scene: the playback

Scenario 1 Cluster Health Check Cluster health check answers the question: Is the cluster good? This problem is addressed by detecting the number of cluster nodes, heartbeat per node in the cluster, cluster write Tps water level, and cluster consume Tps water level. Scenario 2 Cluster Stability Flow control of a cluster often reflects the poor performance of the cluster. Cluster jitter also causes client send timeout. By collecting the heartbeat time of each node in the cluster and the change rate of Tps water level in the cluster, we can know whether the cluster is stable. Scenario 3 High Availability of a Cluster In extreme scenarios, an availability zone becomes unavailable or some topics and consumer groups in the cluster become abnormal. For example, you can deploy MQ in the same city across availability zones, dynamically migrate topics and consumer groups to a DISASTER recovery cluster, or live in multi-mode.

Extraction of governance measures

Required monitoring information

  • Collect the number of cluster nodes
  • Heartbeat duration of cluster nodes
  • The water level for the cluster to write Tps
  • The water level of Tps consumed by the cluster
  • Change rate of cluster writes to Tps

Common governance Measures

  • Periodic inspection: Periodically inspects the Tps and hardware watermarks of the cluster.
  • Dr Measures: cross-availability zone active/standby deployment in the same city, dynamic Dr Migration to a Dr Cluster, and remote active-active.
  • Cluster tuning: System version/parameter, cluster parameter tuning.
  • Cluster classification: by line of business, by core/non-core services.

The most core indicators focus

Which of these key indicators is the most important? I will select the heartbeat detection for each node in the cluster, namely: response time (RT), and let’s look at what might affect RT.

About the alarm

  • Monitoring indicators are mostly second detection
  • Alarms that trigger thresholds are pushed to the unified alarm system and notified in real time
  • The inspection risk notification is pushed to the inspection system of the company and is summarized weekly

Message platform icon

Architecture diagram

Kanban graphic

  • Multi-dimension: cluster dimension and application dimension
  • Full aggregation: Full aggregation of key indicators

Potholes and solutions for RocketMQ in action

guide

We always run into potholes, and we fill them up.

1. The RocketMQ cluster CPU is burr

Problem description

阿鲁纳恰尔邦

The RocketMQ slave node, the master node, has frequent CPU spikes, obvious burrs, and many times the slave node has crashed directly.

Error messages are displayed only in system logs

' 'js 2020-03-16T17:56:07.505715+08:00 VECS0xxxx kernel:[]? __alloc_pages_nodemask + 0 x7e1/0 x9602020 t17-03-16:56:07. 505717 + 08:00 VECS0xxxx kernel: Java: Order :0, mode:0x202020-03-16T17:56:07.505719+08:00 VECS0xxxx Kernel: Pid: 12845, comm: Java Not tainted 2.6.32-754.17.1.el6.x86_64 #12020-03-16T17:56:07.505721+08:00 VECS0xxxx kernel: 2020-03-16T17:56:07.505724+08:00 VECS0xxxx kernel:[]? __alloc_pages_nodemask+ 0x7E1 /0x9602020-03-16T17:56:07.505726+08:00 VECS0xxxx kernel: []? Dev_queue_xmit + 0xD0/0x3602020-03-16T17:56:07.505729 +08:00 VECS0xxxx kernel: []? Ip_finish_output +0x192/0x3802020-03-16T17:56:07.505732+08:00 VECS0xxxx kernel: []?Copy the code

Various debugging system parameters can only slow down but not eliminate, still more than 50% burr

The solution

All cluster systems are upgraded from centos 6 to centos 7, and the kernel version is upgraded from 2.6 to 3.10. CPU burrs disappear.

2. The RocketMQ cluster online delay message fails

Problem description

By default, RocketMQ Community Edition supports 18 latency levels, each of which is accurately consumed by the consumer at a set time. For this purpose also specially tested consumption interval is not accurate, the test results show very accurate. However, such an accurate feature unexpectedly has a problem, received a business students report online a cluster delayed message consumption, weird!

The solution

Json “and” consumequeue/SCHEDULE_TOPIC_XXXX “are moved to other directories. Restart broker nodes one by one. After the restart, the delay message is sent and consumed normally.

Build a high availability management platform for micro services

Design guidelines

Which are our core services and which are our non-core services is the primary question of service governance

Design goals

The service can cope with the sudden sudden increase of traffic, especially to ensure the smooth operation of core services.

Application hierarchical deployment and group deployment

Application of classification

Based on the two dimensions of user and business impact, the application is divided into four levels.

  • Service impact: The service scope affected by an application fault
  • User impact: Number of users affected by an application fault

S1: For core products, failure will cause external users to be unable to use them or cause large capital losses, such as core links of main business, such as single bike, moped switch lock, and core links of issuing and receiving orders of hitch, as well as applications on which core links are strongly dependent. S2: does not directly affect transactions, but is related to the management and maintenance of important configurations of the foreground business or the functions of business background processing. S3: A service fault has little impact on users or core product logic and has no impact on major services or a small amount of new services. An important tool for internal users does not directly affect services, but related management functions have little impact on front-end services. S4: A system for internal users that does not directly affect services or needs to be offline.

Grouping deployment

S1 service is the core service of the company and the object of key protection. It should be protected against accidental impact of non-core service traffic.

  • The S1 service is deployed in a Standalone and Stable environment
  • Non-core services invoke S1 service traffic to route to the Standalone environment
  • The circuit breaker policy must be configured for S1 service invocation of non-core services

A variety of current limiting fusing capacity construction

We build high availability platform capabilities

Partial effect of current limiting

阿鲁纳恰尔邦

  • Preheat graphic

  • Waiting in line

  • Preheat + queue

High availability platform illustration

阿鲁纳恰尔邦

  • All middleware access
  • Dynamic configuration takes effect in real time
  • Each resource and IP node details the traffic

conclusion

  • Which are our key metrics and which are our secondary metrics is the primary question of message governance
  • Which are our core services and which are our non-core services is the primary question of service governance
  • Source code & actual combat is a better way to work and learn.