In this article, we share the RocketMQ governance practices of Himalaya to help you understand the problems you may encounter in the process of using messaging-oriented middleware.
Author: Cao Rong, from Ximalaya, engaged in the development of micro-services and message-related middleware.
In this article, we share the RocketMQ governance practices of Himalaya to help you understand the problems you may encounter in the process of using messaging-oriented middleware.
Current business background and problems encountered
1. Message queue overview
(1) Online RabbitMQ: 9 instances;
(2) Offline scenario: Kafka, 8 clusters;
2. Problems encountered
Lack of governance in online scenarios:
• Services are mixed and interfere with each other. Excessive non-core interconnection leads to cluster traffic limiting.
• Load on nodes is unbalanced and resources are wasted.
• Resources are not associated with applications and messages are stored.
• Services are mixed and interfere with each other. Excessive non-core interconnection leads to cluster traffic limiting.
Upgrade the online MQ cluster
1, the selection of
(1) Business convenience: easy to develop, use, maintain, improve efficiency, such as self-retry, self-dead-letter, self-transaction guarantee;
(2) Performance: including business uncertainties, such as anti-short-term burst traffic, anti-backlog, etc.;
(3) Simplicity: the architecture is suitable for splitting small clusters; The Java language is easy to troubleshoot and redevelop; Active community, more people use, easy to troubleshoot problems;
2. Governance scheme
(1) Cluster division policy: divide small clusters to reduce mutual interference and impact surface;
(2) Resolution scheme:
As shown in the figure above, for the business related to “money” and the core business of the company, not only sufficient resources should be provided, but also high data security should be ensured. SYNC_MASTER is used here.
- For non-core business, we want it to be cost-effective and use as few resources as possible to support enough amount of data. Here ASYNC_MASTER is used, which will have better scalability.
- For other services that do not require high data security, including message traces, we use a single MASTER cluster to ensure performance requirements and low resource usage.
- For delayed cluster, focusing on backlog messages, pagecache utilization is low, so it is not done yet. In the future, we will consider using on-demand purchase on the cloud. Temporary clusters are also not currently covered.
3. Control surface management
(1) Unified message management background;
There is a unified management background for all message-oriented middleware backends, as shown in the figure above.
(2) For RabbitMQ, only maintain and automatically maintain associations;
(3) For RocketMQ, it is used to improve user experience, such as sending/querying messages, one-click access to Demo, resending dead letter, etc.
Message Management interface
Configure the demo
Dead letter resend
(4) PaaS approval
We made a PaaS management for message-based resource management, and made some restrictions on functions. Development and operation and maintenance can only apply for and approve in the test environment, and then synchronize to other environments after approval, and then create resources and notify users.
4. Unified access to SDK
(1) Users only care about what resources to use, and do not need to know the NAMesRV address to reduce the probability of error;
(2) Dynamic configuration of hot effect, saving users’ time;
(3) Receiving/sending messages, retry after failure, providing a backstop for middleware;
(4) Fuse current limit, for the business to do the bottom;
(5) Grayscale message receiving (message switch), which meets special business scenarios;
(6) Integrate other functions of the company, such as call chain, full-link pressure measurement, etc.;
5. Multi-dimensional monitoring
- O&m: overall status, top status in the cluster, status of owning physical machines;
- Resources: Topic upstream and downstream QPS, LAG, etc.
- User: The instance sends and receives messages evenly and delays.
Old cluster migration solution
1. Manual migration
Scenario: The upstream and downstream of messages are its own services;
Migration process: RocketMQ&RabbitMQ -> Double send -> single send (RocketMQ) -> Single send;
Problems needing attention: Business reconstruction and topic merger may lead to multiple tag messages in the downstream, leading to lag abnormalities;
2. Automatic migration
- RabbitMQ <–> RocketMQ mirror migration;
- Granularity: Exchange < — > topic;
- Note: Tasks in the same group are mutually exclusive;
(1) RabbitMQ -> RocketMQ migration scheme
- Synchronize RabbitMQ’s exchange to topic and add a topic prefix topic_abc.excange to abC. exchange.
- Set RabbitMQ’s Routing key to RocketMQ’s tag.
- The migrator task collects RabbitMQ message queues and sends them to RocketMQ for different types.
(2) RocketMQ -> RabbitMQ migration scheme:
The method is similar, as shown below:
(3) Several cases of automatic migration:
- Configure RabbitMQ -> RocketMQ tasks.
- Configure the RocketMQ -> RabbitMQ task.
- RabbitMQ -> RocketMQ task; After the migration is complete, stop the RabbitMQ -> RocketMQ task.
Configure the RocketMQ -> RabbitMQ task.
The original link
This article is the original content of Aliyun and shall not be reproduced without permission.