Online environment massive RocketMQ cluster non-stop elegant upgrade practice

This article has been included in GitHub:github.com/dingwpmz/Ja…

See words such as surface, wesker is, a graduated from ordinary 2 colleges, never contact distributed, micro service, high concurrency to workplace transformation by sharing of techniques, grow up to be excellent RocketMQ community evangelist, senior architect, a consortium published book RocketMQ technology insider, welcome “middleware in circles” attention, is set to star, Communicate and progress together. Pay attention to the “middleware interest circle”, reply RMQPDF to get two e-books, condensed the author’s experience of billions of message flow.

In line with the administrative requirements of the security department, hundreds of RocketMQ machines in the production environment must be upgraded within half a month and must support ACL to avoid security risks.

The RocketMQ cluster upgrade solution and implementation naturally fell on my head, this article not only introduces how I upgrade, but also wants to show as an architect, the methodology to deal with these problems, to show the big factory architect’s daily work.

Tips: For acL-related content, the twists and turns of upgrading from 4.1.0 to 4.8 and enabling ACLS will be shared separately in the following articles.

1. Urgency of version upgrade


The RocketMQ server version is still 4.1.0. Prior to 4.4.0, RocketMQ did not support ACLs (access control). Any machine in the production environment could subscribe to any topic. You can install a RocketMQ-Console on any production application server to control the entire cluster, with permission to delete topics, delete consumer groups, and get a chill on your back.

2. Upgrade scheme


2.1 Determine the upgrade version

The RocketMQ update log shows that RocketMQ officially introduced acLs in version 4.4.0, so upgrade to at least version 4.4.0. There is an unwritten rule in the industry for using open source versions: generally don’t use the latest version and don’t be a guinea pig.

But RocketMQ is a special case.

Through the RocketMQ version change history, it is not difficult to find that the RocketMQ Client has very few changes, that is, the message sending and message consumption areas, which are closely related to users, are very stable and theoretically compatible. And each version has fixed some major bugs, performance improvement is also relatively obvious, so the author decided to “defy the world” this time, decided to help upgrade to the latest version 4.8.0.

Without further elaboration, here is a brief introduction to RocketMQ in the mileage cup sense.

  • Transaction messages are officially introduced in RocketMQ4.3.0, and 4.6.1 is the minimum version recommended if you want to use them.

  • RocketMQ4.4.0 introduces ACLs, message traces, and the minimum version recommendation is 4.7.0 if you want to use these features.

  • RocketMQ4.5.0 introduces multiple replicas (master/slave switchover) and version 4.7.0 is recommended.

  • RocketMQ4.6.0 introduces the request-response model.

2.2 Upgrade Roadmap

The basic requirements for version upgrade are as follows: Services cannot be stopped, that is, services must be upgraded without awareness.

If the machine has enough spare machines, the best version migration solution should be to expand and then reduce the capacity, as shown in the following figure:

The main idea is to expand the size of the Broker by adding two older Broker servers to the cluster. Then, disable write permissions on the Broker of the earlier version, remove the earlier version when messages expire, and upgrade NameServer to complete the online migration without stopping.

The RocketMQ cluster needs to be upgraded in about half a month. The RocketMQ cluster cannot provide so many cold standby nodes. Therefore, capacity expansion and capacity reduction cannot meet the current requirements.

Can the Broker side code be upgraded directly, but a higher version of the Broker stores the directory directly using the lower version, i.e. upgrading the software directly, as shown in the following example:

The idea is to stop the old version of the Broker and then start it with the new version, but with the old configuration file.

With an idea, the next step is to verify the feasibility of the scheme.

2.3 Scheme Verification

Theory is theory. Before making any changes to the production environment, there must be adequate testing and verification.

2.3.1 Verifying server Version Compatibility

The key points of setting up an MQ cluster are:

  • Whether a higher-version Broker can register a route with a lower-version NameServer

  • Whether a lower-version Broker can register routes with a higher-version NameServer

Use RocketMQ-Console to create multiple topics and see if their routing information is correct, verified, and as expected.

2.3.2 Verifying the Compatibility between the Client and server

The RocketMQ client API is simple: send, batch, and consume messages. Since transaction messages are not supported in version 4.1, this update does not even require validation of transaction messages.

  • Whether a lower-version client can normally send and consume messages to a higher-version Broker

  • Whether a higher-version client can send and consume messages to a lower-version Broker

In fact, we don’t need to write the test cases ourselves. We can use the official Demo directly, and the code screenshots are as follows:

Client-side validation in the implementation of the real, in fact is more complex than between server-side validation, because each team use different version of the client, and even some project team will use c + +, Python, and other non-java client, how to accurately find the cluster all client connection information (client version, language types) is critical.

Official version of the consumer group of connection information or support is more friendly, we can write the script, all consumer groups in the first query system, and then iterate through each consumer group, you can query the IP address, the client version of the consumer groups, the use of the information such as the language, but open source version support for producers of unfriendly, There is no interface to retrieve all senders.

The connection mode of the consumer terminal is as follows:

Therefore, the method we adopted is mainly based on the type of failed client of consumer group. During this upgrade, I also made some customized development of RocketMQ, which makes it easy to obtain the link information of all sender, and will submit PR to the official in the future.

2.3.3 Broker storage format verification

Since there are no free resources, the upgrade method is to directly upgrade the software. However, the old and new versions share the same storage directory, and the Message storage protocol based on RocketMQ has not changed since version 4.0.0. The key points of verification are as follows:

  • Can the 4.8.0 version directly use the storage files generated in 4.1.0 (Commitlog files)?

  • Check whether storage files generated in 4.8.0 can be directly used in version 4.1.0

Why verify that version 4.1.0 is compatible with 4.8.0? Because if the upgrade fails and you need to roll back, if the 4.1.0 version is not compatible with 4.8.0, you have no way out, which is not allowed in the architectural design.

After verification, it is found that the stored files are compatible with each other.

2.3.4 Verifying the Test Environment

After the verification of the above three steps, the upgrade is ready, but before the upgrade, you need to run the test environment stably for one day. You can upgrade the test environment to the following architecture:

That is, the mash-up mode of different versions is verified by all application servers in the test environment. If the test environment runs smoothly, the version can be upgraded in the production environment.

2.4 Implementation Plan

To upgrade with the above scheme, and have done a full validation, can be carried out in a production environment, before execution, the need to design output executable can be born to the theory of implementation plan and implementation plan must include a rollback operation, and the rollback operation must be easy to perform, otherwise your package must be less reliable.

Next, focus on some key steps in the implementation process, the whole upgrade steps have a rolling upgrade, that is, one by one upgrade.

1. Disable write permissions on a Broker

Disabling Broker write permissions allows applications to smoothly migrate traffic to other nodes. This can effectively avoid service impact when the machine is restarted.

Sh./mqadmin updateBrokerConfig -b 192.168.x.x:10911 -n 192.168.xx.xx:9876 -K brokerPermission -v 4Copy the code

2. Close the Broker when TPS is close to 0

ps -ef | grep java
kill pid

Copy the code

Start the Broker with the new version

Note that the configuration file used in this process is of an old version, so the write permission is not enabled. Starting the file does not affect the writing of messages on the client.

4. Enable the write permission

After the new version is successfully started, you can enable the write permission

Sh./mqadmin updateBrokerConfig -b 192.168.xx.xx:10911 -n 192.168.xx.xx:9876 -K brokerPermission -v 6Copy the code

Observe the flow.

Repeat the above steps to upgrade the Broker.

The Nameserver upgrade is even easier. Using a rolling upgrade, kill the old version of Nameserver and start the new version on the original machine.

3, trivia


Finally, I would like to share a little episode with you. Although the above plan is very detailed, and after repeated tests, but the MQ in our company is so important, the importance of operational friend in operation can’t laid hands on him, he wants me around looking at when operating, this time, as architects, we should have the courage to take responsibility, clearly told, if you have the right operation, the output fault shall be borne by me, This is also a soft skill that I personally feel is very important as an architect: being in control of the technology you are responsible for, and being accountable for it.

This issue is introduced here, I hope to help you, but also hope to get your recognition, for me like, forward or look at.

Finally, share a core RocketMQ ebook with me and you will gain experience in the operation and maintenance of billions of message flows.

Access: wechat search “Middleware Interest circle”, reply RMQPDF can be obtained.

Middleware interest circle

RocketMQ Technology Insider author maintenance, mainly into the system analysis of JAVA mainstream middleware architecture and design principles, to build a complete Internet distributed architecture system, help break the workplace bottleneck.