precondition

This solution is not universal for all cases, and other deployment models are for reference only.

  • Cluster mode: Multiple master and multiple slave (e.g., 2 master and 4 slave)
  • Supports primary/secondary switchover: Dledger. If Dledger is not supported, you can refer to it, which meets the previous requirement
  • During broker upgrade (broker restart), the sending end and the consuming end are required to avoid errror errors on the client, ensure that services are not affected, and the client is not aware of the error

scenario

MQ cluster nodes need to be upgraded in rotation or restarted by modifying the configuration of the broker. When the broker restarts, the client is not allowed to generate ERROR logs.

The broker restarts, which may cause an error on the client.

  1. Producer fails to send messages (routing information is not updated in time, and stopped brokers fail to connect)
  2. The consumer update offset failed (scheduled to persist the consumer point to the broker Master every 5s when connection failed)

First, the error log is printed directly in the business system.

By default, the consumer logs/rocketmq_client.log in its home directory. However, this can be set up to record logs using slF4J of the business system. The logs can be collected by the business system, and then our business system makes relevant alarms. If there is an ERROR level log (some ERROR logs do not have any impact on the business), the alarm will be generated. To prevent alarms from being detected by service systems during the restart of the Broker Master (alarms that do not affect services are unnecessary), ensure smooth and stable operation. The heartbeat failure is an INFO level log.

If you want to upgrade the MQ cluster from 4.7.1 to 4.8.0, the following operations are performed:

steps

The cluster model is as follows:

Broker1 master-slave node and Broker2 master-slave node are upgraded in sequence.

1. Disable write permission for Broker Master1 and prohibit producer from sending messages to Broker1. At this point broker2 will take the full load. Resource redundancy is important because when write permissions are turned off, all producer traffic is now cut to Broker2, including the original Broker1 traffic.

sh mqadmin updatebrokerconfig -n 'nameserver:9876' -k brokerPermission -v 4 -b broker1master:10911
Copy the code

2. From the console/command line or monitoring platform, depending on what tools you have, use the clusterList command on the command line, watch for Broker1 Master inTps and outTps to be 0, and remove read permissions as well (make sure all messages are consumed for that node and there is no backlog).

sh mqadmin updatebrokerconfig -n 'nameserver:9876' -k brokerPermission -v 1 -b broker1master:10911
Copy the code

The consumer will report logs of WARN level, forbids fetching, but it does not affect because all messages have been consumed.

3. Check that the inTps of slave nodes Slave1_1 and Slave1_2 are 0 to ensure that no message is synchronized. Then, start and stop Slave1_1 and Slave1_2 respectively to upgrade them.

4. Stop the master1 node and ensure that the interval between this step and step 2 is at least 2 minutes. After stopping the master1 node, one of the slave nodes will automatically elect the master node. Then, start the last new broker (which was just stopped) as the slave node (if not for a version upgrade, but for a restart, change the brokerPermission value for this node to 6. BrokerPermission is now 1 because the previous steps shut down the read and write permission for this node).

During the upgrade, if the consumer side reports this WARN log when switching from the node to the new master node, do not worry about it. This is an insufficient implementation of load balancing and queue allocation on the consumer side. It is a normal phenomenon that has not been fixed yet. The detailed reasons will be explained separately later:

Broker1 will then have no effect on the business client during the entire restart/upgrade process, with only a small number of WARN logs and no ERROR logs.

5. Verify that Broker1 is fine and repeat steps 1-4 to operate the broker2-related nodes until the entire cluster upgrade is complete.

You are advised to disable the master read permission at least two minutes before stopping the read permission

As you saw earlier, it takes at least 2 minutes to stop the broker after the read permission is turned off in step 2 to avoid the consumer’s ERROR of a connection exception: persistent consumption offset failed. Break down the broker1 scenario one by one:

Heartbeat timing? Take a look at the code: in the startScheduledTask() method of the MQClientInstance class, notice the Chinese comment I added:

        this.scheduledExecutorService.scheduleAtFixedRate(new Runnable() {
 
            @Override
            public void run(a) {
                try {
                    // The default is 30 seconds
                    // Traverses the cached broker address and determines whether the topic routing information (but topicRouteInfo will always have it) contains the broker
                    Write perm = topicRouteInfo = topicRouteInfo = topicRouteInfo = topicRouteInfo = topicRouteInfo
                    MQClientInstance.this.cleanOfflineBroker();
                    // All brokers are stored in the address table when heartbeat is sent. If there are no consumer instances, heartbeat is sent to the master node. Otherwise, heartbeat is sent to all brokers
                    // The broker creates a Retry Topic each time a heartbeat is sent. So a Retry topic is created again after a heartbeat, even if it is deleted, as soon as the consumer is running.
                    // It also means creating a consumer group, and the retry topic will be created only after the consumer is started
                    MQClientInstance.this.sendHeartbeatToAllBrokerWithLock();
                } catch (Exception e) {
                    log.error("ScheduledTask sendHeartbeatToAllBroker exception", e); }}},1000.this.clientConfig.getHeartbeatBrokerInterval(), TimeUnit.MILLISECONDS);
Copy the code

This heartbeat will not affect the update of consumption offset, mainly focus on my comments above, without write permission, subscription information will not have the topic queue data.

There is a scheduled task that updates the topic routing information every 30 seconds (no code, too much). When routing information is updated, the subscription information in the cache is updated:

                            Update sub info; // Update sub info
                            {
                                Set<MessageQueue> subscribeInfo = topicRouteData2TopicSubscribeInfo(topic, topicRouteData);
                                Iterator<Entry<String, MQConsumerInner>> it = this.consumerTable.entrySet().iterator();
                                while (it.hasNext()) {
                                    Entry<String, MQConsumerInner> entry = it.next();
                                    MQConsumerInner impl = entry.getValue();
                                    if(impl ! =null) { impl.updateTopicSubscribeInfo(topic, subscribeInfo); }}}Copy the code

By default, the consumption offset is persisted every 5s. As long as you make sure there is no broker1 address when persisting offset, you don’t have to worry about stopping Broker1.

        this.scheduledExecutorService.scheduleAtFixedRate(new Runnable() {
 
            @Override
            public void run(a) {
                try {
                    // The offset is reported every 5 seconds by default
                    // If the offset is reported while the address of the stopped broker is still cached, it is reported incorrectly
                    MQClientInstance.this.persistAllConsumerOffset();
                } catch (Exception e) {
                    log.error("ScheduledTask persistAllConsumerOffset exception", e); }}},1000 * 10.this.clientConfig.getPersistConsumerOffsetInterval(), TimeUnit.MILLISECONDS);
Copy the code

In simple terms, the update offset updates all the offset information of the local message queue, whereas the subscribed message queue information is from the previous topic routing information update. Without read permission, there is no queue information for Broker1, and the update offset does not update the consumption offset on the broker. The maximum time for this process is 30+5=35s. This is the maximum time interval for the client. Look at the broker, because the maximum time for the broker to update the routing information of this topic after the read permission is closed is 30 seconds.

        this.scheduledExecutorService.scheduleAtFixedRate(new Runnable() {
 
            @Override
            public void run(a) {
                try {/ / brokerConfig isForceRegister () the default value is true
                    BrokerController.this.registerBrokerAll(true.false, brokerConfig.isForceRegister());
                } catch (Throwable e) {
                    log.error("registerBrokerAll Exception", e); }}. / / brokerConfig getRegisterNameServerPeriod () the default 30 s, in 10 to 60 s, 30 s registered a broker - > name server, namely every 30 s to a topic information
        }, 1000 * 10, Math.max(10000, Math.min(brokerConfig.getRegisterNameServerPeriod(), 60000)), TimeUnit.MILLISECONDS);
Copy the code

Topic routing information is reported to the Name server every 30 seconds, and then the consumer starts to perceive it.

So the total time is 65 seconds. To ensure that the consumer is not affected, it takes at least 65s to stop the broker after read permissions have been closed. Of course, in practice, these schedules are overlapping (some information updates have other scenarios besides scheduling), and it can be as fast as 10 or 20 seconds. The reason I recommend 2 minutes is because it’s easy to remember, you don’t need to worry about details, 2 minutes is absolutely safe.