MQ

  • In Pro and Con terminals, it is guaranteed by business code and request confirmation mechanism
  • On the server side, use persistence and replication

Make sure you don’t lose messages.

Copy messages to multiple nodes

  • Resolve message loss problems
  • Ensure the HA of the messaging service

So MQ is configured in clustered mode and message replication is enabled.

So what are the problems that message replication needs to solve?

1 Indicators of message replication

MQ is expected to have high performance, high availability, and data consistency. Many MQ states that all of these features are supported, but there are preconditions.

1.1 performance

Either type of replication requires data to be written to multiple nodes and then returned, which is not as good as writing to a single node.

The more nodes that need to be written, the better the availability and data reliability, but the lower the write performance. However, replication has little impact on the performance of consumption. Regardless of the replication method, when consuming a message, only one node of multiple replicas is selected to read it, just like single-node consumption.

1.2 consistency

MQ has the following data consistency requirements:

  • Don’t throw the message
  • Strict order

To ensure data consistency, master/slave replication must be used.

In master/slave mode, data is written to the primary node first and replicated from the secondary node only to the primary node. If data inconsistency occurs between the primary node and secondary node, the data on the primary node prevails.

The master node here is not immutable, and in many replication implementations, when the master node fails, other nodes can be elected to become master nodes. Ensure that the number of primary nodes in a cluster cannot exceed one at any time to ensure data consistency.

1.3 high availability

Master/slave replication is required. When a master node goes down, another master node is selected as soon as possible.

1.3.1 Implementation method

1.3.1.1 Managing Services

These nodes are managed using the third-party service. If a primary node is down, the management service assigns a new primary node. However, the introduction of services will bring a series of problems, such as the high availability of management services themselves, how to ensure data consistency? Like the Redis sentinel.

1.3.1.2 since the election

Some MQS choose self-election, where the surviving nodes vote for a new master node.

  • advantages

No external dependence, self-management

  • disadvantages

The implementation algorithm of voting is complex, and the election process is slow, several to dozens of seconds, before the new master node is elected, the service is unavailable.

Most replication practices do not choose to write the message to all copies and then return confirmation, because while this ensures data consistency, if any of these copies go down, the write will be stuck. If the message is only written to a partial copy, the write is considered successful and the acknowledgement is returned, it can avoid gridlock and perform better. So how many copies is a write successful? Assume that the cluster uses 1 master, 2 slave, and 3 copies:

  • If two copies of the message are required to be written successfully, one copy of the message can be down at most. Otherwise, the service cannot be provided
  • If one copy is required, as long as the message is successfully written to the master node, two copies of the three copies can be allowed to go down, and the system can still be serviced, and the availability is better

However, some messages on the master node may not be copied to any slave node in time. If the master node breaks down, messages will be lost and data consistency will be lost.

The choice of different replication implementations for different MQ has its advantages and disadvantages.

2 RocketMQ copy

2.1 Traditional Replication

In RocketMQ, the basic unit of replication is the Broker, the server process. Master/slave replication is usually configured as one master/slave, or one master/multiple slave.

RocketMQ offers two

2.1.2 Replication Mode

Asynchronous replication

The message is sent to the master node, returns write success, and is then asynchronously copied to the slave node.

Synchronous double write

The message is written to both the primary and secondary nodes. Write succeeded is returned only when the data is written to both the primary and secondary nodes. The essential difference between the two methods is that how many copies are written and then the “write success” question is returned. The number of copies required for asynchronous replication is 1, and the number of copies required for synchronous double-write is 2.

If there are not enough copies to write before returning “write success,” the message will be lost. For RocketMQ, do you lose messages if you use asynchronous replication? Don’t throw.

Why don’t you lose messages

RocketMQ’s Broker is fixed by configuration and does not support dynamic switching. If the master node goes down, producers can no longer produce messages, and consumers can automatically switch to the slave node to continue consuming. Before at this time, even if there is some message is copied to the slave node, the message is still lying on the primary disk, unless you are the master node disk is broken, or when the master node to return to service, the messages still can continue to copy to the slave node, also can continue to consumption, don’t throw the message, the message order is no problem.

This master-slave replication approach sacrifices availability for better performance and data consistency.

availability

One pair of master and slave nodes is not available, so many pairs.

  • function

RocketMQ supports the distribution of a topic to multiple pairs of master and slave nodes, with each pair hosting a portion of the topic queue.

  • performance

If a primary node is down, the system automatically switches to another primary node to continue sending messages.

  • Solve the following problems:
    • availability
    • Topic performance can also be improved through horizontal scaling

Old copy defect

Since the topic layer cannot guarantee strict order, it must specify a queue to send messages. For any queue, it must fall on a specific set of master and slave nodes. If the master node fails, other master nodes cannot replace the master node, otherwise strict order cannot be guaranteed. So the choice between strict ordering and high availability of this replication pattern is one or the other.

2.2 the new copy

Deldger, a new replication method, was introduced at the end of 2018.

When Dledger writes messages, it requires that the messages be copied to at least half of the nodes before sending a write success message to the client. In addition, Dledger supports dynamic switchover of primary nodes.

All works

Three nodes are used as an example. When the master node goes down, the two slave nodes vote to choose a new master node, which solves the availability problem compared to the master slave replication. Because the message must be copied to at least two nodes before a write success is returned, even if the primary node is down, at least one node has the same message as the primary node. In the election, the slave node whose data is the same as the master node is always selected as the new master, which ensures data consistency, not only does not lose messages, but also ensures strict order.

New copy defect

Services cannot be provided during the election process. At least three nodes are required to ensure data consistency. When three nodes are deployed, only one node can be used when the two nodes are down. If both nodes are down, they cannot provide services even if only one node is still alive, resulting in low resource utilization. Because at least half of the nodes must be copied before a write success is returned, it is not as fast as the master-slave asynchronous replication.

3 Kafka copy

The basic unit of replication is partition. A small replication cluster is formed between several replicas of each partition. The Broker is just a container for these partitioned copies, so Kafka’s Broker has no master or slave.

Use one master with many slaves in multiple copies of the partition. Asynchronously replicated when writing messages. Once a message is written to the master node, it does not return a write success immediately, but waits for enough nodes to replicate. “Enough” is determined by the user. Corresponding: In Sync Replicas (ISR), the replica that keeps data synchronized. The number of ISRs is configurable. The ISRS contain primary nodes.

Kafka uses ZooKeeper to monitor multiple nodes in each partition and finds that the primary node of a partition is down:

  • Kafka uses ZooKeeper to select a new master node, which addresses availability
  • During elections, new primary nodes are selected from all ISR nodes to ensure data consistency

By default, if all ISRs go down, the partition cannot provide services. You can also choose to configure the partition to continue to provide services, so that as long as there is a node active, can provide services, the price is not guaranteed data consistency, will lose messages.

Kafka’s highly configurable replication mode

  • advantages

You can customize these replication parameters to make service trade-offs in terms of availability, performance, and consistency

  • disadvantages

Higher learning costs

4 summarizes

There is no perfect replication solution; evaluate high performance, high availability, and consistency based on business requirements.