Welcome to follow the wechat official account: Huperia Architecture Note (ID: Shishan100)

My new course ** “C2C E-commerce System Micro-service Architecture 120-day Practical Training camp” in the public account Of Ruape technology nest ** online, interested students, you can click the link below to learn more details:

120-day Practical Training Camp for C2C E-commerce System Microservice Architecture

directory

First, write in the first two, the core process of reliable message final consistency solution. Second, the production practice of high availability assurance of reliable message final consistency solution

First, write in front

In the last article, we talked about TCC distributed transactions. For common microservice systems, most interface calls are synchronous, where one service directly calls another service’s interface.

At this point, it is appropriate to use the TCC distributed transaction scheme to ensure that all interface calls either succeed together or roll back together.

But in the development of a real system, it is possible that calls between services are asynchronous.

That is, a service sends a message to MQ, the messaging middleware, such as RocketMQ, RabbitMQ, Kafka, ActiveMQ, and so on.

Another service then consumes a message from MQ and processes it. This becomes an asynchronous call based on MQ.

So how do you guarantee distributed transactions between services for this asynchronous mQ-based invocation?

In other words, WHAT I want is business logic for multiple services that implement asynchronous invocations based on MQ that either succeed together or fail together.

At this point, a reliable message final consistency scheme is needed to implement distributed transactions.

If you look at the figure above, this kind of distributed transaction solution is relatively simple in terms of “reliable message” and “ultimate consistency” without considering the technical challenges of high concurrency and high availability.

Second, the core process of reliable message final consistency scheme

(1) The upstream service delivers messages

If you want to implement a reliable message ultimate consistency solution, you can generally write your own reliable message service and implement some business logic.

First, the upstream service needs to send a message to the reliable messaging service.

In plain English, you can think of this message as a call to an interface of the downstream service, which contains the corresponding request parameters.

The reliable messaging service then has to store the message in its database with the status “to be confirmed.”

The upstream service can then perform its own local database operations and invoke the reliable messaging service interface again based on its own results.

If the local database operation performed successfully, the reliable messaging service is asked to confirm that message. If the local database operation fails, the reliable messaging service is asked to delete the message.

At this point, if it is a confirmation message, the Reliable Messaging service updates the message status in the database to “sent” and sends the message to MQ.

A key point here is to update the status of messages in the database and to post messages to MQ. These two operations, you have to put them in one method, and you have to start local transactions.

What does that mean?

  • If the database fails to update the status of the message, then throw an exception and exit, do not post to MQ;

  • If the post MQ fails and an error is reported, an exception is thrown to roll back the local database transaction.

  • The two operations have to work together or fail together.

If the upstream service is a notification deletion message, then the reliable messaging service has to delete the message.

(2) Downstream services receive messages

Downstream services wait to consume messages from MQ and, if so, operate on their own local databases.

If the operation is successful, it in turn notifies the Reliable Message service that it has processed successfully, and the Reliable Message Service sets the state of the message to “completed.”

(3) How can the upstream service deliver the message 100% reliably?

All of you have read the above core process: a big question is, what if there is a problem in the process of delivering the message?

How do we ensure that messages are delivered 100% reliably from upstream services to downstream services? Don’t worry, let’s break it down one by one.

If the upstream service fails to send the confirmation message to the reliable messaging service, it is ok. The upstream service can sense the invocation exception and do not need to execute the following process, which is ok.

If there is a problem when the upstream service notifies the reliable messaging service of a confirmation message or a deletion message after the upstream service has operated the local database.

For example, no notification succeeds, or execution succeeds, or the reliable messaging service fails to deliver a message to MQ. What if this sequence of steps goes wrong?

It doesn’t matter, because in these cases, the status of that message in the reliable messaging service’s database will always be “to be confirmed.”

At this point, we develop a thread that runs regularly in the background in the Reliable Messaging Service, constantly checking the status of each message.

If it’s “to be confirmed” all the time, there’s something wrong with the message.

At this point, you can call back to an interface provided by the upstream service and ask, “Brother, did you successfully execute the database operation corresponding to this message?”

If the upstream service replies that I executed successfully, the Reliable Messaging service changes the message status to “sent” and delivers the message to MQ.

If the upstream service replies that the execution did not succeed, the Reliable Messaging service simply deletes the message from the database.

With this mechanism, it is guaranteed that the reliable messaging service will attempt to complete the delivery of messages to MQ.

(4) How to ensure the 100% reliable reception of messages by downstream services?

What if there is a problem with the downstream service consumption message and it is not consumed? What if the downstream service fails to process the message?

It doesn’t matter. Develop a background thread in the Reliable messaging service to constantly check the status of messages.

If the status of the message remains “sent” and never becomes “completed,” then the downstream service never succeeds in processing.

At this point the reliable messaging service can try again to repost the message to MQ for downstream services to process again.

As long as the interface logic of the downstream service is idempotent, ensuring that a message is processed multiple times and no duplicate data is inserted.

(5) How to implement the ultimate reliable message consistency scheme based on RocketMQ?

In the general scheme design above, it relies entirely on the various self-checking mechanisms of the reliable messaging service to ensure that:

  • If the upstream service’s database operation fails, the downstream service does not receive any notification

  • If the upstream service’s database operation succeeds, the reliable messaging service will ensure that an invocation message is delivered to the downstream service, and it will ensure that the downstream service must successfully process the message.

Through this mechanism, the distributed transaction guarantee between the asynchronous invocation/notification services based on MQ is guaranteed.

In fact, RocketMQ, alibaba’s open source, implements all the functions of a reliable messaging service, with a similar core idea.

RocketMQ does a very good job of implementing a complex architecture to ensure high concurrency, high availability and high performance.

If you’re interested, you can check out RocketMQ support for distributed transactions.

3. Production practice of high availability assurance for the ultimate consistency scheme of reliable messages

(1) Background introduction

In fact, a lot of students should know how the above scheme and thought, we are mainly to pave the way for this set of theoretical ideas.

In the actual production, if there is no high concurrency scenario, you can refer to the above ideas based on a MQ middleware to develop a reliable message service.

If you have a high concurrency scenario, you can use RocketMQ’s distributed transaction support to implement all of the above processes.

One of the core topics I want to share with you today is how this solution ensures 99.99% high availability.

In fact, you should have found that in this scheme to ensure the high availability of the biggest one point of reliance, is MQ high availability.

Any MQ middleware, whether it’s RabbitMQ, RocketMQ or Kafka, has a full set of high availability mechanisms.

So when using reliable messaging ultimate conformance solutions in large organizations, we often rely on the corporate infrastructure team to ensure high availability of MQ.

In other words, you should believe that the brother team, 99.99% can guarantee the high availability of MQ, never because the MQ cluster is down, and the distributed transactions of the company’s business system can not run.

But the harsh reality is that many small to medium sized companies, and even some large to medium sized companies, have to some extent experienced an overall MQ cluster failure scenario.

Once MQ is completely unavailable, the various services of the business system cannot deliver messages through MQ, resulting in the interruption of business processes.

For example, recently, a friend’s company, which is also engaged in e-commerce business, encountered the overall failure and unavailability of the cluster deployed by MQ middleware on the company’s machines, resulting in the failure of all distributed transactions dependent on MQ and the interruption of a large number of business processes.

In this case, it is necessary to implement a set of high availability guarantee mechanism for this distributed transaction scheme.

(2) High availability degradation scheme supported by the queue based on KV storage

Take a look at the chart below, which is a high availability degradation mechanism designed by a company THAT I once instructed a friend of mine for the ultimate consistency of reliable messaging.

This mechanism is not too complex, it can be very simple and effective to ensure that the friend company’s high availability protection scenario, once the MQ middleware failure, immediately automatically downgraded to the backup plan.

(1) Self-encapsulate MQ client components and fault awareness

First of all, if you want to be automatically aware of MQ failures and then automatically degrade them, you have to encapsulate the MQ client and release it to your company’s Nexus server.

Then the business services that the company needs to support MQ degradation all use this self-encapsulated component to send messages to MQ and consume messages from MQ.

In your own encapsulated MQ client component, you can determine if MQ is faulty based on what is written to it.

For example, if 10 consecutive retry attempts to deliver a message to MQ are met with abnormal errors, network connectivity problems, etc., the MQ fault can be automatically sensed and the degrade switch can be automatically triggered.

(2) Degradation scheme based on queue in KV storage

If MQ is down and you want to continue delivering messages, you must find an alternative to MQ.

For example, my friend’s company doesn’t have a high concurrency scenario, the number of messages is small, but the availability requirements are high. This can be replaced by a queue similar to that in Redis’ KV store.

Since Redis already supports queues and queue-like data structures, you can write messages to queue data structures in kv storage format.

Ps: About redis data storage format, supported data structure and other basic knowledge, please consult, a lot of online

However, there are several large pits that must be paid attention to.

First, it is recommended not to write too much data into any collection type data structure stored by KV, otherwise it will lead to large value and serious consequences.

So you can’t just make a key in Redis and try to keep writing messages into the data structure. You can’t.

Second, you should never continuously write data to a data structure corresponding to a small number of keys. That will lead to the generation of hot keys, that is, some keys are particularly hot.

You should know that the kv cluster is generally based on the key hash to each machine, if you write a few keys, one of the KV cluster will be too high access, heavy load.

Based on the above considerations, the following is the scheme designed by the author at that time:

  • According to their daily message volume, the KV store is fixed into hundreds of queues, with hundreds of keys corresponding.

  • This ensures that no too many messages are written into the data structure for each key, and that only a few keys are written frequently.

  • In the event of an MQ failure, the reliable messaging service can hash each message evenly into a fixed queue of hundreds of keys corresponding to kv.

At the same time, a degrade switch needs to be triggered by ZK, and all read and write of the whole system in MQ will be degraded immediately.

(3) Degradation perception of downstream service consumption MQ

Downstream service consumption MQ is also done by a self-encapsulated component. If the component is aware that the degrade switch is on from zK, it will first determine whether it can continue to consume data from MQ.

If not, multiple threads are started to retrieve data from the hundreds of queues in the kv store concurrently.

Each time a piece of data is retrieved, it is handed over to the business logic of the downstream service to execute.

Through this mechanism, automatic fault awareness and automatic degradation of MQ failures are achieved. If the load and concurrency of the system are not too high, this scheme is generally fine.

Because in the process of production landing, including a large number of Dr Drills and production performance when the actual failure occurs, can effectively ensure that the BUSINESS process continues to run automatically when THE MQ failure.

(4) Automatic fault recovery

If the degrade switch is turned on, the self-encapsulated component needs to start a thread and try to send a message to MQ every once in a while to see if it is restored.

If MQ has recovered and is able to deliver messages normally, the degrade switch can be turned off via ZK, and the reliable Messaging service can continue to deliver messages to MQ, and the downstream service can switch back to consuming messages from MQ after confirming that the various queues stored by KV are empty.

(5) More business details

In fact, the above scheme is mainly a general downgrade scheme, but the specific landing is to combine the different business details of each company to decide, many details can not be reflected in the article.

For example do you want to make sure that the messages are sequential? Does it involve generating a large number of keys based on business dynamics? And so on.

In addition, there is a certain cost to implement this solution, so it is recommended that you stay with the push company’s infrastructure team as much as possible to ensure 99.99% availability of MQ and no downtime.

The second is to make the decision based on your company’s actual high availability requirements. If you feel that MQ outages occasionally are ok and can be tolerated, then you don’t need to implement this downsizing scheme.

But if company leaders believe that business system processes must continue to run after MQ middleware goes down, there are some highly available degradation options to consider, such as the one mentioned in this article.

Finally, if some companies were involved in high concurrent requests in the tens of thousands of thousands per second, MQ degradation schemes would be much more complex to design and far more difficult to achieve.

END

If there is any harvest, please help to forward, your encouragement is the author’s biggest motivation, thank you!

A wave of microservices, distributed, highly concurrent, highly available **** original series

The article is on its way,Please scan the qr code below, continuous attention:

Huperia architecture Notes (ID: Shishan100)

More than 10 years of experience in BAT architecture