This article is shared by Jing Song, technical team of Ali Xianyu. The original title “99.9% arrival rate: Xianyu message changes engine on the highway” has been revised and changed. Thanks for sharing.

1, the introduction

At the beginning of 2020, I took over the IM instant message system of Xianyu. At that time, there were various problems in the message, and online users’ public opinions were also continuous.

Typical questions, such as:

  • 1) “Chat messages are often lost”;
  • 2) “The profile pictures of message users are in disorder”;
  • 3) “The order status is wrong” (I believe you are still making fun of the news of idle fish reading this article).

So the stability and reliability of instant message system of Idle fish is a problem to be solved urgently.

We investigated some solutions within the group, such as Dingdingimpass. If the rashly direct migration, technical costs and risks are relatively large, including the server data need to double write, compatibility between old and new versions, etc.

Then how to optimize the message stability and reliability based on the existing instant message system architecture and technical system of Xianyu? Where should governance begin? What is the current state of the system? How to measure objectively? I hope this article can let you see a different free fish instant message system.

PS: If you are not familiar with IM message reliability, you are advised to read this introductory article “Getting Started with Zero Base IM Development (3) : What is IM System Reliability?” .

2. Series of articles

This is the fourth in a series of articles with the following table of contents:

  • Ali IM Technology Sharing (I) : The King of Enterprise-level IM — The Excellence of Dingding in back-end Architecture
  • Ali IM Technology Sharing (II) : Xianyu IM Mobile Terminal Cross-transformation Practice based on Flutter
  • Ali IM Technology Sharing (III) : The Architecture Evolution of Xianyu Yiyi LEVEL IM Message System
  • Ali IM Technology Sharing (IV) : Reliable Delivery Optimization Practice of Xianyu Yiyilevel IM Message System (* Article)

3. Industry plan

After consulting online sharing mainstream news reliable delivery technology scheme, I have a simple summary.

Generally, the DELIVERY link of IM messages is divided into three steps:

  • 1) Sender sends;
  • 2) The server receives and drops the database;
  • 3) The server notifies the receiver.

Especially, the network environment of mobile terminal is complicated:

  • 1) Maybe you’re sending a message and the Internet goes down.
  • 2) The message may be in the process of being sent, but the network suddenly recovers and needs to be resent.

The technical schematic diagram is as follows:

PS: Many people may not have a systematic understanding of the complexity of mobile network. The following articles are necessary to be read systematically:

  1. Easy to Understand the “weak” and “Slow” of mobile Networks
  2. Summary of the Most Complete Mobile Weak Network Optimization Methods in history
  3. Why is WiFi signal bad?
  4. Why cell phone signal is bad?
  5. How difficult is wireless Internet access on high-speed trains?

Then, how to deliver IM messages stably and reliably in such a complex network environment?

For the sender, it does not know whether the message has been delivered or not. In order to ensure the delivery, a response mechanism needs to be added.

This mechanism is similar to the following response logic:

  • 1) The sender sends a message “Hello” and enters the waiting state;
  • 2) The receiver receives the message “Hello” and tells the sender that I have received the message;
  • 3) After the sender receives the confirmation message, the process is considered complete, otherwise it will retry.

The above process seems simple, but the key is that there is a server-side forwarding process in the middle. The question is who sends the acknowledgement back and when.

Online check more is the following message must reach the model:

The packet types are described as follows:

As shown in the two figures above, the sending process is as follows:

  • 1) A sends A message request packet to the IM-server, namely MSG :R1;
  • 2) After successful processing, the IM-server replies A with A message response package, namely MSG :A1;
  • 3) If B is online, the IM-server sends a message notification package to B, namely MSG :N1 (of course, if B is not online, the message will be stored offline).

As shown in the two figures above, the receiving process is as follows:

  • 1) B sends an ACK request packet to the IM-server, that is, ACK :R2.
  • 2) After successful processing, the IM-server replies B with an ACK response package, that is, ACK :A2;
  • 3) THE IM-server sends an ACK notification packet to A, that is, ACK :N2.

As shown in the above model, a reliable message delivery mechanism is guaranteed by six packets. Any error in any link can be determined based on the request-ACK mechanism and retry.

The solution we finally adopted is based on the above model. The logic sent by the client is directly based on HTTP, so there is no need to retry temporarily. The logic of retry will be added when the server pushes to the client.

Limited by space, this paper will not be expanded in detail. If you are interested, you can systematically study the following ones:

  1. “Message Reliability and Delivery Mechanism of Mobile IM from client’s Perspective”
  2. Realization of IM Message Delivery Guarantee Mechanism (I) : Ensuring reliable Delivery of Online Real-time Messages
  3. Realization of IM Message Delivery Guarantee Mechanism (II) : Ensuring reliable Delivery of offline Messages
  4. How to design a “Retry Failure” mechanism for a completely self-developed IM?
  5. A set of IM Architecture Technologies for Hundreds of millions of Users (Part 2) : Reliability, Order, Weak Network Optimization, etc.
  6. Understanding the “Reliability” and “Consistency” Problems of IM Messaging and Exploring solutions
  7. Rongyun Technology Sharing: Fully reveal the Reliable delivery mechanism of 100 million IM Messages

4. Specific problems currently faced

4.1 an overview of the

Before addressing the issue of reliable delivery of messages, we should certainly first identify the specific problems we are facing.

However, when I took over this instant message system, there was no relevant accurate data for reference, so the current first step is to do a complete investigation of this message system, so we made a full link burying point for the message.

The specific buried point link is as follows:

Based on the whole link of messages, we sorted out several key indicators:

  • 1) Sending success rate;
  • 2) Message arrival rate;
  • 3) Client drop rate.

The whole data statistics are based on buried points to do, but in the process of buried points found a big problem: the current instant messaging system does not have a globally unique message ID. As a result, the life cycle of the message cannot be uniquely determined in the process of burying points in the whole link.

4.2 Message uniqueness problem

As shown in the figure above, the uniqueness of the current message is determined by three variables:

  • 1) SessionID: ID of the current session.

  • 2) SeqID: the serial number of the message sent locally by the user. The server does not care about this data and it is completely transparent transmission.

  • 3) Version: This is important. It is the serial number of the message in the current session. The server prevails, but the client will also generate a false Version.

Above the graph is: when A and B at the same time send A message will be generated locally as key information, when to send A message (yellow) to the server first, because there is no other version in front of the news, so the original data will be returned to A, the client receives A message, do merge with local news again, will only retain A message. At the same time, the server also sends this message to B. Because B also has a local message with version=1, the message from the server is filtered out, causing message loss.

After B sends a message to the server, the server increments the version of B’s message to 2 because there is already a message with version=1. This message is sent to A and can be merged with the local message. However, when this message is returned to B and merged with the local message, two identical messages will appear and message duplication will occur. This is also the main reason why message loss and message duplication always occur before idle fish.

4.3 Message push Logic Problems

The current message push logic also has a big problem, because the sender uses HTTP request, the message content basically does not have a problem, the problem is when the server pushes to the other end.

As shown below:

As shown in the figure above: When the server pushes to the client, it will judge whether the client is online at this time. If the client is online, it will push the message. If not, it will push the message offline.

This method is very simple and crude: If the status of the long connection is unstable, the real state of the client is inconsistent with the storage state of the server, and the message will not be pushed to the server.

4.4 Client Logic Faults

In addition to the above problems with the server, there is another class of problems that are designed by the client itself.

It can be summarized as follows:

  • 1) Multi-threading problems: the layout of the feedback message list page will be distorted, and the rendering interface will start before the local data is fully initialized;

  • 2) Inaccurate counting of unread and small red dots: the local display data is inconsistent with that stored in the database;

  • 3) Message merge problem: When local message merge, it is segmented, which cannot guarantee the continuity and uniqueness of message.

In these cases, we first combed and refactored the client code.

The architecture is shown below:

5. Our optimization work 1: upgrading the core

The first step is to solve the uniqueness of the current message system.

We also investigated the solution of Dingding. Dingding is the unique ID of the global maintenance message of the server. Considering the historical burden of idle fish instant message system, we adopted UUID as the unique ID of the message, which can greatly improve the burying point and de-duplication of the message link.

5.1 Resolving message Uniqueness

On newer versions of the APP, the client will generate a UUID, and the server will add information if the old version cannot.

Message ID like a1a3ffa118834033ac7a8b8353b7c6d9, after the client receives the message, will first according to the MessageID and heavy, and then based on Timestamp ordering is ok, although it may not be the same client, but the probability of repeated or smaller.

Take iOS as an example, the code is as follows:

– (void)combileMessages:(NSArray<PMessage*>*)messages {

.

// 1. Perform deduplication according to MessageId

NSMutableDictionary *messageMaps = [self containerMessageMap];

for (PMessage *message in msgs) {

[messageMaps setObject:message forKey:message.messageId];

}

// 2. Sort messages after merging

NSMutableArray *tempMsgs = [NSMutableArray array];

[tempMsgs addObjectsFromArray:messageMaps.allValues];

[tempMsgs sortUsingComparator:^NSComparisonResult(PMessage * _Nonnull obj1, PMessage * _Nonnull obj2) {

// Sort the message by its timestamp

return obj1.timestamp > obj2.timestamp;

}];

.

}

5.2 Implement message resending and disconnection mechanism

Based on the retransmission and reconnection model in “3, Industry Solutions” section of this paper, we have improved the logic of message retransmission on the server side and the logic of disconnection and reconnection on the client side.

Specific measures are as follows:

  • 1) The client will periodically check whether the ACCS long connection is connected;
  • 2) The server will detect whether the device is online. If the device is online, it will push a message and there will be timeout waiting;
  • 3) After receiving the message, the client will return an Ack.

5.3 Optimizing the Data Synchronization Logic

Retransmission and reconnection solve the problems of the basic network layer, and then look at the problems of the business layer.

In existing messaging systems, many complex situations are solved by adding compatible code in the business layer, and data synchronization of messages is a typical scenario.

Before perfecting the logic of data synchronization, we also investigated the whole set of data synchronization solutions of Dingding, which are mainly guaranteed by the server side, with a stable long connection guarantee behind them.

The data synchronization process of Nail is as follows:

Our server side does not have this capability, so the idle fish side can only control the logic of data synchronization from the client side.

Data synchronization modes include:

  • 1) Pull session;
  • 2) Pull message;
  • 3) Push messages, etc.

Because of the complexity of the scene involved, there was a scene before that push would trigger incremental synchronization. If push was too much, multiple network requests would be triggered simultaneously. In order to solve this problem, we also made relevant push and pull queue isolation.

The client-controlled strategy is to add the pushed message to the cache queue if it is being pulled, and then merge the pulled result with the local cache logic, thus avoiding the problem of multiple network requests.

5.4 Client Data Model optimization

The data organization form of the client is mainly divided into two types: session and message, and session is divided into virtual node, session node and folder node.

On the client side, a tree like the one shown above is built. This tree mainly stores information about the session display, such as unread, red dots, and the latest message summary. Updates of the child node are automatically updated to the parent node, and the process of building the tree is also read and unread updates.

Of more complex scenarios is idle YuQing newspaper, this is actually a folder node, it contains many child session, this will determine his message sorting, red point count and the update logic will become more complex, the service side informed the customer terminal session list, and then the client to splice these data model.

5.5 Server Storage Model Optimization

In the previous section, I outlined the client request logic that history messages are synchronized in both incremental and full domains.

This domain is actually a layer of concept on the server side. In essence, it is a layer of cache for user messages, which are temporarily stored in the cache to speed up message reading.

However, there is also a defect in this design: that is, the domain ring is long and can hold 256 messages at most. When the user has more than 256 messages, they can only be read from the database.

As for the storage mode of the server side, we also investigated the nail solution – write diffusion, the advantage is that it can be well customized for each user’s message, the disadvantage is that the storage capacity is very large.

Our solution should be a solution between read diffusion and write diffusion. This design method not only makes the client logic complex, but also slows the data reading speed of the server side. The following part can also be optimized.

6. Our optimization work 2: increase the quality control system

In doing the whole link transformation of the client and the server, we also made the logic of monitoring and checking the behavior on the message line.

6.1 All-link Troubleshooting

Full-link troubleshooting is based on the real-time behavior logs of users. The buried points of clients are cleaned into SLS through Flink, the real-time processing engine of the group.

User actions include:

  • 1) Message processing by message engine;
  • 2) User’s behavior of clicking/visiting the page;
  • 3) User’s network request.

The server side will have some long connection push and retry logs, and will also be cleaned to the SLS, thus forming a solution to check the whole link from the server side to the client side.

6.2 Reconciliation system

Of course, in order to verify the accuracy of the message, we also made a reconciliation system:

When a user leaves a session, the system collects a certain number of session messages, generates an MD5 check code, and reports the messages to the server. The server takes the check code and determines whether the message is correct.

After sampling data verification, the accuracy of the message is basically in 99.99%.

7. Optimization of statistical methods of data indicators

When we counted the key indicators of the message, we encountered a problem: we used to count the buried points of users, and found that there would be 3%~5% data difference.

Later, we used the sampled real-time reported data to calculate data indicators:

Message arrival rate = The actual number of messages received by the client/the number of messages received by the client

The actual message received by the client is defined as “message dropped”.

This indicator does not discriminate between offline and online users. If the user updates the device last time on the day, the user should receive all messages delivered on the day and before this time.

After the above optimization work, the message arrival rate of our latest version has basically reached 99.9%. From the perspective of public opinion, the feedback loss of messages is indeed much less.

8. Future planning

Overall, after a year of optimized management, our instant message system indicators are slowly getting better.

But there are some areas that need to be optimized:

  • 1) Insufficient security of the message: it is easy to be used by hackers to send some illegal content with the help of the message;

  • 2) Weak scalability of messages: some additional cards or capabilities will be issued, and lack of dynamic and extensible capabilities.

  • 3) Low scalability of the underlying protocol: it is difficult to expand the underlying protocol, so we should standardize the protocol later.

From a business perspective, messages should be a horizontally supported tool or platform product that can quickly connect to two and three parties.

In the future, we will continue to pay attention to users’ public opinions related to the message, hoping that xianyu INSTANT message system can help users complete business transactions in a better way. (This article is simultaneously published at: www.52im.net/thread-3706…)