This article by Ali Xianyu technical team at present, you share, this revision.

1, the introduction

After several generations of iterations, idle fish instant message system has been able to stably support the message volume of hundreds of millions.

In the process of building this message system, we experienced from simple to complex, from trouble to break the situation, each technology change is to better solve the problems facing the current business.

This article is to share the road of the technical transition of xianyu instant message system architecture from scratch, in order to learn from the experience of more peers on this basis and get valuable inspiration.

2. Series of articles

This article is the third in a series of articles, with the following table of contents:

  1. Ali IM Technology Sharing (I) : The King of Enterprise-level IM — The Excellence of Dingding in back-end Architecture
  2. Ali IM Technology Sharing (II) : Xianyu IM Mobile Terminal Cross-transformation Practice based on Flutter
  3. Ali IM Technology Sharing (III) : The Architecture Evolution of Xianyu Yiyi LEVEL IM Message System (* Article)
  4. Ali IM Technology Sharing (4) : The Practice of Reliable Delivery Technology of Xianyu Yiyi IM Message System (* to be released later)

3, version 1.0: business start-up, minimum available

3.1 Technical Background

In 2014, I launched the independent APP “Xianyu” for idle transactions, and completed the main link of the APP in the first phase, including commodities: release → search → commodity details →IM session → transaction.

As a start-up APP, the business needs to go online as soon as possible to verify the effect. In terms of technical construction, the system construction of Idle fish message needs to be completed from scratch.

3.2 Technical Solutions

As an instant messaging system, minimization capabilities include:

  • 1) Message storage: session, digest, message;

  • 2) Message synchronization: push and pull;

  • 3) Message channel: long connection, vendor push.

Different from the general IM conversation model, Xianyu conversation takes commodity as the main body and “person + person + commodity” as the elements to construct the conversation.

Due to the difference of session model, the existing message system of Taois cannot meet the business demand in a short time, and it takes a lot of time for Xianyu to completely build its own message system.

In order to ensure efficient online business, the technology selection to maximize the reuse of existing system capabilities, to avoid repeated wheel.

Therefore, our technical solution is as follows:

  • 1) The data model and the underlying storage rely on amoebae private message system for construction;
  • 2) In terms of message data acquisition, the client fully pulls the message data from the server;
  • 3) Use SDK and MTOP for communication protocol.

The overall architecture is shown in the following figure, which ensures fast delivery and minimizes business availability:

4. Version 2.0: The number of users grows rapidly and the message system needs to be rebuilt

4.1 Technical Background

The number of Xianyu users is fast exceeding 1 million, and the number of calls for instant messaging services is skyrocketing. In this context, user feedback message data acquisition delay and blank screen become normal, and a large number of Push messages are sent, resulting in frequent system alarms.

The reasons for these problems are as follows: In the 1.0 schema, the message data is retrieved in full pull mode, and the client UI does not store data.

Specifically:

  • 1) When the user needs to view the message data, the success of data pulling depends on the network and data access speed, which occasionally causes delay and blank screen;
  • 2) Centralized data storage, read far more than write, high concurrency, server load is too large.

** For the second point: ** For example, 1W users chat online at the same time, according to the current architecture, pull the full message concurrently, estimated 50,000 QPS. Let’s assume that when the number of concurrent online chat users is 100,000, the pressure on the server can be imagined.

4.2 Technical Solutions

Based on the above issues, we decided to rebuild the messaging system architecture to handle the larger user growth in the future.

Back to the core features of IM systems:

4.2.1) Message storage model:

  • 1) Session model: Owner, ItemID, User and sessionType are used to identify unique sessions, and extended attributes are added to support personalization;

  • 2) Summary model: as a user session view, different users in the same session can personalize the presentation, and the unique summary is identified by userID and SID;

  • 3) Message model: it consists of sender, message content, message version and SID. Sid + message version uniquely identifies a message;

  • 4) Instruction model: it is a kind of double-ended convention, an instruction set issued by the server and executed by the client. Such as do not disturb instructions, delete instructions and so on.

4.2.2) Message channel:

1) Online channel: use the full-duplex, low-latency and high-security channel service provided by Taobao wireless ACCS long connection;

2) Offline channel: AGOO, taobao’s message push platform, is used. It shields the complexity of interconnection between mainstream manufacturers and provides services directly to the business system.

4.2.3) Message synchronization model:

1) The client establishes a database to store message data: when message data is stored on the local device, message synchronization is optimized from full pull to full + incremental synchronization.

Incremental and full synchronization refers specifically to:

  • A. Incremental synchronization: The client stores message loci and synchronizes only incremental messages by comparing them with the latest loci on the server.
  • B. Full synchronization: When the user reinstalls or the gap is too large, the client fully pulls the historical message data for data reconstruction on the end.

2) The server builds a personal message domain ring (inbox model) to perform incremental data synchronization with the client. At the same time, the problem of over-read and under-write in the 1.0 architecture was balanced by the write diffusion of the personal domain ring.

The following figure shows the process of a message from send to receive and the execution of the server and client:

As shown in the figure above: Suppose Ua sends a message to Ub, and the message write spreads to Ua and Ub’s respective domain rings:

  • 1) When the client goes online, the push message location is equal to the domain version on the current end +1. Merge the local message database.
  • 2) When the client is offline, only offline push notification is performed. When the user goes online again, data synchronization is performed. The server determines whether to trigger incremental synchronization or full synchronization.

For point 2, the logic is as follows:

  • 1) If the domain ring version difference is smaller than the threshold, merge the local message database after incremental synchronization.
  • 2) When the domain ring version difference is greater than the threshold, the full message is pulled and the data on the end is reconstructed.

The entire synchronization logic is based on the idle instant messaging domain ring, which can be thought of as a user’s message inbox with a fixed capacity, and all messages sent to a user are synchronized to his domain ring.

Specifically:

  • 1) Domain ring storage: Domain ring needs to support high concurrent data read and write, which is realized by Alibaba distributed KV storage system TAIR;
  • 2) Domain ring capacity: In order to reduce full message synchronization, the capacity of individual domain ring is planned based on the average amount of messages that users need to synchronize next time they enter idle fish. FIFO cycle is used to cover the historical data.
  • 3) Domain ring version: the current message site of the user, when the message enters the personal domain ring, the domain ring version is strictly continuously incremented by Tair counter for full and incremental synchronous judgment.

After the completion of the above construction, Xianyu has its own independent instant message system, which alleviates the current problems and greatly improves the user experience.

5. Version 3.0: With the rapid development of business, system stability needs to be guaranteed

5.1 Technical Background

With the enrichment of idle fish business ecology, IM sessions and message content types continue to expand. At the same time, with the rapid growth of the number of users, public opinion problems such as failure to receive user feedback messages and message delay become increasingly prominent.

5.2 Problem Analysis

Problem 1: Xianyu APP process has no effective keepalive mechanism, and the process will be suspended by the system soon after the APP is withdrawn to the background, resulting in long connection interruption. In this case, the message is pushed through the vendor channel, which has poor real-time performance and different priorities for message push, resulting in message delay perceived by users.

Problem 2: When ACCS pushes online messages, the average delay is short, but there is false connection. In addition, the current message push link has no ACK mechanism. As a result, the server thinks the message has been sent but the client does not receive it. The user can see the message only after opening the app next time, and the user perceiving the message delay.

PS: The reason for false connection is that the user goes back to the background, the ACCS connection is interrupted for a long time, but the device status update is delayed.

Problem 3: The current message synchronization mode (ACCS push), pull mode (MTOP), the client is not isolated, asynchronous processing, resulting in some extreme cases of abnormal message database processing, resulting in message loss.

For example, after a user goes online and receives multiple messages, one of them triggers a domain black hole. During data reconstruction on the message synchronization end, there is a small probability that an error occurs.

Problem 4: Most of the online messages are found by feedback from public opinion. For example, if the messages are abnormal, the system has no perception, no remedial measures and troubleshooting is difficult. It can only be repaired with the version.

Problem 5: As services continue to be enriched, service numbers, content marketing of small programs and message groups are hatched based on the message system. All kinds of message sending links share domain rings and data stores, causing stability problems.

For example, the messages of the personal domain ring include IM chat and marketing messages. IM chat is triggered by users and strong arrival is required. However, marketing messages are generally sent in batches by the system through shuttle buses and other means. The message magnitude is large and TPS is high, which affects the stability of IM service.

5.3 Settlement plan

Based on the above analysis, we solve each problem individually.

1) Message retransmission and push and pull isolation:

As shown above:

  • A. ACK: Ensures the timely arrival of messages. After receiving the ACCS message and processing it successfully, the client sends back an ACK to the server. After receiving the ACK, the server updates the arrival status of the message and terminates the retry to avoid device misconnection or network instability.
  • B. Resend: The delay resend policy determines when to resend messages to ensure the certainty of message arrival. The adaptive delay retransmission policy is a delay policy that detects the network status of the device through four short delays of fixed N seconds, and then increases the delay policy of fixed step M according to the network status. This policy can ensure that the message can be successfully delivered in the shortest time with the least retransmission times.
  • C. Message queue: A message queue is introduced on the end to process messages in sequence to ensure the accuracy of message processing. At the same time, push and pull isolation is implemented to ensure orderly consumption of queues, which solves the problem of error in concurrent processing of message data under complex conditions.

2) Data storage split:

More than half of the instant messages sent by idle fish every day are marketing messages. The sending of marketing messages has obvious peaks and troughs, which may cause message database jitter and affect IM messages. I perform service isolation for message, digest, and domain ring storage to meet different requirements for stability in different business scenarios.

The specific approach is:

  • 1) IM messages need extremely high stability, and their messages and abstracts continue to be stored in mysql;
  • 2) The storage period of marketing messages is short, and the stability requirement is lower than THAT of IM, so Lindorm is used for storage;
  • 3) The domain ring performs instance-level isolation to ensure that the capacity of the IM domain ring will not be occupied by other messages, thus affecting message synchronization.

PS: Lindorm is a multi-model cloud native database service with low cost, customized TTL, and horizontal capacity expansion.

3) Online problem discovery and recovery:

Guarantee the stability of the key elements is to do a good job in all kinds of core index of monitoring, the monitoring must first have a data source, the service side + client key link node point, based on group UT, SLS, through the calculation of real-time cleaning, blink, eventually form a unified, standardized log data to SLS, for real-time monitoring and link.

The core goal of the message system is to ensure that the message is sent, received and received in time, so we monitor the stability of the system by calculating the success rate of sending, arrival rate and message delay.

In addition, in order to solve the problem of difficult investigation of users’ public opinions:

  • 1) We designed a set of instructions. Through the convention of instruction protocol, the server sends instructions to the specified user, and the client executes corresponding instructions to report abnormal data, thus improving the efficiency of investigation;
  • 2) Extended mandatory full synchronization, data correction and other instructions, directional repair user message data problems, compared with the past serious bug can only let users uninstall reinstall to solve, this method is obviously more user-friendly.

After a series of special governance, technical public opinion decreased by 50%, message stability system was built from 0 to 1, and user experience was further improved.

Looking to the future

Xianyu is an e-commerce transaction APP, in which IM is the pre-link of transaction, and the product experience of IM greatly affects the user’s transaction efficiency.

Some time ago, a user survey was conducted, and the NPS from Xianyu IM was lower than expected (NPS is a measure of user loyalty = recommenders %- detractors %).

From user feedback:

  • 1) Some users have strong demands for product functions, such as message search and grouping;
  • 2) Most users find it difficult to understand violations in the process of sending messages;
  • 3) There are still many public opinion feedback messages cannot be received or delayed.

Mapping to the current instant messaging system, our system architecture still needs a lot of continuous improvement.

Typical problems, such as synchronization protocol redundancy, tend to cause problems in the process of demand iteration, the impact of the absence of effective survival mechanism on instant message delivery, the failure of offline message receiving of niche models, and the overcrowding of online database accumulated for years, affect the iteration speed and NPS of idle fish business.

As the technical team, the next step will be to improve NPS as the core technical goal, xianyu instant messaging system version 4.0 architecture is on the way…… (This article is simultaneously published at: www.52im.net/thread-3699…)