Today, more and more users are attracted by the continuous accumulation of notes, walkways, buzz and other high-quality sharing content, which inspires the enthusiasm to travel here, and also drives the growth of Hornet’s nest transactions. IM systems play an important role in helping users make travel decisions and complete transactions.
The IM system establishes a direct communication channel between users and merchants, and helps users answer the questions in purchasing travel products. It not only facilitates the transaction of orders, but also helps users to dispel doubts and realize their travel wishes. With the rapid development of business, hornet’s nest IM system has undergone several important architecture evolution and transformation in recent years.
IM 1.0 — Early stage
In the early stage, in order to support the rapid launch of services, and the version traffic was low at that time, and the requirements for concurrency were not high, the technical architecture of IM system was mainly for the purpose of simplicity and availability, and the functions realized were also very basic.
IM 1.0 uses PHP development to achieve IM basic user/customer service access, message sending and receiving, consulting list management functions. When users consult, they are allocated to customer service based on the equal allocation policy, and the association between users and customer service is recorded. When the user/customer service sends a message, the message will be delivered to the Redis blocking queue of the other party by invoking the message forwarding module. When receiving the message, the message polling module is called through the HTTP long connection. When there is a message, it will return immediately, and when there is no message, it will return for a period of time. The purpose of blocking is to reduce the polling interval. The message sending and receiving model is as follows:
Message polling module optimization
In the above model, the long connection requests of the message polling module are mounted on the blocking queue through PHP-FPM. If the phP-FPM process cannot be released in time, the server performance will be greatly consumed and the load will be very high.
To solve this problem, we optimized the message polling module, using Lua coroutines based on OpenResty framework to optimize the phP-FMP long time mount problem. Lua coroutine will determine whether to intercept network requests by marking the request forwarded by Nginx. If it does, Lua coroutine will handle the blocking operation, releasing PHP-FMP in time and reducing the consumption of server performance. The optimized processing flow is shown in the following figure:
IM 2.0 — Requirements customization phase
With the rapid growth of business, IM system is faced with a large number of customization requirements in a short period of time, and many new business modules have been developed. In the face of a large number of users consulting, customer service capacity has been overwhelmed. Therefore, IM 2.0 will focus on improving the experience of business functions. For example, in the process of user consultation, the former single distribution method will be changed to adopt average, weight, queuing and other ways. In order to improve the efficiency of customer service, optional configurations, such as automatic reply and FAQ, are added to customer service reply.
Take a typical user consultation scenario as an example. When users open App or web page, they will establish a long connection through the connection layer. Then, when initiating consultation at the consultation portal, they will carry message clues to initialize the message link and establish a reusable and retrivable message line. When sending a message, the message will be stored in DB through the message service, and it will retrieve whether the current consultation is assigned to the customer service according to the message line. The purpose of calling the allocation service is to improve the customer service information for the current consultation. Finally, update the customer service information to the link relationship.
In this way, a complete message link is established, and then the message sent by the user/customer service is transmitted to the other party through the forwarding service, as shown in the following figure:
IM 3.0 – Service disassembly phase
The volume of business is accumulating, and the IM system’s code expands rapidly as modules are added. Because the code specification is not unified, the interface responsibility is not single, the coupling between modules and other reasons, changing one requirement is likely to affect other modules, making the development and maintenance of new requirements are very high.
In order to solve this situation, IM systems must be upgraded architecture, the first task is the separation of services. At present, the IM system is divided into four services, including customer service, user service, IM service and data service, as shown in the following figure:
-
Customer service: provide various ways to improve customer service efficiency and user experience, such as providing group management, member management, quality inspection services to improve the operation and management level of the customer service team; To make the reception efficiency of users more flexible and efficient by allocating services and transferring services; Automatic reply, FAQ, and knowledge base services are supported to improve the reply efficiency of customer service consultation.
-
User service: analyze user behavior, make interest recommendation and user portrait for users, and count users’ satisfaction with customer service of hornet’s Nest merchants.
-
IM service: supports single chat and group chat modes, provides real-time message notification, offline message push, historical message roaming, contact list, file uploading and storage, and message content risk control and detection.
-
Data service: define data indicators by collecting the source and entrance of user consultation, whether to consult or place an order, whether there is customer service reception, customer consultation and customer service reply time information, etc., carry out offline data calculation through data analysis, and finally provide external data statistics. The main indicators include 30-second, 1-minute response rate, number of inquiries, no reply times, average reply time, sales volume of inquiries, conversion rate of inquiries, conversion rate of referrals, time-sharing reception pressure, on-duty situation, service score, etc.
User status flow
In the existing IM system, a complete user status flow during user consultation is shown in the figure below:
The user clicks the consult button to trigger the event, and the user state enters the initial state. When the message is sent, the system changes the user status to to be assigned. After the customer service is assigned by invoking the assignment service, the user status changes to assigned or unsolved. When the customer service solves the problem or the user does not speak for a long time after the customer service replies, the system automatically solves the problem. At this time, the user status changes to solved and a consultation process ends.
IM service reconstruction
In the process of service splitting, we need to consider the generality, availability and degradation strategy of specific services, and at the same time, we need to reduce the dependency of services as much as possible to avoid the risk of service breakdown due to the unavailability of a single service. During this period, IM services were increasingly required by other lines of business, with increasing frequency and magnitude of use. In the early stage of IM service, when the number of connections is large, you can only modify the code to achieve horizontal expansion. When accessing new services, Openresty environment and Lua coroutine code need to be configured on the service server. Service access is very inconvenient, and IM service is not universal.
Considering the above problems, we reconstructed the IM service completely. The goal is to extract the IM service into independent modules, independent of other services, and provide unified integration and invocation methods externally. Considering the requirements of IM service for high concurrent processing and low loss, Go language is selected to develop this module. The new IM service design is shown as follows:
The more important Proxy layer and Exchange layer provide the following services:
1. Routing rules, such as IP-hash, polling, and minimum number of connections, are used to hash clients to different ChannelManager instances.
2. For client access management, the connection information will be synchronized to the DispatchTable module for the Dispatcher to search.
3. Communication protocol between the ChannelManager and the client, including the client’s request to establish a connection, disconnection, active disconnection, heartbeat, notification, sending and receiving messages, and QoS of messages.
4. Provides a REST interface for sending single messages and group messages. You need to determine whether to use this interface based on scenarios. For example, users need to send messages through this interface when consulting customer service. The main reasons are as follows:
-
When sending messages, there is logic for creating message lines and assigning managers, which is currently implemented in PHP. IM services need to know the result of PHP execution. One way is to use Go to implement it again, and another way is to call PHP back through REST interface. This can cause excessive network interaction between IM services and PHP services, affecting performance.
-
To forward A message, multiple instances of ChannelManager need to communicate with each other. For example, user A on ChannelManager1 sends A message to customer service B on ChannelManager2. If there is no communication mechanism between instances, the message cannot be forwarded. To extend the ChannelManager instance, the new instance needs to communicate with other existing instances, which increases the complexity of system extension.
-
If the client does not support the WebSocket protocol, HTTP round-robin, as a degradation scheme, can only be used to receive messages, and short connections are used to send messages. Other scenarios that do not need message forwarding but only transmit messages to the ChannelManager can be directly sent through WebSocket.
Modified IM service invocation process
The process of initializing the message line and assigning customer service is done by the PHP business. When a message needs to be forwarded, THE PHP service invokes the message sending interface of the Dispatcher service, and the Dispatcher service retrieves the ChannelManager instance where the receiver resides through the shared Dispatcher Table data. The message is sent to the instance via RPC, and the ChannelManager pushes the message to the client via WebSocket. The IM service invocation process is as follows:
If the number of ChannelManager connections exceeds the upper limit of the current ChannelManager cluster, you only need to expand the ChannelManager instance to dynamically send ETCD notifications to the listening side. At present, the BROWSER version of JS-SDK has been developed, and other business lines can easily integrate IM services by accessing documents.
There are three issues to consider in the design of the Exchange layer:
1. Synchronize multiple messages
Currently, the client has PC browser, Windows client, H5, and iOS/Android. If a user logs in to multiple terminals, all connections of the user need to be found when a message is received. If a terminal of the user is disconnected, the connection needs to be located.
As mentioned above, connection information is stored in the DispatcherTable module, so the DispatcherTable module must be able to quickly retrieve connection information based on user information. The design of DispatcherTable module uses Redis Hash storage. When the client establishes a connection with the ChannelManager, the metadata to be synchronized includes UID (user information), uniqueField (unique value, A unique value corresponding to a connection), wsid(connection identifier), clientip(clientip), serverip(serverip), channel(channel), the corresponding structure is roughly as follows:
The key(UID) can be used to find multiple connections to a user, and the key+field can be used to locate a connection. The default expiration time of connection information is 2 hours. In this way, the server does not capture the connection information due to abnormal client connection interruption and stores some expired data in the DispatcherTable.
2. Synchronize the online status of users
For example, a user who has consulted with four customer service providers will appear in four customer service lists. When the user is online, make sure that all four customer service personnel see that the user is online.
There are two ways to do this. One is for the customer service to get the user’s status through polling, but when the user’s online status does not change, many invalid requests will be made. Another way is to push online notification to customer service when users go online, which will cause message diffusion, and every customer service consulted needs to spread notification. We finally adopted the second method, in the process of push, only to the online customer service to push user status.
3. Messages are not lost or repeated
In order to avoid message loss, for the long-connection polling mode, we will bring the ID of the client’s read message when we initiate a request, and the server will calculate the travel value message and return it. In WebSocket mode, the server pushes the message to the client and waits for the ACK of the client. If the client does not receive an ACK, the server tries to push the message several times.
In this case, the client needs to duplicate the message according to the message ID. In this case, the client may receive the message but fail to acknowledge the ACK because of other reasons and retry is triggered, causing the message duplication.
Message flows for IM services
Above mentioned IM service needs to support multiple terminals, at the same time on the role is divided into client and merchant side, in order to inform, message in the output according to the domain name, when the content of the dynamic output terminals, role differentiation, introduced the modeling method of DDD (domain driven design) to the message processing, processing process as shown in the figure below:
Summary and Outlook
With the deepening of hornet’s nest “content + transaction” mode, IM system architecture is also going through different stages of evolution and upgrading, from the initial rough and disorderly mode to unified management, and gradually standardized and formed a scale.
We’ve made some progress, but there’s still more to go. In the future, we will continue to optimize the IM system in combination with the company’s business development steps and the team’s technical capabilities. We are currently planning to replace the server code in the message polling module with Go, so that it is no longer dependent on PHP and OpenResty environments for better decoupling; In addition, we will explore to intelligent customer service based on TensorFlow. Through training data model and analyzing data, we will further improve the solution efficiency of human customer service, improve user experience, and better empower business.
Author: IM r&d team of Hornet’s Cell e-commerce platform.
(Hornet’s nest technology original content, reproduced must indicate the source to save the end of the two-dimensional code picture, thank you for your cooperation.)