Hornet’s nest technology original article, more dry goods please search public number: MFWtech

Instant messaging (IM) is very important for e-commerce platforms, especially travel e-commerce.

From the perspective of commodity complexity, a tourism commodity may include the user’s clothing, food, housing, transportation and other aspects in the future. From the consumption amount, often single consumption amount is large; Unfamiliar to the destination, possible problems in the journey, these factors make users have a strong need to communicate with merchants before, during and after purchase. It can be said that a good IM can promote GMV of enterprise e-commerce business to a certain extent.

In this paper, we will combine the development history of IM service of Hornet’s Nest tourism e-commerce, focusing on the IM reconstruction based on Go, hoping to give some reference to friends with similar problems.

Part.1 Technical background and problems

Different from instant messaging in a broad sense, each business line of e-commerce has its own unique business logic, such as guest allocation logic of customer service chat system, sensitive word detection logic, etc., which are often coupled into the communication process. As more lines of service are connected, the im service becomes more redundant. At the same time, the whole message link tracing is complicated and the service stability is affected by the business logic.

Previously, the message push in our IM application was mainly based on polling technology, and the long connection request of the message polling module was realized by phP-FPM mounting on the blocking queue. If the PHP-FPM process cannot be released in a timely manner when the number of requests is high, the performance of the server is very costly.

In order to solve this problem, we used OpenResty+Lua to transform the whole polling ability from PHP to Lua by using the way of Lua coroutine, so as to release the pressure of a PHP. Although this method can improve part of the performance, the mixed and heterogeneous mode of PHP-Lua makes the use, upgrade, debugging and maintenance of the system very troublesome, and the universality is poor. Many business scenarios still rely on THE PHP interface, and the optimization effect is not obvious.

In order to solve the above problems, we decided to reconstruct IM services based on the specific background of e-commerce IM, the core of which is to achieve the separation of business logic and instant messaging services.

Part.2 Two-layer distributed IM architecture based on Go

2.1 Achieving goals

1. Service decoupling

The service logic is separated from the communication process to make the IM service architecture clearer, and the IM service logic is completely separated from that of e-commerce, ensuring service stability.

2. Flexible access modes

Before, when new services were connected, OpenResty environment and Lua coroutine code needed to be configured on the service server, which was very inconvenient, and IM service was not universal. Considering the actual situation of existing services, we hope that the IM system can provide HTTP and WebSocket access modes for businesses to flexibly use according to different scenarios.

Such as have access and run the good electric agreed to customize the team to do system, custom swim grab single system, complaints and downward related systems, such as system, these businesses have no obvious high concurrency requirements, can quickly access via HTTP, don’t need to be familiar with slightly complex WebSocket protocol, and reduce development costs unnecessarily.

3. The frame can be expanded

In order to cope with the continuous growth of services to the system performance challenges, we consider using a distributed architecture to design instant messaging services, so that the system has continuous expansion and improvement capabilities.

2.2 Language Selection

At present, the hornet’s nest technology system mainly includes PHP, Java and Golang, and the technology stack is relatively rich, so that the business can choose a more appropriate tool and language according to the problem scenario.

In combination with IM application scenarios, we choose Go for the following reasons:

1. The performance

In terms of performance, especially for IO – intensive application scenarios such as network communication. The performance of the Go system is closer to C/C++.

2. Development efficiency

Go is easy to use, efficient to code, and quick to get started, especially for developers with some C++ background, you can write handwritten code in a week.

2.3 Architecture Design

The overall architecture diagram is as follows:

Noun explanation:

  • Customer: Generally refers to the user who buys the product

  • Merchant: the supplier that provides the service, the merchant will have the customer service personnel, provides the customer an online consultation function

  • Dispatch module: the Dispatcher provides the bridge function for message distribution to the specified work module

  • Work module: the Worker server, which provides WebSocket services, is the module that actually works.

Architectural layers:

  • Presentation layer: provides HTTP and WebSocket access modes.

  • Business layer: Responsible for initializing message lines and business logic processing. If the client is connected through HTTP, it sends the message in JSON format to the service server for message decoding, customer service assignment, and sensitive word filtering, and then sends the message to the message distribution module for the next conversion. Services accessed through WebSocket do not need message distribution and are directly sent to the message processing module in WebSocket mode.

  • Service layer: composed of message distribution and message processing layers, multiple Dispatcher and Worker nodes are deployed in a distributed way respectively. The Dispatcher is responsible for retrieving the server location where the receiver is, sending the message to the appropriate Worker in the form of RPC, and then the message processing module pushes the message to the client through WebSocket.

  • Data layer: Redis cluster, which records the unique Key composed of user identity, connection information, client platform (mobile terminal, web terminal, desktop terminal), etc.

2.4 Service Process

Step one

As shown on the right in the figure above, the user client establishes a WebSocket long connection with the message processing module. The client is connected to the appropriate server (a Worker in the message processing module) through the load balancing algorithm. After successful connection, record user connection information, including user role (guest or business), client platform (mobile terminal, web terminal, desktop terminal) and other unique keys, and record them to Redis cluster.

Step 2

As shown on the left side of the figure, when the user who buys goods wants to send a message to the butler, the message is first sent to the business server through HTTP request, and the business server processes the message with business logic.

(1) The step itself is an HTTP request, so clients in a variety of development languages can be accessed. The message is sent to the business server in JSON format. The business server decodes the message first and then gets the customer service of which merchant the user wants to send it to.

(2) If the buyer has not talked before, there needs to be a process of assigning customer service in the business server logic, that is, establishing a connection between the buyer and the customer service of the merchant. Get the ID of the customer service for business message delivery; If you’ve talked before, skip this section.

(3) At the business server, the message will deviate into the database. Ensure that messages are not lost.

Step 3

The business server sends the message to the message distribution module with an HTTP request. The function of the distribution module is to relay the message to the specified merchant.

Step 4

Based on the user connection information in Redis cluster, the message distribution module forwards the message to the WebSocket server connected by the target user (a Worker in the message processing module).

(1) The distribution module forwards the message to the Worker connected to the target user through RPC, which has faster performance and less data transmission, thus saving the cost of the server.

(2) When messages are transmitted through workers, multiple policies ensure that messages will be delivered to workers.

Step 5

The message processing module pushes the message to the client through the WebSocket protocol:

(1) At the time of delivery, the recipient should have an ACK message to send back to the Worker server, telling the Worker server that the recipient has received the sent message.

(2) If the receiver does not send this ACK to tell the Worker server, the Worker server will send this information to the message receiver again within a certain period of time.

(3) If the delivery information has been sent to the client, the client also received, but because of network jitter, did not send the ACK information to the server, the server will repeatedly deliver to the client, at this time the client through the delivery of the message ID to re-display.

The data flow of the above steps is roughly shown in the figure below:

2.5 System integrity design

2.5.1 reliability

(1) Messages are not lost

To avoid message loss, we set up a timeout retransmission mechanism. After sending the message to the client, the server waits for the ACK of the client. If the client does not return an ACK, the server tries to push the message several times.

Currently, the default timeout period is 18 seconds. If retransmission fails for three times, disconnect the server and reconnect to the server. After reconnection, the mechanism of pulling historical messages is used to ensure message integrity.

(2) Multi-terminal message synchronization

The client has PC browser, Windows client, H5, iOS/Android. The system allows users to be online at the same time, and the same end can have multiple states. Therefore, it is necessary to ensure that the messages of multiple terminals, users, and states are synchronized.

We use Redis Hash storage to record user information, unique connection corresponding value, connection id, client IP, server ID, role, channel, etc. In this way, we can find the connection of a user on multiple ends through key(UID), and locate a connection through key+field.

2.5.2 availability

As we mentioned above, because of the two-layer design, it involves communication between two servers, such as in-process communication Channel, and non-process communication using message queue or RPC. Overall performance and utilization of Server resources, we finally choose RPC for inter-server communication. When selecting a line for RPC based on Go, we compared the following mainstream technical solutions:

  • Go STDRPC: RPC from the Go standard library, with optimal performance but no governance

  • RPCX: Performance benefits 2*GRPC + service governance

  • GRPC: Cross-language, but not as good as RPCX

  • TarsGo: Cross-language, performance 5*GRPC, the disadvantage is that the framework is large, integration is difficult

  • Dubo-go: lower performance, suitable for Go and Java communication scenarios

In the end, we chose RPCX because it also has good performance and service governance.

Communication between the two processes is also needed, and ETCD is used here to implement the service registration discovery mechanism.

When we add a Worker, if there is no registry, we need to use the configuration file to manage the configuration information, which is quite troublesome. And when you add a new one, the distribution module needs to find it immediately, without delay.

If a new service is available, the distribution module wants to be aware of the new service quickly. By using the Key’s lease renewal mechanism, if no Key lease renewal action is monitored within a certain period of time, the service is considered to have died and the service is removed.

During the selection of the registry, we mainly investigated ETCD, ZK, Consul. The results of the pressure survey of ETCD, ZK, Consul are as follows:

The results showed that ETCD had the best performance. In addition, ETCD is backed by Alibaba and belongs to Go ecology. Our company’s internal K8S cluster is also in use.

After all this consideration, we chose to use ETCD as the service registration and discovery component. And we use the cluster mode of ETCD. If one server fails, the other servers in the cluster can still provide services normally.

By ensuring the normal communication between services and processes, and the design of ETCD cluster mode, the IM service as a whole has high availability.

2.5.3 extensibility

Both the message distribution module and the message processing module can be horizontally extended. When the overall service load is high, you can add nodes to share the load to ensure message immediacy and service stability.

2.5.4 security

For security reasons, we set up a blacklist mechanism, which can limit a single UID or IP. For example, if the number of connections established for a UID exceeds the threshold, the uid may have a risk and services are suspended. If the UID continues to send requests while the service is suspended, the time limit for the service is extended accordingly.

2.6 Performance optimization and potholes

2.6.1 Performance Optimization

(1) JSON codec

At the beginning, we used the official JSON codec tool. However, due to the pursuit of performance, we changed to didi’s jSON-Iterator, which is open source, so that while compatible with native Golang’s JSON codec tool, the efficiency has been significantly improved. The following is a reference diagram for manometry comparison:

(2) time.After

During the pressure measurement, we found that the memory occupation was very high, so we used Go Tool PProf to analyze the memory application of Golang function, and found that there was a problem of constantly creating time.After timer, which was located in the heartbeat coroutine.

The original code is as follows:

The optimized code is:

The optimization point is not to use select + time.after in the for loop.

(3) Use of Map

Map is used when saving connection information. There was a pit in the TCP Socket project that Map was not secure under coroutines. A fatal error occurs when multiple coroutins read and write to a Map. A critical error occurs when multiple coroutins read and write a Map. A concurrent Map read and Map write

2.6.2 Experience in stepping pits

(1) Coroutine anomaly

Considering the development cost and service stability, our WebSocket service is developed based on Gorilla/WebSocket framework. One problem is that the write coroutine does not notice when the read coroutine exits. The result is that the read coroutine exits but the write coroutine continues to run until the exception is raised. This does not affect the business logic on the surface, but wastes backend resources. When coding, it should be noted that the write coroutine should be actively notified when the read coroutine exits, such a small optimization can save a lot of resources in high concurrency.

(2) Heartbeat design

For example, we took a few detours in the development of the idle heartbeat feature. Initially the heartbeat sent on the server side was a timed heartbeat, but when used later in real business scenarios, it was found that the heartbeat was better when the server read was idle. Because users are chatting, sending a heartbeat frame is a waste of emotion and bandwidth resources.

At this time, it is recommended that you do not write the code in the process of business development if it cannot be written, and first combing the logic with words according to the business needs, you may find that it will be more smooth later.

(3) Split the log every day

In the initial investigation, the log module decided to use Uber’s open source ZAP library based on performance considerations and meet the requirements of business logging. The selection of log library is very important, and the bad selection also affects the system performance and stability. The benefits of ZAP include:

  • The need to display line numbers, which Is supported by ZAP but not by Logrus, is a benefit. The line number display is important for locating the problem.

  • ZAP is more efficient than Logrus in that it uses built-in JSON encoder instead of reflection to write JSON-based logs, concatenating strings directly with explicit type calls to minimize performance overhead.

The small hole:

The function of writing a log file every day is not supported by ZAP at present, so you need to write your own code to support it or request support from system Department.

Part.3 Performance

Pressure test 1:

Online production environment, docking with business side and pressure test, now the whole process of customized business has been connected, and a Client has been written. Simulate sending heartbeat frames at regular intervals and then utilize the Docker environment. Fifty containers were opened, and each simulated and initiated 20,000 connections. This is millions of connections to a single Server. The single-node memory usage is about 30 GB.

Pressure measure 2:

At the same time, connect 3000, 4000, 5000, and adjust the sending frequency, respectively corresponding to uplink: 600,000, 800,000, 1 million, 2 million, a log structure of about 6K.

Half of them are heartbeat packets and half are log structures. The downlink delay data under different pressures are as follows:

** Conclusion: ** As the concurrency of the uplink increases, the latency is controlled between 24 and 66 ms. Therefore, there is a slight delay for downlink services. In addition, for 60,0005K uplink at the same time, another script is used to simulate the opening of 50 coroutines concurrent downlink 1K data body, the delay is improved compared with the time without concurrent downlink, the delay is increased by about 40ms.

Part four summarizes

On the basis of WebSocket, the business layer of IM service based on Go is designed as a double-layer architecture mode with message distribution module and message processing module, so that the processing of business logic is advanced and the purity and stability of instant messaging service is guaranteed. At the same time, the HTTP service of the message distribution module is convenient for the rapid docking of various programming languages, so that all lines of business can quickly access the instant messaging service.

Finally, I want to cheer for Go. Many people know that the hornet’s cell architecture is primarily based on PHP, and some core businesses are migrating to Java. At the same time, Go is playing a role in a growing number of projects. Now, cloud native is becoming a mainstream trend, and we can see that Go is the main development language in many core projects needed to build cloud native applications, such as Kubernetes, Docker, Istio, ETCD, Prometheus, etc., Including the third generation of open source distributed database TiDB.

So we can call Go the native language of the cloud age. “The cloud native era is the best time for developers.” Under this wave, the sooner we enter Go, the sooner we may seize the key track in this new era. We hope that more friends will join us in the development and learning camp of Go, broaden their skills spectrum and embrace cloud native.

Author of this article: Anti Walker, r&d engineer of Hornet Travel E-commerce trading basic platform.