HTTP is a stateless protocol based on TCP request/response mode. Requests can only be initiated by clients and responded by servers. In most scenarios, this Pull pattern of request/response is sufficient. However, in some cases, such as message Push, notification and other application scenarios, data needs to be synchronized to the client in real time, which requires the server to support active Push data.

The server push technology has a long history and experienced the development of short polling and long polling. To some extent, it can solve problems, but there are also shortcomings, such as timeliness and resource waste. The WebSocket specification brought by HTML5 standard basically ended this situation and became the mainstream solution of the service side push technology at present.

The integration of WebSocket in the system is very simple, related to the discussion and information is very rich. However, how to implement a general WebSocket push gateway has not yet mature solutions. Currently, cloud service vendors focus on iOS, Android and other mobile terminal push, and lack support for WebSocket. This article introduces some thinking and experience when we implement WebSocket long connection gateway based on Netty.

1 IQiyi WebSocket usage status

Iqiyi is a platform for the creation, distribution and realization of iQiyi’s content, covering multiple businesses such as we-media, network, network drama, children, knowledge, documentary and so on. It is an important component of iQiyi’s content ecology. As a foreground system, iQiyi has high requirements for user experience, which directly affects the creative enthusiasm of creators. At present, iQiyi has used WebSocket push technology in several business scenarios, including:

  • User reviews. Push comments to the browser in real time.
  • Real name authentication. Before signing the contract, the user is authenticated by his/her real name. The user scans the QR code and accesses the third-party authentication page. After the authentication is complete, the browser is notified of the authentication status asynchronously.
  • Living body recognition. Similar to real-name authentication, when living identification is completed, the result will be notified to the browser asynchronously.

In the actual business development, we found that WebSocket push technology in the use of the following problems: first, WebSocket technology stack is not unified, both based on Netty implementation, but also based on Web container implementation, which brings difficulties to the development and maintenance; Secondly, WebSocket implementations are dispersed in various projects and strongly coupled with business systems. If other businesses need to integrate WebSocket, they will face the dilemma of repeated development, waste cost and low efficiency. Third, WebSocket is a stateful protocol. When the client connects to the server, it only connects to one node in the cluster and communicates with this node during data transmission.

Therefore, WebSocket clusters need to address the issue of session sharing. With a single-node deployment, this problem can be avoided, but there is a single point of risk because it cannot scale horizontally to support higher loads. Finally, there is a lack of monitoring and alarm. Although the number of WebSocket long connections can be roughly evaluated by the number of Linux Socket connections, the number is not accurate, and the number of users and other indicators with business meaning cannot be learned. It cannot be integrated with existing microservice monitoring to achieve unified monitoring and alarm.

Design and implementation of long link gateway

In order to solve the above problems, we have implemented a unified WebSocket long connection gateway, which has the following characteristics:

1. Centralized persistent connection management and push capabilities. Unified technology stack, the long connection as the base capability precipitation, facilitate functional iteration and upgrade maintenance.

2. Decouple services. The separation of business logic and long-link communication makes the business system no longer care about the details of communication, and avoids repeated development and wasted research and development costs.

3. Easy to use. Provides THE HTTP push channel to facilitate the access of various development languages. The business system only needs a simple call to realize data push and improve research and development efficiency.

4. Distributed architecture. Realize multi-node cluster, support horizontal expansion to meet the challenges brought by business growth; Node breakdown does not affect overall service availability, ensuring high reliability.

5. Multi-terminal message synchronization. Allow users to log on to multiple browsers or tabs at the same time, ensuring that messages are sent synchronously.

6. Multi-dimensional monitoring and alarm. The customized monitoring indicators can be connected with the existing micro-service monitoring system, so that problems can be reported in time to ensure the stability of services.

2.1. Technology selection

In many WebSocket implementation, from the performance, scalability, community support and other aspects of consideration, the final choice of Netty. Netty is a high-performance, event-driven, asynchronous, non-blocking network communication framework widely used in many well-known open source software.

WebSocket is stateful and cannot achieve load balancing in cluster mode like direct HTTP. After the establishment of a long connection, WebSocket maintains a session with a node on the server. Therefore, there are two solutions for the cluster to know which node the session belongs to. One is to use event broadcast for each node to determine whether to hold a session. The two scenarios are shown in Table 1.

plan advantages disadvantages
The registry The session mapping relationship is clear, which is more suitable when the cluster is large Implementation complexity, strong registry dependency, and additional operational costs
Event broadcast Simple implementation is more lightweight If there are a large number of nodes, all nodes are broadcast and resources are wasted

Table 1: WebSocket clustering scheme

Considering the implementation cost and cluster size, a lightweight event broadcast scheme is chosen. The implementation of broadcast can choose rocketMQ-based message broadcast, Redis-based Publish/Subscribe, ZooKeeper-based notification and other solutions, their advantages and disadvantages are shown in Table 2. In terms of throughput, real-time, persistence, and implementation difficulty, RocketMQ was chosen.

plan advantages disadvantages
Based on the RocketMQ High throughput, high availability, and reliability Not as real-time as Redis
Based on the Redis High real-time, simple implementation Not guaranteed
Based on a ZooKeeper Implement a simple The write performance is poor and is not suitable for frequent write scenarios

Table 2: Comparison of broadcast implementations

2.2. System architecture

The overall architecture of the gateway is shown in Figure 1.

Figure 1: WebSocket long-connection gateway architecture

The overall process of the gateway is as follows:

1. The client shakes hands with any node of the gateway to establish a long connection, and the node adds the client to the long connection queue maintained in memory. The client periodically sends heartbeat messages to the server. If the client does not receive heartbeat messages within the specified period, the long connection between the client and the server is disconnected. The server closes the connection and clears the session in the memory.

2. When the service system needs to push data to the client, it sends the data to the gateway through the HTTP interface provided by the gateway.

3. After receiving the push request, the gateway writes the message to RocketMQ.

4. As a consumer, the gateway consumes the message in broadcast mode and all nodes receive the message.

5. After receiving the message, the node determines whether the message target is in the long connection queue maintained in its own memory. If it exists, the node pushes the data through the long connection; otherwise, it directly ignores the data.

The gateway is a multi-node cluster. Each node is responsible for a part of the long connection, which can realize load balancing. When faced with massive connections, the gateway can also share the pressure by adding nodes to achieve horizontal expansion. In addition, when a node breaks down, the client attempts to establish a long-term connection with another node by shaking hands to ensure overall service availability.

2.3. Session Management

After a persistent connection is established, the session is maintained in the memory of each node. The SessionManager component is responsible for session management. The internal hash table is used to maintain the relationship between UID and UserSession. UserSession represents the session of the user dimension. A user may establish multiple persistent connections at the same time. Therefore, UserSession also uses a hash table to maintain the relationship between Channel and ChannelSession. To prevent users from creating unlimited long connections, UserSession closes the earliest ChannelSession when the number of channelsessions exceeds a certain amount to reduce server resource usage. The relationship between SessionManager, UserSession, and ChannelSession is shown in Figure 2.

Figure 2: The SessionManager component

2.4. Monitoring and alarm

The gateway provides basic monitoring and alarm capabilities to understand how many long connections are established and how many users are contained in the cluster. The gateway is connected to Micrometer, exposing the number of connections and the number of users as user-defined indicators for Prometheus to collect, thus connecting with the existing microservice monitoring system. You can easily view the number of connections, number of users, JVM, CPU, memory and other metrics in Grafana to understand the current service capacity and stress of the gateway. Alarm rules can also be configured in Grafana to trigger odd message (internal alarm platform) alarms when data is abnormal.

2.5. Performance pressure test

Select two 4-core 16-GB VMS as the server and client. During the pressure test, 20 ports were opened for the gateway and 20 clients were set up at the same time. Each client used a server port to establish 50,000 connections, and one million connections could be created at the same time. The number of connections and memory usage are shown in Figure 3.

Figure 3: Mega connections

Sending a single message to a million persistent connections at the same time, using a single thread, takes the server an average of about 10 seconds to complete, as shown in Figure 4.

Figure 4: Server push time

Generally, the long connections established by the same user at the same time are in the single digits. Taking 10 long connections as an example, the TPS of the push interface is about 1600+ when the number of concurrent connections is 600 and the duration is 120s, as shown in Figure 5.

Figure 5: Long connection 10, concurrent 600, duration 120s pressure data

The current performance indicators have met the actual service scenarios of iQiyi and can support future business growth.

3 Business Cases

In order to illustrate the optimization effect more vividly, at the end of the article, we also take the cover image to add the filter effect as an example to introduce a case of iQiyi using WebSocket gateway.

Iqiyi can choose to add a filter effect to the cover image when the video is published on the we-media to guide users to provide a better cover. When a user selects a cover image, an asynchronous background processing task is submitted. When the asynchronous task is completed, the images processed with different filter effects are returned to the browser through WebSocket. The business scenario is shown in Figure 6.

Photo 6: IQiyi video cover filter \

In terms of r&d efficiency, it will take at least 1-2 days to integrate WebSocket into the business system. If the push capability of the gateway is directly used, the data push can be realized with a simple interface call, the development time is reduced to minutes, and the research and development efficiency is greatly improved. In terms of operation and maintenance costs, the business system no longer contains communication details irrelevant to business logic, the maintainability of code is stronger, the system architecture is simpler, and the operation and maintenance costs are greatly reduced.

I’ll write 4 at the end

WebSocket is the mainstream technology to realize server push. Proper use of WebSocket can effectively provide system response and improve user experience. The WebSocket long connection to the gateway can quickly increase the data push capability for the system, effectively reduce the operation and maintenance costs, and improve the development efficiency.

The value of the long-connected gateway is that it encapsulates the WebSocket communication details and is decouple from the business system, so that the long-connected gateway and the business system can be independently optimized and iterated to avoid repeated development and facilitate development and maintenance. Secondly, the gateway provides a simple and easy to use HTTP push channel, and supports access of multiple development languages, which facilitates system integration and use. In addition, the gateway adopts a distributed architecture, which enables horizontal capacity expansion, load balancing, and high availability of services. Finally, the gateway integrates monitoring and alarm, so that when the system is abnormal, it can give early warning in time to ensure the health and stability of the service.

 \

At present, WebSocket long connection gateway has been applied in iQiyi image filter result notification, MCN electronic seal and other business scenarios. There are many areas to explore in the future, such as message resend and ACK, WebSocket binary data support, multi-tenant support, etc. We will continue to optimize and enrich the features to bring developers a better experience.

Maybe you want to see it again

Iqiyi knowledge WEB front-end componentization practice \

All data can be configured: Operation background design practice of iQiyi overseas station \

Scan the qr code below for more exciting content to accompany you!