HTTP is a stateless, TCP-based request/response protocol. Requests can only be initiated by the client and responded by the server. In most scenarios, this Pull pattern of request/response will suffice. However, in some application scenarios, such as message Push and notification, data needs to be synchronized to the client in real time, which requires the server to support active Push data.

Server-side push technology has a long history and has experienced the development of short and long polling. It can solve problems to a certain extent, but it also has shortcomings, such as timeliness and resource waste. The WebSocket specification brought by HTML5 standard basically ends this situation and becomes the mainstream solution of server push technology at present.

It is very simple to integrate WebSocket into the system. But how to implement a general WebSocket push gateway has not yet been mature scheme. At present, cloud service manufacturers mainly focus on iOS and Android and other mobile terminals push, but also lack support for WebSocket. This paper introduces our thinking and experience in implementing WebSocket long connection gateway based on Netty.

1 IQiyi WebSocket usage status

Iqiyi is a platform for content creation, distribution and realization of IQiyi, covering “we media”, “Netbig”, “netdrama”, “children”, “knowledge”, “documentary” and other businesses, which is an important component of iQiyi’s content ecology. As a front desk system, IQiyi has higher requirements for user experience, which directly affects the creative enthusiasm of creators. At present, iQiyi has multiple business scenarios using **WebSocket push technology, ** including:

  • ** User reviews. ** Push comment messages to the browser in real time.

  • ** Real name authentication. ** Real-name authentication is required for users before signing the contract. Users scan the QR code and enter the authentication page of the third party. After the authentication is complete, the browser is notified of the authentication status asynchronously.

  • ** In vivo identification. ** Similar to real name authentication, when liveness is complete, the result is asynchronously notified to the browser.

In actual business development, we found that WebSocket push technology has the following problems in the use: ** First, **WebSocket technology stack is not unified, both based on Netty implementation, but also based on Web container implementation, which brings difficulties to development and maintenance; ** Secondly, **WebSocket implementation is scattered in various projects, and is strongly coupled with the business system. If other businesses need to integrate WebSocket, they will face the dilemma of repeated development, wasting costs and low efficiency; ** Third, **WebSocket is a stateful protocol. When the client connects to the server, it only connects to one node in the cluster, and only communicates with this node during data transmission.

Therefore, WebSocket clustering needs to solve the problem of session sharing. If only one node is deployed, this problem can be avoided, but it cannot be horizontally expanded to support higher loads, resulting in a single point of risk. Finally, there is a lack of monitoring and alarm. Although we can roughly evaluate the number of WebSocket long connections based on the number of Socket connections in Linux, the number is not accurate, nor can we know the indicator data with business meaning such as the number of users. It cannot be integrated with the existing micro-service monitoring to achieve unified monitoring and alarm.

Design and implementation of long connection gateway

In order to solve the above problems, we have implemented a unified WebSocket long-connection gateway with the following characteristics:

**1. Centralized implementation of long connection management and push capabilities. ** Unified technology stack, the long connection as a base capacity precipitation, easy to function iteration and upgrade maintenance. **2. Decouple services. ** Business logic and long connection communication separation, so that business systems no longer care about communication details, but also to avoid repeated development, waste of RESEARCH and development costs. **3. Easy to use. ** Provides HTTP push channel to facilitate the access of various development languages. The business system only needs simple call to realize data push and improve the efficiency of research and development. **4. Distributed architecture ** Realize multi-node cluster to support horizontal expansion to meet the challenges brought by business growth; Node downtime does not affect overall service availability and ensures high reliability. **5. Multiple message synchronization. ** Allows users to log in online using multiple browsers or tabs at the same time to ensure synchronous message sending. **6. Multi-dimensional monitoring and alarm. ** Custom monitoring indicators and existing micro-service monitoring system through, problems can be timely alarm, to ensure the stability of the service.

2.1. Technology selection

Among many WebSocket implementations, Netty is finally selected from the aspects of performance, scalability, community support and so on. Netty is a high-performance, event-driven, asynchronous, non-blocking network communication framework that is widely used in many well-known open source software.

WebSocket is stateful and cannot achieve load balancing in a cluster like direct HTTP. After a long connection is established, it maintains a session with a node on the server end. Therefore, there are two solutions to know which node the session belongs to under the cluster. One is to use event broadcast for each node to determine whether to hold the session. The two schemes are shown in Table 1.

plan

advantages

disadvantages

The registry

The session mapping is clear and suitable for a large cluster

Implementation is complex, strongly dependent on registries, and has additional operation and maintenance costs

Event broadcast

Simple implementation and lighter weight

If there are too many nodes, all nodes are broadcast, wasting resources

Table 1: WebSocket clustering scheme

Considering the implementation cost and cluster size, lightweight event broadcast scheme is selected. You can choose rocketMQ-based message broadcasting, Redis-based Publish/Subscribe, zooKeeper-based notification, etc. The advantages and disadvantages of these schemes are shown in Table 2. RocketMQ was selected for throughput, real-time, persistence, and implementation difficulty.

plan

advantages

disadvantages

Based on the RocketMQ

High throughput, high availability and reliability

It’s not as real-time as Redis

Based on the Redis

High real-time performance and simple implementation

No guarantee of reliability

Based on a ZooKeeper

Implement a simple

The write performance is poor and is not suitable for frequent write scenarios

Table 2: Comparison of broadcast implementation schemes

2.2. System architecture

The overall architecture of the gateway is shown in Figure 1.

Figure 1: WebSocket long-connection gateway architecture Figure 1: WebSocket long-connection gateway architecture

The overall process of the gateway is as follows:

1. The client shakes hands with any gateway node to establish a long connection, and the gateway adds the long connection to the queue for memory maintenance. The client periodically sends heartbeat messages to the server. If the client does not receive heartbeat messages within the specified time, the server considers that the long connection between the client and the server is disconnected, and the server closes the connection to clear the sessions in the memory. 2. When the service system needs to push data to the client, it sends the data to the gateway through the HTTP interface provided by the gateway. 3. After receiving the push request, the gateway writes the message to RocketMQ. 4. As a consumer, the gateway consumes messages in broadcast mode, and all nodes receive the messages. 5. After receiving the message, the node determines whether the message target is in the long-connection queue maintained in its own memory.

If yes, push data through the long connection; otherwise, ignore it.

Gateways form a multi-node cluster. Each node is responsible for some long connections to achieve load balancing. In the face of massive connections, gateways can be added to share the load to achieve horizontal expansion. ** At the same time, when a node goes down, the client will try to re-establish long connections with other nodes to ensure the overall availability of the service.

2.3. Session Management

After the long connection is established, the session is maintained in the memory of each node. SessionManager component is responsible for session management, internal use of hash table to maintain the relationship between UID and UserSession; A UserSession represents a session of the user dimension. A user may establish multiple long connections at the same time. Therefore, a hash table is also used in a UserSession to maintain the relationship between channels and channelsessions. To prevent users from creating unlimited long connections, UserSession shuts down the earliest ChannelSession when the number of internal Channelsessions exceeds a certain threshold to reduce server resource usage. The relationship between SessionManager, UserSession, and ChannelSession is shown in Figure 2.

Figure 2: SessionManager component

2.4. Monitoring and alerting The gateway provides basic monitoring and alerting capabilities to know how many long connections the cluster has established and how many users it contains. The gateway was connected to Micrometer, and the number of connections and users was exposed as a custom indicator for Prometheus to collect, thus connecting to the existing microservice monitoring system. In Grafana, you can easily view the connection number, number of users, JVM, CPU, memory and other indicators to understand the current service capacity and pressure of the gateway. Alarm rules can also be configured in Grafana to trigger an odd letter (internal alarm platform) alarm when data is abnormal.

2.5. Performance pressure test

During the pressure test, select two 16-GB VMS with four cores as servers and clients. During the pressure test, 20 ports are selected to open for the gateway and 20 clients are established at the same time. Each client establishes 50,000 connections using one server port, and one million connections can be created at the same time. The connection count and memory usage are shown in Figure 3.

Figure 3: Mega connections

Sending a message simultaneously to millions of long connections using a single thread takes the server about 10s on average, as shown in Figure 4.

Figure 4: Server push time

Generally, long connections established by the same user at the same time are in single digits. Taking 10 long connections as an example, when the number of concurrent connections is 600 and the duration is 120s, the TPS of the push interface is about 1600+, as shown in Figure 5.

Figure 5: Long connection 10, concurrent 600, duration 120s pressure test data

The current performance indicators have met the actual business scenarios of iQiyi and can support future business growth.

3 Business Cases

In order to illustrate the optimization effect more vividly, at the end of the article, we also take the cover image to add filter effect as an example to introduce a case of iQiyi using WebSocket gateway.

When iQiyi “we media” publishes videos, it can choose to add filter effect to the cover image to guide users to provide better cover. When the user selects a cover image, an asynchronous background processing task is submitted. When the asynchronous task processing is completed, the images processed by different filter effects are returned to the browser through WebSocket. The business scenario is shown in Figure 6.

Figure 6: IQiyi video cover image filter

From the perspective of r&d efficiency, it takes at least 1-2 days to integrate WebSocket into the business system. If the push capability of the gateway is directly used, the data push can be realized only by simple interface call, the development time is reduced to the level of minutes, and the research and development efficiency is greatly improved. From the perspective of operation and maintenance cost, the business system no longer contains communication details irrelevant to business logic, the code is more maintainable, the system architecture becomes simpler, and the operation and maintenance cost is greatly reduced.

The four goes at the end

WebSocket is the mainstream technology to achieve server push at present. Proper use can effectively provide system response capability and improve user experience. The WebSocket long-connection gateway can quickly increase the data push capability of the system, effectively reduce operation and maintenance costs, and improve development efficiency.

The value of the long-connection gateway lies in that it encapsulates the details of WebSocket communication and decouples it from the business system, so that the long-connection gateway and the business system can be optimized and iterated independently, avoiding repeated development and facilitating development and maintenance. Second, the gateway provides a simple and easy to use HTTP push channel, ** support a variety of development languages access, easy system integration and use. In addition, ** gateway uses a distributed architecture, ** can achieve horizontal expansion of services, load balancing and high availability. Finally, ** gateway integrates monitoring and alarm, ** when the system is abnormal can be timely warning, to ensure the health and stability of the service.

At present, **WebSocket long connection gateway has been applied in iQiyi image filter result notification, MCN electronic seal and other business scenarios. ** There are many aspects to explore in the future, such as message retransmission and ACK, WebSocket binary data support, multi-tenant support, etc. We will continue to optimize and enrich the functions to bring better user experience to developers.