The introduction

In a perfect instant messaging application, Websocket is an extremely critical link. It provides a full-duplex communication mechanism for the client and server of Web applications. However, due to the instability of its own and its underlying TCP connection, the developer has to design a complete set of schemes to keep alive, check alive and reconnect. Only in practical applications can the real-time and high availability of applications be guaranteed. As far as reconnection is concerned, its speed seriously affects the “immediacy” and user experience of the upper application. Just imagine that one minute after opening the network, wechat still cannot send and receive messages, will you be mad?

Therefore, how to quickly restore the availability of Websocket when the network changes becomes particularly important.

Quickly understand

websocet

Websocket was created in 2008, became an international standard in 2011, and is now supported by all browsers. It is a brand new application layer protocol, which is a true full-duplex communication protocol specially designed for Web clients and servers.

The Websocket protocol can be compared to the HTTP protocol. Their differences:

· The protocol identifier of HTTP is HTTP, and that of Websocket is WS

· HTTP requests can only be initiated by the client. The server cannot actively push messages to the client, but WebSocket can

· HTTP requests have homology restrictions, and communication between different sources needs to cross domains, while Websocket has no homology restrictions

Similarities:

· Both are communication protocols of the application layer

· The default port is the same, 80 or 443

· Both can be used for communication between browser and server

· All based on TCP protocol

The relationship between TCP and TCP:

Reconnection process disassembly

Let’s start with the question, when do we need to reconnect?

The most obvious scenario is that the WebSocket connection is broken and we need to initiate a connection again in order to send and receive messages. However, in many scenarios, even if the Websocket connection is not disconnected, it is actually unavailable. For example, the device switches over the network, the route crashes in the middle of the link, and the server cannot respond to the high load. In these scenarios, the Websocket is not disconnected, but the upper layer cannot send and receive data normally. Therefore, before reconnecting, we need a mechanism to sense whether the connection is available, whether the service is available, and to sense quickly so that we can quickly recover from an unavailable state.

Once the connection is perceived to be unavailable, you can discard it and disconnect it, then initiate a new connection. These two steps may seem simple, but they are not easy to achieve quickly.

The first is to disconnect the old connection. For the client, how to disconnect quickly and quickly? According to the protocol, the websocket connection can be disconnected only after the client negotiates with the server. However, if the client cannot contact the server and cannot negotiate with the server, how can THE client disconnect and quickly recover the connection?

The second is to initiate new connections quickly. This speed is not the same as that of the other. This speed does not mean that the server initiates a connection immediately, which will have unexpected impact on the server. Reconnection is usually initiated after a delay of a period of time by using some backoff algorithms. But how do you balance the reconnection interval against the performance cost? How to quickly initiate a connection at the “right point in time”?

With these questions in mind, let’s take a closer look at these three processes.

Quickly sense when you need to reconnect

Reconnection scenarios can be divided into three categories: the connection is disconnected, the connection is not disconnected but unavailable, and the service on the peer end is unavailable.

In the first scenario, the connection is simply broken and must be reconnected.

For the latter two, either the connection is unavailable or the service is unavailable, so the upper application can no longer send and receive instant messages. From this perspective, a simple and crude way to sense when the connection needs to be reconnected is through heartbeat packet timeout: Send a heartbeat packet. If the server does not receive a heartbeat packet within a specified period of time, the service is considered unavailable. This method is the most direct. So if you want to

Rapid perception

Instead, you have to multiply your heart rate. However, too fast heartbeat will consume too much traffic and power on the mobile terminal, so this method cannot achieve fast perception, which can be used as a bottom-pocket mechanism for detecting connections and services.



If you want to detect the unavailability of the connection, in addition to using heartbeat detection, but also by judging the network status to achieve, because of disconnection, switchover

To sum up, the scheme of sending heartbeat packet periodically is stable and can cover all scenes, but the speed is not good. However, the scheme to judge network status is fast and sensitive without waiting for heartbeat interval, but the coverage scenario is limited. Therefore, we can combine two schemes: periodically send heartbeat packets at a frequency not too fast, such as 40s/ seconds, 60s/ seconds, etc., depending on the application scenario. Then send heartbeat packets immediately when the network status changes from Offline to Online to check whether the current connection is available. In most cases, the upper-layer application communication can be quickly recovered from the unavailable state, and in a few cases, there is a timed heartbeat as the backstop, which can be recovered within a heartbeat cycle.

Quickly disconnect old connections

In general, if the old connection still exists before the next connection is initiated, the old connection should be disconnected first. In this way, the resources of the client and server can be released, and the data sent and received from the old connection can be avoided.

We know that the bottom layer of Websocket transmits data based on TCP protocol. The two ends of the connection are the server and the client respectively, and the TCP TIME_WAIT state is maintained by the server. Therefore, in most normal cases, the server should initiate the disconnection of the bottom LAYER TCP connection, rather than the client. That is, to disconnect the WebSocket connection, if the server receives the instruction to disconnect the WebSocket, it should immediately initiate the TCP connection disconnect. If the client is instructed to disconnect the Websocket, it should signal the server and wait for the underlying TCP connection to be disconnected or until it times out.

If the client wants to disconnect the old Websocket, it can be divided into two cases: webSocket connection is available and not available. When the old connection is available, the client can directly send a disconnection signal to the server, and then the server initiates the disconnection. When the old connection is unavailable, for example, when the client switches wifi, the client sends a disconnect signal, but the server cannot receive it, and the client can only wait until the timeout is allowed to disconnect. The timeout disconnection process is relatively long, so is there a way to disconnection faster?

Upper-layer applications cannot change the rules of disconnection initiated only by the server, so they can only start from the application logic. For example, upper-layer applications use the service logic to ensure that the old connection is completely invalid and the simulated connection is disconnected, and then initiate a new connection to resume communication. This method is equivalent to not try to disconnect the old, abandoned directly, then I can quickly into the next process, so it is important to ensure that when using old connection has no effect on the business logic, such as: guarantee away from the old link received all the data, the old can’t hinder the establishment of a new connection, the old connection timeout after disconnecting does not influence the new connection and the upper business logic, and so on.

Initiate new connections quickly

IM the development experience of students should be aware, met due to network reasons causing reconnection, is absolutely can’t immediately initiate a new connection, or when a network jitter, all the equipment will be immediately launched an connection to the server at the same time, this is a hacker through a large number of requests consume network bandwidth caused by denial of service attacks, This is a disaster for the server. Therefore, during reconnection, some backoff algorithms are usually adopted to delay reconnection for a period of time, as shown in the process on the left in the figure below.



What if you want a quick connection? The most direct method is to shorten the retry interval. The shorter the retry interval, the faster the communication can be restored after the network is recovered. However, too frequent retries can cause serious performance, bandwidth, and power consumption. How to make a good trade-off between these?

A more reasonable approach is to gradually increase the retry interval as the number of retries increases. On the other hand, monitor network changes. When the network status changes from Offline to Online, when it is more likely to be reconnected, the reconnection interval can be appropriately reduced, as shown on the right side in the figure above (the reconnection interval will increase with the number of retries). The two methods can be used together.

In addition, the interval can be adjusted appropriately according to the possibility of successful reconnection based on the business logic. For example, the reconnection interval can be adjusted larger when the network is not connected or the application is in the background, and smaller when the network is normal, so as to speed up the reconnection speed.

At the end

To conclude, this article began by breaking down WebSocket disconnection and reconnection into three steps: determining when to reconnect, disconnecting old connections, and initiating new ones. Then were analyzed under the different states of websocket, different network condition, how to quickly complete the three steps: first, by regularly sends a heartbeat packet way to test whether the current connection is available, at the same time restore event monitoring network, send a heart immediately after recovery, quick sense of the current state, judge whether need rewiring; Secondly, under normal circumstances, the old connection is disconnected from the server. When the connection is lost, the old connection is directly discarded and the upper layer simulation is disconnected to realize fast disconnection. In the end, a backout algorithm is used to delay the initiation of a new connection for a period of time. Considering the waste of resources and the reconnection speed, the reconnection interval can be increased when the network is offline, and reduced when the network is normal or the network changes from offline to online, so as to reconnect the network as soon as possible.

Reference:

· tools.ietf.org/html/rfc645…

, www.ruanyifeng.com/blog/2017/0…

Learn about netease Yunxin, communication and video cloud Services from netease’s core architecture

More technical dry goods, welcome to pay attention to vX public number
“Netease Smart Enterprise Technology +”. Series of courses in advance, free gifts, but also direct conversations with CTO.

Listen to netease CTO talk about cutting-edge observation, see the most valuable technology dry goods, learn the latest practical experience of netease. Netease Smart Enterprise technology + will accompany you to grow from a thinker to a technical expert.