As a leading audio and video service provider, netease Yunxin has been committed to providing top audio and video calling service experience and providing users with reliable audio and video services in all kinds of harsh environments. How to provide users with reliable audio and video services under extremely weak network conditions is the top priority of netease Yunxin. This paper will elaborate the architecture and technical solutions adopted by netease Yunxin to improve the timeliness of reliable data in weak network environment.
The introduction
Most of the traditional audio and video services on the market do reliable data transmission based on TCP protocol, but because of the characteristics of TCP itself, there are some inherent defects, such as:
- Low transmission efficiency
The TCP selfless transmission feature causes slow transmission and low transmission efficiency, especially in weak networks.
- Large delay in connection construction
The security design of three-way handshake takes a long time to set up a connection for the first time, resulting in a late appearance of the first screen.
- Poor resistance to weak network
The reliable transmission characteristics of TCP determine that a small packet loss will cause link disconnection.
- The head of the queue is badly blocked
Packets are transmitted sequentially, and the loss of small serial number packets will cause postbread blocking until retransmission succeeds.
As a result, in the weak network environment of traditional audio and video services, reliable data links are disconnected before media links, and signaling is delayed, affecting user experience. How to improve the reliability and timeliness of reliable data link is a problem that all audio and video service providers need to solve. With the development of technology, the current hot protocol QUIC came into being.
QUIC overview and advantages
QUIC stands for Quick UDP Internet Connection. It is a protocol proposed by Google in 2013 that uses UDP for multi-channel concurrent transmission. QUIC uses UDP to create a connection between two endpoints and supports multiplexing connections. Initially, QUIC wanted to provide SSL/TLS level network security, reduce latency for data transmission and connection creation, and control bandwidth in both directions to avoid network congestion. Let’s take a look at QUIC’s advantages over TCP.
- ** Simplifies TLS handshake process and reduces RTT for first connection **
The biggest optimization made by QUIC protocol is to simplify the handshake process to 0/1RTT. As we all know, HTTPS handshake takes 3RTT, while QUIC connection takes 1RTT at most. The following figure illustrates TCP’s cumbersome handshake process, but QUIC reduces the cumbersome process to 0-1 RTT.
- With the multi-flow strategy, header blocking of one stream will not cause data blocking of another stream.
Protocol data at the application layer exchange information through streams. Each stream contains an ordered sequence of bytes, but there is no sequential relationship between streams, and streams are isolated and independent from each other. Packet loss or unreachable transmission of one stream does not affect the data of other streams in the connection. This design can avoid TCP queue head blocking, we need to do different priorities of data isolation.
- Improve congestion control algorithm
TCP congestion control is for a connection, all traffic is controlled by a congestion control module. However, QUIC can do flow control for each stream.
- Supports dynamic connection migration
Connection migration means that when any element of a connection quad changes, the connection remains, keeping the business logic intact. The reason why QUIC can support connection migration is that QUIC no longer identifies the connection with a quad, but with a 64-bit random number as ConnctionID. In this way, even if the IP address or port changes, as long as the ConnctionID remains unchanged, the connection will still be maintained. The upper-level business logic does not sense the change, does not break, and does not need to be reconnected.
- Implement forward error correction by sending redundant packets to reduce retransmission
QUIC sends redundant packets with higher priorities. Lost packets can be calculated through redundant packets to reduce retransmission times and improve transmission efficiency.
All these QUIC advantages are very attractive in the Internet industry, and netease Yunxin has designed its own accelerated agent architecture based on these advantages.
The acceleration architecture of netease Cloud Communication reliable link
Referring to the advantages of QUIC, netease Yunxun developed QUIC accelerated proxy service by itself to provide proxy service of reliable data transmission for end-to-end edge server nodes and improve the timeliness of reliable data. The following figure shows the architecture diagram of netease Yunxin Accelerated agent.
As shown in the figure above, Cloud communication adopts the parallel design of QUIC protocol and TCP protocol, and the client supports the establishment of QUIC link and TCP link. In normal cases, clients preferentially use QUIC. The client connects to the back end by connecting to the QUIC acceleration service. When the QUIC connection fails, the client selects the standby TCP link to connect to the back end directly. The purpose of reserving the original TCP protocol is to supplement UDP in some user scenarios and ensure high availability of the program.
Netease Cloud letter acceleration architecture design original intention
We adopted this architecture for the following reasons:
- Accelerate for the last mile
The last kilometer from the user is the link that is most prone to weak network failure. The user reliable data can be accelerated on this link to achieve twice the result with half the effort.
- UDP/TCP dual insurance improves high availability
UDP is disabled on the firewall of some Lans. Therefore, UDP packets are unreachable on this network. In this network environment, all UDP-based QUIC packets are discarded. In this network environment, only TCP transmission can be expected, so CLOUD communications select TCP as the backup of data link.
- Accelerated services are isolated from business and do not care about business data
As a transparent agreement, netease Yunit Acceleration service is only responsible for accepting THE QUIC package, unlocking the package and transmitting it to the back end. Acceleration proxy service does not care about the content of QUIC, so it has a large scale and can be used to speed up data transmission for many services that need acceleration.
- Compatibility with existing architecture
Providing flexible upgrade policies for existing customers is one of the product upgrade policies of CansOFT. Therefore, the deployment of this architecture does not affect existing users. Existing users can retain their original links, while new users adopt accelerated links by default.
Below, let’s take a detailed look at the architecture design of netease Cloud acceleration server.
Data accelerates the architectural design of the server
Data acceleration proxy server is divided into two modules, QUIC pre-proxy module and post-proxy module.
QUIC pre-proxy module
QUIC pre-proxy module, start a UDP listening, listening to client user QUIC packets, the pre-proxy module is mainly responsible for the following work:
- Receives UDP packets of the QUIC protocol from the client
- QUIC packet unpacking and redundancy filtering are performed on the received packets
- The packets to be sent are packaged by the QUIC protocol
- Calculate bandwidth and redundancy according to network conditions and send redundant packets
- Verify the integrity of the received package
QUIC post-agent module
The QUIC post-proxy module is responsible for establishing TCP connection with the back end or initiating HTTP requests with the back end. The main tasks of the post-proxy module are as follows:
- According to the request of the front end, the back end initiates the connection request
- The service data packets of the front-end proxy are packaged and transmitted transparently to the back-end server
- It receives the service packets from the back-end server, compresses and verifies them, and sends them to the front-end agent module, which forwards them by QUIC protocol
Front-end agents and back-end servers are usually deployed nearby or in the same machine, so the delay from the proxy to the back-end server is almost negligible. The main delay is the delay between the end and the front-end agent, and the focus of optimization is to increase the optimization of QUIC transfer from the end to the front-end agent.
Audio and video service optimization based on QUIC accelerated service implementation
On the basis of QUIC acceleration service, netease Yunxin has mainly optimized the following aspects:
- First screen opening speed optimization
The client and server of cloud communication cost 0-1RTT to establish a connection, compared with TCP+TLS+HTTP/2 3RTT connection, has a great advantage. After a successful connection, once the client cache or persistent client is configured, the 0RTT connection can be established by reuse and combined with the locally generated private key for encrypted data transfer without the need for another handshake. This gives users at least a 2RTT delay in opening the first screen. Especially in some remote areas with poor network, it can reduce the first screen delay of 100-300ms and improve the user experience.
- Many Streams design
Multiple Streams are created under the QUIC link for data transmission of different application layer protocols. The Streams are isolated from each other, so that data with a lower priority will not be affected by the blocking of the header of the data queue with a higher priority.
- Signaling priority level design
Data with a higher priority is sent with a higher redundancy, while data with a lower priority is sent with a lower redundancy. In this way, data with a higher priority arrives first and data with a lower priority does not block signaling transmission with a higher priority.
- Transmission data compression is optional
Data proxy supports Zlib compression. For some signaling character Json data, zlib compression can achieve 20% compression rate for character compression. Compression reduces transmission bandwidth and alleviates bandwidth congestion.
- CRC is introduced to verify the transmission data
QUIC is streaming data. CRC check is added for each transmitted data. If the check fails, the data is disconnected to ensure the reliability and accuracy of data transmission and prevent services from being affected.
- Dynamically adjust the redundancy according to network conditions
Nnacked packet is used to evaluate the situation of all-link network. If unNACKED occurs, the redundancy will be adjusted forward, otherwise, the redundancy will be reduced after stability. Minimize bandwidth usage, save user bandwidth, and avoid queue congestion.
Weak network performance of netease Yunxin acceleration service
We made a series of comparisons between data acceleration and non-acceleration, especially in the weak network environment.
First screen time and login time
The figure above shows the first screen opening time and login process time of QUIC and TCP of Yunxin audio and video services. It can be seen that there is a 20% improvement in the opening of the first screen, and a nearly 30% improvement in login. The effect is quite obvious. The main reason is that QUIC realizes 0-1 RTT handshake process, especially for some remote provinces, 100ms delay can be saved, the effect is relatively obvious.
Anti-packet loss capability
The figure above shows the maximum packet loss rate that cloud signaling data can resist under QUIC and TCP links. QUIC can still provide service when the upstream packet loss rate reaches 70%, and the downstream boundary can even resist 75% packet loss. The TCP link disconnects and reconnects in 45% packet loss cases. Compared with TCP signaling link, QUIC link has a 50% improvement. Main reasons:
- Cloud communication implements dynamic redundancy, which increases redundancy after packet loss is detected. In this way, redundant packets are used to compensate for high packet loss, bringing anti-packet loss performance.
- QUIC’s improved traffic control and congestion control algorithm make IT possible for QUIC to obtain greater transmission advantages in weak networks.
Limited bandwidth
We also looked at the bandwidth usage of QUIC in the case of limited bandwidth, basically the bandwidth usage of QUIC can reach more than 90%, whereas TCP is much worse.
The test results
In terms of packet loss, Yunxin has achieved a 50% improvement in anti-packet loss performance, a 20% improvement in the opening speed of the first screen, and a 90% bandwidth utilization rate under the condition of limited bandwidth, which is the leading level in the audio and video service industry.
Outlook & Summary
Netease yunxin has greatly improved reliable data transmission in terms of reliable data acceleration, but there are still some areas that need to be optimized:
- Once packet loss occurs in one direction, bidirectional redundancy is increased for both the server and the end, resulting in unnecessary redundancy. Unidirectional packet loss will be detected later, and the redundancy is increased only for the link with packet loss.
- For high RTT and high packet loss scenarios, the QUIC congestion control algorithm needs to be continuously optimized.
Netease yunxin will continue to provide users with quality services in all extreme situations in the audio and video field.
Author’s brief introduction
Ji Song, senior audio and video server development engineer of netease Yunxin, is responsible for the live streaming business of Yunxin and audio and video data link acceleration business, and has been responsible for the live streaming business of TFboys with 700,000 users. He has rich experience in audio and video data transmission and network data forwarding.
More technical dry goods, welcome to pay attention to [netease Smart enterprise technology +] wechat public number