Communication is the eternal pursuit of human beings. We are always eager to break through the limitations of time and space and shorten the distance between people.

With the maturity of RTC, live broadcasting and other technologies, more real-time and higher quality communication is becoming more and more within reach. Combined with traditional IM messages, “fusion communication” has become a hot field in recent years, and Yunxin is committed to creating the industry “fusion communication cloud first brand”. In order to achieve this goal, the transmission quality of communication data is very important, but it is always a difficult problem to ensure the transmission quality in long distance and complex network conditions. And in the pursuit of higher quality, lower cost and more general, more flexible architecture are also important indicators to examine whether a communication system is better.

Under these circumstances, a new generation of large-scale distributed transmission Network, WE-CAN(Communications Acceleration Network), developed by Yunxin itself, is born. It can not only greatly improve the quality of end-to-end communication, reduce the cost of communication, and can be applied to a variety of application scenarios.

We-can not only has lofty goals, advanced architecture design and excellent engineering implementation, but also has been fully implemented and verified in the actual online business of Yunxin:

  1. Hundreds of billions of messages and hundreds of millions of minutes of media streaming data are transmitted daily;
  2. There are multiple nodes in major countries in Asia Pacific, India, Middle East, Europe, North America, North Africa and other regions are also covered by nodes. Each provincial unit in China has a large number of edge nodes, covering 200+ regions around the world.
  3. In the domestic audio and video transmission to achieve more than 99.9% in-network high-quality transmission rate, end-to-end high-quality transmission rate of more than 99%;
  4. Cross-border communications approach dedicated line quality with a worldwide delay of no more than 250 ms.

This paper will analyze the architecture design and key technical difficulties of WE-CAN from various angles.

WE – CAN define

We-can (Communications Acceleration Network) is a complex Network system set up on the public Internet, which achieves the goal of improving data transmission quality and reducing data transmission cost through intelligent scheduling of various resources.

WE – CAN target

As to build the base of the cloud letter “fusion communication first brand”, WE – CAN the fundamental goal is to create a CAN of arbitrary data from the global point stability, quickly and efficiently to any other general transport network, in the corner of the world and this network is set up on the public Internet — that is, without using any special hardware or set up special line, Instead, they do it through software solutions.

Compared with similar products, THE goals of WE-Can are:

  • Faster than CDN
  • Cheaper than SD-WAN
  • More versatile than RTN

WE – CAN advantage

Compared with other similar products, WE-CAN has its unique advantages: Compared with CDN, WE-CAN CAN not only achieve large-scale and marginal distribution of content, but also CAN do it faster; As for RTN, WE-CAN not only supports RTC business of streaming media, but also supports other transmission modes.

  • WE CAN transmit streaming media with high arrival and low delay, and WE CAN implement optional, business-transparent ARQ, FEC and other redundant policies in addition to the various QoS policies of the media itself, which are also common to all other transmission modes of WE CAN.
  • We-can CAN also distribute video live on a large scale, eliminate the bottleneck of room number through path cascade and reuse, reduce bandwidth cost, and make the cost close to CDN and RTC in real time, so as to better support low-delay live scenes.
  • The WE-CAN CAN also reliably transfer signaling, IM, or other data. The so-called “reliable transmission” is to ensure that the data will arrive, and ensure the sequence of data delivery;
  • We-can’s services and protocols feature industry-leading decoupled and layered designs that are elegant to implement, easy to use, and flexible in approach. For example, it abstracts and encapsulates the reliable transport protocol and provides a minimalist interface, which we call MessageBus. The goal of MessageBus is to provide a globally deployed distributed message queue service.

WE – CAN effect

The following are the online real data of THE WE-CAN production environment, and the comparison results of A/B test by using the RTC of cloud telecom to connect part of channels to WE-CAN and direct connection between media servers of some channels (without WE-CAN).

The figure below shows the end-to-end quality transmission rate change curve in the past 13 days. As CAN be seen from the figure, the network transmission arrival rate of channels that use WE-CAN has significantly increased (our quality transmission rate refers to the rate of arrival rate greater than 95% in all statistical Windows) :

Caton rate comparison

The following is the change curve of the market latecomer rate of YUNxin RTC audio in the past 13 days:

Delay in contrast

The figure below shows the 24-hour market delay gradient statistics:

WE – CAN architecture

Consider the process of sending A message from A to B. From the perspective of WE-CAN, this process CAN be divided into three stages:

  1. From client A to its access server A’;
  2. Server B’ from server A’ to server B;
  3. From server B prime to client B.

So WE-CAN needs to optimize two different transport scenarios:

  1. S2S (Server to Server) is intra-network transmission.
  2. Last-mile is edge access.

The optimization of both transport scenarios can be further divided into two dimensions — quality and cost. So there are different ways to optimize quality and cost for in-network transport and edge access and find a balance between the two. From this perspective, the ultimate goal of WE-CAN is to provide high-quality, low-cost edge access and intra-network transport solutions for all types of data transmission needs.

The network transmission

The core node

The core system of the in-network transmission part of WE-CAN is mainly divided into three types of nodes: Edge, Relay and Controller.

** A-Node: **Edge, as an A-node, is the part of WE-Can that is physically closest to the client.

Relay node: The Relay is responsible for data transfer within the network. The Relay nodes form a full-mesh network with each other:

  • Data transfer in long-distance transmission, especially transnational transmission is more common, sometimes even need to carry out multi-hop transfer. For example, when data from the computer room in Guangzhou is sent to Los Angeles, WE-Can may send it to Hong Kong first, and then from Hong Kong to Los Angeles, or even take the transit route from Guangzhou to Hong Kong to Singapore to Los Angeles. The actual path is determined by the current network condition.
  • The second case is that when the single-line A-Node crosses ISP (network provider), for example, the data on jiangsu Mobile A-Node needs to be transferred to Zhejiang Telecom node through Jiangsu third-line (or Zhejiang third-line) Relay.
  • The third common need transfer is much rooms one line failure or network jitter, such as hangzhou unicom three line node failure occurs, other domestic unicom room data can be sent to the nearest three wire first Relay nodes, then sent to hangzhou telecom/mobile mouth of the three lines so that you can make hangzhou unicom three line node failure mouth continue to provide services.

In addition to a few preset rules (for example, when two nodes cross ISPs, they must be transferred), whether data between any two points in WE-CAN needs to be transferred and how the transfer path is dynamically routed and deployed according to the current network conditions.

Controller node: Relay agents detect link quality among themselves and report to the Controller.

If there is direct data traffic between two Relay agents (no forwarding is required), the traffic packets and Ack packets are used for link quality statistics. For Relay Relay links with no direct traffic, WE-CAN constructs artificial probe traffic according to a specific pattern and uses the probe traffic to count the link quality. The Controller determines whether link quality detection and statistics are required for any pair of Relay agents. Some specific links are not detected, such as the link detection between China Telecom and China Unicom nodes because there must be a transfer, so link detection is meaningless. Therefore, the Controller calculates the network/detection policy of the Relay according to the preset configuration and sends it to the Relay.

The quality of optimization

We-can improves in-network transmission quality through real-time intelligent routing between transit nodes.

Each WE-CAN transit node performs quality detection on a regular basis with each other and reports the detection results to the control node. After receiving the quality detection information of the whole network, the control node will calculate the best route of the whole network and then deliver the route to each transit node. When the direct connection between two A-Nodes is poor, the traffic between them will be forwarded through the transit node. The “route” determines which transit nodes to go and the series relationship between them.

Because the carrying capacity of each transit node is limited, the transit node overload should be avoided during routing. When a transit node fails or network jitter deteriorates, its traffic should be evenly distributed to other transit nodes to prevent other transit nodes from being blown up and causing network avalanche effect. Moreover, control nodes collect quality detection information and calculate route delivery periodically. When network jitter occurs in a transit node, there will be a certain lag in route switching and avoidance. Therefore, we-CAN transit nodes dynamically modify the routing table according to the current detection results to quickly respond to network congestion and node failure. In a word, WE-CAN improves the transmission quality and reduces packet loss rate and delay through the close cooperation of all nodes, the periodic update and delivery of routing table by control nodes and the real-time correction of routing table by access/transit nodes.

We-can also performs packet-level ARQ (timeout retransmission) and FEC (packet loss recovery) between nodes to improve transmission quality. Such ARQ and FEC policies are transparent to the transmitted content. As these policies will bring additional bandwidth consumption, whether or not they are enabled and how much they are enabled are optional. WE CAN also provide multiple redundant transmission between two points to improve transmission quality with multiple transmission bandwidth. This strategy is also optional, and WE CAN ensure that multiple redundant transmission paths do not overlap each other.

Cost optimization

We-can reduces in-network transmission cost through public network transmission and edge node sinking.

We-can does not rely on dedicated lines to ensure the transmission quality between nodes, but reduces the bandwidth cost by means of common public network transmission through intelligent real-time routing. At the same time, WE-CAN does not use expensive BGP nodes or three-line (multi-carrier) nodes for access. In other words, A-Nodes are usually deployed only in single-line rooms (specific carriers), while transit nodes are mainly deployed in three-line rooms, which greatly reduces bandwidth costs. In addition, WE-CAN considers the historical peak bandwidth of transit nodes during route calculation to avoid high peak bandwidth of different transit nodes within the same billing cycle (month), and tries to stably distribute bandwidth to each transit node without affecting route quality. We-can supports single message delivery to multiple destinations. Multiple destination A-Nodes and transit nodes on the path of multi-broadcast messages are organized into a tree-like cascade structure to reduce traffic through path multiplexing, which has a good effect in large-channel or low-delay live broadcast scenarios.

Edge access

Access to the node

From the perspective of in-network transmission, A WE-Can A-node is a service called Edge, but it is actually an Edge Cluster composed of a group of services:

  • Gateway: responsible for receiving and caching data, as well as possible future protocol transformations, so that downstream services can be hot-swappable and upgraded without worrying about data loss;
  • Broker: Responsible for Topic management, MessageBus reliable transmission protocol encapsulation, etc.
  • Driver: Responsible for various data processing, including unpacking, grouping, retransmission, sorting, Session management, etc.
  • Edge: responsible for managing the status of other Edge and Relay;
  • Monitor: Monitors the entire Edge Cluster, including health status and load status. At the same time, the link quality in the inter-node network is evaluated and reported.
  • Ens-registrar: Registrar is responsible for the registration and query of topics. ENS(Edge Name Service) is part of the Edge Cluster and the Registrar is a centralized Service.

Control platform

The WE-CAN Control Platform (Dashboard) consists of a Web page, a configuration database and a set of APIS, which are responsible for the configuration, monitoring and management of various WE-CAN resources.

Unified scheduling

The unified scheduling system obtains the static configuration information of each Edge node through the configuration database of the management and control platform. Combined with the dynamic load information reported by each Edge through MessageBus, the Edge node is scheduled and allocated according to the preset rules and historical data feedback.

The long-term goal of unified scheduling system is to manage and allocate all resources of cloud communication according to common rules.

The quality of optimization

We-can improves edge access quality through real-time intelligent scheduling of A-Nodes, and deplores enough A-Nodes globally for the dispatch system to allocate them nearby.

A-nodes report their real-time load information and aggregation information to the scheduling system. For each access request, the scheduling system assigns the optimal A-Node of the same carrier to ensure the access quality. Generally, the optimal A-Node is located in the same province or country (not necessarily the closest in a straight line).

In the process of selecting the optimal node, the scheduling system will refer to the aggregation information, just as people of one channel will try to be allocated to the same server, on the premise that the server is “close” to each user.

The scheduling system assigns BGP nodes to small carrier users or overseas users (who do not have single-line nodes with the same carrier). The scheduling system revises the allocation result based on historical data, including positive feedback correction, negative feedback correction, and static lookup table correction. Small operators can also access single-line nodes after correction to avoid expensive BGP nodes with unsatisfactory results.

Cost optimization

We-can reduces access costs through aggregation of distribution results and gradient selection of A-Nodes.

Allocating convergence not only improves transmission quality (avoiding transmission between nodes), but also reduces transmission costs. Nearest the user/best access nodes may cost is higher, WE – CAN through the different business models selectively allocate some cost lower nodes to reduce the cost of access, such as live on low latency scenarios, through the sacrifice of a certain delay, CAN improve the distribution degree of convergence, and CAN choose low cost of the node.

Layered decoupling

The transmission protocol of WE-CAN is fully decoupled by layers. First, it is decoupled from business, that is, it CAN support various data and transmission modes. Then, various functions and service guarantees provided by WE-CAN are also implemented by three layers, which are responsible by different services using different protocol headers:

  • Application layer: At present, the application layer provides protocol encapsulation of MessageBus, including Topic subscription, consumption and other mechanisms, which will be extended according to different business scenarios in the future.
  • Transport layer: The transport layer is responsible for Session management, packet sorting, retransmission, slicing, recombination, etc. The transport layer of WE-CAN develops a reliable transmission mechanism based on UDP protocol.
  • Network layer: The network layer is responsible for data routing, traffic scheduling and congestion control. Meanwhile, the network layer also has hop-by-hop ARQ, FEC and other redundant policies among forwarding nodes to improve arrival rate and reduce delay. Because WE-CAN abstracts and encapsulates each layer protocol, it CAN not only make each layer work independently without affecting each other, improve the stability of the system, but also promote the rapid iteration of functions and reduce the difficulty of development. Thorough hierarchical abstraction also enables WE-CAN to provide more flexible and diversified hierarchical services. For example, in addition to the software-defined intelligent routing network, the network layer can also provide private line service, and even flexibly switch between private line and public network. For example, the transport layer can provide multiple packet retransmission policies and data redundancy policies.

conclusion

Relying on the advanced design concept of “layered decoupling, hierarchical service, path reuse”, WE-CAN becomes the industry’s first transport base independent of business logic. However, the pursuit of netease Yunxin.com is more than that. As an expert in integrated communication cloud services, netease Yunxin.com has not only continuously polished its audio and video and instant messaging technology capabilities, but also committed to becoming a leader in the global intelligent routing network. To achieve perfection in every direction, netease Yunxin will empower customers with continuous technological innovation to achieve endogenous growth and value.

More technical dry goods, welcome to pay attention to [netease Smart enterprise technology +] wechat public number