background

Real-time live broadcasting has become a standard technology for Internet applications after the war of thousands of broadcast last year, but the cost of live broadcasting platforms has always been high. In addition to poaching anchors and Internet celebrities, the high bandwidth costs behind each platform are also their biggest cost.

At present, the transmission of live broadcast technology is divided into two parts: CDN and Lianmai system. CDN is responsible for the distribution and transmission of streaming media, and Lianmai system is responsible for the real-time communication and transmission of simultaneous interaction between multiple anchors. We always believe that the live broadcast technology based on CDN+ Lianmai system is a technology with high cost and high consumption, which is verified by the loss of major live broadcast platforms. In addition to bandwidth costs, latency is one of the major drawbacks of current live streaming technology. We have realized for a long time that the current traditional live streaming technology is unable to carry out large-scale interactive live streaming of online education. Therefore, the outstanding student started to develop an interactive live streaming system based on UDP and P2P technology since the second half of 2016. The design objectives of the whole system are:

  • The end-to-end delay is within the range of seconds.
  • Try to save distribution bandwidth without compromising video quality.

The whole distribution architecture based on P2P technology has been tested and tuned on a 10W+ live broadcast platform for 9 months, and the design goal has been preliminatively achieved. So how is the whole system designed? What techniques were used to achieve the goal? Let me focus on the architectural design and technical details.

P2P distribution network architecture

In the transmission and distribution network, we combine the mianling system and the distribution system, and take distributed push flow and edge node distribution as a set of transmission system. The minimum delay of mianling is realized through P2P communication and routing between services. The architecture is shown as follows:

The whole transmission distribution network is divided into three parts: push flow part, P2P between services and CUSTOMER node P2P. The transport network has a system anchor: It is assumed that packet loss and delay will not occur when the speaker of the stream pusher pushes the stream to the Edge server. The Edge Server will quickly distribute the received stream data to other Edge servers through inter-service P2P, and no delay or packet loss will occur during this process. Why is such an anchor needed? The P2P network at the client node needs to ensure smoothness and minimum delay, that is, all Edge servers are required to have complete data within the shortest period of time. As for why this is necessary, we will focus on the flow compensation later.

I will go through the technical details of the entire stream data transfer process, but first I will solve the problem of media data sharding. All transmission processes will be designed based on segments.

Media data sharding

Media data sharding is the most basic part of the whole distribution and transmission system. When we design sharding, we mainly consider the problems of delay and consumption. If the sharding is too large, the transmission delay will be higher, such as HLS. If the fragments are too thin, the network will receive a large number of feedback packets, and the extra consumption of P2P networks is also a problem. Finally, we draw on the experience of RTP and FLV to do data sharding by frame, which has the following advantages:

  • The granularity of delay by frame fragmentation is small, which can optimize the delay in frame transmission.
  • The implementation is simple and consistent with codec coding principles.
  • Flexible combination, can achieve seamless combination of playback buffer.

Each fragment is called a segment and is represented by a self-increasing 32-bit ID. This ID is used to identify data integrity during transmission.

Push stream and link wheat

After the media fragments are determined, the streaming can be carried out. We combine the streaming and distribution paths into one, so that users can push the streaming data segment to the Edge server nearest to them instead of to the special link system. RUDP transmission algorithm is used in stream transmission. RUDP adopts congestion algorithm similar to BBR based on delay and packet loss, and makes packet congestion discarding, as shown in the diagram below:

For details on RUDP, see my article “How to Make UNRELIABLE UDP Reliable?” . As for why RTP or RTMP/TCP is not used to push the flow, RTP is based on UDP, but it needs to ensure the reliability through RTCP and NACK, need to design congestion algorithm, also need to transform and expand RTP, and is also limited by THE RTP protocol itself. So we didn’t go straight to RTP. The RTMP/TCP design was simple, but was rejected at the beginning of the design because of the high latency and reconnection in weak network environment.

Peer-to-peer (P2P) between the Server

Because the entire transmission and distribution network is distributed and consists of multiple Edge servers, media data fragments from Edge server must be distributed to other Edge servers as soon as possible based on system anchor points. At the beginning, we used the BGP server for forwarding, which consumes a lot of BGP bandwidth. Moreover, once the BGP server is abnormal, the communication between the Edge Servers is interrupted. In fact, most of the time, the latency between Edge servers across carriers is not as big as expected, so you can consider using point-to-point communication between Edge servers to solve the problem. Therefore, we designed a transmission model based on RUDP windowless multipath for communication between Edge servers, as shown in the following figure:

The communication model in the figure above is a multipath parallel communication model. Before RUDP sends, we add a path routing table, which records the distribution probability of each path. RUDP selects the path according to the probability in the routing table every time it sends a packet to the receiver. So determining the routing table probability is a very important thing. We obtain the network state (packet loss, delay and jitter) through RUDP real-time ACK feedback and path real-time ping detection, and then input the network state parameters into the approximation function F (x) to determine the probability of each route. Here is the principle: If the delay and packet loss of direct connection between Edge servers are small enough, the probability of direct communication routes will be close to 100%. If a route is periodically disconnected or the delay exceeds 200ms, its probability will be close to 0. The following is a schematic diagram of the whole route probability evaluation process:

P2P network construction process

The media streaming data reaches each Edge server through P2P multipath transmission network between Edge servers, and then each Edge server needs to fragment the streaming data to each client node. We make special transmission processing for the wheat node to reduce the delay. The process is similar to a normal RTC communication model, which will not be described here. Watch the distribution on the node using self-organizing P2P network, since it is delivered through P2P, then it is necessary to build a P2P network in the customer node group, how to build this network? Specifically divided into three steps: connection, evaluation, stratification.

The connection

Client applications run on clients, most of them behind routers or NAts, and they must traverse each other’s NAts and firewalls to establish connections. Although there are many NAT traversal methods, such as STUN and ICE, the traversal connectivity rate is always a problem. If the traversal rate is too low, many high-quality node resources will not be fully utilized. In designing the traversal scheme, we put the continuous pass rate in the first place, and designed a traversal mechanism based on port multiple guesses and attempts by modifying STUN protocol. Firstly, information such as NAT type, change rule of NAT port and whether NAT has blacklist mechanism is determined by using STUN protocol. Then, such information is stored in the Edge server connected to the jurisdiction. When a partner node passes through it, such information will be exchanged with each other. Different permutations and combinations will have different traverse strategies. The process and result of each traverse will be recorded in our background database, and we will periodically analyze these data and adjust the negotiated traverse strategy. The diagram below:

After traversal, the nodes will conduct connection handshake and identity certificate authentication (more on why certificates are required later), and establish communication and neighbor relationship. So how does the neighbor relationship change dynamically?

Neighborhood relationship and evaluation

Neighbor problem

Once the connection is completed, the nodes become neighbors and will exchange state and heartbeat with each other. Then the problem arises. A live broadcast system has thousands of nodes involved, and if they are all connected in pairs, optical heartbeat communication can occupy the upload bandwidth of the client node. We designed a node LRU elimination linked list, in which 40 neighbor nodes maintain contact, the old node will quit and the new node will join, and LRU will add and eliminate LRU according to the communication state between the neighbor and itself. The principle is as follows:

  • The nearest principle, the internal network first, the same city and the same carrier network second.
  • Periodic evaluation delay and media fragment hit rate, last out.
  • When the number of nodes in the LRU list is less than 40, new nodes will be selected from the standby node list and added to the LRU.

Node to evaluate

The computing capability, communication capability and network partition of each customer node are different, which makes us have to make an evaluation for each node. The evaluation of each node is divided into two parts: neighbor node’s evaluation of itself and its own evaluation of itself. Neighbor evaluation is mainly as follows:

  • RTT
  • Packet loss rate
  • Request hit ratio

Each of these three parameters calculates an affinity score for each neighbor, which is used for subsequent distribution selection.

The main assessment of their own points:

  • CPU, memory,
  • Network type: WIFI/4G/ wired network
  • Upload bandwidth

The node periodically calculates these two kinds of parameters, and calculates node capability and QOS policy for neighbor through a network QOS convergence function F (x).

The node hierarchy

The ultimate goal of node evaluation is to make capable nodes become super nodes to share the distribution pressure of Edge Server. So what does it take for a node to become a supernode? There are the following conditions:

  • There is enough upload bandwidth, 4G and weak WIFI can not become a super node.
  • With free CPU and memory, low-end mobile devices that lack computing power cannot become supernodes.
  • Be friendly to your neighbors, not isolated.
  • I was appointed by Edge Server and had good communication with Edge Server.

What if the supernode’s performance degrades? The answer is that it will degrade to ordinary nodes, because node evaluation is conducted periodically and in real time. If the performance of a node degrades, Edge Server will degrade it.

Now that a supernode is appointed, how does it work? Each super node will be assigned a group ID when it is appointed. Edge Server will group according to the number of super nodes in its jurisdiction. Each group is composed of multiple super nodes, and the super nodes in the group are responsible for the media fragment distribution of their own group. For example, if there are 5 super node groups, the first group is responsible for distributing numbered segments 1, 6, 11, 16, and so on. The second group is responsible for distributing numbered segments 2, 7, 12, 17… This prevents the failure of a single super node and enhances the stability of P2P distribution. The schematic diagram is as follows:

P2P streaming media distribution process

According to the above P2P network construction process, we know that the whole P2P network is actually a hierarchical directed graph distribution network, so how to distribute stream data specifically? Also divided into three steps: first push, then pull, and then compensation, the following is a detailed explanation of how to achieve.

push

When introducing the supernode, it is mentioned that the supernode will push the data to the corresponding supernode group according to the segment ID. How will the supernode process the data after receiving these segments? According to the principle of P2P design, data should be forwarded to super nodes or common nodes of other groups. However, if they are all pushed in this way, the network may cause redundancy and consume too much bandwidth.

In order to solve this problem, we designed a pre-subscription mechanism. The principle is that each P2P client node will book according to the segment ID with the largest buffer. The media data fragments 10 seconds later will be booked in advance, and the subscription request will be weighed according to the affinity value score evaluated by the node. The super node that receives these requests will save the pre-ordered fragment request information, and when Edge Server pushes this fragment to the super node, it will unconditionally forward these pre-ordered packets to the pre-ordered node, as shown in the figure below:

The following principles can be seen from the figure above:

  • The path from Edge Server to all nodes has a maximum of two layers, which is done to control link latency.
  • The super nodes of different groups subscribe to each other’s segments.
  • Regular nodes only subscribe to super Nodes.

pull

Data segment is pushed to each client node through pre-subscription, but the network will lose packets, and the super node may quit in the middle of the process, which will cause packet loss of the final node. Then what should we do if packet loss occurs? We designed a mechanism to pull missing fragments from neighbors. The general process is as follows:

  1. The node periodically checks the information about fragments that are lost and received, and constructs a gossip protocol to exchange buffer information with its neighbors.
  2. The node receives the gossip message from the neighbor and records the fragment information of the peer to the local node.
  3. The local searches for the lost fragments based on the fragment information of the neighbor, randomly selects the neighbor based on the affinity value score, and initiates a pull request to the selected neighbor.
  4. After receiving a pull fragment request from the neighbor, the fragment is sent to the requested node.

The whole step will periodically try to pull several times, as shown in the following diagram:

It is worth mentioning here because the gossip information of the buffer will be exchanged periodically, which means that the smaller the gossip information of the buffer, the better. We designed a similar Bloom filter to describe the gossip information, which can not only reduce the data size of the gossip message, but also reduce the size of the gossip message. And the comparison is fast.

The compensation

Because P2P client nodes are unstable, it is possible that a segment may not be received after several times of pulling, and the segment is close to the playback position, then the node missing this segment will directly request the Edge server to compensate for sending this fragment as soon as possible. The purpose of this is to prevent packet loss due to P2P communication. This means that each Edge Server needs to have all the shard data, which is the anchor point of the system. The process is as follows:

In most cases, there is no problem with this process. However, if most client nodes miss some segment segments at the same time, there will be a large number of compensation requests to Edge server, which will cause a network storm. We responded to this by designing a scarcity assessment and denial of service mechanism. This mechanism means that when too many compensation requests reach the Edge server within a unit of time, the Edge server will reject the requests beyond its capacity and only resend the fragments within its capacity. The process also evaluates the scarcity of compensation requests, and if most of the nodes in a shard are missing, it will proactively push the shard again through the Super Node group.

Buffer and delay control

All data segments can be distributed to each client node through the above three stages, but the client node needs a buffer to coordinate with the three stages and local playback. If the buffer buffer time is too long, it will cause unnecessary delay; if the buffer time is too short, it will cause lag and the three stages are incomplete. Therefore, we designed a three-stage dynamic buffer buffer, as shown in the figure below:

Here’s what the sections above mean:

  • Push interval: Since fragments are pushed from different super nodes, it will inevitably cause some jitter. Therefore, there will be a Jitter buffer at the beginning of the buffer until the fragment push position in the gossip message of the first neighbor node ends. This phase generally lasts from 100 to 300ms.
  • Pull interval: Fragment sequence After entering the pull interval, missing fragments are periodically checked and pulled based on gossip’s neighbors. Three attempts are made. The time for each attempt is the RTT value between the local node and its neighbor. Three failures will lead to compensation.
  • Compensation interval: After the fragment sequence enters the compensation interval, the system periodically checks the lost fragments and directly requests the Edge server for pulling according to the ID of the lost fragments. Four attempts are made, and the time of each attempt is the RTT between a local node and the Edge Server. If 4 fails, the system is in a waiting state, waiting for neighbor gossip or Edge Server to push it actively.
  • Expiration interval: Played shards are placed in this expiration interval instead of being deleted immediately. Why? Because the playback time of each node is different, the fragment played locally may be the fragment lost by other nodes, and other nodes may pull it by pull. Therefore, we will delete the fragment played after 3 seconds in the expiration period.

Second open problem

The three phases and buffer control distributed above solve the problems of continuous stream distribution and playback delay control, but all current live broadcast technologies must have second on. In fact, P2P distribution is simpler to solve the second on problem than simple Server CDN forwarding. Second opening means that the user can see the video image of the anchor instantly when entering the live broadcast room. The purpose of second opening is that the newly entered client node requires the edge node of the server to send data from the last GOP key frame of the video, and then the client node accelerates the playback from this GOP key frame with zero wait according to the video encoder. The newly entered nodes in the P2P distribution network will receive the last GOP fragment ID of Edge Server, and the client node will quickly pull the whole GOP fragment data from each neighbor according to this ID, instead of simply sending it by Edge Server. The speed of the second was shortened by an average of 100 milliseconds.

P2P Content Authorization

In addition to transmission and distribution, the live broadcast distribution technology also needs to consider content theft and authorization, and the system security needs to be considered in P2P system. We introduce CA certificate and two – end negotiation encryption scheme to ensure the validity of link. Roughly speaking, each valid node unit (Edge Server and all client nodes) will initiate validity verification to CA. If the verification passes, CA will use CA’s RSA to generate certificates according to node ID, node RSA public key, authorization start time and authorization end time. Each node unit that obtains a certificate needs to communicate with other nodes, exchange certificates first, verify the validity of the certificate of the other party, and then use the KEY of the RSA public KEY encryption algorithm in the certificate to return it to the certificate party. After receiving the encrypted KEY, the certificate party will decrypt the symmetric encrypted KEY with the RSA private KEY. In this way, both parties complete the validity check and use the exchanged KEY to encrypt and decrypt packets. The process is as follows:

Online data comparison

The above technical analysis only helps readers to understand the operation mechanism of this system. In addition, of course, online data need to be published to prove the feasibility of the system. The following figure is the online comparison data of a 10W+ online live broadcast platform after using this P2P system. In the same live broadcast object on the same Edge server, we turn off P2P for half of the user nodes and enable P2P for half of the users to observe the bandwidth consumption of the two user groups on the same Edge server in a day.

As can be seen from the figure above, the bandwidth consumption in P2P mode is only 1/4 of that in non-P2P mode, and our P2P system saves 75% of the bandwidth cost. The video sample of this data is a single-channel live stream with 480P and 800kps bit rate. The number of real nodes in the peak period is 1000+, and the average delay of all terminals is 1.07 seconds.

Afterword.

By now, the technical analysis of P2P distribution network is over. It has been 19 years since the emergence of P2P technology, and P2P CDN is also the main technology of the next generation OF CDN, and P2P technology and model have been changing and improving. We use UDP and P2P in the field of live broadcast distribution to solve the problem of educational scene interaction in terms of cost and delay. Different starting points will lead to different results. If you are troubled by cost and delay, you can try to use this technology to solve the problem.

The authors introduce

Yuan Rongxi with honors student jun, a senior architect, 16 years of C programmers, is good at building high-performance service system and the system performance tuning, P2P communication network and TCP/IP communication protocol stack and authentication encryption technology, joined in 2015 students with excellent grades, responsible for building intelligent routing of the students with excellent grades king real-time audio and video transmission systems and networks, Solve the real-time problem of audio and video communication.

Thanks to Yuta Tian Guang for planning and reviewing this article.