One, foreword

With the arrival of 5G, the network speed of users has gradually increased, and users have higher requirements for live broadcast delay and other broadcasting experience. In this context, The technical team of Youku has proposed two Low Latency HLS based on HTTP Live Streaming (HLS) in combination with the mainstream Live Streaming technology architecture in the industry, which has been formally applied to Youku Live Streaming business.

The advantages and disadvantages of current mainstream livestream low-delay schemes in the industry are compared as follows:

1, RTMP (Real-time Messaging Protocol)

Advantages:

1) The protocol specially developed for streaming media has good support for FLASH;

2) Low delay, generally 1-3 seconds;

Disadvantages:

1) Transmission based on TCP (Transmission Control Protocol), using a non-public port (1935), may be blocked by the firewall;

2) RTMP is a proprietary protocol of Adobe, some devices cannot play directly, may need to use a third-party module;

2. Hypertext Transfer Protocol- Flash Video (HTTP-FLV)

Advantages:

1) Based on port 80, it can penetrate the firewall better;

2) Flexible scheduling and load balancing through 302 jump;

3) Supports HyperText Transfer Protocol Secure (HTTPS) for encrypted transmission and is compatible with multiple terminals;

Disadvantages:

1) Due to its transmission characteristics, streaming media resources will be cached in the local client, which is not good enough in confidentiality;

2) Switching clarity requires switching the player instance, so smooth switching cannot be achieved. During the switching process, there may be screen faults (the picture stops on a certain frame, black screen loading, etc.), and the experience will be worse in the case of obvious network jitter;

3, HLS (HTTP Live Streaming)

Advantages:

1) Mainstream platforms have a high degree of support for HLS and a wide range of applications;

2) Based on port 80, can avoid firewall interception;

3) Support smooth switching clarity;

4) CDN supports the protocol well;

Disadvantages:

1) High delay, generally more than 10 seconds;

2) It is easy to increase bit rate when reducing delay directly by reducing GOP (Group of Pictures);

4. Real-time Transport Protocol (RTP)

Advantages:

1) Good real-time performance, generally within 1 second delay;

2) Based on User Datagram Protocol (UDP), high transmission efficiency;

Disadvantages:

1) It is often used in video conferencing and so on. Direct application of PGC (Professionally-generated Content) and OGC (Occupationally generated Content) is less;

2) Poor support for high bit rate, usually sacrificing bit rate for fluency;

By comparing the advantages and disadvantages of the above live broadcast schemes, it can be seen that HLS protocol is most suitable for live broadcast scenarios. HLS is also highly recognized by overseas manufacturers, and Apple has been vigorously promoting the low-delay live broadcast scheme based on traditional HLS. Therefore, the technical team of Youku chose LHLS scheme.

Why the delay

To understand the context of latencies, we need to analyze the HLS file structure:

Simply put, HLS consists of two parts, the M3U8 file (playlist) and the host specific media content

(TS, CMAF, FMP4, etc.), the client downloads the media content according to the instructions of M3U8 and periodically refreshes the M3U8 file to get the latest content list.

Here is a master playlist with multiple bit rates (if not multiple bit rates, the M3U8 playlist can be omitted) :

Here is the m3U8 content of one of the Playlists (if you only have one bit rate, you can supply only m3U8) :

The above tags are described as follows, and there are many more tags. For details, please refer to RFC:

1. Playlist refresh and playback mechanism

Ext-x-targetduration is usually the size Of GOP (Group Of Picture). If ext-X-TargetDuration is 4 seconds, it takes 4 seconds to encode a fragment and update the playlist. The client uses the polling scheme to obtain the next version of the playlist. If the polling time is within the interval of (4,8) seconds, the playlist containing only the first fragment will be obtained. Moreover, the request and response of the playlist itself need a round-trip time (RTT). On a mobile network, you can add hundreds of milliseconds of latency before downloading a file fragment. Once you have enough data, the player can’t start playing (some players need to cache 2-3 clips before playing, with a delay of more than 12 seconds).

CDN cache mechanism

As shown in the figure below, the source station has advanced to the fourth fragment, but the CDN edge node still caches the playList of the last version (which only contains three fragments). The edge node will get the list of the latest version only after the TTL of the file expires, and in the worst case, it has to wait for another TTL. This cache TTL cannot be cancelled either. If every request on the end goes to the source for the latest version when it reaches the edge of the CDN, the source may be overwhelmed by traffic.

3. Network factors

The network factor is also one of the main factors causing the delay, which varies from hundreds of milliseconds to several seconds depending on the network condition.

Iii. Standard scheme based on Apple specifications — optimization from the protocol itself

In order to reduce the 10-30 delay to less than 2 seconds, LHLS proposed 6 improvements based on the Apple specification:

(1) Reduce the release delay of fragments

(2) Optimize the fragment discovery mechanism

(3) Eliminate segment request time

(4) M3U8 adopts incremental upgrade mechanism

(5) Accelerate the switching speed of live stream at different bit rates

(6) Dynamic seeding strategy model based on network scoring

Each optimization point is detailed below:

(1) Reduce the release delay of fragments

To reduce release delays, ext-X-Part and ext-X-Part-INF tags have been introduced. Example M3U8 is shown below, where the entire TS had to be produced before it was released, but now it is released in small pieces as PART.

(2) Optimize the fragment discovery mechanism

Using blocking M3U8 loading (EXt-X-server-Control: can-block-reload =YES), the client CAN calculate the SEQUENCE number of the next fragment based on the number of fragments received and the cardinals of the ext-X-media-sequence. Then go directly to the server and ask for the corresponding M3U8,

The GET example.com/live.m3u8?_… (this request is to stop blocking and return m3U8 content when 1803 TS is present in the stream on duty.)

(This request is to stop blocking and return m3U8 when the first part of 1803 _HLS_part=0 occurs in the live stream.)

M3u8 contains the ext-x-media-sequence number, and the client can calculate the SEQUENCE number of the next fragment according to the number of fragments received and the base of the ext-x-media-sequence number, and then directly request the corresponding M3U8 from the server:

The GET example.com/live.m3u8?_…

The above request indicates that when 1803 TS appears in the live stream, stop blocking and return m3U8 content.

In combination with the content of 1, if the server is allowed to deliver the fragment part, the request is as follows:

The GET example.com/live.m3u8?_…

The above request means to stop blocking and return m3U8 content when the first part of 1803 (_HLS_part=0) appears in the live stream.

(3) Eliminate segment request time

The last part of the above request — _HLS_push is more subtle, which is also a big change in the HLS protocol upgrade. It requires the server to support HTTP/2, mainly using multiplexing and server push capabilities. When requesting playlist, the content of the fragment /part will directly follow the push down. Reduce one RTT delay.

Compared through the above three points after the improvement, you can see the old version before HLS, can now under the low latency begin for the first frame data decoding, the sample of the part time is 0.5 seconds on the drawing, network transmission of 0.5 seconds, in theory the client observed delay can be as low as 1 seconds, the length of the part can be further narrowed, Say 0.2 seconds to get a lower latency.

(4) M3U8 adopts incremental upgrade mechanism

Since M3U8’s requests can be as high as 3-4 times per second, an incremental update mechanism was introduced to further reduce the size of the data transmitted over the network,

Wherein, ext-X-server-control: can-skip-until =36.0 tells the client that data that is more than 36 seconds earlier than the current live frontier data will be ignored. This value is required to be more than 6 times that of ext-X-targetDuration. The client can then tell the server to deliver incremental updates with the request parameter _HLS_skip=YES.

This can be useful in some situations. Some live streams allow users to look back for a period of time, so their M3U8 files can be large, possibly in the hundreds of KB. Using an incremental update mechanism can greatly reduce the amount of transfers.

(5) Accelerate the switching speed of live stream at different bit rates

The solution is to add ext-X-rendition-report (e.g., GET) to the end of m3U8Example.com/1M/live.m3u…

(6) Dynamic seeding strategy model based on network scoring

According to the user’s signal strength, bandwidth, packet loss rate to determine parameters such as grade a network model based on the scale model to quantify the user’s network (such as: network is better: 3 points, network general: 2 points, poor network: 1 minute, no network: 0), according to the results of quantitative network to determine the seeding location, such as: The dynamic strategy model can balance the experience of delay and stalling well after it takes effect. It has been proved that LHLS can reduce the delay by 50%-80% when the stalling rate does not increase significantly compared with HLS live broadcast.

Community LHLS — optimized with HTTP chunked

Before Apple launched the LHLS low-latency draft based on HLS, the major technical communities had already made similar exploration. The main technical point of its scheme was the Chunked Transfer Coding mechanism based on HTTP/1.1 (RFC 2616). Block transfer encoding is mainly used to send data of unknown length, which is not fully prepared when the client requests, as shown in a simple HTTP response data:

As you can see from the above procedure, chunking transfer coding is a natural fit for transferring “yet-to-come” HLS fragment data. The core change that the community LHLS scheme makes to the standard HLS is to add the segment url to the playlist a few segments in advance. For example, when a live stream is being started and the first frame of the stream arrives at the server from the push stream, the server will immediately publish an HLS media playlist with three (configurable) segments. When clients receive a playlist, they request the first TS shard (some request all three shards directly, which has been extensively modified since it was proved that this approach could cause bandwidth competition). The server responds to each request using a block transfer code. Requests for the first segment will first get the data accumulated in the segment when the request arrives, but subsequent data (for the remainder of the duration of the segment) will be transmitted to the client when it actually arrives. After receiving the first piece of TS data, send the request for the second piece directly, and so on.

The benefit of broadcasting segments before playlists are available is that it eliminates playlist latency issues due to the frequency of client playlist polling and the TTL of playlists in the CDN cache. Since clips are broadcast a few seconds before they actually contain the media, if the playlist can be aggregated by the CDN, there is no impact on latency. The client learns about the upcoming fragments and requests them a few seconds in advance, so it can receive the data as soon as the server gets it.

5. Comparison of LHLS technical schemes

1. Standard LHLS scheme:

(1) Advantages: LHLS low delay scheme based on Apple specifications supports both Apple standard specifications and flexible expansion, which is convenient for the optimization of user’s playing experience (especially the optimization of lag and delay), and supports overseas promotion (currently this method is mainly recognized overseas);

(2) Disadvantages: To use some new features of HTTP/2 (such as multiplexing and server push, etc.), the CDN side is required to support HTTP/2. Vendors that only support HTTP/1.1 are limited to take full advantage;

2. Community LHLS Scheme:

(1) Advantages: Based on the HTTP chunked block transfer encoding mechanism, native support for the advance unknown filesize stream transmission, more convenient to implement;

(2) Disadvantages: The recognition of overseas promotion is doubtful; The realization of the new scheme not only solves the pain points of the existing scheme, but also may hatch a new form of live broadcasting business in the near future. Youku technology team will continue to make efforts in the optimization of live broadcast technology, and promote the technological iteration and development of the whole industry while improving the broadcasting experience of Youku video.