The whole Web video thing

Image: ultimatewebsitedesign. Co. UK

Author: HSY

background

For online video playback, video content can be divided into VOD and Live Streaming according to the real-time characteristics of video content. Nowadays, when a video needs to be played in the Web environment, you can usually use the Video TAB to host all aspects of the video to the browser.

From the perspective of video consumers, the main technical links of online video playback are video decoding, display efficiency and video data transmission efficiency. The Web standard decoupled these two links through the video tag, so developers do not need to care about the decoding and display of video data, but carry out some extensions in the data loading link.

In the vod scenario, the producer of the video has prepared the data content of the video in advance, and the developer standing on the side of the video consumer only needs to specify the SRC attribute of the video tag as the corresponding resource address. However, in some complex requirements, such as the need to carefully control the pre-loading timing and data volume of video data, some of the video specifications and related technical points need to be further understood.

This paper will briefly introduce the technical content needed to expand the data loading link in the video on demand scenario.

Video file parameters

Video files have some common specs that serve as a starting point for video-related technical content, and a brief understanding of them can help you quickly understand what’s on top of them.

Generation of video

At first, the source of video is the reflection of light from things in nature, which is collected by the image acquisition equipment at a certain frequency and then saved. When it is necessary to watch, the saved content is played at a certain frequency. This general process continues to this day, but thanks to digital technology, video content can also be generated directly through software editing.

The spectrum and the frequency of light changes that humans can recognize have a certain range, and changes below or above this range cannot be perceived by most people. Therefore, when saving and playing the video, it is necessary to adjust the parameters of the video according to the sensory experience and the limitation of hardware and software resources.

Frame rate Frame rate

The video works like a quick slide switch.

Each time the screen is switched, it is called a Frame. Frame rate, on the other hand, is the number of Frames switched Per Second, so it is in Frames Per Second. Frame rate is not directly related to the sharpness of a video, but it is an important factor in the sensory experience.

There is a range for human perception of the switching frequency of the picture, generally about 60FPS is a suitable range. But this is not always the case. When you need to record a change of two seconds, prepare enough frames to capture subtle changes. When shooting a slow shot progression effect, the frame rate doesn’t need to be too high and the resolution will be more important.

The selection of frame rate should not only combine the playback content, but also the refresh frequency of the display device. Otherwise, if the frame rate is too high and the display device does not support it, the extra frames will only be discarded.

The Resolution of the Resolution

As the video plays, each frame that appears on the screen contains the same number of pixels. A pixel is the smallest unit of light on a display device, and the resulting picture is a combination of several pixels.

Resolution refers to the number of pixels contained in each frame of a video, which is expressed as the number of pixels in the horizontal direction × the number of pixels in the vertical direction. For example, 720p = 1280 × 720 is a common resolution.

The P here stands for Progressive Scanning, as opposed to the I for Interlaced Scanning, as shown in the figure below

The leftmost column is a progressive scan, and the middle column is an interlaced scan. It can be seen that interlaced scanning will lose some of the image information, and conversely, it will collect the image more quickly, and the image pixel information, i.e. the file size, will be smaller than that of progressive scanning.

When the resolution of the video is lower than that of the display device, the number of pixels on the device is more than the required number of pixels for video display. Then all kinds of Interpolation algorithms are used to generate color value information for the unused pixels on the display device, so as to completely light up all the pixels of the display device. Otherwise, black dots will appear on the screen.

Here’s a short video of the tween algorithm for reference: Resizing Images.

Because the information on those pixels is algorithmically generated, when the resolution of the video is significantly lower than that of the display device, the number of pixels becomes excessive, resulting in a loss of sensory clarity. And if the video is sharper than the device’s resolution limit, the extra information is lost, so it doesn’t look any better.

Therefore, the selection of video resolution needs to be combined with the resolution of the display device.

Bitrate Bitrate

The unit of bit rate is bit/s, indicating the number of bits contained in the length of video per second.

The current video picture acquisition equipment can collect a large amount of pixel information, and the video author needs to compress and transcode the original video to facilitate the distribution of the video.

The size of bitrate is affected by the volume of video file per unit time, which is affected by the original video content, the resolution of transcoding, frame rate and the encoding scheme (Codec) used for transcoding.

So bit rate is not a parameter directly related to video sharpness.

If you are playing offline, the bit rate is not an important parameter. When watching video on demand, the bit rate of the video becomes an important indicator that must be considered.

Bitrate represents the number of bits needed to transmit in order to display a one-second picture. It is easy to compare with bandwidth. When voD is online, it is necessary to ensure that as many bits as possible are transmitted per second under limited bandwidth conditions. These bits need to ensure that the transmission of the picture does not have problems.

Video format

Because the computer can only follow the established program logic, the video data needs to be arranged in a pre-defined format before it can be saved on the device. In the formulation of the format, two main types of information are saved:

Video metadata
Video main data

Metadata contains descriptive information about the subject data, such as the size and resolution of the video, and the encoding scheme of the audio and video.

How to combine video metadata and main data for storage needs to be specified by container format, common container formats include MP4, AVI, etc. MP4 is generally chosen as the container format for video, as it is widely supported by various systems and devices.

For video body data, additional parameters are required to specify, often called encoding schemes. Different encoding schemes make a trade-off between the size of the video file and the final playback effect.

Segmented loading of video data

Some of the video related parameters have been briefly described above. In the vod scenario, the producer of the video has finished the production of the video, and the video content is encoded by a certain coding scheme and stored in a container format together with other information. From the perspective of video consumers, in order to watch the content of the video as soon as possible, it must not be played after loading the whole video data. Therefore, it is technically necessary to support segmenting loading and decoding of video data.

As mentioned at the beginning of this article, Web standards decouple decoding playback from data loading, so as a developer you only need to implement segmenting loading of video data.

Before we get into the technical details of segmented loading, we can use a simple example to get a feel for the whole process of segmented loading.

Take HLS as an example

This example will be expanded with the HLS protocol. The HLS protocol is just one of many segmented loading protocols for video data, and some details about it are described in the next section. Start by running an example to get a concrete feel for segmented data loading.

To start the example, make sure you have FFMPEG installed on your computer and your network is up and running:

wget -qO- https://gist.githubusercontent.com/hsiaosiyuan0/412a4ca26d00ff7c45318227e7599c3d/raw/de5e3c467800d1f2315e74ba71f0eb6c676 0a7a0/hls.sh | bashCopy the code

This command loads and executes a script on the network. This script will do the following:

Download a video from the Internet
This video is compressed and segmented by FFMPEG
An M3U8 file is generated to describe and index these segmented files
Generate an HTML file for the presentation in which you will use the video tag for voD functionality
Start a local Web server using SVRX

If you can’t run the script, you can use this video or GIF to get the demo.

As you can see in the demo, the browser loads the segmented data and loads the segmented content in different definitions as the network environment changes.

HTTP Live Streaming (HLS) protocol

HLS is a streaming media playback protocol launched by Apple to meet the demand of online on-demand. Because this protocol is based on HTTP, it can make full use of existing technical content and optimization measures for HTTP.

It can be seen from this figure that after the video recording is completed, it needs to be uploaded to the Web server for segmentation and indexing, and the consumption during this period will cause a certain lag in the information received by users. So although it is called Live Streaming, there is still a gap between it and Realtime.

Apple not only provides details of the HLS protocol, but also provides a solution for implementing the protocol. See About Apple’s HTTP Live Streaming Tools for an introduction to these Tools. Since the protocol itself is open source, you can still use FFMPEG to do some of the same things as the example in the previous section.

The key to the HLS protocol, as well as the key to the segmented loading of data, lies in two parts:

What strategy is used to segment the data
The results of segmentation are described and indexed

For the segmentation strategy, videos are usually divided into small segments according to the same playing time. For example, in the above example, videos are divided into each segment by 10s.

In HLS protocol, the contents of segments are described and indexed by m3U8 file. As in the example above, in addition to segmenting the video, the video producer needs to generate an M3U8 file that describes the segments, and the video consumer, once he gets the M3U8 file, has the flexibility to choose the loading mode of the segments.

Next, we’ll learn more about the segmentation by looking at m3U8 in more detail.

M3u8 file

M3u is a file format for storing information needed to play audio and video files, and M3U8 is an abbreviation of the format when encoded using UTF8.

M3u8 files play the role of Playlist in HLS protocol, similar to “Playlist” in music player software, but Playlist is for users, while M3U8 is for players to choose.

Because they are encoded in UTF8, they can be opened using a common text editor. The following is the contents of the M3U8 file generated in the previous example:

See RFC 8216-HTTP Live Streaming for a complete description of the M3U8 format. The protocols involved in the screenshot are as follows:

There are only three types of data per line in the file: URIs, empty lines, and lines starting with #. Empty lines are ignored by the interpreter along with comments
The protocol specifies that a line starting with # can represent a comment or Tag, followed by a text EXT (case sensitive) indicating that the line is a Tag. Otherwise, it is considered a comment and will be ignored by the interpreter
So the first line, #EXTM3U, represents the label. This label is used to indicate that the current file is an extension of the M3U format. The protocol stipulates that this label must appear on the first line of the file
The remaining lines starting with #EXT are tags, separated by URI lines, that modify the URI resources below them
Some tags contain values in the form of Attribute Lists. The attributes in the Attribute list are separated by semicolon commas (,). An Attribute consists of a name and its value in the form of AttributeName=AttributeValue. So the value of the # ext-X-stream-INF tag is the property list
Playlists are divided into two types. The URI content in the screenshot is called the Master Playlist. The URI in the Master Playlist is another type, the location of the Media Playlist. The Media Playlist specifically records the segmented information of resources:
The # ext-x-stream-INF tag describes information about a Variant STREAM. Most of these attribute meanings have been described in section 1. BANDWIDTH represents the peak bit rate of each segment in the stream. Another common AVERAGE-BANDWIDTH is not present, but represents the AVERAGE of the combined bit rates of each segment. The client will select the appropriate stream of the next segment data based on the values of these two attributes and its current bandwidth

Therefore, in the above example, by specifying a Master Playlist with three different resolutions to the video, the browser will automatically switch between the three data streams according to its current bandwidth rate. Currently, if there is no scene that requires automatic resolution switching, you can also directly specify a Media Playlist to the video TAB to complete the playback of a stream with a specific resolution.

Dynamic Adaptive Streaming over HTTP (DASH)

The HLS protocol is widely supported on Apple devices because it was introduced by Apple. In Safari browsers running on these devices, the video TAB can be directly used as an HLS client.

In browser environments that do not support HLS, you can often use a similar protocol that is developed and maintained directly by the MPEG organization, called MPEG-DASH. MPEG is an international organization that develops video playback protocols, so DASH is more general than HLS.

DASH, another HTTP-based streaming protocol, has similar technical details to HLS. For example, it provides functionality similar to m3U8 files through Media Presentation Description (MPD) files. Below is a screenshot of the MPD file contents:

The DASH protocol is not directly supported by major browser vendors and often requires additional client-side plug-ins, such as MSE, which we’ll cover next, to perform segmented loading of data. In the case of mobile playback, the HLS protocol is generally more suitable for more devices, because DASH playback devices can usually support HLS through the HLS plugin, and vice versa.

Media Source Extension

MSE and Media Source Extension are interfaces for audio and video data loading formulated in Web standards. Web applications can realize their own audio and video data loading schemes through this interface. Since WebRTC is not a vod protocol, MSE becomes the only audio and video data loading interface in voD scenarios.

The HLS and DASH protocols mentioned above can be supported by implementing plug-ins based on MSE in browsers that do not support them natively but do support MSE.

Below is an overview of CURRENT MSE compatibility, also see Caniuse-Mediasource:

These are iPadOS, which are currently not supported on iOS. Generally speaking, android devices have high support, so the HLS protocol mentioned above can be adapted to more devices. It is supported natively on Apple devices and supported by HLS plug-ins based on MSE on Android devices.

The purpose of MSE is to decouple audio and video playback from data loading. Web developers don’t have to intervene or interfere with the behavior of existing audio/video decoding and playback controls. They just need to pass the playback data to the audio/video elements via MediaSource. The following diagram illustrates the relationship:

The audio/video element is only responsible for decoding and playing audio/video data, applying custom audio/video data loading policies and loading data, and interacting with each other through MediaSource.

AbortController

MSE is used to customize the audio and video data loading strategy, and an important feature of the data loading strategy is the need to terminate the continued loading of ongoing requests that are known to be useless in order to save bandwidth. The Fetch API’s ability to terminate data loading needs to be used in conjunction with AbortController.

Here is an overview of AbortController’s current compatibility, also see Caniuse – AbortController:

IE is not supported at all unless necessary, so it has a similar degree of compatibility with MSE without considering IE and can be used together.

Xmlhttprequet.abort () is another highly compatible solution in cases where AbortController is not available. AbortController, however, has the ability to “group” resources, which XMLHttpRequest does not have by default.

Video loading rate optimization

Generally speaking, there are two directions for video loading speed optimization, namely “dynamic switching clarity” and “video content preloading”. The following two technical points will be briefly introduced.

Dynamic switching sharpness

Dynamic definition switching refers to the function of switching between streams of different definition according to the current bandwidth of the device during video playback.

When HLS protocol is used, the server compresses video segments in advance and provides M3U8 file. After receiving the M3U8 file, the client dynamically switches the loading of segments according to the current bandwidth.

For example, three streams of different definition are prepared for the same video, and the video streams are separated into three segments:

720P --seg11--  --seg12-- --seg13--
480P --seg21--  --seg22-- --seg23--
360P --seg31--  --seg32-- --seg33--
Copy the code

Under the simplest algorithm, the client can select stream 480P with intermediate resolution to load the first segment of video (Seg21). According to the loading time and the size of Seg21, the current download rate can be obtained. If it is lower than the bandwidth required for normal playback of 480P, such as 1500kps, The next video will be selected from streams that meet the current bandwidth, in this case seg32 loading 360P.

Of course, since bandwidth is dynamically variable and is an estimated value, switching through a segment in this example is largely unreasonable, but this is for demonstration purposes only. For a more practical algorithm, you can refer to Adaptive Bit Rate Logic in DASH protocol.

Video content is preloaded

Video content preloading, as the name suggests, preloads a portion of the video before it starts playing, so that when the user clicks play, they get immediate feedback.

The algorithm of video content preloading is developed based on two points:

Prediction of the user’s behavior about to play on demand
Preloading without impeding the current playback progress

For the first point, specific adjustments need to be made according to business requirements.

For the second point, the simplest algorithm can set a buffer safety threshold for the playing video, such as 5s. If the current playing video has 5s content in its cache, it can try to preload the next video.

So in general, preloading depends on what is technically feasible, and the specific strategy needs to be formulated according to the actual needs. In scenarios where MSE is not supported, preloading can be enabled through the preload attribute of the video tag. However, the preloading timing and the amount of data preloaded are in the managed state.

MSE and MP4

The HLS and DASH protocols and MSE were briefly introduced, as well as ways to optimize video loading rates. In the case of both HLS and DASH, existing open source implementations such as hls.js and Dash.js can be used directly for client-side playback support.

The prerequisite for HLS and DASH protocols is that video files must be segmented in advance according to protocol requirements and actual service requirements. In other words, the cooperation of the server is required.

The following will introduce the feasibility of using HTTP Ranges and MP4 container formats plus MSE to complete segmented loading of video content without the support of HLS and DASH. This format requires little additional cooperation from the server, and you can learn more about the MSE protocol and MP4 container format by introducing the feasibility of this format.

HTTP Ranges

The distribution of audio and video files is usually accelerated through CDN network in the case of vod. Web servers in CDN networks usually support HTTP Range Request functionality. This feature allows clients to retrieve portions of a file on demand, knowing the contents of the file in advance.

Therefore, the ability to load a segment of video data is provided by CDN by default. All that remains is how to make the client anticipate the content of a segment of video. In HLS or DASH protocols, this is achieved through index file M3U8 or MPD. By loading this index file, the client can predict the segment information of the entire video and their relative offsets, thus successfully completing the subsequent loading work.

The functionality of m3U8 files can be provided directly in the MP4 container format, because the MP4 container format defines a moov box, which records segmented information. Details on the MP4 container format will be covered in the next section.

Moov is sometimes at the end of the entire video file depending on the software used to produce the video. Therefore, in order to enable the client player software to load into the MOOV box as soon as possible, CDN service providers always suggest the video provider to move the MOOV information of the video forward and then distribute it. For example, the NOS website has similar advice, see Flash Player On Demand.

It can be seen that in this project, the understanding of MP4 file format has become an important step, and the MP4 file format will be briefly introduced in the following.

MP4 Container Format

MP4 is a container format, which is emphasized because it is used to organize all the information used to describe the video into a unified storage. Another oft-mentioned encoding format is just one of the many video messages that the MP4 container contains.

MP4 format has the following features:

MP4 files are encoded by binary, and Object Oriented design is adopted in the formulation of format
Video information is organized by category and grouped into Box classes, so the entire MP4 file is made up of boxes, each of which is an instance of its corresponding Box type
Box types are preceded by nested forms to indicate inclusion relationships between them

You can view the structure of the MP4 file by putting this demo video in the online tool provided by mp4box.js:

In addition to online tools, AtomicParsley is a command line program that can quickly get the internal box structure of MP4 files.

Use the following command:

AtomicParsley 1.mp4 -T
Copy the code

In the figure above, the Box Tree View on the left shows the boxes contained in the file and the hierarchy of those modules, and the properties on the currently selected Box on the right.

The top box in the picture above includes FTYP, Free, MDAT and MOOV. Moov contains additional boxes that are presented as child nodes. Mdat boxes hold the body of audio and video data, and MOOV boxes contain meta information to describe the data.

The numerous box types and their dependencies in MP4 are shown below and also in ISO/IEC 14496-12:

The indentation in the table above shows the inclusion relationship between boxes. One of the more important moov boxes is the Trak box. In MP4 format, picture data and audio data are saved separately. Generally, a video will contain one picture data and one audio data, and their information will be saved in two separate Trak boxes.

Trak box mainly records the encoding information of audio and video, as well as the index of audio and video data blocks. The main body of data blocks is stored in MDAT box.

So as soon as the client gets the MOOV information of a video, it can load the audio and video data in MDAT on demand.

After understanding the relationships between boxes, you can use the following figure to briefly understand the data structure of boxes:

The data of the box can be divided into head and body. Although this classification is not stipulated in the agreement, it is a convenient way to understand. Note the size data in the header, which represents the size of the entire box. This design allows the client to quickly offset between boxes, and only select the parts it is interested in for deferred parsing.

If the file as a whole is a big box, then fTYP, MOOV and other boxes at the top are its child nodes. The order of boxes in their parent nodes is not required to be fixed. For example, FTYP is required to be present early in the standard, not necessarily at the beginning.

Because the order of the boxes in the container is not fixed, the client software can only obtain the Moov box by constantly requesting the top box, which can be demonstrated by running a short program. Here is the source of the demo code, which can be run with the following command:

deno --allow-net https://gist.githubusercontent.com/hsiaosiyuan0/a7b215b53b48b9e66d2e0bad9eb8d1dd/raw/6bf88be0c81d55f620b1d94d51f5a56b556 91007/locate-moov.tsCopy the code

If running this script is not convenient, you can also view the resulting demo here.

This script takes advantage of the box data structure and requests the first 8 bytes of content from the start of the file. This is also the head part of the box mentioned above. They will contain the size of the box and the type of the box, each 4 bytes. Subsequent requests will continuously add the offset address to the next box, but still only request the header of the box each time. After several offsets, the moov box will be returned.

Through this example, we can also see the significance of preloading MOOV information, which will effectively reduce the number of offset requests for locating moov boxes.

MSE-FORMAT-ISOBMFF

So far YOU have covered MSE, HTTP Ranges, and MP4 container formats. By understanding the format of MP4 container, it is found that the data inside MP4 is actually stored in chunks, and the information in chunks is stored in MOOV box.

So it seems that loading chunks of data according to the moov box chunks of information and then using the Media Source to play the video, such as the following code from Google Developer-Media Source Extensions, The most important piece of code is the appendBuffer call:

The application code loads the segmented data and then calls appendBuffer to pass the data to the audio and video player via the Media Source.

In fact, things are a little more complicated because the MP4 container format is just one of many container formats, whereas MSE, as a general data interface, interconnects data in various container formats, so it has a general data fragmentation format. This segmented FORMAT is called MSE-format-ISOBMFF, which is based on ISO Base Media File FORMAT, a modified version of ISOBMFF.

MP4 (mPEG-4 Part 14) MP4 (ISOBMFF)

As you can see MP4 is an extension of the ISOBMFF format to include more diverse information. Mse-format-isobmff introduces the concept of segmentation on the basis of ISOBMFF.

The mse-format-isobmFF FORMAT is as follows:

The figure above shows the two main fragment types in MSE-format-ISobMFF: Initialization segment and media segment. These sections are made up of boxes.

Therefore, in order for files in MP4 format to be played by MSE, it is necessary to convert the format from right to left like the following. The overall process is:

The far right is the original video content, and the dotted line in the middle is the content written by the developer himself, which needs to complete the task of downloading and reassembling the original video time.

The conversion process does not involve transcoding the video data, only converting the container format to each other, but even then the whole process is still very cumbersome, so the existing implementation can easily complete the work.

Mp4box.js is a rich implementation in this respect. Although there are many features, the documentation is not very rich, so you can refer to the test cases in the repository.

To make it easy to quickly get a locally runnable demo, a demo project is prepared. In the demonstration project, the data is loaded through the custom Downloader, and the loaded data will be delivered to MP4Box for reassembly. After assembly, the data will be added to the Media Source and passed to the player. If you want to complete the preloading function, you only need to modify the Downloader slightly according to the service requirements.

At the end

This paper begins with the common parameters of video, and ends with a vod mode that does not need the extra cooperation of the server. It introduces some technical points involved in voD function in Web environment. Because of the amount of content involved in this area, plus the level and time constraints, I can only present this content for the time being. I hope they can be of some help, and I look forward to your valuable comments and suggestions.

reference

What is video bitrate and why it matters?
What bitrate should I use when encoding my video
What are these 240p, 360p, 480p, 720p, 1080p units for videos
An Introduction to Frame Rates, Video Resolutions, and the Rolling Shutter Effect
MDN – Media Source Extensions API
MDN – MediaSource
Google Developers – Media Source Extensions
Apple – HTTP Live Streaming
Youtube – Building a Media Player
The structure of an MPEG-DASH MPD
What Is an M3U8 File
MP4 File Format
MPEG-4 Part 14

This article is published from netease Cloud music front end team, the article is prohibited to be reproduced in any form without authorization. We’re always looking for people, so if you’re ready to change jobs and you like cloud music, join us!