Multimedia front-end technology introduction guide

With the rapid rise of live video platforms such as B Station, Douyin, Kuaishou and Taobao Live, multimedia technology has been derived from the front end, and a new army, Web multimedia team, has emerged in the traditional front end team of each company. Just looking at the team Title, this is a group of people with rare skills in the front end × multimedia crossover. Alibaba inside there are many Web multimedia team, internally we for finishing a multimedia front-end technology start guide, and today will be the start guide foreign share out, also let everybody know what are we in this field in recent years, the direction of the work, what are the practices and fall to the ground, this article will from the following aspects with everyone a look:

What is a multimedia front end?
W3C standard media technology
Play the scenario and solution
Consumer oriented: The Business system in live video
Production-oriented: live streaming, video clips and other tools
Alibaba front-end committee multimedia direction development and planning

What is a multimedia front end?

Multimedia front-end is a front-end that uses professional front-end capabilities to solve various technical and business problems in multimedia scenarios. At present, the multimedia front-end mainly comes from two types of people, one is to learn digital media professional after engaged in front-end development, one is professional to do front-end learning multimedia. This is after the collision of two mature knowledge system cross field of a new job, need to have the height reduction, experience the control, span end, engineering, such as front-end development ability, but also to have audio and video, streaming protocol, broadcast technology, Web media technology such as multimedia capabilities, at present this kind of talent gap is very big.

W3C standard media technology

At the beginning of multimedia front-end development, the general scene is relatively simple, such as playing a video/audio on the web page, to achieve the basic playback control (play, pause, progress bar, volume, mute, etc.). Before the HTML5 standard, plug-ins such as Adobe Flash or Microsoft Silverlight were required to play video content in the browser. However, plug-ins are cumbersome and insecure to install, so the W3C HTML5 standard defines a series of new media elements to avoid using plug-ins. Here are some HTML elements that are commonly used by media developers:

The HTML element

The
The
Thetag is placed inside the
Thetag is placed inside

Browsers provide media player functions are fairly simple, a video or audio tags plus SRC is done, but it lacks the load such as video segmentation, video code rate switch, partial load, memory management, and other modern player should have some function, need to be more customized requirements, or more multimedia applications, Use the media API:

The media API

Media source extension API

MSE (Media Source Extension) extends the browser’s Media playback capabilities, allowing you to dynamically construct a Media stream MediaSource object in JS and feed it to the
Network audio API

The Web Audio API is used to process and synthesize Audio in Web applications, allowing developers to synthesize sound, add Audio effects, visualize Audio, and more. Using the Web Audio API, we can almost complete a professional Web Audio processing software (such as metronome, tuner, etc.).
Media capture and streaming APIS

The Media Stream API allows developers to capture and record audio and video using a local camera and microphone, or capture a computer screen, or read local video for compositing. It is commonly used for Web cameras, photography, screen recording, video calls, etc. There are also 1688 examples of Web video clips using this technology in the video Clips section below. MediaStream is the middle layer between the WebRTC API and the underlying physical stream. WebRTC processes audio and Video through the Vocie/Video Engine, and then exposes it to the upper layer for use through the MediaStream API.
WebRTC

WebRTC is a set of W3C Javascript API that supports browsers to conduct real-time audio and video dialogue. It includes audio and video collection, codec, network transmission, display and other functions, so that any two users on the Internet can realize real-time audio, video and any data communication without a server. Around 2010, real-time communication was available only with proprietary software, plug-ins, or Adobe Flash; In 2013, Chrome and Firefox made the first cross-browser video call, a prelude to plugin-free audio and video calls between browsers. Now WebRTC is used in a variety of scenarios, such as audio and video communication, live streaming, cloud clips, cloud games and so on.

In addition to these media apis, there are some techniques that are commonly used with media apis:

Technology used with the media API

Canvas API

The Canvas API allows you to draw, manipulate, and change the image content on the < Canvas >. This can be used with media in a number of ways, including setting the < Canvas > element as a node for video playback or camera capture to capture or manipulate video frames.
WebGL

WebGL provides OpenGL ES compatible apis on top of the existing Canvas API, making it possible to create powerful 3D graphics on the Web. A canvas used to add a 3D image to media content.
WebVR

The Web Virtual Reality API supports Virtual Reality (VR) devices, enabling developers to translate a user’s location and movement into movement in a 3D scene, which is then displayed on the device. WebVR is expected to be gradually replaced by WebXR, which covers a wider range of use cases.
WebXR

WebXR, which aims to eventually replace WebVR, is a technology that supports the creation of virtual reality (VR) and augmented reality (AR) content. Mixed reality content can be displayed on the device’s screen, or in glasses or headphones.

Play the scenario and solution

Media content source file is bigger, in order to facilitate the transmission and storage, need the original video file through the coding to compress the file size, again through the container packaging will be compressed audio and subtitles combinations to a container, this is the process of encoding and container packaging (can use compressed biscuits and sealing bag packaging to understand, There are many different compression processes and packaging specifications).

Then in the playback end to play, it is necessary to carry out the corresponding unsealing and decoding (first open the package to take out the cookies, enjoy the cookies can be eaten directly, can also be crushed to eat, can also soak milk to eat). The browser’s built-in

In addition to container format and encoding format, there are streaming media protocol, rendering container, multi-instance playback and other issues that need to be solved by multimedia front-end. Here are the following:

Multi-protocol, multi-container format

With the development of streaming media services, there are many new transport protocols. Media content is further contained in one layer of transport protocol (take HLS protocol as an example, m3U8 index file is added, and the source file content is sharred and encapsulated into TS container format), so that

The above mentioned flv.js of B station and hsS. js of the community are both player libraries that use MSE to solve multi-protocol and multi-container formats.

1. flv.js

Flv. js is Bilibili website open source HTML5 FLV player, based on HTTP-FLV streaming media protocol, through pure JS FLV transencapsulation, so that FLV format files can be played on the Web.

However, not all flV. js videos can be played, and there are certain requirements for the browser environment. The following are the limitations of flv.js:

Video must be AVC (H.264) encoding, and audio must be AAC or MP3 encoding
The browser environment must support MSE. To view the support list: caniuse.com/#feat=media… 13 above systems, ios mobile terminal is completely unavailable; Android requires a 4.4 or higher system
Browsers must support video. Fetch API, XHR and WebSocket support one or the other

2. hls.js

HLS. Js is based on Http Live Stream protocol development, using Media Source Extension, used to realize HLS on the Web playing a JS library.

As the HLS protocol is proposed by Apple and widely supported on mobile devices, it can be widely used in live streaming scenarios. HLS. Js can be used on PC only with MSE support, and on mobile, SRC can be set using native video tag. HLS. Js will first request the M3U8 file, and then read the fragment list of the file, as well as the encoding format, length of the video, etc. The TS fragments are then requested in sequence, and the binary buffer is merged with MSE to form a broadcastable media resource file.

There are also several teams in Ali with similar player products, such as Aliplayer of Aliyun, VideoX of Tao and KPlayer of Youku. The implementation ideas are basically the same.

3. Aliyun Aliplayer

After several iterations of Aliplayer, the architecture of Aliplayer has become more reasonable, giving itself and users more extension capabilities, with the ability to independently increase playback types and functions. For example, to support FLV playback capability of H5, you only need to add FLV Extend function extension module, without modifying the code of other modules. Such as HLS Extend, etc., to ensure that other features do not interfere with the normal operation, and maintain the stability of each release.

4. Tao VideoX

Videox’s underlying playback layer has gone through several changes, starting with a simple

5. Youku KPlayer

KPlayer’s current solution is fragmented and includes several libraries and components, mainly the KMux transencapsulation library, the KDRM WebDRM plug-in, the core abstraction of the KCTRL player control behavior, the core abstraction of the MediaEngine decode & Render & play, and the KUI embedded UI framework. KPlayer player core design is as follows:

Multiencoding format

The previous section addressed the problem of unsupported protocols and formats (not unwrapping the cookie) in the browser

The new video coding standard H.265, AV1 and so on has a higher compression rate than the traditional H.264, but the browser

For the H.265 player on the Web side, several teams in Ali have made corresponding attempts, including Youku, Ali Cloud, Tao Department, ICBU, etc., with basically the same idea. They all implement a JS player through FFmpeg + Webassembly, using JS streaming and decamping. The 265 decoding ability of FFmpeg is compiled into WASM module for JS call, the video is decoted by FFmpeg YUV frame data, and the image frame data is drawn through WebGL, and the Audio data is played directly through Web Audio API because the browser supports AAC, MP3 and other mainstream Audio formats. The design idea is as follows:

Multiple render container

The scenarios described above are primarily PC Web browsers, but more of our business scenarios are on mobile. The consumer scenarios are very demanding in terms of playback experience and performance, which introduces cross-end support for players, especially across rendering containers (Webview/Weex/ applies). The end-side player mainly uses Native player, and the front end is encapsulated as Weex/ same-layer rendering component, which runs in each rendering container.

If native video is used at the end, compatibility and performance problems may occur. The player occupies high memory, and irregular service use may bring stability risks and even cause APP Crash. Taking H5 in the terminal as an example, the rendering scheme of the same layer was built together with Rax team, client base team and client player team:

Native Player: Responsible for the basic capabilities of the player, the player kernel fetch the stream address, demux, decode, and then output to the upper Naitve component layer, Native Player provides on-demand/live capability
Native component: encapsulates the underlying player instance. It is mainly responsible for connecting the communication connection layer and controlling rendering
Communication connection layer: Manages rendering channels via WindMix, which then binds events via Windvane, passes properties, API capabilities, and connects to the front end player
Front-end Player core: acts as the front-end player output, translating the events/API/properties provided by the underlying layer into the W3C player standards, and is responsible for player stability related capabilities
Service player encapsulation: encapsulate the broadcast control protocol for the front-end player core of the lower layer, and manage the playback capability of each player through the event center; At the same time, from the perspective of services, the network and device performance is monitored to determine whether the system is degraded

Multi-instance control

The previous mentioned scenarios are for single video playback. In actual services, there are scenarios for multiple players on the same page, such as list flow, version, etc. At this time, the problem of multi-instance control of players is introduced. You need to ensure that only one player plays on the page (play sweet area), memory management, etc.

The player must have powerful broadcast control capabilities, namely, the ability of playing and scheduling:

Event center driven
A queue to be played is maintained inside the player, which is implemented by rax-view’s onAppear and onDisappear. When onAppear is executed, the queue to be played stores the unique ID of the corresponding player into the queue, and when onDisappear is executed, the corresponding player is removed from the queue
There are three types of play forms in the queue: 1. 2, specify id to play; 3, onAppear scene play; Are controlled by events. The event center then distributes it to different players, telling them which players can play and which cannot play
The presence of players in the PCOM library ensures that all players are notified by the event center, whether for conference scenarios or source development scenarios
Broadcast scheduling
The memory usage of the player instance is about 20-40MB. For an H5 page, the memory usage is too high, which may lead to serious problems such as crash. So during event-driven playback, the event center ensures that one and only one player instance is playing

Consumer oriented: The Business system in live video

All the functions mentioned above are simple playback, but the actual business scene is the live broadcast room or short video full-screen page. From the player to the live broadcast room/short video, another important part is the user interaction layer, which introduces the business system in live video.

Taking live broadcasting as an example, the audience has two demands: watching live broadcasting and participating in interaction. The general design of live broadcast is as follows:

Generally, there are three kinds of broadcast room architecture: Web broadcast room, Hybrid broadcast room and small program broadcast room. For example, Dingding Open Class broadcast adopts Web broadcast room architecture, Ant and Tmall Genie adopt small program broadcast room architecture, and the vast majority of other businesses adopt Hybrid broadcast room architecture.

Web studio architecture

Pure Web broadcast room refers to the H5 broadcast room mainly running on the Web browser, including H5 player, event channel, UI components, etc. However, the compatibility of browsers on mobile terminals is poor and the playback delay on mobile terminals is high. Therefore, the Hybrid broadcasting studio architecture is generally selected for service scenarios that require high experience and performance.

Hybrid studio uses native player capability, better compatibility, more smooth experience; Meanwhile, the event channel is a Native channel, which is more secure than websocket in the Web broadcast room.

Hybrid studio architecture

The host of Hybrid broadcast room is Native, and Native player is used. For interactive play, an interactive layer (Webview or WEEX container) is covered on the player, and the player can communicate with it.

The interactive layer scheme in Hybrid architecture has also gone through several stages of evolution, from an independent layer for each gameplay component at the beginning, to the packaging of all gameplay components into a single layer, and then to the definition of a livestream container to dynamically load components:

The live broadcast container has the following characteristics:

Unified component message protocol, including component package name, component behavior, and user-defined business fields
Support dynamic loading: The broadcast room is different from other details pages. The interactive transmission depends on the operation of anchors and the time when users enter the broadcast room. Each user may participate in different interactions, so the dynamic loading of interactive components is critical to the performance of the first screen
Cache and dependency deweighting: The anchor can be pushed multiple times for the same interaction, and the basic libraries (RAX-XXX, Universal – XXX) of each interaction also have a lot of duplication, so a reasonable design of caching and dependency deweighting mechanism is also very important for performance improvement
HOC high-level components: the business development in the studio is different from other independent source code page, such as air data acquisition, access to news and event monitoring, namely screen state, with small window jump mammal, watch live long, and so on are dependent on the environment or the client API, business components need these basic ability, need through the HOC to enhance the business component

The development and debugging of components under Hybrid architecture requires a complete live broadcast studio environment and live broadcast container for development and debugging. Without supporting engineering system, the development of components is very inefficient. Therefore, this architecture also needs a supporting engineering system, which mainly includes the following parts:

Def suite: live studio component development scaffolding, enhance debugging capabilities, including live studio simulation, debugging agent, hot update, compilation and detection and other functions
Live broadcast Debug tool: Develop a Debug component based on the live broadcast container, providing log debugging, container API call, data Mock, message Mock and other functions
VS Code plug-in: the PC equivalent of the broadcast studio Debug tool, combined with the simulator can be independently developed and Debug on the PC

The multimedia front-end team can efficiently and stably build the business system in the live video by building the live broadcast container and engineering system.

Of course, Hybrid architecture brings complexity in architecture due to its combination with Native and front end, especially in the state management of live broadcast room, player instance and WebView instance management when the live broadcast and video slide, which will lead to some state problems, such as the “magic three strings” bug once appeared on a live broadcast platform: In Viya’s studio, Li Jiaqi’s live stream is playing, but ye’s merchandise is on display:

Based on the Hybrid architecture, we can consider further upgrading, making the player and interaction layer more integrated, forming “super Video”, or interactive Video solution, and even defining a new < Video > standard.

Small program broadcast room architecture

With the emergence of cross-terminal business, especially the cross-app scenario, the Hybrid studio architecture is like a fat man and naturally has no cross-terminal advantages (the cost of client integration and SDK maintenance is extremely high). Under this background, the small program studio architecture appears. Small program broadcast room design is divided into the following layers:

Live plugins: In addition to the basic player, unlimited list, and official components, the plugins also encapsulate the basic capabilities of component layout rules, event center, container API modules, and so on
Cross-terminal communication layer: Since the JS context between the plug-in and the host instance of the small program is completely isolated, @Alipay/Armer-x uses the feature of AppX that the declaration of a specific function does not serialize to bridge the communication between the plug-in and the host small program, such as to the plug-in
Small program components: small program components customized by two or three parties, which can be customized with the help of the container capacity of the broadcast room to meet the specific access rules
Small program instance: After having the two-party component and broadcast plug-in, we can integrate the two, package and build, and generate a new small program instance. The generation method of instance can be built by IDE, or by building platform, such as: For “Morpho”, the current strategy is to build and generate instances in the way of Morpho on the alipay side, and upload them through IDE on the Baichuan side, and put them on the Taobao development platform

The small program broadcast program jointly created by Ant and Taobao Live has been widely used in many APP terminals (including external media APP).

Production-oriented: live streaming, video clips and other tools

As mentioned above, it is more of a consumer-oriented scenario. As the business grows, the multimedia teams begin to pay more attention to capacity building on the production side. After all, the production of content is the blood supply of the entire content ecology. As a developer, we need to provide efficient and useful production tools for creators such as businesses, anchors and experts. In the production side, there are two main directions, one is the live streaming tool oriented to live streaming, the other is the video clip tool oriented to video.

Live on flow

There are two main schemes in the front end of live streaming push:

Desktop client: Using the Electron + OBS solution (OBS is a free and open source software package for recording and webcasting. OBS is written using C and C++ voice to provide real-time source and device capture, scene composition, coding, recording and broadcasting, etc.). By design, OBS is not responsible for business functions, and is only a pure push stream SDK. OBS interfaces are encapsulated into V8 interfaces through IPC communication, and then packaged into Node modules through CMake or GYP for front-end invocation. The playback area (size, padding, etc.), the interaction of elements (rotation, scaling, etc.), push parameters, etc., are all encapsulated as interfaces and exposed to the front end. The entire APP is a BrowerView in Electron, with React support view, Mobx management data, cross-page communication and the corresponding data flow, cache and window management are all managed by the main Electron process. The front end provides the WebView window handle to the OBS process to ensure the smooth display of the pusher screen

Taobao Live anchor workbench and 1688 Live partner both adopt this scheme:
Web browser: The WebRTC solution is adopted, which is lower definition and tighter performance than the desktop. For example, due to the business characteristics, the Financial media team could not use the third party OBS open source software, so they chose this solution and produced the PC web page collection and push stream mixed stream SDK, which supports camera, computer screen and gasket push stream

The video clip

Compared with the production of live broadcast, the production coverage of video is wider, and more teams are involved in the construction of this direction, including Ali Mom, Tao Department, Hema, 1688, Ali Xiaomei, Ant, etc. Producing an article requires a rich text editor, while producing a video requires a video clipping tool, but the complexity of the two is completely different. We build intelligent production, or efficient and easy to use deep editing tools, are expected to reduce production costs, can be as convenient as editing articles clip video.

The traditional desktop clipping software provides a GUI interface to provide the user with the feedback of the editing effect, and generates the editing description data after the editing is completed. Finally, the graphic image module, audio processing module and video coding module generate the final video file. Refer to most open source clip software implementation, the first half of the GUI through the operating system graphical interface or cross-platform such as QT, SDL, SWT implementation, this part we call the “GUI front end”, while the latter involves graphics, audio, coding related parts, most of the media editing kernel to support. For example, Linux is famous for MLT, GStreamer, and so on, which we call the “edit kernel” here.

Based on the above idea of GUI front end + editing kernel, multimedia front end usually has the following implementations on editing tools:

Desktop clipping: Electron solution, GUI front end in Webview, edit kernel in Native, packaged as Node module for front end call (similar to the Electron live stream above). Using the solution of the Tao department of Marvel editing tools for pro – shooting business
Pure Web clip: Browser solution, GUI front end and editing kernel are all in the browser. The editing kernel provides editing, rendering, compositation and other capabilities through the Media Stream API or FFmepg + WebAssembly. There are 1688 video editor (Magic Sparrow) and Ali Xiaomi’s xiaomi video creation tool. Take Xiaomi’s scheme as an example:
Web Cloud Clipping: browser + server solution, GUI front end on browser, editing kernel on server. Because the GUI front end and the editing kernel are on separate sides, you still need to figure out how the GUI will present the editing kernel content. There are usually two schemes: the browser actively pulls the rendering effect and WebRTC livestream streams to the user interface. The latter has a lower latency and a better experience. There are Aliwood of Alima, video editor of Hema, PC Video of Tao Department and FMS short video production tool of Ant. Take Aliwood’s scheme as an example:
But the current use of Web cloud clip is the browser actively pull rendering effect scheme, the author has not seen WebRTC scheme (similar to cloud games), currently tao department is building WebRTC cloud clip scheme.

Play production

In addition to live streaming and video clips, there is also gameplay production combining with streaming media recognition, such as face detection, gesture detection, object recognition and other superimposed filters, stickers, beauty, text and other special effects, to achieve intelligent media gameplay of live broadcasting and video. Outside, there are Facebook’s AR Studio, Snapchat’s Lens Studio, and Tiktok’s Effect Creator. Inside, there is Tao’s MAI(MediaAI Studio), which is widely used in Taobao live streaming, browsing and other businesses.

Alibaba front-end committee multimedia direction development and planning

With the development of multimedia business in recent years, Alibaba Group has a large number of multimedia front-end teams, including Lazada, CBU 1688, ICBU, Ali Mom, Ali Cloud, Local Life, Ali Learning, Cainiao, Financial Media, Ant, Youku, Laicrazy, Taobao Live, Dingdou and so on. With the wide use of audio and video and hardware upgrades, the front end committee this year took the multimedia field as one of the directions for future layout and breakthrough, and began to lay out the multimedia field under Web standards.

We find the ali group each BU multimedia technology team, including the main business scenario, the main problem, can ask economies’ output for the front committee of the technical plan, the demands of multimedia direction, according to the baseline information, determines the priority of higher Web video clips and play in the direction of the construction, our aim is to: Deep into the multimedia industry, Laton Group Web multimedia technology, build advanced Web video clip and playback solutions, layout at least one forward-looking multimedia technology direction (such as WebXR related).

If you have any questions, please contact me. Wechat id: JovenPan.