Author: Zhang Lingzhu (Yisheng)
This is the youku play black science and technology the second series, the first article clickable youku play black | free view of science and technology technical experience optimization practice view. Focus on not getting lost ~
Different from vod, live broadcast attracts users because of its real-time and interactive characteristics. Users can simultaneously like, comment, reward and other interactive behaviors while watching. This kind of interaction mainly focuses on users’ social behaviors, and content-based interaction also comes into the public’s eyes with the emergence of “interactive episodes” in voD. Users can select characters to perform different behaviors at key points in the plot progress and then enter into different branches of the plot. Similarly, content-based interaction in the direction of live broadcasting is also constantly explored and applied. An example is simultaneous live broadcasting from multiple perspectives, allowing users to switch and select a specific perspective to watch live content. Especially in medium and large live events (such as street dance, European Champions League, etc.), fans are more obvious (stars, soccer stars). If multiple viewing angles or a focus Angle can be provided for switching, the user experience will undoubtedly be greatly improved.
Based on this, after research and exploration, WebRTC finally realized the low-delay high-definition multi-perspective live broadcasting capability, and was officially applied in the annual finals of This is Street Dance. The effect video is as follows:
Video can click to view: youku play black | WebRTC based of science and technology to achieve the live “cloud” multiple points of view analysis technology
Scheme selection
Whether it can achieve user experience with low latency switching, high definition and full coverage is an important criterion for deciding which technical solution to adopt. Common application-layer protocols in the industry include:
- HLS (including LHLS)
- RTMP
- RTP (strictly speaking RTP is between the application layer and transport layer)
Among them, RTMP is determined to be used for live streaming due to its advantages and disadvantages (low latency but easy to be blocked by the firewall). This paper will not introduce it for the moment, but we will mainly analyze the streaming process below. According to the number of end pull-up flows, the implementation schemes can be divided into:
- Single-channel stream pseudo multi-view live broadcast: Based on HLS protocol, only one stream is pulled at the end at the same time, and the cutting Angle needs to cut the stream address again.
- Multi-channel and multi-view live broadcasting: Based on HLS protocol, multi-channel streams are simultaneously pulled on the end for decoding and rendering, and the cutting Angle does not need the cutting address to start broadcasting;
- Single-channel multi-view live broadcast: Based on RTP protocol, only one stream is pulled at the end at the same time, and the Angle of view does not need to re-cut the stream address start broadcast.
Horizontal contrast
The following table shows the disadvantaged features in red:
plan | agreement | At the same time to preview | Seamless switching | Bit rate | End performance burden | Incremental cost |
---|---|---|---|---|---|---|
Single – channel stream pseudo – multi – view live broadcast | HLS | no | no | ordinary | ordinary | There is no |
Multi-channel streaming multi-view live broadcast | HLS | is | is | high | high | CDN |
Single stream multi view live broadcast | RTP | is | is | ordinary | ordinary | Edge service and traffic bandwidth |
Through comparison, we finally decided to adopt rTP-based single-channel multi-view live broadcast, namely edge scheme.
Summary of WebRTC
WebRTC, short for Web Real-time Communication, is an API that enables Web browsers to conduct real-time voice or video conversations. It became open source on June 1, 2011 and was incorporated into the World Wide Web Consortium’s W3C recommendation with support from Google, Mozilla, and Opera. WebRTC uses UDP (actually RTP/RTCP) for audio and video data transmission by default, but TCP can also be used for transmission. Currently, it is mainly used in video conferences and connected mics.
Internal structure of WebRTC
Voice Engine The Voice Engine collects and transmits audio files and provides functions such as noise reduction and echo cancellation. Video Engine The Video Engine is responsible for network jitter optimization and Internet transmission codec optimization. On top of the audio and video engine is a set of C++ apis.
WebRTC protocol stack
ICE, STUN, and TURN are used for Intranet penetration to solve the problem of obtaining and binding extranet mapped addresses. DTLS (UDP version of TLS) is used to encrypt the transmitted content. RTP/SRTP is used to transmit audio and video data that requires high real-time performance. RTCP/SRTCP is used to control RTP transmission quality. SCTP is used to transfer customized application data.
The system design
The overall link of multi-perspective live broadcasting involves stream production, domain name scheduling service, edge service, confluence service, live broadcasting and control service and other links. The overall structural framework is as follows:
- Stream production: The stream collected by the camera on the scene is uploaded to the live broadcast center, which codes and optimizes the stream.
- Confluence service: the multichannel input stream will output the multichannel mixed stream according to the template after time stamp audio and video alignment.
- Edge service: precache confluent service multi-channel output stream, encode the output stream and transmit it to the end side through RTP connection;
- Domain name scheduling service: Configures IP addresses and ports for communication between the end and edge services.
- Broadcast control service: provides playback control services such as confluence service request parameter configuration and multi-view service capability configuration.
The detailed design
In order to reuse the main broadcast control link as much as possible and reduce the intrusion of the main broadcast link, the end – side added the multi-view player. Multi-view player maintains the same call interface with other players, and determines whether to create multi-view player according to the service type in the broadcast control office. Add multi-view plug-in in business layer, decoupled from other business, can be easily mounted and removed.
The end-side structure design is as follows:
The core processes
When users enter the multi-view mode, the main starting process is as follows:
Currently, three display modes are supported, namely mix mode, Cover mode and Source mode. The following figure shows the specific mode and switching process.
Edge side instruction
End – side and edge node mainly exchange instructions, data transmission through RTP protocol. The general header transmission format is as follows: PT=96 indicates H264 media data.
- Connection instruction: blocking instruction, the end side establishes RTP connection with the edge node, and needs to wait for the response from the edge node. If there is no response within a certain period of time, the connection is considered to have failed.
- Disconnection instruction: non-blocking instruction, the end is disconnected from the edge node, without waiting for the return of the edge node;
- Playback instruction: non-blocking instruction, delivering playback instruction from the end to the end, which needs to pass the playback stream ID and OSS stream configuration information for edge nodes to find the correct stream;
- Cut stream instruction: non-blocking instruction, the instruction of switching perspective is delivered side to side. In order to keep synchronization with the original perspective, the original perspective frame timestamp needs to be passed to the edge service.
The project to the ground
Adjustment of playback ability
- Audio sampling rate adjustment
Currently, WebRTC does not support 48K audio sampling rate by default. However, the current sampling rate of large live broadcasts is relatively high. If it is not adjusted, the audio decoding module will crash.
- AAC audio decoding capability extension
Opus is used for audio encoding in WebRTC by default, but AAC format is used for most of the current live broadcast, so the client needs to add AAC encoding information and implement AAC decoder in offer SDP, and edge services cooperate with delivering RTP packaged data.
- H.264 encoding support
In order to reduce the delay as much as possible, WebRTC video coding adopts VP8 and VP9, which needs to be migrated to the more common H.264 coding.
- In terms of transmission mode, WebRTC uses P2P to carry out media transfer, which only solves the end-to-end problem and is obviously not suitable for the live broadcast content of large PGC and OGC. Therefore, there is no hole logic in the network transmission layer, RTP connection is used to transmit streaming media data, and RTCP is used for transmission quality control.
Access the existing playback system
In order to minimize the impact on the master player, the same data request process, the same interface of playback capability, and the same timing of data burying point are maintained with the master player:
-
Multi-view player: Package WebRTC, expand the multi-view player, dedicated to multi-view video playback;
-
Adjustment of broadcast control logic: Adjust the data acquisition process of live bypass and merge the data returned by the combined circuit service, AMDC and other services into the data stream;
-
Playback ability & statistical adjustment: Keep the original playback ability and callback interface, distinguish by expanding interface parameters;
-
Extended error code: Based on the main player error rules, extended multi-view player error code.
End – side problems resolved
Finally, we need to achieve the main player window as shown below with several child player Windows rendered simultaneously:
The right sliding list using UITableView (RecyclerView), for each sub-window to add a rendering instance GLKView (RTCEAGLVideoView package management), By creating, removing, and updating the render frames of sub-windows at appropriate times, the effect of streaming multiple views simultaneously can be achieved.
During this period, we mainly solved the experience problems of cutting Angle flashing and the stability problems of rendering black screen caused by memory leak. Here is a brief introduction.
Play flashing
【 Problem Description 】 If there are N perspectives, then our implementation scheme has N*3 streams (see 3 display forms in system design). Every time you click the perspective switch, the actual operation process is that the player issues RTP cutting stream instruction, the edge node changes stream ID and sends Sei return message. The player performs clipping and updating of each window after receiving SEI information. This is a problem. It takes some time for the window to update after the successful view cutting, which will lead to the fact that the first frame or several frames receiving the frame data of the new stream will still be clipped in the template of the old stream, so the user will see the visual residue for a moment.
[the solution] in kernel receive after cutting flow instruction set lost frame window for a short period of time, to receive the Sei information indicates that shear flow after a successful restore rendering, this period because there is no new frame data so the user sees is a static frame, so that we can shield leads to disorder of the child window content rendering data, upper UI in the meantime also can’t see the residual frame flicker, The user’s motion perception is basically smooth switching.
A memory leak
[Description] When the user keeps changing the view Angle, refreshing the list, and creating the cell again, the memory usage keeps increasing. When the memory usage reaches the upper limit, a black screen window appears.
【 Solution 】 The memory leak is caused by the batch creation of OpenGL rendering instances in the kernel during cell re-creation, and the old instances are not destroyed in time. In the first step, it is clear that the upper-layer business code has released UIView objects from removeFromSuperview, and the LLDB print reference count is still not 0, so the problem is the kernel instance management. __bridge_retained indicates that after transferring the OC to the CF object, the memory management needs to be manually released. Therefore, you need to call Filter to achieve destruction and release on the C++ layer. Memory performance after resolution:
Service concurrency optimization
The pre-optimized version used to support the live broadcasting of CUBA related events, but it had many problems: high cost, high switching delay, and no ability to support large-scale concurrency. The objective of this optimization is to be able to support live broadcast of large-scale events, greatly reduce the cost, while ensuring the experience.
Coding of pre –
The following figure shows the link before optimization. It can be seen that each client needs to be rearranged and encoded. These two operations cost a lot of resources, and even T4 GPU can only support 20 channels of simultaneous viewing.
After analysis can be found in the perspectives of application scenarios of the combination of the user to watch is limited, under the mode of street dance, viewing Angle is 3 * N, N for original acquisition perspective, so if we put this 3 N flow production in advance, and then on the edge of the service directly copy the encoded h. 264 data, can reduce the CPU consumption greatly, The number of concurrent users is increased by at least one order of magnitude.
The confluence service in the figure only needs to be produced once, and the content produced can be shared by all edge service nodes.
There is a problem with this simple cut stream, however, because there is no frame-level alignment. In other words, the user may find that the new cut view is not consistent with the previous view when selecting a different view. In order to solve this problem, we asked the students directed by Aliyun to align according to the absolute timestamp when confluence, and at the same time to transparently transmit the timestamp information to the edge service.
We still cannot achieve true frame-level alignment due to the problem of GOP alignment. Students familiar with H and 264 coding know that the coding and decoding of video is based on GOP, and the length of GOP may be 1s to 10s or even longer. In live broadcast, the typical application length is 2s. If the user switch happens to be at the beginning of a new GOP, the simple cutting stream can be realized. However, we cannot ask the user to cut at which time he or she can cut at which time he or she cannot cut. What the user wants is to cut whenever he or she wants. Our solution is that if the user does not have a GOP to switch to, the edge service produces one.
From the figure above, we can see that when the user switches from perspective 1 to perspective 2, we generate a new GOP at the switching point, which can be seamlessly decoded and rendered to the new perspective after the stream is pushed to the client.
Through the above steps, we greatly reduced the coding cost, but in order to be able to respond quickly to the user cut perspective, we had to have all the original frames of the source perspective (YUV420) ready to use when we needed to generate new GOP. However, the requirements are always changing. With 4 viewing angles, we can pre-decode all 43=12 streams. When the business side wants 93=27 streams, decoding all 27 1080p video data at the same time is too much for the whole 32-core machine, and more streams may be required later.
On-demand decoding
Users want more perspectives, and we need to satisfy them. So the previous way of pre-decoding all had to change, so we implemented the decoding on demand, that is, only when the decoding is really needed, to prepare the Frame for it (YUV420). The biggest problem here is the real-time problem, because when the user switches, the picture may be in any position of a GOP, and we can only decode from the beginning frame of GOP. However, through multiple optimizations, the delay presented to the user will not be perceived.
Client dynamic caching
Those of you who have done audio and video should be familiar with the lag rate. Annual management is always not ideal, and multi-view live broadcasting is faced with more troublesome problems. If the lag rate is reduced, the most basic method is to increase the playback cache, but we hope that users can see new perspectives quickly, so the cache should not be too high. Therefore, we introduce dynamic cache. Simply speaking, when users switch over, we use extremely low water level cache to ensure switching experience. When users do not switch over, the cache will slowly recover to a certain level to ensure that network jitter can be prevented to a certain extent.
Summary and Prospect
The multi-perspective ability was officially launched in the hip-hop dance finals, which, as a unique new play of Youku, received a lot of positive feedback from users such as Weibo and moments of friends. Youku plans to further improve user experience through interactive optimization and delay optimization.
Pay attention to [Alibaba mobile technology] wechat public number, every week 3 mobile technology practice & dry goods to give you thinking!