In order to comprehensively improve taobao live experience and interactive ability, Taoshi technical team and Ali Cloud experienced three years to create the first full-link RTC real-time transmission network. In time delay, cost, anti – weak network and other indicators have been greatly improved. This time, we invite Mr. Chen Jufeng (Feng Huo), a senior technical expert of Alibaba Tao Department, to introduce GRTN’s technology evolution route and future planning.
By Chen Jufeng (Fenghuo)
Organizing/LiveVideoStack
I started to do Taobao live broadcasting in 2016, and also did a lot of sharing in 2017. From its launch in 2016 to the present, Taobao live broadcast still has a lot of technical precipitation. I remember that in 2017, the earliest proposal was to optimize the live broadcast link through RTC on the live broadcast uplink. At that time, this plan was seldom talked about in the industry, because most of the solutions were traditional RTMP on the up and FLV on the down. If I remember correctly, we should have introduced RTC technology in the live broadcast earlier. At that time, Taobao live broadcast was still forward-looking in technology. Technology has been making progress, indicating that RTC technology has been applied more and more widely in live broadcasting from 2016 to now, and research has been more and more in-depth.
From the experience of Taobao, the introduction of RTC technology has brought great help to the whole experience of Taobao live broadcast and core business indicators. GRTN is an RTC network jointly built by Taobao Live and Ali Cloud team. I will share the landing and planning of GRTN in Taoshi content business, and also introduce the key technical points of GRTN.
#01 Business data of Taobao
Review taobao live past business data, launched in 2016, taobao live until 2020 double tenth a growth of the whole data is terrible, it is a business from 0 to 1, just start is no business foundation and lack of technical reserves, the only have a plenty of taobao 2016 live in hand for the home page “love shop” has a very small section, It’s only 600,000 DAU, but the nice thing is that there’s some sediment in the anchor supply. In fact, before going around the business, tao department in some small content on the track such as good goods, must buy list or do a lot of attempts. Love shopping is also a typical example. Of course, we met a lot of difficulties later, DAU fell sharply, and the position of the home page also moved down. But in the early stage for tao department accumulated a large number of high-quality content producers, talent, writers, these accumulation for Taobao live in the earliest start brought help.
Going back to the time when Taobao live broadcast was launched in 2016, the business model of the whole live broadcast was clearly defined in the early stage. In 2016, there was no concept of e-commerce live broadcast in the market. The so-called “Hundred broadcast war” was actually referred to as the products of shows, games or traditional YY live broadcast manufacturers’ mobile live broadcast platform. There was little mention of e-commerce or “selling goods” in the live scene. It can be emphasized here that It is Indeed Taobao Live broadcast that defines the standard of the whole e-commerce live broadcast — buy while watching: the purchase link formed by the anchor speaking, the user asking questions, and the anchor answering questions. This is actually the core business model that has been established after the first version of Taobao Live broadcast was launched in 2016. We have done a lot of work in the following years, including low latency, optimization of interactive ability, and a lot of gameplay in multimedia interaction. All of these are expansions and upgrades on the edge of the model, and the essence remains unchanged. In the past four years, the core is still “buying while watching”, and the interactive ability of live broadcasting is well applied. For a long time, the conversion rate of taobao live is second only to that of “guessing”. For most people, taobao has a strong purpose. They seldom visit “guessing you like”, and most of them open the search bar of “Guessing you like” and search directly. Interactive + e-commerce live broadcast has a great impact on the conversion rate of the whole purchase link, which is the power of innovation.
The overall data basically doubled the growth rate year after year. Whether it is DAU in the live broadcast room, or the total number of live broadcast starts, the transaction scale has exceeded 400 billion in the whole year of last year.
Cover a lot of changes have taken place on the contents of the form, early typical scenario is women’s clothing and makeup, but in recent years, such as jewelry, food, including village in various forms are emerging, data performance is good, in addition to the diversification of the content, clinch a deal the scale there are 33 $studios (double during the 11), the overall present a very healthy state, This is also the result of our past efforts in biodiversity.
On the supply side, the broadcast link has also been greatly simplified. Original electric business platform more complex open a shop, a new, authentication security mechanism in taobao live for a long time in the air, occupancy, and will send the goods the whole process is very complex, so from the aspects of the business on air link in last year did the minimalist: a key air, a typed in, a key, the entire businesses started session, frequency get improved. Technically, last year double 11 single live broadcast room at the same time online number broke through 2 million, this is the real connection at the same time online (Taobao live broadcast room caliber is strictly true online, must be a long connection to maintain in the broadcast state), in 2 million real online, a single live broadcast delay within 1.2s, An even bigger challenge is keeping the screen and the message in sync with a very low latency. Ensuring that links pop up at the same time in the screen is critical. And because “grab is to earn”, the current purchase mode in the live broadcast room has become a strong real-time “second kill mode”, which means that the baby link pushed must be synchronized with the picture in the first time and the lowest delay pushed to all users in the live broadcast room, to achieve “grab as soon as you see”. This is actually the biggest technical challenge of scale.
In addition, if there is no codec, code rate level optimization, will bring linear cost increase. Therefore, we have also done the pre-research of full link based on H.265, including H.266. Using the characteristic of bandwidth 95 peak billing, some intelligent scheduling is done by means of peak clipping and valley filling to achieve two purposes: the first is to reduce the peak value. The second is to increase the daily bit rate at the same peak (total cost).
My theme will also revolve around these aspects.
#02 Gameplay introduction
When we usually watch the broadcast room, in addition to watching, we will provide some marketing capabilities to the anchor side, including lottery, reward and countdown red envelopes. I will focus on “situational interaction” from a business perspective.
In fact, before we do the real-time content understanding of the whole live broadcast room, the content distribution link of the live broadcast room is very difficult, because the live broadcast room has strong real-time, and it may end in two hours, meaning that the content can only be distributed within two hours of the maximum value. For example, after the host sells goods, the goods are distributed. At this time, the value of distribution is not much, and only the simple content introduction value is left.
The second point is how to maximize the distribution value of the link in the process of live broadcasting. There is no problem with the real-time distribution network, which is alibaba Group’s biggest advantage. The difficulty is that what happens in the live broadcast room is unpredictable. Unlike video and graphics, there is a lot of known preprocessing, and a lot of preprocessing before the live content is distributed to structure it. However, livestreaming is different. It is also the combination of image recognition, voice information and even context information mentioned above that makes the whole content of the livestreaming room structured. The best state is that the anchor is selling a dress, and we hope that when users search for “dress” in the main search, they will hit the dress introduced by the anchor at this time, and now it is very close to this state.
#03 Technical Architecture
Describe our technical architecture.
First of all, the infrastructure at the bottom is mainly based on aliyun’s multimedia system, including edge streaming, center transcoding, time shifting and recording capabilities. The second point includes the distribution mechanism of the player side, which is the technical application of Ali Cloud. A big difference of live broadcasting is that Taobao live broadcasting establishes its own classification system — RTC distribution system on the basis of AliYun, which is a process rather than a step in place. Together, they constitute the bottom base of Taobao live broadcasting.
The core of live broadcast open platform is divided into three aspects, one is divided into two aspects. At one end is the production side, including the codec system, anchor APP, pre-processing (beauty treatment), scene identification, end-to-end and side-pushing flow capacity, and online processing including flow control. On the other side is the viewing side. The whole Room of Taobao Live broadcast applies a self-developed player, does a lot of post-processing circuit, including picture enhancement and self-developed H.265 software decoder. It is worth mentioning that Taobao Live broadcast room is the only one that has realized full link H.265 coverage. This means that there is no transcoding (production, streaming, distribution side, and playback side) for the whole link, and the coverage ratio is very high. Some students have doubts about the difficulty of soft solution h. 265 at the end of the implementation, solve this problem will solve the whole H.265 at the end of the decoding coverage, combined with the hard solution link to achieve a great improvement, but also save a lot of transcoding costs.
On the streaming media link, the concept of the whole live broadcast business domain model, such as the flow state, interaction ability, commodity management ability, barrage, marketing type gameplay, is divided. We put together a whole set of apis to support the core business.
The two most core transmission links at the bottom of live broadcast: one is streaming media link, the other is message point link. Message point link is the whole comment message and marketing type of play and streaming media link at the same time do down distribution, synchronization. These are some of our core technical points.
#04 Technology Evolution
In addition to basic optimization, the biggest underlying transformation of live broadcasting in the past few years is the transformation from the traditional CDN based centralized distribution mechanism to a decentralized network. We first introduce the characteristics and problems of the original network.
The first phase (upper left) is by far the most typical solution used by most livestream vendors. We are based on the most traditional CDN central distribution network, mainly FLV files. On the one hand, the transformation of the whole CDN is low. On the other hand, both RTMP and FLV have the most complete service support. And the flow control of the underlying protocol, all through TCP, do not need to do too much optimization transformation, is the fastest from 0 to 1 live business construction system, but what is the problem of this mode? Although this mode is very fast, including the first phase of Taobao Live broadcast, the whole link must be from L1 to L2 to the central node, there is a backsource mechanism, which is inevitable. Because this is a static structure, there is no dynamic awareness, which brings a lot of back source costs and delay issues. On the one hand, the delay problem is related to the loopback source, but also to the lack of more precise control of the whole flow control protocol. RTMP and end-to-end FLV are typical TCP low-level protocols, which can be optimized at limited points in business. Tuning local Cache and L1 Cache have limited means of GOP size, which will bring a lot of extra costs.
We found a more serious point under this system in Taobao Live broadcast. The anchors of Taobao Live broadcast are scattered compared with shows and games. Popular anchors in the show are in a concentrated state, which may account for 60% or even 80% of the total flow of the head, so that its centralized distribution can amortize the cost. However, most of the online anchors in Taobao are below the average water line, which is very different from the head, but these users account for the vast majority. Just now, the centralized mechanism mainly solves the problem of hotspot concentration, but in Taobao live broadcast, 1 million online people may be distributed in tens of thousands of live broadcast rooms, so this mechanism will further magnify the problem of high cost.
The above is why Taobao will evolve towards the decentralization of GRTN. There is a formation process here.
First stage originated in a very simple question, then through online data analysis, the vast majority of cause CARDS in play side, mostly because of uplink network jitter, basically in the first jump (anchor RTMP to L1 nodes between fluctuations caused most caton), but under the situation at that time unable to link to do optimization, Typical design mechanism we live with L1 nodes deployed at the taobao own rising node, and the host of the RTC private agreement, straight from the RTC agreement nodes on L1, middle is the same place, go line or GRTN, also is the foremost one kilometer walk off with a private agreement, this is the first step of RTC.
The second step is between the downlink of anchor and L1 node, and it is transformed into RTC. The effect after transformation is better, but the effect here is more to solve the problem of stuck. Based on WebRTC uplink, the optimization of the entire RTMP uplink is very obvious after the upgrade, and the latency is reduced by nearly 1/3, especially in the scenario of weak network and overseas push flow. However, after the delay statistics, the codec at both ends is about 60ms, and the whole transmission is less than 100ms. The same is true for the playback side, and the core occupies the most delay in the whole distribution link. Therefore, if the two problems are solved, the lag problem can be solved, but the delay problem cannot be solved, which goes back to how to truly achieve full-link RTC.
This is not a business decision, and the entire distribution mechanism in the middle must be close to the original technology of the entire CDN. Because the original CDN distribution mechanism is not the distribution network of streaming media, it is essentially the distribution network of files. So how to transform it? We combined multiple teams and finally completed the upgrade of GRTN full-link RTC. The advantages brought here can almost solve the previous problems. On the one hand, it is completely decentralized without any logic back to the source. For rooms that are very close to some areas, the whole dynamic routing strategy can completely avoid the waste of the center.
The benefits of full-link RTC are the perception between nodes and the finer control of transmission, even for the particularity of streaming media transmission. In streaming media transmission, some packets can be lost, and the policy can be controlled with more detailed strength, not necessarily reliable arrival. Before the multi-network convergence was proposed, we actually went through independent channels in the whole live broadcast, video conference and mic connection scenes in many cases. FLV might be used when playing, and RTC was used for mic connection between users.
So far most of the online education companies because the cost of the problem was using this strategy, but if done the RTC all link network, any node can be upward or downward, downward at the same time live link and it become a two-way conversation scene is essentially no difference, it means that we can be done in the same channel even wheat and live broadcast. Of course, video conferencing is also similar in the whole peripheral system with MCU or SIP gateway, that is to say, live broadcasting, connecting mic, video conferencing and point-to-point call have achieved four networks in one. Today this direction is generally accepted.
Costs, including source cost and link length, will be solved. Full-duplex and multi-network convergence are a concept with both uplink and downlink logic.
The above is basically a process of network evolution.
#05 Low delay live broadcast
Going back to the mechanism of linking on anchors in live broadcast rooms, it essentially includes several issues:
Message arrival rate: Messages over a million online are in fact a completely different technical system from the difficulties faced by thousands or tens of thousands of online distribution. In practice, in addition to a large number of transformation of the original Push mode, including the distribution mechanism and saturation mechanism, the biggest technical transformation here is to layer the message, including the message pushed by the anchor to the user, which we believe is the largest arrival rate and value to the business. This kind of Push message is adopted to statically transform the original messages in and out of the broadcast room, comments and diffusion into CDN. In the end, the whole message mechanism was combined with push and pull mechanism, and the message arrival rate of 5s was higher than three nines, which was about three nines of 3S.
Simple message arrival rate and stable timeliness, if it can’t and pictures of the arrival rate, effect is not displayed (after link, the picture to see a shorter time, grab the link in the host side of the baby at the same time as short as possible to see and a shorter time delay, here on the one hand, rely on the news link, on the one hand, rely on the whole GRTN low transmission delay, The synchronization of the two frames is mainly based on the dyeing of the frame. The SEI currently walking, or the additional information of a particular frame, ensures the synchronization of the two frames. The key is to keep the synchronization time window as small as possible. If the link is too large, it is likely to be the screen, and users can solve the link by cutting the screen in advance), which is also a point of concern. In 2018, there is a scene of “point-to-gold” — prizes for answering questions. Even though such technology is more urgent in the mode of “point-to-gold”, the pictures and questions must be highly consistent, which is the strong synchronization of live broadcast and news.
Depending on the whole link of the whole RTC, the delay can be less than 1s, which is an average value. In some more non-blind cases, the delay can be lower.
#06 Taobao live broadcast with mai
Dual channel is advantage of the RTC channel can undertake at the same time live push flow and even the wheat, and with the help of the peripheral system matching, video conference can also supply (MCU), and this is what we are doing the direction, hope four scenarios (live, call, video conference, even wheat) each have a true business can undertake together, and has a certain scale.
Taobao’s current end confluence and cloud confluence have an evolution stage respectively, but eventually move to the integration type. Because end convergence will bring cost advantages, but cloud convergence has more advantages in scheduling and scalability, so we support both directions. The evolution path basically started from the early stage when RTMP supported mi-link. Later, due to the phasing of upstream and downstream RTMP, there was a state of fusion. Finally, full-link RTC completely solved the problem of disunity in the channel of live broadcast mi-link.
#07 GRTN dynamic path planning
There are several key points in the dynamic path planning of GRTN.
One is dynamic routing planning, where the core is that each node can perceive the arrival path of any node, and there are many indicators to measure, such as the dynamic estimation of packet loss rate, delay, jitter, capacity and cost between any two nodes. Because the whole network will continuously send detection packets to record the information, node 1 has a dynamic policy table and records the routes from its connected nodes to any node. According to the policy mode, some of the best quality, that in the packet loss, delay, jitter have weight design, to ensure that it must be at the lowest time delay and minimum delay rate transmission; The other kind of cost is optimal. The intermediate link may be long, and the cost consumption of the nodes that go through it is different. Some go through special line and some go through LTN, so the cost will be different.
At present, the version of GRTN running online is still integrated with a variety of strategies. A typical example is the RTC call scenario, which requires the highest quality delay and the lowest latency, so it will choose the strategy with the best quality. So this section will also do some design according to different scenes.
#08 Parameter self-learning
Across the network, many of the parameters (latency and lag) are in and out, but that’s just the end result. If there are more than 400 parameters in fine granularity, such as encoding side, decoding side and end side cache Settings, the optimal solution of these parameters is difficult to determine and verify in the process of convergence.
Therefore, we adopted the method of parameter self-learning, with manual pruning in the early stage, and then designed the convergence of 400 parameters very quickly by using a large number of AP systems on line. Pure parameter optimization brings benefits. Under the condition that the underlying mechanism remains unchanged, pure parameter optimization has brought obvious advantages to lag and delay.
#09 Gameplay
Last year, we made a lot of play-based optimization, including marketing optimization (lottery, red envelope) and community interactive gameplay or game gameplay, which can shorten the connection between anchors and users (such as the submarine of Douyin some time ago), so as to produce both fun and content. However, this mode has not been well applied in live broadcast before, and several problems will be encountered:
First of all, when a large number of end-to-end inference models are used, neither MN nor end-to-end network small sample training can fully follow up. This year basically can achieve and industry alignment.
Second, there’s a lot of exploration that can’t be done with just end-to-end computing power. For example, 3D live broadcast room and virtual anchor technology have comprehensive demands on some computing power. It is unrealistic to simply put computing power on the terminal. How to use a better solution to uniformly apply computing power on the one hand and on the other hand, and remove the differences between the terminal and the cloud? In essence, the mode of pre-processing is recognition or screen effect coordination, while the end-side or cloud side processing is not perceptive to the user side and the host side. For example, for face recognition or special effect recognition, we can either put this model on the anchorside APP or PC before anchorside optimization, or put it on the anchorside APP through real-time echo link after processing on a node of GRTN. Therefore, if the delay problem can be solved for anchors, it makes no difference whether the echo is local or from the cloud. Our fundamental goal is to solve the problem of how to fully and flexibly mobilize the computing power of the cloud, and to solve the live-streaming interactive gameplay problem by combining GRTN real-time echo link.
A quick introduction to some of the gameplay shown above. In the second “Nian Beast”, the anchor controls the left and right movement of the battery at the bottom of the screen through his face, hits the small bee at the top, and inputs control props through his face. The game is based on two mechanics — face recognition, stream processing, rendering, and final composition are all on the same end, and we did a cloud version as well.
3D broadcast room, personnel stand in front of the green screen, the scene behind is 3D digital broadcast room, made by matting and synthesis. In the 3D scene, it is not realistic to put it on the end side, because it does not have good computing power. It does not interact with the 3D scene at all, but after it is fully 3D, it can interact with the elements in the scene and solve the problem of remote computer samples in the live broadcast room.
#10 Multimedia processing center
The whole design pattern is similar to the above introduction, the most important points are to connect the front capability as the OPERATOR of GRTN, but also can be linked to the end of the system scheduling. In addition, the whole GRTN real-time echo mode also achieves cloud integration design.
#11 Intelligent control
The last part of the core is based on the current level of bandwidth or the strategy on the business data, for the whole line rate, according to the more granular crowd, the balance of cost and quality, we do during the double 11 played a good role, can do in the evening packing has reduced rate, will also automatically according to the variation of line increase the quality.
Above is the content of sharing, thank you!