The author | | BiXuan source alibaba cloud native public number

With the continuous maturity of 5G/ chip/blockchain and other new technologies, the popularity of cloud computing and the convenience brought by the era of cloud native, the challenges facing developers and architects are no longer just 0-1 construction problems. How technology can bring more business value has become a topic worth discussing. Alibaba group, the researcher, ali cloud intelligent video cloud head Lin hao (flower BiXuan), was published in global software development QCon conference keynote “5 g and cloud technology in the old age” in the second half, in 5 g scenario audio and video, for example, to explore related technology and the technology of the second half, the following content for the speech.

Many people may have heard, for example, Xiaoyao Zi, chairman of the Board of Directors of Alibaba Group, said on many occasions that the biggest sure change in the world is digitalization, which means that most of the big changes in the future are “the acceleration of digitalization”. And in the digital trend, we see “video” with greater certainty.

What changes will 5G+ cloud native bring to the business

The two parts of 5G that are most relevant to the service system are lower latency and wider bandwidth.

The current mainstream network, such as 4G, has a delay of about 10ms ~ 100ms, which has a relatively large delay range. 5G typically slows down to between 1 and 10ms, and its target is 10ms. So what changes will we see in the business as latency gets lower and bandwidth gets wider?

The figure above mainly shows typical cases as bandwidth becomes larger and latency becomes lower. For example, now the topic of special fire – cloud games. Games have particularly high requirements on delay, such as racing and competitive games. The 4G network itself is impossible to lower the delay, but in 5G scenarios, if the delay is less than 50ms, many businesses may become reality.

So from the business level of 5G, we focus on what needs more bandwidth and what needs lower latency.

When it comes to cloud native, it’s really a hot topic right now. In last year’s Double 11, we said that the biggest change was that all the core systems went to the cloud, while this year’s double 11 we said that all the core systems started cloud biogenesis.

But we also said that everyone’s mind may not be the same cloud native, who do not know what is called cloud native.

For Alibaba, why are we pushing cloud native so aggressively? I used to be the cloud architect responsible for the whole core system of Ali. In my opinion, the most important thing in the evolution process of the whole business is that all businesses start to move from a closed and independent technology system to an open technology system, which is the most important change brought by cloud native.

After cloud native, the self-owned system of the whole society will be more and more open and public. This is very helpful for a lot of business innovation. Because before, you had to do a lot of things yourself, but now you can probably do a lot of things based on a relatively mature technology. For example, Alibaba saw that some businesses after the cloud transformation had greatly helped the speed of innovation and business iteration in our whole business.

Most typical scenario: video

As mentioned above, 5G brings low latency and large bandwidth, while cloud native brings an open and public free system. What is the most typical scenario after 5G + cloud native? What scenarios are particularly appealing to 5G and cloud native?

What we’re pretty sure about right now is the video. Because of the epidemic, video seems to have suddenly become a particularly hot area of business innovation and technological innovation in the whole industry this year. But video technology has been around for years, and seems to be exploding again this year.

I think a lot of people have the feeling that most business systems used to have no video, but now most business systems are starting to have video in one way or another. Short video, live broadcast and audio and video communication are the hottest scenes at the moment.

We believe that from the scene level, video is a very typical 5G+ cloud native scene. The reason is that all video services, whether live broadcast, short video services, or audio and video call services, focus on experience first.

The most important thing for video production is experience, such as whether the live broadcast is smooth enough and the definition of the picture is good. Short video is also the same, and audio and video calls are even more so. For example, what people pay most attention to in video conference is whether they can hear clearly what the other party is saying, and whether the picture is smooth enough.

Therefore, once this business is started, the first topic to be concerned about is experience. In order to achieve a good experience of video business, the first problem we face is whether the video can be well distributed to a point close to each user.

To be honest, most small and medium startups or even large companies have a hard time solving this problem. Generally speaking, in order to make the whole experience really good, most businesses rely on a huge network behind them, which is usually only provided by cloud companies because it takes a huge investment for other companies to build that network.

So, in terms of experience, video is very typical, and it’s more of a consideration to use a cloud-native service rather than build it from scratch.

In addition to experience, the second big problem faced by video businesses is cost. Video is different from many businesses. If the scale of these businesses does not come up, the cost may not be too big. It may just be a machine for several computing resources, a little storage and a little database. Of course, if you are doing big data and AI, the relative investment is bigger.

But once you do video, you face a big challenge in bandwidth, because bandwidth is money. In addition to bandwidth, there are storage costs associated with making video a little bigger, because you have to save it, and video files are obviously bigger than anything before.

With storage, the video will also face the problem of computing consumption, because it may have to do some processing on the video, such as doing some coding and decoding or other things, resulting in a relatively large consumption of overall computing resources. Therefore, on the whole, video will not only solve the problem of experience, but also face huge cost consumption. And to solve the cost problem, all kinds of problems can arise. So we can see that video-based cloud native services are a relatively good choice for many teams.

One of the other things I feel about the video business is that it requires a very big investment in basic technology. For example, in order to control the bandwidth better in the process of video distribution and playback, we may have to solve the problem of how to reduce the bandwidth cost and control the bit rate under the condition that the video picture quality seen by most users does not change much. This is important for many companies, because in most businesses, a small number of videos account for the most bandwidth costs, but the quality of a small number of videos cannot be reduced. Because if the quality drops, it will affect the user experience.

To solve this problem, we may need to devote a lot of people to codec optimization. Of course open source is available, and the quality of open source is not bad, but if you want to do better on open source, the investment is very large.

In addition you may have heard of, also in watching a video, video content is directly determines which place is where need to be very clear, is relatively less important, this may be combined with the AI to do video content understanding, then dynamic coding optimization, based on the point of interest to you to do optimization, behind may involve a variety of teams, Codec team, AI team, algorithm team, so for a little bit of improvement, there may be very large investment behind.

Make the delay lower

Just to give you a couple of examples of the benefits of lower latency.

The first is online education. In the early days of online education, the teacher recorded the video in advance and then played it, and other people clicked on it. But for many customers, such as parents, this is not acceptable, because there is no good interaction with teachers. Later, online education hopes to have more real-time interaction between teacher and student, rather than no interaction at all.

In order to interact, the key is delay. Traditional livestream technology usually has a delay of about five seconds. Of course, the latency is a little bit longer for things like live TV, but that’s because of the other requirements, the technical level is about 5 seconds, that’s the result of the protocol. Online education aims to bring the delay down to a few hundred milliseconds so that audio and video interaction can work better.

The second is e-commerce, in which Ali has a very strong feeling. When Ali first started the live streaming of mobile shopping, it also adopted traditional technology. The biggest problem faced by the scene was that the anchor came up to tell everyone, “I’m going to sell a thing”, and then he had to link and do news interaction. However, at this time, there may be a delay in the two processes of the anchor speaking and the user and audience sending messages, but the delay of the message may be different from the delay of the video. The message may be 1 second, and the video may be 5 or 6 seconds.

The problem is that the news and the video are not in the same screen – the host may have already cut to the next episode, while the buyer is still communicating with him about the previous episode.

So in the scouring scene, we kept working with the scouring team to push the delay down as far as possible. For example, in this year’s Singles’ Day, Hand-shopping adopted a large number of low-delay live broadcasts, which reduced the delay of live broadcasts to about 1 second. After controlling the delay within the range of 1 second, we can see that it is of great help to the transformation of the whole GMV, because the anchors have a stronger interaction with the audience.

We have seen the appeal for delay in all live broadcast systems. Now live broadcast hopes to move towards strong interactive live broadcast, rather than the original one-way behavior, because the audience also hopes to have stronger interaction.

The last one is video conferencing, which is the scene you feel most strongly about during the epidemic. The time delay for video conferencing is now technically several hundred milliseconds, so it is now common to have video conferencing. Although there used to be more teleconferencing, video conferencing is clearly on the rise. After all, everyone’s communication is more about seeing people than just the voice on the phone.

To take another example, many companies will invite candidates to the local area and complete the interview face-to-face when it comes to a decisive or critical round. This is because I feel it is difficult to judge many things when I only have a phone interview and cannot see the person. I need to see the person. But with video conferencing, some interviews can be conducted without a person being invited in.

Therefore, the role of delay technology in the field of video is very obvious, from a few seconds to hundreds of milliseconds to promote a lot of video scene innovation.

But for video, that’s still not enough. For example, a research report by an academic institution shows that, in fact, a video conference with a delay of several hundred milliseconds is quite different from a person-to-person communication.

Everyone in a video conference should have such feelings: in the video conference scene, there will still be a hijacking situation. You may have said a word, but before you finish the conversation, the other person will have the hijacking situation, which is bound to happen, because the delay of face-to-face communication between people is not hundreds of milliseconds.

In the video scene, we have a very strong motivation to think about how to push the delay down even lower, so that people can have a more real experience, including many companies now do a lot of things to make people in the teleconferencing, so that people can have a more similar experience to face-to-face communication. For us, it is a very good thing if the delay can be lower and lower, and we can make more innovations at the business level on this basis.

Analysis of audio and video transmission delay introduction

The overall audio and video technology may be a little bit different than the system level technology, so let’s look at the latency. For example, in live broadcast, typical scenes in audio and video, you take a mobile phone and start shooting, which is the process of collection. You leave a video image, collect it, and then encode it. Most of them may be done on the server. This delay is now in the range of about 60ms.

After the collection, the stream (such as live broadcast and camera stream) will be directly pushed to the remote end, mostly in the cloud or its own server. For example, the content of live broadcast usually needs to be reviewed once. For some complicated live broadcast, other things may need to be done, such as logo addition, camera editing and camera switching.

If there are multiple camera positions, it will also involve the question of which camera position to choose for live broadcast. Another is distribution, how to push the server side to a lot of points. The client stream is then pulled to the local area, where it is decoded and played.

From the point of view of the total time, it used to be 3-5 seconds of delay, and most of the main time was spent on the pull flow side, which was determined by the protocol. RTMP is a standard protocol. At present, the low-delay live broadcast, which is popular in the industry, pushes the delay of live broadcast from 3 seconds to 1 second, and after 1 second, we call it low-delay live broadcast, which has a lower delay than before.

As you can see from the overall optimization in the figure above, it is more about replacing the protocol layer. At present, most companies’ low delay live broadcast is based on RTC protocol, which is the Open source webRTC protocol of Google. As can be seen, when based on RTC push stream, RTP distribution, the previous protocol layer is in replacement, almost can pull the stream end of the start pressure to less than 1 second. At present, the overall delay of ali.com’s live broadcast is within the range of 1 ~ 1.2 seconds, which is enough for the interactive scene of news. If the anchor interacts with the audience through messages, sending a message or giving rewards, everyone will not have a long delay feeling. As you can see, in this scenario, we can pull down the entire delay by protocol substitution.

But you can also see that a lot of the delay is actually due to the whole network. If it is caused by the network, there are not too many good solutions, it is very difficult. And the standard RTC can do 200-300ms time, is such a situation.

These three kinds of delay, in addition to the difference in the technical level, the other level is when the adoption of these technologies, the overall cost is a great change. As you delay and do lower and lower, the cost actually goes up a lot. Compared with the traditional live broadcast delay, the cost of RTC may be more than 7 times. Like low-delay streaming, companies are now trying to get the two costs as close as possible.

In order to control the delay, the most important thing is to replace the protocol. Because after protocol substitution, after TCP to UDP, you have to do a lot of things yourself.

The most important indicator that video manufacturers pay attention to is anti-packet loss. Most companies pursue to meet their demands in different scenes when the packet loss is 50%, 60% and 70%. For example, if video conferencing is just for meeting, the biggest demand is actually on the audio side — audio clarity and smoothness, and the picture is a little bit sluggish, we can barely accept. Of course, if the video conference is a powerpoint presentation, that is unacceptable, and the priority may become the clarity of the video. Therefore, different scenarios require a variety of different strategies.

For example, if you watch live and video conferencing, what is the biggest difference? For example, if I am the anchor, as long as there is no big problem between the camera and me and the link between me and the server, basically there is no influence between the audience. One audience may be stuck while the other audience may not, because there is no influence between the audience. For example, there are ten people in a meeting. If any one of them gets stuck or the video or audio is not normal, the efficiency of the whole meeting will be affected.

In such scenarios, in order to ensure delay and smoothness at the same time, anti-packet loss level needs to do a lot of things, including comprehensive strategies.

We look at a lot of audio and video companies, and their great competitiveness lies in their ability to adapt to the right end. For example, some people use Apple, some people use Android, especially Android. There are countless kinds of Android phones. The audio and video capabilities of each phone are very different. There is also the possibility of switching from Wi-Fi to 4G, where the network point is also very critical.

Therefore, when the overall delay is getting lower and lower, its technical threshold is constantly rising. How to do a good job in the control of the lag is the biggest problem faced by companies to do this type of business.

The key technologies here are pushing flow, distribution, and some things that the whole pull flow level does to control the delay. Push flow is mainly the protocol level and anti-packet loss, while the distribution level is mainly the distribution of the whole network behind.

Many companies do video business, usually there are several methods, one is to directly build the whole audio and video network based on the CDN of cloud vendors, another is to build their own audio and video network based on edge computing nodes, but there is a problem to be solved. No matter what scheme is used, there is always a problem to solve: how to schedule so many nodes better? This involves a very complex scheduling problem, because the bandwidth capacity and computing resource capacity of each node may be different, so how to schedule the entire network according to the situation of users.

Uhd is the future, but there are still a lot of technical issues to work out

At the bandwidth level, from the current point of view, everyone is wondering who will use the bandwidth when 5G bandwidth becomes larger. Someone has to use the bandwidth. Just like 4G, it’s actually video, short video to hold up 4G video bandwidth. Video is now the bulk of the traffic on the Internet. The 5G era is the same, why we need greater bandwidth consumption, we must see a big change from the business side.

This is probably some of the sharpness that you see all the time, 720p video, 1080 4K, and 8K that we see in most scenes now. In fact, 8K is rarely seen, because 8K has very high requirements on the screen, and basically requires a large screen to show the effect of 8K.

Ali had done a demo in the Winter Olympics a few years ago, called 5G+8K to watch the skiing scene of the Winter Olympics, its sense of sport is very strong, so it is very obvious. Now VR/AR, which is very popular, needs higher definition. Now many VR are still 4K, so we will feel that the particle sense is very strong. But when VR is combined with 8K, we will feel that the particle sense is much better, and the picture is closer to the real.

Only with more bandwidth can we push clarity further forward. As for clarity, someone once said that if you ask a lot of people, they will think that things are already clear enough and don’t need to be clearer. But when you give him something clearer, he’ll find he needs more clarity. Most typically, Apple introduced the Retina display, and when the retina display came out, people had a better experience.

Now short video manufacturers are also pushing 4K. A lot of people used to think short videos didn’t have to be that clear because the screen was too small to be able to see the difference in 4K.

But from the development of the industry, we feel that this trend is more obvious, the overall development towards a clearer, it is certainly appealing. And why is progress slow now? There are a number of reasons. The first is that when clarity is pushed forward, it’s not just the back side, it’s the production side. Of course, now many cameras may be 4K, but after shooting how to do 4K video editing, processing, is actually very complicated, not to mention bandwidth consumption. In addition to the bandwidth can be released, there is a problem is that every release behind all bandwidth consumption, the bandwidth consumption is all cost.

We think ultra clear is a good development direction, but how to solve many problems in the development process of ultra clear is the technical side needs to pay attention to.

Uhd technology involves a lot of things, simply speaking, from video input, that is, shooting a video, and then when a video is finally seen by the user, exactly what we should do.

You may have heard some of the words, like the one in the picture above. To put it simply, when a mobile phone takes a 2K video, how to divide it into 4K video, so that you can see a similar effect of 4K? This is for the cost of the production end, because many production ends do not have the ability to produce ultra HIGH DEFINITION.

In addition, you may have heard of narrowband HD and other technologies, in fact, to solve the problem of giving you a high-definition video, but how to control the overall bandwidth cost. If you’re in the HD business, cost is very important. Long video is very typical, most long video will offer a lot of different resolution options, most companies will provide more and more clear and better experience, like Youku, we will provide frame enjoy things so that people can see a better and different experience.

There are still a lot of scene problems, such as shooting different scenes, aerial photography and sports videos have relatively high requirements for clarity, especially sports videos are very obvious. Ali Youku does the World Cup broadcast, can obviously feel, if the definition is not enough, many times may not even see where the ball is, when the vision is more difficult. During that time, people were working on how to make this picture clearer.

So I think for a lot of companies uHD technology needs to evolve, needs to solve the whole chain from production to distribution, processing to playback. Bandwidth is fundamental, and only as bandwidth gets bigger and bigger will this thing become a reality.

Because I am now in contact with video more, from the two propositions of 5G and cloud native, I see that video is the most closely combined technology.

5G is more about low latency and high bandwidth. We need to think about what new business innovations are possible and how the innovation model will change as the latency gets lower and lower. Latency is getting lower and lower, and we’re seeing more and more business changes in the video scene, many of which are completely different than before.

Because of the maturity of video, during the epidemic, many things that had to be offline can be shifted to online business. When the whole society and technology progress, all business systems have to think about, video is relatively more obvious. The other is bandwidth, which services are consuming more and more bandwidth.

Take another example, computing resource consumption. Most of the original computing resources were devoted to online business systems, such as trading systems, which consumed a lot of machines. But then what we saw was typically big data, big data became a more dominant consumption of computing resources, and then AI.

Actually are changing, in all business scenarios should be thinking about delay what will bring more and more low, then the change of bandwidth will bring what, the last is based on the cloud where do business innovation opportunity is more quickly, because the cloud native and, more importantly, how can I better, more quickly and try to complete the whole business of iteration and innovation, Perhaps for all the people who do system architecture and system architecture technology, this is a topic that needs to be slowly combined with their own business to think about.

As mentioned above, in the second half of 5G and cloud native era, video is the latest and biggest certainty. Video cloud contributes to the video of content from text and text to video, and video cloud changes the way of information interaction from offline to online.