Videoconferencing has become increasingly used in People’s Daily lives, especially in the wake of the COVID-19 epidemic, which has led to the rapid growth of the videoconferencing market, leading to the continuous updating of Cisco’s network video technology. In this session, we invited Thomas Davies, Chief Engineer of Cisco’s Collaboration Technologies Group, to share with us the development history of AV1, the challenges of developing AV1, and the future of AV2 and its role in real-time communication.

The text/Thomas Davies

Organizing/LiveVideoStack

Hello, I’m Thomas Davies. I’m the Chief Engineer of the Collaborative Technologies Group at Cisco, and today I’d like to talk to you about AV1, Cisco WebEx and the next generation of video conferencing.

My talk today covers several topics. First, I want to talk about the recent explosion of video conferencing applications caused by COVID-19. Videoconferencing applications should have been common for some time, but COVID-19 created a tipping point that changed the landscape of real-time communications. Then, I want to talk about the historical context, which is the history of the Open Media Alliance and real-time communications. How did we get to where AV1 is today, when we were developing AV1, what kind of real-time communication considerations were we looking at. Then I’d like to talk about our AV1 codec on Cisco WebEx, and what we’re working on for our rollout. Finally, I want to talk about AV2’s role in real-time communication. Can we create more in this area?

#1. The development of video conferencing

The first is the explosion of video conferencing.

I think we’ve had an extraordinary year, and I think we can all agree. From the conference’s point of view, this opens up a new chapter for us. Starting in February of last year, the number of meetings on our platform grew dramatically, as you can see by the end of February, and then by March we had a huge increase in traffic — 10 times, 20 times, 30 times, over 500 million people attending monthly meetings, over 25 billion minutes of meetings per month. Another interesting factor is that we started having more team meetings, more educational meetings and so on, so the size of the meetings increased by 33%. As a result, our use cases have changed a bit. There was a huge increase in traffic, which obviously had an impact on us, because we had to support it accordingly. But it also underscores the need for expansion. The feedback we get from our customers is that we need to take our technology to the next level. People now know that the technology does work, but we need to keep moving forward and keep improving the tools and techniques to improve the experience. Now in some cases we’ve added artificial intelligence and new technologies for background noise suppression or speech-to-text transcription and translation in real time, but we can’t bypass basic video and audio when it comes to improving the quality of the user experience. That means checking our video processing lines, but it also means new codecs, and we’ve been using H264 for a long time.

But the question to ask about the recent outbreak experience is “Is this the new normal?” . In a sense, obviously not, because COVID-19 is probably a once-in-a-lifetime experience. But even before COVID-19, telecommuting was growing pretty steadily, up about 30 percent over the last decade. What we’re seeing is that people are increasingly using video apps and video calling where they might have been using voice before. But when an outbreak occurs, many companies begin to look at the way their teams work and consider what will happen when the outbreak is over. Whether 74% of US CFOs predict they will continue to work extensively remotely after COVID-19 is over remains to be seen, but I suspect that many companies will fundamentally change the way they work. The videoconferencing market is expected to grow by 11 per cent a year, or more than double in the next seven years. But it will also become more and more free. In many cases, video is replacing audio, and people often make video calls because devices are equipped with video calling apps, which changes the way people interact with those devices and the way they use them. COVID-19 won’t last, but I think it will have a lasting impact on technology that will change the way people work for a long time to come.

#2.AOM and real-time communication

We’ve seen this before. In the early years of the Open Media Alliance, there was also something of a turning point in videoconferencing, particularly with the proliferation of software-based platforms, which required an increase in technical capabilities. We wanted to move forward with other solutions, and the Open Media Alliance gave us an opportunity to move forward with codecs.

Cisco is a founding member of the Open Media Alliance. We have felt that the existing standards do not serve open media well, and licensing fees are a barrier, especially for H265. We have developed a solution for H265, but the licensing model does not fit. We use it on software platforms that may have millions of users. At the same time, we needed the next generation of video codec, since H264 has been in use for nearly 20 years, and we developed the “Thor codec” for the RTC. This actually shows how careful we are about balancing complexity and compression performance. Thor was integrated into the first AV1 test model based on VP9. From the beginning we were interested in working with the real-time communication focus of the new standard to understand the impact of each tool on our use cases.

But what does that really mean? We should identify three main requirements for a new video codec in real-time communication. First, it is less complex than its other use cases. And we particularly wanted to do it in software on a commoditised PC platform, not because we didn’t use the hardware, but because we did use the hardware, but it took a few years to open up the hardware, and then a few more years to develop good hardware with good encoders. The second factor is the resilience of the network, and we need standards-compliant tools to detect and fix errors and help recover from them. Third factor we think may be more controversial factors, is we want to limit the number of the standard configuration, at least from the point of view of tools, because the new configuration as new codec, we have to interact with the operation, so we prefer to use the new codec, instead of using multiple codecs, must interact with multiple codec operation, We have already seen limited adoption of H264 high-spec files for this reason, as well as limited scalability due to the difference between H264 and H265 configuration files.

In terms of complexity, compared with some other cases, the price we at some low quality to get faster running speed, above the red circle, is the place where we want to codec operation, we have to fast, we need the complexity of the limited, which means that we cannot merely average, we can’t just coding at a speed of 30 frames per second, We need to meet the time requirements for these frames from time to time. One of our goals, and I think this is a feature of a well-designed codec standard, is to be able to achieve real gains even at similar complexity, which requires faster operating points, and if you’re in a video-on-demand scenario now, maybe you’ll move to higher complexity. It could be higher, more complexity to achieve these gains, for example in video on demand, you might be able to tolerate 5 or 10 times the complexity, a 40 percent bit rate reduction, but since we have to replace the previous standard with a standard in the software. We have a limited number of envelopes that we can use to increase complexity. So we need to achieve real gains, and even if the complexity is similar, we can tolerate a modest increase in complexity, but not a huge increase.

AV1 fulfills this need to a large extent by first giving us the huge enhancements needed for any new standard, such as screen content tools and powerful loop filtering, but at the same time keeping the complexity of whatever core tools we use for good video encoding modest. These loop filters have a certain amount of complexity, we have multiple symbols, the arithmetic encoding is less complex than other standard similar techniques, we have very simple interpolation filters, we have fast decomposition transformations. In terms of elastic network, we can detect errors through the frame number, so we can see that whether we and reference frame synchronization, through the elastic model, even before the frame is lost, we can analysis frame, we also can repair the missing frames, because we can find things such as motion vectors, even if we don’t have the reference frame, We can also use this information to insert, and we also have extensibility as a standard. This is also related to point 3, which is that there is only one main configuration, which has chroma sampling-based profiles such as 4:4:4, etc., but the main configuration has all the tools, including extensibility, which is very useful if you want to build an encoder because it gives you a complete toolkit to explore. And the good news is that some people who identify standards think that simple tools may not actually be the best choice, but you can implement them without being constrained by that decision.

#3. AV1 development for Cisco WebEx

During AV1 development, we were also developing our own encoder for Cisco WebEx, which is for the standard. Presented as software on personal computer hardware.

We presented the world’s first AV1 HD real-time video encoding. We designed 720p for camera video and 1080p for screen content. We demonstrated this technology in the summer of 2019 in New York City. Since then our encoders have increased in speed by around 60%, and we have been working hard to provide AV1 with all the integration and system support needed for an end-to-end solution.

So what are our concerns? We had to choose between the video content we shared and the input from the camera, and we decided to use the shared content. Because this does represent some of the most challenging videos that we need to code, and some things can be very simple, like this slide, but nevertheless, more and more people share all kinds of things, we share charts, slides, YouTube videos. Or maybe a hybrid video that plays in a browser, which is something other than a computer design application. So there are a lot of requirements for fidelity. Some very colorful materials may have very low frame rates but very high resolution, and some very high motion scenes, and we have an adaptive system to deal with that kind of motion and content.

We needed to integrate AV1, and in our first phase, we rolled out AV1 to cover high-speed motion sharing, and high-speed motion video was the hardest video because it could be anything in nature. We launched the mode in our February product, and our next phase will cover both high resolution mode and automatic adaptation, which we aim to complete in the first half of this year. Future phases will include camera input video and transcoding, which is now relevant from interactions with H264 attendees during the conference. We are currently running backwards compatibility mode, so if a AV1 attendees and a only H264 attendees for the meeting, so they will be backwards compatible to the H264, but obviously this efficiency is very low, so we hope that in this case some special transcoding, this does not necessarily mean that we will always continue to use this kind of practice, This may not be the most efficient approach, but it will increase usage and provide the greatest benefit to the largest number of attendees.

What challenges did we encounter in the encoder development process? I think the biggest challenge is to achieve AV1 influence on CPU, compared with the H264 is very small, it does not mean that has no effect, nor does it mean that we do not use more CPU, more doesn’t mean we can’t in the case of more of the available CPU use more CPU, but it does mean that in some cases we do need the least amount of CPU, And still achieve the gain, which is very challenging from an encoder optimization point of view, the second thing is more of a solution question, which is how to balance mass and bit rate. I think one of the things that changed the focus on COVID-19 in a way. It’s that we really need to provide higher quality, and yard rate is not always the most important thing. But it’s usually some kind of tradeoff between the two, if you want higher quality. Even if the bit rate drops to very low levels, then you can use AV1. As I mentioned earlier, we have to support backward compatible behavior in an interactive operation scenario. More generally speaking, we face a multidimensional problem. That’s based on the CPU power of whatever device we’re using. Adjust encoder complexity Settings, resolution and bit rate. So we can reduce or increase the complexity by changing the encoder Settings, which is more or less a loss, or we can change the resolution of our code, or we can change the bit rate of our code, both of which can change the complexity. There are different tradeoffs involved, and we need to develop a good engine to make that decision.

As we moved forward, we started to offer more hybrid conferencing, and as I mentioned before, how to combine multi-streaming conferencing with multiple encoders is a problem, if you send multiple layers of different qualities, where should the new codec be in that? You can go to a very low bit rate, the lowest layer, and really make sure it works, or you can aim for the highest quality and make sure there’s even better quality. There’s also a problem with decoding, where you might have multiple decoders running different codec standards, and how do you integrate and manage them in the CPU envelope, so those are pretty difficult technical challenges when it comes to providing solutions.

# 4. AV2 and RTC

What about AV2 and the next generation codecs? Where do we see real time communication going?

In a sense, our requirements have not changed since AV1 came out, and let’s look at these trade-offs between mass and speed. You might want to operate with the same or slightly higher complexity as you did with AV1, but still be able to achieve actual gains. And one of our testing goals with AV2 was that we could prove that we were going to achieve those gains. Now that’s going to be very difficult, because nobody’s going to develop a full real-time encoder for the AV2 standard. You can deal with this kind of thing in the process of getting started and trying to understand, but you’re not going to have a completely optimized solution at every point. But AV2 is unique in that it has a software implementation working group that promises to give us some insight into implementation issues, perhaps not as fast as real-time communication, but certainly faster than the maximum compression that encoders can provide. In terms of video on demand, I still think that codecs may not be able to support a significant increase in complexity compared to previous codecs. Ideally, they would seek a modest increase in complexity, perhaps beyond what we would tolerate for real-time communication, and perhaps 5 to 10 times is a reasonable goal. But it is still likely to be much lower than before. If those curves don’t overlap, then I think we’ve done a pretty good job, but how do we make sure that those curves don’t overlap and that we have gains in both the velocity and the mass range? (because, as shown in the curve for us, if we can reduce the complexity, then we can budget effectively applied to the curve of quality, if you put the blue curve to the right or move up, then you will have more space, improve the quality at the same speed, or speed in the same quality.)

I think we need to keep some principles in mind. Cannot rely on the first principle is that the new software encoder more CPU, which seems to be a bit of a surprise, because we expect the next generation of codec conforms to Moore’s law, in part, this is right, but people hold the computer is more and more long, the performance of the single nucleus did not like the past, although the number of nuclear has been increasing. But it’s also important to remember that other programs on our devices are also using the CPU. One of the challenges with an application like Netcom is that we have to share the CPU with other programs that are running and using a lot of computing power and we have to accommodate that without causing problems for other applications, and I think we need to be very careful about how much CPU is available. And at the same time we want to achieve a huge gain, ideally another 50% bit rate reduction. In order to implement AV2 software coding with low complexity on these ordinary computers, I think we need what’s called “scalable complexity,” and we want to be able to find a path through standards as well as a path through still-simple encoders. Ideally, it is even simpler than the previous standard, which means that the general core tools of AV1 should maintain or reduce their complexity over time. This is a difficult thing to do because the way these tools are improved can make it hard to predict how they will be optimized in real encoders, but the price is very high because the reduced complexity of all the tools means an actual increase in quality at real-time communication speeds.

Now you’re reinforcing these patterns, you’re increasing the number of choices throughout the process, so the reference implementation slows down, which is obviously not good, which is not good for any encoder. But that is not completely disaster, because the part can avoid this kind of intelligent encoder complexity, you cannot avoid the complexity of tools increases, this is you can’t avoid or don’t want to avoid, because they were very useful, it also means that a preliminary analysis and machine learning, the complexity of encoder management will become more and more important, If we were to generalize all of these models, because we don’t have time to predict all of them. Then we have to do something that saves time and effort, and so more and more people are going to be looking at the minimum operation of a complete search algorithm.

# summary

In short, the Open Media Alliance ecosystem has helped us take the next step in video conferencing technology. It has helped us move beyond the ancient H264 video codec, and we are now releasing AV1 in real time with a complexity similar to that of the H264 AVC. But we realize the significant gain, I think it shows AV1 is a well-designed standard, it is not perfect, but it is the core of the application is very useful for real time communication, this is also our expectations of AV2, if we have the same good design principles, design, and we are equipped with intelligent encoder so AV2 will be able to achieve further gain, Finally, I want to say thank you very much and I welcome your questions.

Thank you!