By Chad Hart

Original link /Webrtchacks.com/webrtc-toda…

Extensible video encoding

Extensible video coding (SVC) is arguably a better way to handle multiple media streams from the same sender to handle different conditions for each receiver in a group call. In many ways, it is also considered more complex. Sergio&Gustavo has written an excellent article on the subject.

**Chad: ** If Simulcast isn’t there, where is SVC?

**Bernard: ** In some ways SVC is easier than Simulcast. Today, it’s in Chromium as an experimental implementation of time scalability. In Plan B, time scalability is also supported – so it actually exists and is supported by the conference server. So for most conferencing services, this is actually a much easier step forward in some ways, for example, supporting both RID and MID.

MID is the SDP media identifier, and RID is a newer restriction identifier used to restrict a single stream. I leave it to the reader to look at the various SDP specifications for more information about them.

I think RID and MID are supported by many conferencing services, both Medooze and Janus. One of the understandings about SVC is that it is required in both VP8 and VP9 – the decoder must support this. Therefore, there is nothing to negotiate. The encoder can push it out. SFU doesn’t even have to discard [the SVC layer] if it doesn’t want to, but this is obviously better.

AV1

Chris Wendt wrote here a long time ago about the codec wars between the H.26X and VPX camps and the potential for one codec to rule them all. Today, the codec already exists, and it is called AV1.

When did WebRTC adopt AV1 as a standard?

Bernard: The challenge [with AV1] is trying to figure out how to make full-resolution coding useful and usable before a lot of devices support it.

Chad: I should explain to our listeners that AV1 is the next generation open source free codec.

Bernard: AV1 itself does not require any changes to WebRTC PeerConnection. But as an example, AV1 supports many new scalability patterns. Therefore, you need this control, which is where WebRTC SVC comes in.

The other thing is that AV1 has a very effective screen content encoding tool that you want to be able to turn on. Therefore, we added something called a content hint that might cause the AV1 content encoder tool to open.

Floren T [Castelli] proposed a hybrid codec called Simulcasts. The idea is that, for example, if you want to perform something like 360P or 720P, and you have a machine that can do it, you can encode AV1 at a low bit rate. You can do this in software, no hardware acceleration required. Then, at higher resolution, you will use another codec. For example, you can use VP8 or VP9.

This way, you can immediately introduce AV1 code without forcing it to be removed entirely or completely. With mixed codecs and Simulcasts and content prompts basically as long as AV1 encoders and codecs enter the WebRTC PC, it’s about time. I don’t think much thought has been given to AV1, but with these extensions (with very few tweaks to the API) our goal is to make it available immediately.

We are not far from that goal. Dr. Alex is writing a test suite. The encoder and decoder libraries are there, so it’s not particularly complicated. RTP encapsulation is not particularly complicated, it is very, very simple.

**Chad: ** So what makes it hard?

**Bernard: ** The trick is what we call dependency descriptor header extension, which SFU uses for forwarding. The tricky part is building support into the conference server. AV1 inherently allows end-to-end encryption [E2EE], which is where pluggable streams come in.

In fact, AV1 as a codec is not much different in terms of [codec]. I think it’s the next branch of the VP8 VP9 lineage. It has some H.264-type MAL cell semantics, so it’s kind of like a crossover between H.264 and VP9.

But from the point of view of the overall usage model of the conference server, it’s quite unique because you have end-to-end encryption, so, for example, you shouldn’t parse AV1 OBU. SFU should make forward decisions based purely on a dependency on description to allow end-to-end encryption. So, in essence, you’ve moved on to the next model, where SFU may now be independent of codecs.

Pluggable streams and sFrames

Pluggable streams are a topic loosely related to codec independence and directly related to end-to-end encryption (E2EE). In fact, we’ve already published on the subject, and Emil Ivov took a deep look at E2EE on Kranky Geek a few weeks ago.

I’ll let Berard talk about the use of the pluggable stream API.

Slide over TPAC’s pluggable stream. Source: TPAC – 2020 – Meetings (docs.google.com/presentatio…

**Bernard: ** End-to-end encryption is not just a simple use case. Pluggable flows is actually this idea, and in the pluggable flows API model, one way to think about it is you have access to the framework. You can perform operations on the framework, but you don’t have access to RTP headers or RTP header extensions or the like. You should not change the size of the frame significantly. So you can’t add a lot of metadata to it. You should operate on the frame and then essentially return it to the packer, which then packages it as RTP and sends it out. So it’s related to RTP.

There are other apis under development that operate with the same idea of giving you video frames.

The most prominent are WebCodecs and pluggable streams for raw media. One way to consider this is an extension of media stream tracing, because pluggable streams, raw media do not depend on RTCPeerConnection, whereas pluggable streams and encoded media do. In all of these apis, you can access video frames (raw or encoded), perform operations on them, and then, inevitably, return them. In the case of an insert stream, it is grouped and sent over the network.

There are some tricky aspects and some bugs have been filed. It now works with VP8 and VP9. But it doesn’t work with H264, I’m not sure about that part, but we have a bug we’re still working on.

Equally important here is the idea that we are not trying to tell developers how to do their encryption or which key management scheme to use. We are developing a standard for the end-to-end encryption format, SFrame, and doing IETF standardization there. We have not fully agreed on a key management plan. As it turns out, there are multiple scenarios that might require different key management.

Secure frames or SFrames are a relatively new proposal for allowing end-to-end media through SFU by encrypting the entire media frame rather than individual packets. Since there can be multiple packets per frame, it can run more efficiently.

IETF Secure Frame (SFrame) proposal

(datatracker.ietf.org/doc/draft-o…

**Bernard: ** One of the cool things that makes SFrame more scalable is that you operate on a complete frame, not on packets. This means that if you make a mark, you do it once over the entire frame. It is not considered feasible to digitally tag each packet. For example, for a keyframe, this means that you can sign many packets. But with SFrame, you just mark each frame.

As a result, it actually leads to a significant reduction in the workload of tagging. So it is now practically possible to do basic raw authentication — knowing who each frame is coming from, which is not possible in each packet model.

Everyone seems to agree that only one SFrame format is needed, but this is a much trickier business for key management. We’ve already discussed at TPAC the possibility of building SFAME in a browser — having a native implementation of SFAME. We are not at the point where we think we can have native key management. This is a very tricky thing to do because you might end up using five key management schemes in your browser.

WebCodecs

The theme of WebCodecs is to give developers deeper access and more control over the real-time transport control stack. I’ll let Bernard explain what it is:

**Bernard: **WebCodecs gives you access to codecs already in the browser at a lower level. So essentially what you want to think about is that it’s similar to a pluggable stream where you can access a frame. For example, you can access an encoding frame, or you can enter a raw frame and get an encoding frame.

**Chad: ** Ok, so it’s lower level, direct access to the encoder and decoder on the other end?

**Bernard: ** Yes. On the decoding side, it’s similar to what we call the mean square error (MSE).

**Chad: ** Media Source extension?

Media source extensions and media source apis replace much of what Flash did for streaming in standardized JavaScript. It allows developers to play any containerized media to the browser, even if it has DRM content protection. Here’s an MDN link for more information.

Developer.mozilla.org/en-US/docs/…

So how does this compare to MSE?

**Bernard: ** The way it’s treated as a WebCodecs on the decoding side is similar to MSE, except the media is not containerized, it will be in the encoded video frame, so they are similar in that way.

When people ask me, “How do all these things work together? For example, if you want to stream games or movies, you can connect to WebTransport to receive encoded media. Then you can render it with MSE, or you can render it with WebCodecs. What’s different about a small business is that you have to deliver containerized media. With WebCodecs, it’s not packaged, it’s packaged, so it’s a little bit different. In MSE functionality, you can actually get content protection support. In WebCodecs, at least today, you don’t get content protection support.

**Chad: ** How does MSE differ from WebCodecs in terms of coding?

**Bernard: ** This is interesting because if you think about it, if this is a cloud game or a movie or something coming down from the cloud, you will never encode it on the browser, only decode it. Therefore, this situation doesn’t actually require WebCodecs to code anything. For example, the encoding scenario could be a video upload. So if you want to upload a video, you can encode it with WebCodecs and send it over the network. You can send it either as a reliable stream or as a datagram. If you do it with datagrams, then you have to do your own retransmission and your own forward correction.

If you’re not too concerned about delay control for video uploads, you can just use a reliable stream. So this is a scenario that uses WebCodecs as an encoder, and I think this scenario or use case is a scenario where WebCodecs has a real advantage because you don’t have to do any weird tricks like put it on a whiteboard or anything, or do anything.

Does WebRTC have a future in the face of these alternatives?

Posting videos is one of the big things WebRTC does. Will network transport using other apis such as network codec or building your own codec in WASM replace network real-time transport? In fact, that’s what Zoom does (as we’ve already discussed), and some members of the Google Chrome team even promoted it in the last webcast.

Is the ** direction to let people think and do that for themselves? Or do you think there will be a parallel track to standardize these mechanisms?

**Bernard: ** This is a real problem. In a sense, you’re free to do whatever you want, and if everything in your world is end-to-end, that’s fine. For example, many people today want to use open source SFU. You can’t just send whatever you want to an open source SFU – it has expectations of what it will get. This may not be important in a simple scenario like video uploading, but it is important in a scenario like conferencing, and the most important thing is that you have a standard to understand what is expected.

Now, another thing to consider is performance, because I know people have raised the possibility of trying to make a conference server using WebTransport. I’m deeply concerned about that, because especially today, if you look at conference services, there’s a huge demand for more and more people, like seven by seven, or who knows how big they’re going to be.

As a result, students’ needs seem insatiable — teachers expect everyone to attend classes that can be huge. So, in this case, you’ll see a surprising number of streams, probably in HD. In this case, performance is actually very important. So, with this decomposition model, a lot of the code is running in WASM, and it’s a real question whether it copies everything countless times. That’s how it works today. For example, in WebTransport, you have two copies when receiving. Whenever you send anything to WASM, you have a copy. Not everything is moved to a separate thread.

I think there’s a lot of potential for inefficient use of resources – browsers have a lot of work to do managing all of these resources.

**Bernard: ** Yes. So, people do complain that WebRTC is monolithic, but on the other hand, there are huge optimization opportunities when it’s a single code base in which not all JavaScript is running. You can eliminate the large number of copies that might exist in the classification model.

Machine learning

ML is a pervasive topic in computer science. We even hosted the 2018 Kranky Geek event at the RTC a few years ago. We’ve already seen improvements to ML within JavaScript, such as my “Don’t touch your face” experiment, and the progress of delete/replace in the background of various WebRTC applications. Most of these run around WebRTC rather than using it directly. In fact, ML seems conspicuously absent from the lower levels of WebRTC. I asked Bernard about it.

**Bernard: ** When we started talking on WebrTC-NV, one of the things we did was do NV use cases and try to evaluate what people are passionate about doing. It turns out that one of the things people are most interested in, besides end-to-end encryption, is access to raw media, because that opens up the whole world of machine learning.

**Chad: ** Let me also clarify – access to raw media only to reduce latency? In my experiments, I found it hard to get these things running in real time when there’s a lot of inherent latency in the stack.

**Bernard: ** Many of the scenarios we see involve local processing. For example, you have a captured media and you want to do something on the captured media before you send it. Many effects in Snapchat, for example, do this. This is what we call a funny hat, where you look at the face position and you put the hat on or something. One very popular feature is custom backgrounds, where you can detect the background and change it in some way — there are dynamic backgrounds, and so on.

Today, many aspects of machine learning are done in the cloud. For example, usually speech transcription or translation. So you send it to the cloud, and I don’t know if we can do that locally, let alone in a Web browser. There are other things that can be done locally, things like face positioning and body posture.

The overall long-term goal is to be able to do anything you can do locally as well as on the Web. This requires not only access to raw media, but also access in an efficient manner. For example, in a native implementation, we often see that everything stays on the GPU and everything is done there. We’re not there yet. To do this, we need to capture the GPU without copying it, and then allow machine learning operations to complete without copying it back to main memory, uploading, and downloading.

Source: Kranky Geek Virtual 2020 — Google WebRTC Project Update Youtu.be / -thoaymtJP8…

The background is a bit different and Google actually mentions “zero copy video capture” to hold GPU frames:

**Bernard: ** This was a topic that came up at a W3C workshop. One of the concepts that emerged was the Network neural network API. What you’ve seen before is that a lot of libraries like TensorFlow use things like WebGL or WebGPU. But if you think about it, it’s not a completely effective approach. In fact, what you want to do is make something as basic as matrix multiplication run efficiently, just giving it WebGPU or WebGL operations doesn’t necessarily work that way. Therefore, WebNN actually tries to handle these operations — such as matrix multiplier operations — at a higher level.

One key point here is that all apis must work together so that they can pass data to the right place without having to copy it to another API. For example, you’ll find that WebCodecs does support the concept of GPU buffers, but with some limitations, because many times those GPU buffers are not writable — they’re read-only. So if your goal is to do machine learning and change things in the GPU cache, you can’t do that without duplicates, but maybe you’ll try to get as much performance as possible.

One product that really caught my eye in 2020 is Nvidia’s Maxine. Nvidia uses generative adversarial networks (gans) on its graphics processors to capture a small number of key frames, then continuously extract facial key points, and combine keyframe data with facial key points to reconstruct the face. Nvidia claims that this method uses a tenth of the bandwidth of H.264. It also offers new features, as reconstructed models can be adjusted to do things like super-resolution, face rearrangement, or simulate avatars.

**Chad: ** This seems to be ML’s more revolutionary use of RTC. Is this also the standard direction?

**Bernard: ** If you look at research into the next generation of codecs, there’s a lot of research being done with machine learning right now — just from the codec perspective. The way I think about it is looking around you during the pandemic, what we see is a fusion of entertainment and real-time meetings. So you’ll see a lot of your shows — Saturday Night Live — are made by videoconference. The plays I see, the characters have their own backgrounds. We’ve even seen some movies that incorporate conference technology. From the Microsoft team, we’ve seen what we call “collaborative mode,” which essentially takes user input from the meeting service and transfers it to a completely synthetic new event. The basketball players are real, but it combines the game with the fans who aren’t actually there. So you build the environment — augmented reality/virtual reality. I saw a fusion of entertainment and real-time scenarios. This is reflected in tools such as network transport and WebCodecs. There’s both RTC and streaming. All of these scenarios are the same.

Machine learning can be directing, it can be cameraman, it can be editor, it can tie the whole thing together. Every aspect of it can be influenced by machine learning.

I don’t think it’s just a traditional medium. I don’t think we should think of these as just trying to do the same sessions as before with the new API. It’s not much of an incentive for anyone to just rewrite your meeting services with a whole new set of things. But I think it enables a whole new combination of entertainment and meetings that we can’t even imagine today. Many of them seem to have an AR/VR smell in them.

**Chad: ** OK, so there will be more fusion of real-time media types controlled by ARTIFICIAL intelligence.

What to do now?

**Chad: ** Before we go, is there anything else you want to say?

**Bernard: ** There is a lot about this new technology that works in the original tests. It’s very enlightening to use it and try to put things together to see how it works because you’re sure to find a lot of drawbacks. I’m not saying that all these apis are in any sense consistent — they are not. But I think it gives you a sense of what’s possible out there and what you can do. People will be surprised at how quickly some of these technologies come out. I would say that by 2021, these things will probably start hitting the market and you’ll see some of them in commercial applications. So people often dismiss it as something that doesn’t exist today, or I don’t need to think about it, and I think they’re wrong, and those who do end up being very surprised.