In recent years, RTC (real-time audio and video communication) has been widely used in chat room, live microphone, video conference, interactive classroom and other scenes, with a delay of 200ms-300ms, which can meet the interaction requirements of most scenes. However, there are still some scenes that have very strict requirements on delay, and the delay level directly affects user experience, such as “online KTV” and “cloud game”. This article is based on the sharing by Chen Xi, the industry solution director of Instant Technology, in the LiveVideoStack public class, combined with the ultimate engineering experience of instant technology in the real-time chorus scene, the optimization idea of ultra-low delay experience is analyzed in detail.

Article | Chen

Finishing | LiveVideoStack

Chen Xi open class

Hi, everyone. I’m Chen Xi from Jiqu Technology. At present, I’m mainly responsible for the work of solution architecture, including the design of new products, new scenes and online project maintenance.

The theme of this session is how real-time technology can be optimized with ultra-low latency, and how to develop new scenes with ultra-low latency support, such as real-time chorus. The content is mainly divided into four parts: first, RTC development and pan-entertainment scene innovation; Second, the optimization idea of ultra-low latency experience; Third, the experience of achieving extreme engineering in the real-time chorus scene; Finally, the new scenarios to be launched and the future prospects of the RTC industry are discussed.

01 RTC development and pan-entertainment scene innovation

Let’s start by reviewing the evolution of online real-time entertainment. More than ten years ago, there was no concept of real-time interaction. At that time, it was mainly one-way live broadcast. Limited by technical conditions, the delay was as high as 3-5s. In the past decade, based on traditional CDN technology, the delay can barely be compressed to about 1s, so “pseudo-real-time interaction” has appeared successively. With a 1s delay, it’s about 2s back and forth, which people can’t adapt to in face-to-face communication; With the rise of Google WebRTC technology framework, in recent years the delay has been gradually compressed to 500ms, 300ms, 200ms, 300ms and 200ms, which are the common delay time of online voice, connect the mic and open the black.

Combined with the existing innovation scenarios, mainly real-time shared experience. Humans are social animals that are hard-wired to share words, emotions, even body language and expressions with others. Due to the outbreak of the epidemic, as well as the increasing work pressure and faster pace of life, people may prefer to stay at home rather than socialize offline on weekends. Therefore, creating online real-time sharing experience close to offline has become the mainstream appeal. At present, the shared experience scenario mainly revolves around “together +”, such as playing, watching, listening and singing together, and even includes the future trend of real-time network interaction, namely “meta-universe”, which carries out immersive experience through VR/AR and other devices.

In daily life, lovers and friends will share a pair of headphones to share good songs. In the series of “Together +”, “Together Listen” is the transformation of this sharing mode to online. Ximalaya launched “Together Listen” in March this year, users can invite friends into their rooms to share music together, make comments at any time, open the mic chat and so on.

When we share a good movie or a funny clip with our friends, it is impossible to know whether they have watched it or how they feel after watching it just by sharing the link. The launch of “Watch Together” avoids the above problems. By pulling friends into the room where the video is held, they can intuitively see each other’s feelings. They can connect the mic and open the camera to communicate in real time, which truly simulates various experiences of offline social interaction.

In the past, douyu game hosts could only weakly interact with fans through typing when broadcasting games, and they might miss many messages when playing games. In “Play Together”, no matter the host or any player can broadcast the game in real time with pictures and sound effects, and at the same time maintain voice connection with fans and team mates, which has achieved a qualitative leap compared to the previous interactive mode.

“Sing together” includes the immersive “watch concert together” in the metasverse, etc., which requires more extreme delay. The delay of the scene introduced above can meet the current demand at about 200-300ms. “Sing together” simulates that offline friends go to KTV boxes together, each with a microphone, or two to three or even four microphones at the same time to sing a song, the melody of the song is fixed, will not stop. If A and B are singing in chorus, A needs to hear B’s song at the same pace and rhythm, and there is no delay of network transmission to prevent disruption of rhythm, these requirements can not be realized in the industry before, so it is more solo singing, queuing round singing and scrambling singing, and the latter two have transitions so that delay time can be reserved.

You can listen to the chorus effect of APP in the plan of “Sing together”. The male voice in the audio is singing the melody of the chord, and the intonation of the lead singer is basically above the third and fifth chord. No matter from the audience’s point of view or from the two singers’ point of view, you can hear the effect of the audio just now. You can hear the delay is very low, about 70~80 milliseconds, and you can hardly feel the network barrier on the line.

02 Ideas for ultra-low Latency experience

RTC experience in characterization, watch side mainly from the low latency, flow experience, high quality, and quality evaluation, but in the process of RTC scenes constantly optimized, the three indicators as the three Angle of the triangle, is always equal to the sum of 180 °, impossible is small at the same time, such as reduce delay, buffer reduced fluency will decline, rate will decrease, Because many users have unstable downlink bandwidth, they cannot achieve the high sound quality, high picture quality requires the high bit rate.

In other words, after 20 years of operation experience accumulation and daily time accumulation of 2 billion, based on the analysis of users’ time behavior, the tuning data has been obtained, so far, it has basically achieved low delay, fluency, high sound quality and high picture quality.

Whether it is sound or picture, the whole chain from collection, pre-processing (beauty, sound change, echo cancellation, etc.), coding (programming AAC), transmission through MSDN, decoding at the player end, post-processing, hardware rendering, each step of which is not only in the production of data but also in the consumption of data of the previous process. As a result, the increase of delay in any link will become a short board in the barrel effect, thus increasing the delay of the whole process or even crashing. The “freshness preservation period” of information in each step is very short, so how to make the speed of each process match each other and not become a short board due to the emergence of jitter, which needs to be realized step by step under the balance and trade-off based on a lot of online operation experience, bottom mining, namely the construction of the ultimate engineering thought.

On the acquisition side, whether iOS or Android, especially Android, we use a new underlying interface and do parameter tuning to achieve lower acquisition latency. Pre-processing plate, minimize unnecessary pre-processing, including the necessary 3A sound beautification, further streamline processing algorithm, the delay will be reduced to lower; In the coding plate, Opus should be used as far as possible (the ordinate represents delay and the ordinate represents bit rate). The figure on the right shows that the delay of Opus is always very low in the process of bit rate from low to high; The optimization of network transmission includes real-time monitoring and active detection; The pull buffer on the other side is also the same, the push is only half, how to ensure low latency rendering and decoding of the pull depends on the latency. You can guess that we must set the JitterBuffer to be very low, only a few tens of milliseconds, but in fact the jitterBuffer is a dilemma. When the setting is too high, the lag rate decreases but the delay increases, so the results need to be balanced through a large amount of data.

Finally, under the premise of ensuring stability and smoothness, iOS delay is less than 70ms, Android average delay is less than 86ms (Qualcomm chip lower than 86ms, Haisi chip slightly higher than 86ms lower than 100ms), Windows performance is normal. From the perspective of vocal music and the principle of human hearing, in the real-time chorus scene, as long as the control end-to-end delay is stable within 130ms, as the main and auxiliary singers, they can hear the opposite end of the song and feel the same, so it will not affect the singing.

The structure keeps the stability of the network end and reduces RTT and jitter as much as possible by optimizing the network transmission link. Optimization is divided into two points:

First, more than 500 nodes are deployed globally, covering more than 200 countries. Active detection can be carried out between nodes to detect links, SDN lines and the current state between nodes, rather than passively collecting logs by relying on online user data.

The second is the structured product SDK, which will also carry out active detection process before access.

Active probing before access is required by scheduling policy, which is generally combined with two policies:

The default scheduling policy is that the scheduling server obtains the network quality and server quality of different nodes in real time. After summarizing the algorithm, a node server in a cluster is assigned to users in a certain area by default. However, due to rapid network changes, such as network jitter and node quality changes, the default scheduling policy may not respond in a timely manner. Therefore, the SDK client needs to take the initiative to detect and select routes before access. The access detection selects high-quality nodes by comparing the RTT, packet loss rate and download speed of nodes allocated by the default scheduling policy with other nodes.

The experience of realizing extreme engineering in the real-time chorus scene

Karaoke was originally “weak interaction” or even no interaction. Users recorded songs and uploaded them to the community or cloud server, and other users commented on the songs after seeing them. Users who sang well were more popular in the community. Later, multi-party KTV emerged, and the market mainly focuses on snatching and singing round, but few can achieve the pseudo-real-time chorus (explained below). Several live chorus projects launched this year have been expanded by instant technology to support two, three or even dozens of people.

The willingness and payment rate of users in the former K-song industry in China are gradually increasing.

Even 50% of the top ten online music products used by generation Z users with huge potential are related to K-song, such as Nationwide K-song, Chanba, kugou. Kugou has its own Kugou live broadcast, Kugou sing, and Kugou Qunqun.

At present, online KTV ranks first both in terms of customer acquisition ability and user stickiness, which is a very hot scene with very high attraction for users’ social interaction.

As mentioned above, the most popular “pseudo-real-time chorus” in the market is actually a serial chorus, which is characterized by the fact that the vocalist basically cannot hear the voice of the secondary chorus, and the delay between them is as high as 500ms, which will differ 1-2 beats for songs with fast rhythm, seriously affecting the rhythm of the vocalist. Therefore, in general “pseudo-real-time chorus” app, the lead singer cannot hear the voice of the secondary singer to avoid interference, but this is completely different from real offline karaoke. In addition, it is difficult to land the multi-person chorus, because the serial chorus can only lead singer string vice, vice sing string audience, can not achieve multi-person chorus.

Serial chorus solution architecture is very simple, lead singer to sing at local using the microphone, the microphone sound came to join the accompaniment music after blending flow all the way to sing, deputy to the voice of the lead singer sing according to the rhythm from the microphone input voice at the same time, the last of the music has three parts, the lead vocals, deputy singing voices and accompaniment music. Finally, it is played to the audience through low delay or CDN.

The disadvantage of this scheme is that the vocalist can sing only when the vocalist’s voice arrives, and the vocalist can only hear when the vocalist’s voice falls back. The double delay of about 500ms leads to the poor experience of the vocalist, and the vocalist can’t hear the voices of others, thus losing social contact.

In order to solve the above pain points and truly realize online karaoke close to offline karaoke, the real-time online karaoke chorus technical architecture is designed.

This architecture is completely different from the serial architecture. It is a deformation architecture, which does not divide the lead singer and the secondary singer, but only divides the singer 1 and 2. In the business side, one person will be selected as the mic leader, and the mic leader and another singer will push the songs collected by the microphone to ZEGO RTC. The difference is that the mic leader will push one additional music, that is, two streams (one accompaniment and one voice), while other chorus members only need to push their own voice.

The red box at the bottom right of the picture can represent one or more singers. The ZEGO RTC aligns the timestamps of each stream that arrives frame by frame for each stream with n+1 streams. The time stamp here is the absolute time stamp of NTP. It is the absolute time obtained by each singer from the same network NTP time server before singing through the NTP time protocol. It is the same for everyone. Under this premise, the mix-up server can take out every frame of each person’s audio, because audio has 50 frames per second, so it takes out every frame’s time stamp per second. Then compare with the time stamps of all other streams, combine the frames of the same time stamp into one stream and play it to the audience. The audience must hear the same song in the stream, provided that the rhythm of the singer is accurate of course.

So what is the premise of realizing the principle of architecture, how can singers synchronously push their songs at the same pace as others into the cloud, that is, how to ensure that singers sing at the same pace as others. The progress of singing depends on the progress of accompaniment, how to ensure that each singer heard the same progress as others. Which compose an innovative solution is given, and different from the serial one launch of music to listen to others, emphasize accompaniment to just above the audience to listen to, this plan is to let each singer was accompanied at the local broadcast music, accompaniment music based on each singer absolute (NTP) synchronization in advance good local time, the appointment after 12345 local accompaniment of the absolute time value at the same time, As long as the appointed time arrives, the local media server of all people will play the same accompaniment music at the same time. As long as the accompaniment music is aligned, the singing progress of all people will naturally be aligned.

This audio is a demonstration of the KTV effect of the three-person real-time chorus. It can be heard that the progress of the three-person chorus is very uniform, and it is hard to tell that it is an online chorus. Each singer sings the chorus in real time by listening to the local accompaniment progress. The audience and other singers on the mic hear the same effect, the user experience is very good.

The steps described above will not be repeated here.

There are two prerequisites for each singer to play the accompaniment locally: first, each end obtains the NTP network time in advance; Second, is the local media player pre-load song resources, only after all pre-load, start countdown 54321 start, the player can begin to play in a few milliseconds. At present, the progress error can be achieved within 8ms, and the impact on the whole delay is minimal.

Start together chorus can guarantee, if midway somebody to join the chorus, or singing to a half player suddenly caton, one card is 50 ms, or the way the user plug headphones cause underlying streaming engine restart suspended, delayed a few milliseconds, back-and-forth error increases gradually, play schedule error from 8 ms to 50 or even 100 ms, ms Plus the delay could reach 200ms.

In view of these situations, a corresponding unique algorithm is set to obtain the current singing progress from the master stream in real time and write it in the stream in SEI mode. The frequency can be customized. Non-master singers continuously obtain the progress value from the master stream and compare the NTP time stamp and sending time corresponding to the progress value in the master stream. Then compare with the local broadcasting progress of the non-host to know the progress of the host in one NTP time, and compare with the current broadcasting progress of the host. The network delay error can be eliminated by subtracting, and the current broadcasting progress of the host can be accurately predicted. Non-wheat main singer only to wheat main seek alignment, here involves the seek accuracy problem, many manufacturers of seek accuracy can not do very low, about 100 milliseconds, we through a series of breakthroughs, currently can do 10 milliseconds level seek to complete the above action.

As mentioned above, the mixed streaming server aligns all the audio streams of singers on the mic frame by frame and mixes them together to ensure the audience’s listening experience.

In order to avoid poor user experience in user scenarios with poor network environment, we made some codes and related parameters. After accessing scenarios, it is recommended that users perform speed measurement before singing together, mainly to test RTT and packet loss rate. If the user network is poor, the business layer will give friendly service guidance, for example, users are advised to optimize the network before singing together.

In addition to ultra-low latency and the ability to match scenes, real-time chorus also requires local playback of music. So how do you get copyright? In response to the national “net” action, users can not be required to obtain audio from other channels. In order to let more platforms access real-time KTV, NAMELY structure and TME have opened the channel of copyright purchase. The copyright fee is less than 1/5 of that of other channels, and the payment method is flexible and the music library is very wide.

Sonic Engine basically covers the elderly, middle-aged, youth and Douyin hit songs, to meet the demands of users of different ages.

04 Future prospects of ultra-low latency scenarios

It is not so much the future prospect of ultra-low latency scenarios as the future trend of real-time interaction between users, real-time pan-entertainment, and even the whole network society. In the future, there will inevitably be more powerful terminals, codecs may have H.266 or H.267, codec quality is getting higher and higher, compression rate is getting higher and higher, network may develop into 6G, 7G, transmission speed is getting faster and faster, packet loss rate, delay is getting lower and lower.

Based on these system cornerstones, the future network must be a complete virtual world full of VR/AR interaction, body interaction and expression control elements dominated by meta-universe, immersive experience and virtual Second Life. It fully meets all demands of users to simulate offline social interaction online.

Yuan universe

Many years ago, “The Matrix”, I believe you all know, the movie tells the future society, human biological body is imprisoned in a number of life-support equipment, everyone thinks of the real society is a super AI simulation of the virtual society. What we see of reality, whether with the naked eye or the Hubble telescope, is what we feel. From this point of view, as long as the simulation is realistic enough, what is the difference between virtual and reality? There is now some academic theory that the universe is a simulation of a giant superAI, exemplified by the inseparability of Planck time, which means that the superAI’s computer frequencies cannot be subdivided.

In short, focusing on the near future, the meta-universe must be a complete, completely self-consistent virtual world, and the economy may be derived from blockchain. It mainly relies on ultra-low latency, because everyone has their own avatars, and the social interaction between avatars is almost exactly the same as real social interaction, requiring less latency than the current requirement of 70ms. Immersive and social experiences rely more on virtual device manufacturers, such as HTC, HP, SONY, etc. VR rendering engines rely on game manufacturers.

For example, during the pandemic last year, Travis Scott used VR to hold a virtual concert in fortnite, a popular game at the time, which lasted only 10 minutes but was attended by over 12 million people. In addition, players can watch the concert from different angles in the game. The singer himself switched about 10 scenes within 10 minutes, and the diversity of scenes is also unmatched in the offline concert.

From the beginning of Second Life more than 20 years ago, there have been a number of live-action social games based on VR engines. The picture shows VR virtual classroom. The characters are virtual images generated by VR engine, and the voices come from real players. The top left corner is the user’s avatar in the game, and they can communicate with each other in real time, exactly like the classroom scene and interaction in the real world.

Cloud Game Scenarios

The picture shows the scene of cloud games. Cloud games do not consume local hardware resources, but live on the server. The server calculates the game engine in real time and transmits the game screen to the user’s terminal device with ultra-low delay. At present, based on the ultra-low delay technology mentioned above, the picture delay can be achieved within tens of milliseconds, and with the new signaling channel, the delay can be compressed to less than 10ms. There is basically no difference with local running the game, and save the cost of graphics card, PC, flagship mobile phone, any terminal can run the game without heating.

Cloud games + play together + live

We hope to combine cloud games, play together and live broadcasting. Without the hardware loss caused by running huge 3D games locally, we can live broadcast while playing games, connect with friends, form teams or chat and interact with fans.

The virtual image

The video clip shows a group of singing avatars, our avatar engine that will launch in October. The avatar engine will have a character model, related abilities, actions, and facial expressions at the same time. The facial expressions of the avatar can simulate the user’s facial expressions in the camera. And it will be packaged into a whole set of capabilities that can be quickly implemented without the need for professional game developers to access the platform, just by accessing the SDK. Although it is far from real VR and immersive experience, fast access as a virtual image is the first step.

VR scene

In the future, we will add the output capacity of VR engine that supports VR devices. VR devices support hardware commonly available on the market, such as SONY, HTC, HP and so on.

That’s all for this sharing, thank you!


In the scanQr codeLearn more about the conference