Tencent Conference was launched last year and expanded rapidly in two months during the epidemic, with more than 10 million daily active accounts, making it the most widely used video conferencing app in China. Tencent conference breakthrough behind, is how to ensure smooth communication through end-to-end real-time voice technology? This article is shared and organized by Mr. Shang Shidong, senior director of Tencent Multimedia Lab audio Technology Center, in “Yunjia Community Salon Online”. From the development of real-time voice communication to the future of voice communication experience under 5G, you will be revealed one by one.

Click this link to see the full live video replay

Evolution of communication system

1. From analog to digital phones

When it comes to the real-time voice end-to-end solution behind Tencent conference, you may think of PSTN telephone at the first time. After more than 100 years of development, the whole voice communication and voice telephone system has undergone a great part of changes since bell LABS created analog telephone. Especially in the last 30 years, voice calls have changed from analog signals to digital signals, from fixed phones to mobile phones, from circuit switching to the present packet switching.

The PSTN telephone system used the old analog phones. Then digital relative analog phone’s advantage is obvious, especially in the phone voice quality on anti-jamming, long distance signal attenuation resistance is superior to analog phone and system, so the first step in the evolution of the telephone system is from the terminal from analog phone upgrade to digital phones, networks are upgraded to ISDN (integrated services digital network), Supports digital voice and data services.

The most important feature of ISDN is that it can support end-to-end digital connection and realize the integration of voice and data services, so that data and voice can be transmitted on the same network. But ISDN is essentially a circuit-switched network.

Circuit switching is a proprietary circuit connection between two phones. The benefit of a proprietary circuit connection is consistent call quality. This ensures the stability of the link and the quality of communication, and also ensures the privacy of the whole communication. However, the disadvantages of the PSTN telephone system based on circuit switch are also obvious, especially when making long distance calls. Long distance calls are based on proprietary lines, so they can be very expensive.

At the same time, in this stage, ip-based Internet began to flourish, and communication terminals for calling also began to evolve from circuit switching to packet switching. The advantage of packet switching, as shown in the figure above, is that the bandwidth can be shared, so that the entire link is not exclusively shared by the two calls, but by many calls. The benefit of sharing is that the cost is greatly reduced, and at the same time, it further promotes the continuous development of the whole telephone voice communication technology.

2. From digital phones to IP phones

From about 2000, when the network began to undergo the evolution process from circuit switching to IP packet switching, in the last decade, we began to face a new challenge: the whole network, communication terminals have become more complex and diversified than before.

In the past, it was mainly phone to phone, but now we can use all kinds of CLIENTS based on IP network, such as PC, mobile App, phone and so on. Phone to phone can be exchanged through traditional circuit, or digital phone based on IP network. This led to an obvious problem: the network became incredibly complex and diverse, and the terminals became incredibly diverse.

In such an evolution process, how to ensure the interoperability between them? How to solve the problem of intercommunication between traditional telephone terminals and different Internet telephone terminals, and how to ensure the quality of call and call experience?

3. H323 and SIP

For voice calls, both based on VoIP technology, and based on the traditional circuit switching, there are two problems need to be solved: first of all need to register into the telephone network, after registered in, in the process of call, also need to make clear the following questions: how to build a phone, how to maintain the call, and finally how to close the phone?

After a phone is established, capability negotiation is carried out. In the case of an IP phone, the essence of capability negotiation is that the two parties exchange IP addresses and port addresses to establish a logical channel before a call can be made.

In the evolution of PSTN telephone network to IP telephone network, two interesting protocol families appear. The first one is H323 protocol. The agreement comes from the International Telecommunication Union (ITO), the international body that traditionally sets telegraph and telecommunications standards. Another protocol comes from the Internet, the IETF (Internet Working Group), which has many standards for various aspects of the Internet. The two international organizations of standard protocols have each launched a corresponding set of solutions for Internet telephony.

H323 protocol family solution carries out ITO organization’s consistent rigorous, all-inclusive style, the whole protocol family definition is very complete and detailed. From the application layer to the following transmission layer, using H.225 protocol to register the phone, with H225.0 protocol to establish and maintain the phone, as well as with H245 in the whole phone process for a variety of ability negotiation, IP address exchange…… The formulation of such a set of protocols, including the following transmission of audio and video using RTP protocol for code stream transmission, with RTCP protocol for the bandwidth control of the whole code stream, statistical information reporting, as well as the entire RTP protocol on the audio and video coding format setting. The H323 protocol family is defined in detail and complete, which can be used as the standard for audio and video calls over the Internet.

This standard is adopted by many large companies, such as Cisco and Microsoft, whose products follow the H323 standard. But even with such a complete and detailed definition, the H323 standard has been slow to move to market.

The SIP protocol comes from the IETF Internet Working Group. The style of the Internet working group is open and flexible, so the whole protocol has completely inherited its consistent open and flexible thinking. The overall architecture is very simple. SIP protocol does not specify what media streams are, only signaling, compared with H.323. The whole SIP is transmitted using the existing widely used protocol like HTPP on the Internet, and the whole message packet is written in text format. Therefore, the communication between its various entities, including phone, Prosy, DNS and Location servier, is open and flexible.

It does not specify the specific content, but only specifies the framework of the entire SIP protocol, the network structure, and the protocol that SIP modules follow to communicate with each other, for example, the SDP protocol. The communication format is not binary, and the H323 protocol is binary, which is very difficult to expand and read.

The SIP protocol is very open and flexible, so it is widely used by many companies and products for call establishment and call maintenance during Internet calls. However, it has its own disadvantages, that is, SIP solutions between vendors are often difficult to communicate with each other.

H232 protocol and SIP protocol, due to their different positioning, the style of the two international standardization organizations is different, there is no absolute dominance in the market, each retains the corresponding market share. It is also because of the emergence of H323 or SIP protocol, the Internet based on IP voice and video call is possible.

The audio solution in Tencent conference system is just the two protocol families and frameworks. H323 protocol is adopted in the whole signaling solution to communicate with PSTN phones. SIP is used to communicate with VoIP clients over the Internet.

4. Difficulties and challenges of VoIP technology

VoIP technology is based on the current complex IP network environment, but also faces many challenges. In circuit switching, quality can be guaranteed, although expensive, because resources are exclusive. But voIP-based solutions that are packet networks, not exclusive resources, face a lot of network architecture challenges, as well as acoustic challenges.

(1) Packet loss challenge

The first challenge in network architecture is packet loss, because it is not unique and the entire UDP protocol does not guarantee that the entire packet will reach its destination.

(2) Delay challenge

The second challenge is delay. There are many switches and various intermediate switching nodes in the whole IP network, which will cause delay at each switching node.

(3) the Jitter

The third challenge is a concept unique to packet switching: Jitter. For the change of delay, although the first packet is sent earlier than the second packet, the second packet may arrive at the destination first, so the voice cannot be played even if the second packet is received but the first packet is not.

(4) Echo problem

Compared with PSTN phones, VoIP phones face the challenge of delay, which results in different Echo processing.

Most traditional phones do not need to consider Echo, because the basic delay of local phone can be controlled within 50 milliseconds, human eyes can not distinguish whether it is an Echo or his own voice. However, as the Delay on the Internet increases, it may even exceed 150 milliseconds, so the echo problem must be solved properly, otherwise the human ear will feel very uncomfortable.

(5) Bandwidth problems

The bandwidth of the entire network is also related to call quality. If the capacity is insufficient, the number of VoIP calls and the quality of VoIP calls are greatly affected.

5. Tencent conference audio and video solutions

The following figure shows a major framework in the VoIP protocol stack, H323 protocol and SIP protocol, which Layer they correspond to in the whole OSI integrated network model and how different layers interact with each other.

In the whole Tencent conference voice communication, how can H323 and SIP signaling set up the call, after the establishment of the most important audio and video media stream on the Internet is how to transmit?

(1) Real-time voice communication: RTP protocol

RTP is widely used for real-time voice communication. RTP is based on UDP. Because it is UDP protocol, so unlike TCP, it does not guarantee no packet loss, it is as long as there is a packet to send destination.

RTP cannot run over UDP in voice communication because it is sensitive to voice delay caused by packet loss and jitter. However, TCP cannot be used because of a large delay.

So now everyone is using RTP. The RTP protocol has several mechanisms, with two typical fields: Sequence Number and Time Stamp. The two fields are used to ensure that the voice packets arriving at the receiving end are discontinuous or out of order, and the voice communication quality is not low when the jitter and packet loss are not large.

In the RTP protocol, for the telephone system, voice calls have multiple streams. The RTP defines an SRSC Identifier for multiple speakers, including audio and video. Different SRSCS correspond to different audio streams. Both the client and server can mix or stream audio streams according to the situation.

(2) Opus voice engine

VoIP solutions based on the Internet actually have a lot of choices, from the earliest H323, G.711 series began before and after 20 or 30 years of dozens of standards appear, but the current Opus has a unified trend.

As can be seen from the figure below, the whole Opus covers a wide bite rate ranging from a few KBPS to dozens of KBPS. Opus not only supports voice but also music scenes. Tencent conference business will also occupy a certain proportion in music scenes in the future.

At the same time, Opus is also a low latency voice engine, because in real-time voice communication delay is very important, more than 200 milliseconds for real-time voice communication is obviously not acceptable.

2. Pain points and technical difficulties of Tencent conference users

In the real use of technology to solve the audio problems in Tencent conference, or can encounter a lot of difficulties and pain points. During the development of Tencent conference, we found that many problems occurred in the actual user experience due to various reasons.

1. Common sound problems

(1) Silent problems

The first problem is silent. For example, when you join a meeting through VoIP client or phone, you may encounter silent problems, such as driver exception, hardware exception, no microphone permission, device initialization, and phone interruption.

(2) Echo leakage

In the process of real-time voice, echo leakage may occur. In the traditional PSTN telephone system, echo does not exist because of the low latency, and most phones are in microphone mode. However, with VoIP clients, such as PCS and mobile phones, more and more people prefer to use external playback instead of putting earphones on their ears, which causes echo problems.

(3) Noisy

There is also the problem of noise. For example, when working in a mobile scene, outdoors or in an office, people who use the VoIP client will often hear the sound of typing on the keyboard and drinking from a glass of water in the office. These noises are not obvious in the normal phone microphone mode.

(4) The volume is low and erratic

There will also be low volume, volume erratic situation, which is also related to the use of peripherals and the use of the scene. Like system based on PC, Mac, or mobile device play the callback is too high, the system CPU load is too high, holding a microphone posture, music voice misjudgment, and network Jitter causes deceleration, these conditions will lead to meeting speech encounter all sorts of problems in the process, and in previous calls basically do not have these problems.

(5) The voice is muddy and the intelligibility is poor

There is also the problem of voice turbidity and poor intelligibility. Now the real-time call scene is much more complex than before. If it is called in the heavy reverb scene or in the environment with poor acquisition equipment, it is easy to lead to poor sound quality.

(6) Audio lag

There are also issues like voice stutter, which are common to all VoIP calls. We will immediately think of the audio jam is related to the network, but in the actual process of solving the problem, we found that there are many reasons may cause the audio jam. The Internet is a big part of it, but not all of it.

For example, when the source quality is poor, there will be a lag in the process of sound signal processing, because some very small speech will be taken as noise cancellation. Similarly, CPU overload and synchronization failure of playback thread will also lead to block. When processing echo acquisition and playback is not synchronized, the phenomenon of missing echo will also appear to stall. Therefore, during the meeting, there will be many reasons from many aspects, resulting in the final sound quality damage.

(7) Broadband speech narrowband speech

In addition, we also found a very interesting phenomenon that many people in our company are using IP phones. The sound quality of communication between phones is usually good, but once we enter the Tencent conference, we will find that the voice is changed from the original broadband to narrow band.

Why is that? In many cases, it has a lot to do with the network topology adopted by our company’s IP system. Because a lot of internal many segment does not realize the connectivity, often need to pass this time transcoding, transcoding service voice gateway in order to ensure maximum compatibility, tend to be high quality voice transcoding directly into g. 711, this is 30-40 years ago using narrow-band standard, can guarantee the biggest compatibility, all unit and system support, But the sound quality will become narrowband accordingly.

Voice, narrowband voice, broadband and heavy reverberation room, the sound quality will lead to damage, and we found that heavy reverberation effects on the human ear has something to do with the whole volume, when you feel the volume is not suitable for or too loud, so heavy in reverberation room acoustics may further damage, plus the caton or noise together a variety of factors such as polymerization, Voice quality of VoIP calls will be greatly challenged.

2. Join a meeting with multiple devices in the same place

In the process of using Tencent conference, there will be problems with multiple devices in the same place. In the previous use of telephone scenarios, we will not encounter such problems, because a room is only one phone, there is no multiple phones, multiple acoustic equipment in the same place to join the situation. Nowadays, with the popularity of meeting solutions, everyone can install a collaborative meeting client on their computer, and people are used to taking their computers to attend meetings and sharing screens and PPT content. Everybody goes into the meeting, turns on their screen share, and all of a sudden, in a conference room there are multiple terminals joining in the same room, the same acoustic devices joining in the same place, and immediately the problem is echo.

For a single device, the playback signal can be captured as a reference to solve the echo problem. But for multiple devices, such as my laptop’s microphone processor, it’s impossible to get a sound reference from someone else’s speaker because the network latency is different from the CPU at the time. Therefore, it is usually only possible to solve simple echo problems in the local machine, and there is no good way to deal with the sound played by multiple sound source devices in the same room. A slightly better case is to produce a leaky echo, a little less will directly produce a scream.

Tencent conference has a detection scheme, we use the correlation of multiple devices to solve such a problem of joining the conference with multiple devices in the same place, which will be elaborated in detail below.

3. AI technology improves meeting audio experience

Tencent conference inside, we also adopted what kind of method, to improve the user’s call experience?

1. The expansion of super dividing-bandwidth in the audio field

First, we do narrowband to wideband super resolution extension for some narrowband voice, especially for narrowband voice from PSTN.

Traditional PSTN phones have a sound frequency upper limit of 3.4KHZ, but the sound is not bright and detailed enough for human ears. Compared with VoIP phones, the sound is not satisfactory. With the help of AI technology, the low-frequency information is predicted and generated, and the high-frequency components are well compensated, so that the originally dull and not rich speech becomes brighter and the sound quality becomes fuller.

2. Packet loss and hiding technology

Secondly, artificial intelligence can be used to solve the packet loss challenge in IP network. There are many solutions to the packet loss problem, which can be solved at the transmission level and can be solved at the network level through FEC solution. However, the solution to packet loss problem at the network level has its own limitations. Both ARQ and FEC schemes will be accompanied by the increase of bandwidth or delay, resulting in bad experience.

On the acoustic level, there is a certain correlation between speech signals or language frames. When a normal person speaks, the length of a byte is about 200 milliseconds, assuming a maximum of five words per second, each field is 200 milliseconds, for our speech frame, the length of the package is 20 milliseconds. By packet loss hiding technology, does not need each package should be received, lost voice package like as long as it’s not much more special sudden large quantities of packet loss, but only sporadic packet loss, jitter or network packet loss situation, can be on the acoustic deep learning through digital signal processing technology and machine technology make the packet loss make up for the reduction.

So when to encode the parameters of the speech frame, we can through digital signal processing technology and parameter prediction from the loss of deep learning technology, the signal level through a variety of filter to lost signal synthesis, with the network transmission layer itself again FEC or AIQ technology combine, can well solve the challenges of packet loss and jitter in the network.

3. Enhanced noise reduction language

Another strong demand for voice communication is noise reduction. We do not want to hear environmental noise, and most want to pay attention to the voice itself. Traditional noise reduction technology, after 30 or 40 years of development, whether based on statistics or other methods have been able to solve the traditional stationary noise reduction, can accurately estimate the stationary noise.

However, for the noise reduction of non-stationary and sudden sound, which is common nowadays, the classical speech processing technology is dwarfed. Tencent conference audio solutions is to use the machine learning method to train model, learning itself has the characteristics of the sudden noise, such as noise spectrum characteristics, finally the good of the traditional digital signal technology cannot solve the sound, such as keyboard, mouse, sound, water glass, mobile eliminate vibration, etc. These sudden noise.

4. Speech music classifier

In addition, the existence of music should be considered in the meeting. For example, when teachers give lectures to students, they often share some video content, and there will be high-quality background music. If our scheme can only process speech but not music, there will be great limitations for some of our application scenarios. Therefore, as shown in the figure below, we have developed such a speech music classifier, which can well integrate background music into meeting audio.

Iv. Audio quality assessment system

Real-time monitoring of audio and sound quality evaluation is very important for an Internet product like Tencent Conference that supports tens of millions of DAUs. Us throughout the tencent meeting during the period of development, to a large extent ITO reference implementation based on the international telecommunication union tests on the communication quality assessment program, as shown in the figure below, in the sound quality test evaluation scenario, we equipped with standard foreman, standard reference devices, to test and evaluate the overall tone of voice.

Evaluation of a complete set of solutions we turn to the ITU, 3 GPP standards, for sound source in different environment, different test code flow, under the condition of different sound sources, different kinds of test scenario has complete definition, for one-way voice calls, double, eliminate leakage echo, noise reduction, evaluate speech SMOS and NMOS scores have corresponding standards.

How to score the sound quality signal processed by Tencent conference, how to judge whether the sound quality meets the requirements? We have formed a complete voice quality evaluation system to evaluate the whole end-to-end voice communication quality.

Once in the whole process of voice calls, without reference to the quality of assessment evaluation scheme based on QoS parameter model generally, more from the type of encoder, the calling process of packet loss, delay how many, how much is the whole sound use of stream, the starting point, and then according to the parameters is deduced in the process of the whole call quality.

This solution is useful to the operator or network planning department, who can get these parameters. For users, it’s not as intuitive.

For users, they can intuitively feel whether there is echo leakage, whether the voice call is continuous, whether the call sound quality is natural and so on. For users, they will pay more attention to the QoE perspective and see whether the whole call experience is satisfied from the perspective of personal experience. We further refined the QoE index, mainly looking at the noise degree during the call, the color degree of the whole call voice (the naturalness of the call voice), whether there are changes and mechanical sounds, or other unnatural sounds, and whether there is a speech jam during the whole call.

Speech itself has a lag. I say one word and pause for a moment before I move on to the next. This kind of lag is obviously different from the lag caused by network packet loss and network jitter. Through digital signal processing scheme and machine learning technology, we score the audio without reference voice communication from the three different dimensions of QoE, so that we can know from the current network, the effect of the communication meeting used by users is what. As shown in the figure below, we use our non-reference scoring model to fit the data with reference, and it can be seen that the degree of fitting is very high.

Based on the unreferenced voice call model, we can have a good grasp of the quality of the current network communication. We do not need to get the reference signal of a specific voice, but only according to the signal received by the player, we can know whether the quality of the call is normal now, and if not, where the problem may occur.

5. Future outlook of conference audio system

In the area of meeting audio, there is a need for transcribing meetings in addition to phone calls.

At the beginning of 2019, Microsoft announced Project Denmark, which can collect the voices of different conference speakers with mobile phones and pads, and separate the voices of different speakers. As we know, in a conference room with more than one person speaking at the same time, the speaker’s voice can not be recognized by ASR alone. The ideal method is to separate the different speakers and then connect them to the ASR back-end for speech-to-text conversion.

Once the voice is converted to text, you can do a lot of things later, such as generating meeting notes, retrieving content, sending emails to non-attendees for viewing, and so on.

Cisco is doing the same thing. It recently bought a company that transcribes meetings.

But there are several problems with conference voice transcription: ASR recognition. ASR recognition provides a lot of good language recognition solutions, such as the recognition of dialect, the recognition of basic proper nouns, ASR also provides a better solution for debugging.

The biggest challenge for audio transcription of a multi-person meeting in the same room is: how to detect and switch consecutive speakers in multi-person meeting scenarios? If SOMEONE interrupts me while I’m speaking, or if two voices overlap, how can I effectively cut and separate the sounds? If multiple speakers are not connected to each other on the timeline, it is relatively easy to cut them out and identify the different speakers through voice recognition.

But how do they separate further if they overlap? How do I cluster the signals that I’ve cut out, how do I cluster the signals that I’ve said, how do I cluster the signals that I’ve said, how do I cluster the signals that I’ve said and how do I cluster the signals that I’ve said? These related technologies are very important for transcribing the entire conference, and many companies are also investing in this area, including Tencent.

In addition to meeting transcription requirements, the entire VoIP technology is also in the process of continuous evolution. It is common to hear people ask: what does the whole 5G mean for voice communications? Some people think that voice 5G bandwidth is so large, voice call bandwidth is so small, it does not make much sense.

In fact, 5G will provide a bigger and better stage for VoIP technology. The first is the promotion of bandwidth for meeting voice communication. Although the bandwidth of voice itself is very small, only a few tens of KBPS, the situation for meeting audio is far more complicated than this. Conference in addition to the transmission of voice, but also the transmission of high-quality audio, high-quality audio is not more than ten K can be done. In this case, the bandwidth consumption of meeting audio is rapid and may cause network congestion if the network conditions are not allowed. 5G, once the bandwidth ceiling is extended, will provide a bigger and better stage for meeting audio, we can provide better quality and higher quality sound quality.

5G can also greatly improve the delay. In fact, a large part of the delay of hundreds of milliseconds is consumed in the transmission delay, but 5G can reduce the transmission delay to one tenth of the original, which is a great improvement for the whole real-time interactive experience.

Therefore, the development of 5G technology can provide better sound experience and more immersive experience for voice calls. As long as the bandwidth is not limited, it is possible to realize immersive experience based on AR and VR on the audio of meetings. When the delay is greatly reduced, the interactivity of meetings will be better. If the interaction performance is further improved, in fact, face-to-face communication with people will not be much different, which is the development of technology.

From a business perspective as a whole, we’re seeing a lot of changes happening. Like converged communication, which is more and more used as service in more and more scenarios, now fewer and fewer people are using telephone equipment, are using the cloud way, because the initial cost reduction is very significant.

Artificial intelligence technology will also bring better voice communication experience in the future. For example, the intelligent noise reduction and intelligent packet loss compensation technologies mentioned above can solve some of the original problems, thus providing better voice quality experience than the original PSTN network.

WebRTC technology will also be popular, WebRTC also has a set of protocol family, after being widely supported in browsers, VoIP technology with the help of WebRTC can be widely used in many scenarios. As VoIP technology is widely popularized, it will be applied more and more In in-app Communications.

VoIP technology is also on the rise in the IoT field. In the future, smart speakers, smart refrigerators and other devices at home will have some communication functions and be connected through IP networks.

Smarter VoIP assistants are based on the artificial intelligence voice assistants provided during VoIP calls to solve voice problems during communication.

Six, Q&A

Q: Can teachers recommend classic books and open source projects on real-time audio and video communication?

A: WebRTC is A good open source project. There are books based on WebRTC, and there are good blogs on WebRTC. There are good introductions about WebRTC architecture and core technologies, which can be found online.

Q: Can you explain the local multi-device solution in detail?

A: This is true of local multi-device. Although the local collector can get the local signal for echo cancellation, it is impossible for the local collector to get the playback signal from another device in the room. This is the core of the problem of local multi-device. We can’t get the signal from the other device, but there’s a strong correlation between the local device and the other device in the room. Because they are all from the same source, but through different networks, different devices when playing out, there will be different distortion and delay. Therefore, we may not be able to achieve the noise or echo suppression caused by multiple devices in the same place, but we must achieve the detection of multiple devices in the same place. Once the detection of multiple devices in the same place, different product strategies can be used to solve this problem. Because it’s very difficult to eliminate multiple devices in the same place. If you have three or five devices joining at the same time and you turn on the microphone, it’s a disaster. The acoustic challenge of solving this problem is very expensive for THE CPU, and it’s not worth it.

Q: WebRTC is being used in many broadcast rooms. Would you please talk about the future of WebRTC?

A: WebRTC has A lot of promise. It’s an open source project first. WebRTC in real-time audio and video transmission, especially for network NAT technology, network traversal technology solutions have a very unique place. WebRTC for audio and video itself encoding and decoding, audio pre-processing have some relevant solutions, WebRTC is a very good solution in many scenes.

Q: How to improve speech articulation by rereverberation aphonia?

A: The first is multi-channel acquisition. Using microphone array technology, through directionalism, let’s say IF I’m speaking in this room, my voice will bounce off the walls and the table and be picked up by the microphone, causing interference. If the microphone is array form, it can be very good to track the speaker’s sound source, try to collect my direct sound, and shield the reflection sound from the wall and desktop, so that can solve the problem of rereverberation. For the sound acquisition of single channel microphone, both classical digital signal processing technology and machine learning can solve this problem, but because it is a filtering process after all, it may lead to sound quality damage, so it is not easy to do reverberation processing under single channel condition.

Q: What are the advantages and disadvantages of VoIP compared to VoLTE?

A: VoIP and VoLTE take A different approach. VoLTE transmits audio and video streams, which require QoS protection. The voice is relatively high. When network congestion occurs, the voice is preferentially transmitted. So VoLTE must be guaranteed bandwidth, guaranteed low latency. In terms of QoS, VoLTE has some advantages, but as 5G highways get better and better, VoLTE will not have many advantages compared to VoIP. With the mass adoption of 5G in the future, VoIP quality can be done very well.

Q: Teacher, what is the specific solution to the problem?

A: There are many specific solutions to the problem. The key is to look at the specific cause of the problem. Is network is leading card, and the device itself cause the card, if is the network causes the card, is about to see is a network packet loss caused or Jitter, FEC technology can solve the problem of packet loss, if is the Jitter is too big, it enlarged Jitter package, although the delay is damaged, but it can solve the Jitter of card. If the device is faulty, the CPU usage is too high. Sometimes the source will also cause the stall. For example, when I suddenly turn my head to speak, the microphone directional acquisition of my speech voice does not match the original voice, then the voice will suddenly become smaller, and the background sound processing will also suffer from the stall. Therefore, the cause of the stall is complicated, and it is necessary to analyze the cause and take targeted measures to solve it.

Q: Large live broadcasts, such as sports events and press conferences, mainly use HLS and FLV. Can WebRTC technology be used in the 5G era?

A: The two scenes are different, they may jump during live broadcast, or it doesn’t matter if the delay is too long when VOD is playing. It doesn’t matter if the delay is more than 200ms, 500ms or even 1 second. But real-time voice communication, more than 300 milliseconds, or even a second after the call is made, is definitely not acceptable. I don’t think they’re going to use RTC, they’re going to use something like RTMP to push the stream, or HLS to cut the packet and send it, because it brings latency, but it’s better for network jitter and a lot of other things. Therefore, different scenarios apply to different technologies in the future.

Q: There is no way for multiple devices in the same place to get the reference sound of other devices. How to achieve echo cancellation?

A: The multiple devices in the same place did not get the reference sound of other devices, but in fact there is A certain correlation between the collected sound, which can be judged and processed in the algorithm.

Q: What is the difference between deep learning algorithm and traditional methods for audio pre-processing?

A: There are differences. Traditional digital signal processing methods are difficult to achieve accurate positioning in different scenarios. For example, some traditional digital signal processing technologies do not have A good way to deal with burst noise. However, this kind of nonlinear sound can be dealt with very well by deep learning algorithm, which can better solve the problems that are not well dealt with by traditional methods during fitting, such as residual echo, burst noise, noise reduction and aggregation.

Q: Is the Tencent conference in the WebRTC framework?

A: No, The Tencent conference is not developed under the WebRTC framework.

Q: Are IoT applications smart furniture products?

A: Yes, more and more smart furniture will use IoT technology, such as smart speakers and other technologies that will also integrate voice communication in the future.

Q: Voice problems exist all the time. I am curious about how Tencent conference collects and learns about these problems. How does an online video voice product monitor the video quality of a user’s voice?

A: We need unreferenced speech evaluation system. With unreferenced speech evaluation system, we can know the language quality of the communication in the live network, whether there are problems, what kind of problems, which area, which time period, or which peripherals occur.

Q: Is there any good sharing of the microphone array for sound source location?

A: Sound source location, microphone array, there are A lot of technology can be done, such as DOA technology, microphone array technology, traditional algorithms are used to do speech signal processing, there are A lot of extended technology developed, you can refer to the detailed introduction on Google, answer more in-depth, I give A rough introduction here.

Q: Which parameters are appropriate for the subjective and guest evaluation of audio quality?

A: Subjective evaluation means that the convenor scores. For objective evaluation, ITO should have A P863 standard, and score objective indicators by referring to such speech standard, so as to further evaluate noise lag and speech quality.

Q: Teacher, regarding the packet loss processing and compensation processing, in the communication course of the school, the teacher has taught the way of cross-frame processing and then distributed the lost packet among each frame, and compensated the packet loss by using the association between frame data? Is the packet loss processing of Tencent conference similar to this processing? What is the general idea of deep learning processing?

A: What the teacher said in class is to disperse the bags into different groups in case of sudden mass loss. After receiving the lost piece in the group, the received bag can be recovered by FEC technology. Different from here, grouping interleaving can solve certain packet loss problem, but the cost is too much delay. When you divide a packet or multiple packets into different groups and interweave them, you have to wait for all packets to be collected before recovering the voice stream, which will bring about the problem of excessive language delay.

Q: Can Tencent provide services for the establishment of forwarding servers?

A: As for the traversal technology provided by WebRTC, Tencent Cloud also provides solutions, but the relevant technology used by Tencent Conference is for the use of Tencent conference. If you need Tencent cloud to provide NAT related technology for network traversal in your solution, it can be done.

Q: Is it possible to perform quality assessment by sampling locally and sending it asynchronously (because you don’t need real time, so you can send it directly over TCP) to the server, who samples the same interval of real time audio streams for comparison?

A: It can be done in the testing process, and of course it can be done in the live network, but the sampling itself will have great limitations. For products with tens of millions of DAU, such as Tencent Conference, it is unlikely to be sampled, and sampling also has great limitations for the evaluation of the live network. We suggest building a model without reference to the quality evaluation method and conducting real-time evaluation of all the data of the live network.

The lecturer introduction

Shang Shidong, senior Director of Tencent Multimedia Lab, joined Tencent Multimedia Lab in early 2019 as senior director of Audio Technology Center of Multimedia Lab. Prior to joining Tencent, Shang founded dolby Beijing engineering team in 2010 and served as senior director of Dolby Beijing and Sydney engineering team for nine years. After joining Tencent, I led the Audio Technology Center of multimedia Lab, responsible for the design and development of audio engine and audio processing in real-time audio and video SDK.

Pay attention to yunjia community public account, reply “online salon”, you can get the teacher’s speech PPT~