Welcome toTencent Cloud + community, get more Tencent mass technology practice dry goods oh ~

This article was published in cloud + Community column by Tencent Cloud AI Center

Broadly speaking, intelligent speech technology has a variety of definitions, and these are some of the common hot scenarios. Speech recognition, part of which luo shared just now. Speech synthesis is the transformation of text into speech, which we will discuss in more detail later. Moving forward, voice print recognition, there are a lot of functions in smart cars that require the human voice medium to control commands and voice print is important. Open a car door, there’s a kid in the car, suddenly cry, the next inappropriate command, you can’t tell the person, not appropriate for voice control. Or there are some improper operations, which can be done by voice print, the process of identifying and authenticating people by voice. In fact, voice print recognition is quite popular in the future application scenarios. What are the big challenges in practical application? A lot of other biometrics rely on more stable features like faces or fingerprints, but the voice print is not stable. When people are happy, they sing karaoke one night and lose their voice the next day. How to recognize the biometrics that are changing is a big challenge.

Acoustic event monitoring, where you have a sound system or a monitoring system in your home, and you monitor for the sound of a baby crying, or any inappropriate sound happening, that’s the application of acoustic event monitoring. It’s not that difficult now, but it’s going to evolve very quickly as more scenes come in.

Natural speech processing, in essence, does some work at the semantic level.

We continue to expand, speech recognition just teacher luo mentioned a lot, we will not talk about, these technologies and links together is the framework and structure of speech recognition.

Speech recognition challenges, in the practical application of some of the difficulties we encounter, we need to solve it? As mentioned before, how accurate is speech recognition? If you can achieve about 90% of this scenario, people will say that other manufacturers give feedback of 95% to 97%, but you only get 90%. In fact, there is a premise, how is the quality of the audio materials you provide? If the sound is clear and there is no noise, 97% can be achieved.

The challenge of speech recognition, first of all, is colloquial. Most of the time, your speech is not as rational and logical as your speech. There are many colloquial expressions. For example, in the meeting scene, there are many people talking at the same time, will scramble to speak. It’s hard to expect speech recognition to be particularly accurate when you’re listening to a recording and it’s noisy. The distance from the microphone, the microphone technology and so on, needs to be improved in this area.

There is also the problem of dialect. For those with an accent, the corpus we do speech recognition training is not quite standard, and we will instill the recognition of Mandarin with an accent when we train the corpus. If the accent is not particularly serious, there is no problem with a mild accent. If you have a strong accent, you can’t hear it unless you’re a local, which is a big challenge for speech recognition. So when we put into the application, we should consider the scene and the recording material, so as to achieve good results. In order to give customers a good experience offline and online, we try to improve the ability of speech recognition, in other aspects of the material level is the same, is to have a certain degree of coordination can be better.

Because we’re going to talk a lot about speech recognition, we’re going to move on to other topics, and the next stage I’m going to talk about speech synthesis. Speech synthesis is more of an art. Speech recognition has an objective measure of how well it’s transliterated. How accurate is it when you say a sentence and it’s translated into text compared to what you would normally say? But the difficulty of speech synthesis is that there is no objective unified standard. What is the ultimate goal of speech synthesis? It is hoped that the pronunciation of the machine is close to the normal pronunciation of human beings. It is more artistic to judge the quality of speech synthesis technology. Whether your voice is good or not and whether it looks like a real person are subjective feelings.

Speech synthesis, if only recognition without synthesis, you can only listen but not speak, so that the interactive experience is not complete, the technology of speech synthesis is becoming more and more hot, there are also many scenes need speech synthesis, I will elaborate on this later.

Speech recognition synthesis is pretty straightforward, so what’s difficult about it? You see the person is in when you talk, the voice of the people are accurate, whether to speak fluent, people speak in different situations is Yin and Yang cadence, most of the time you will find that this is a robot talking, for speech synthesis technology is not successful, because speech synthesis final goal is genuine, hear a robot, not seriously listen to not to come out, This is also the point of making technological breakthroughs.

The subjectivity of technical difficulties, sometimes it is difficult to have an objective index, some people say this speech synthesis is not good, I say what is not good? He said, “I don’t feel comfortable. In different occasions, people’s requirements, whether your voice meets, whether your voice is suitable for the voice assistant, I will give you some examples to show the technological breakthroughs we are making.

Another part mentioned that many customers want customization of voice, why? For large companies, if I make a smart refrigerator or smart hardware, this sound has high requirements. For them, this sound is the same as the logo of the brand, and they want users to hear the sound of my brand or the application of my brand. I don’t want to share it with anyone. There are similar needs, which are very common, and this area also challenges the technology of speech synthesis. Some manufacturers are rich, can invite some stars to the recording studio recording, the quality of the recording determines the effect of the synthesis, before the synthesis requirements to record 8 hours in the recording studio to have a better effect, now this threshold is constantly reducing.

Inside Tencent this year, Ma Huateng sent a red envelope to everyone in wechat, and there was a speech, which was made by speech synthesis. We collected ma Huateng’s quality pronunciation in the conference, and combined training with these high-quality pronunciation can still achieve a good synthesis effect.

Talk a little bit about the introduction of speech synthesis. The earliest is waveform stitching, from each person’s pronunciation, what is your pronunciation like, and then this waveform stitching, and then is HMM+GMM synthesis, and then to the neural network parameter synthesis, and then WaveNet synthesis. The effect of WaveNet synthesis is very close to the effect of real person recording, MOS is an indicator of speech synthesis, the generated vocoder can achieve 4.2 has been very good, real person recording effect is generally in 4.5, if it is good to do more than 4.5, WaveNet synthesis I see better effect is in 4.52, Very close to the recording effect of a real person. At this year’s AI Conference, Google released a human-robot dialogue effect that was a composite of WaveNet. These are WaveNet synthetic sound, than before a listen is very obvious robot voice, now the effect of synthetic technology compared to the previous has a qualitative improvement.

Simply talk about WaveNet, because I do products, may not speak deep enough, the follow-up can continue to discuss. WaveNet is end-to-end synthesis technology, is Put forward by Google, the earliest WaveNet synthesis speed is relatively slow, relatively large consumption of resources, at the end of 2017 when Google and a WaveNet technology, 1000 times faster than before.

This was mostly about technical reserves, but we’ll switch to more scenarios later.

Speaking of digitization, why is speech important in the digital age? Digital age, a lot of time on the user’s service is better man-machine interaction experience, and the pursuit of more channels of human-computer interaction, such as robot is your customer, you can only give others typing chat, do feedback to customer service, now more and more hope based on the human communication, if you a lot to do in particular, the artificial cost consumption is higher, this is a scenario. After sales service is needed, the mode of human-computer interaction has a better effect.

You can see all of these examples of mobile apps. First is the voice input method, press and hold this key, and then is the voice remind, 5 minutes later I want to make a phone call, help me record, or remind me to drink water, with a simple input to do voice remind function, I believe many people use. When I talk in wechat, for example, after I receive a speech, I am in a meeting. Long press the voice button with a button on it and a text transfer button on it, the voice material you just received will be converted into text, and the effect of receiving information in real time will be the same when it is not convenient to listen to the voice.

So here’s a voice, and this is one of the things we did in one of our previous projects, and this is an application embedded in the bank APP. We were doing it in-house, and the testers knew what was going to happen next. The feeling is not very coherent, in this application, in addition to a lot of information dialogue scenes mentioned just now, many functional products have voice technology embedded. With the maturity of voice technology recognition, voice interaction can be used to access many business scenarios.

That was the case on the mobile end, so let’s jump out of the mobile end and talk about the hardware. Intelligent hardware has also been popular for a long time, speech recognition synthesis, semantic understanding is very important functions. Smart wearables, watches and other scenes, more common is smart speakers, as well as vehicle-mounted voice technology. Voice is very important especially in the automotive environment, in the car when the driver did not have time according to the phone, if it continues with the phone more dangerous, voice interaction at this time is a good starting point, to implement the control, such as simple open air conditioning, such as to help me put a disc, than with the hand press not only convenient but also increases the safety.

This is a scenario in which several intelligent hardware schemes are applied in hotels. We made some sample rooms in a hotel in Beijing, and put the platform of voice interaction in the guest room. Through interaction with the platform, you can help me to close the curtain, play the music for me, and turn off the light for me. Many people are lazy, but they don’t want to turn off the light in the hotel, so they can help you to do that. Including weather, traffic, news and so on, the effect of voice assistant can be realized in the hotel room, also convenient for many hotel guests.

Spoke on the phone in front of the application, then the application of intelligent hardware, there will be a large application scenario, is customer service robot, actually in offline encounter these problems, these problems we have to understand, one is you need feedback for 24 hours, if you need to be online at the same time, the customer ask you 80% are repeated, not difficult to the point where need someone to check it. For example, operators query phone charges and so on. If 80% of the problems are repeated, we try to use robots to solve them. Talking about the concept of omni-channel, the earliest robot customers take some public accounts, service numbers, and some web customer service and so on. The telephone seat or with artificial to cover, because the front of the telephone seat put a layer of speech recognition, and is speech synthesis, if these two do not do well, the customer experience is very poor, I do not know if there is such a phone, you find it is a robot, speech recognition is very poor, I listen to a robot, very impatient. And then the semantics, some synthesis effect is good, I did not think it was a robot, when I said two sentences to give me the same feedback, I know it is a robot, speech recognition failure, or semantic parsing is not. Telephone customer service is a very comprehensive and challenging product. We need to continue to explore. If this effect is well done, it can largely solve the problems that need to be overcome, and also solve the problems that many customers are waiting for. All these need to be further optimized in the future.

In the previous part, I talked about some scenario-based problems, which were later implemented in our Cloud, including some solutions for offline scenarios and some directions we are working on. This part first review, before Teacher Luo also talked about Tencent cloud speech capabilities, including speech recognition, speech synthesis and so on, in these technologies packaged with some solutions, what practical problems can these solutions solve?

The first is the solution to live broadcast security. In the field of Internet content security examination and approval, has been a topic of great importance to the regulatory authorities. For the corresponding live broadcast platform, its content is not controllable. If anchors have improper behaviors in the live broadcast room, they will also receive a lot of reports, which will bring a lot of trouble to these platforms. The earliest content identification, based on the level of images, sometimes failed to solve the problem. Maybe there’s nothing wrong with the image, but the words are not right, or the sound is not normal during the whole broadcast. There’s nothing wrong with the picture. The words below are small, but they are identified. In the process of live broadcast, if anything is said that violates the rules, the live broadcast platform will be offline or receive an alarm. Therefore, it is necessary to conduct content review based on the ability of pictures, voice, and even audio and sound recognition.

Customer service quality inspection is also the scene of many offline phone service. The quality of the dialogue between the operator and the customer is not controllable, but the mature platform is good. Now some Internet financial companies, their business development is relatively fast, and they meet the demand for payment, etc., the staff’s language is not standard, even abusive scenes. Such a situation will be subject to a lot of complaints, the customer service of certain platform is particularly uncivilized. Can do the quality inspection based on the customer service, rely on someone to check, such as 20 seats, one day it was impossible for people to qc check several calls, voice recognition so good, I record the situation down of the entire telephone recording, and then converted into text, the text do score level based on keywords or specific business logic, to evaluate my customer service is in line with the management practices, This is also based on speech recognition.

Smart court solutions, it’s also interesting. The clerk needs to make a record, what someone said, the same in the court of the scene, stenographer’s personal habits are different, or there is something missing here, there is something more, the court of the record is not readable. In this scenario, we put the speech recognition technology in future, the judge has a microphone in front, the defendant and the plaintiff have the microphone, the microphone to record who said this sentence, voice to text, text, combining the two level form records: the prosecutor said what, what did the judge said, and what the defendant said. Based on this record, subsequent retrieval of files will be made.

Speech recognition technology, as discussed in the previous scenarios, describes how speech recognition technology can help many traditional industries or government agencies offline.

Who is Xiao Wei? We pack a man-machine interactive operation platform, in overseas mature is amazon, based on this platform, can you through the simple dialogue, such as help me check the weather, for the small micro background of technology platform, based on speech recognition, speech synthesis, and the processing of dialogue, a variety of ability together to achieve the effect of a human-computer interaction.

Apart from the ability to speak, is the platform usable? This is also the advantage of Tencent itself, Tencent Music and other own capabilities package, we put these capabilities into small and micro platforms, small and micro users can be very convenient to use.

The platform of voice interaction corresponds to some hardware to some extent, including robots and so on. Based on these hardware partners, the entire human-computer interaction terminal ecosystem is finally formed. Harman katton music is also very high-end sound, with teng Small micro cooperation, low alto effect is very good, your sound is just a human-computer interaction, if you want to listen to music, there are better requirements for this can be achieved.

Q&A:

**Q: ** Now some minor languages or corpus is not enough, so the recognition rate is not high. How do you calculate the recognition rate?

**A: ** The result of speech recognition is compared with the result of normal text word by word, the industry also mentioned sentence error rate.

**Q: ** I want to ask a question, Xiao Wei, similar to other products have come out. For example, Microsoft’s, Amazon’s and Google’s have corresponding ecospheres, as well as corresponding development tools, application scenarios and features. Have you compared the ease of use of SKD or development platform? Can you share the comparison results with us?

**A: ** Compared the results, to be honest, the progress of each company is relatively fast, it is difficult to say that there is an accurate foreign products with high maturity is Amazon and Google. Because the ecology is relatively complete, there are a large number of development and underlying application platforms, China still belongs to the initial stage. Tencent’s advantages in this respect include friendliness to developers and the ability of underlying hardware. Tencent itself has a relatively strong original ability. We have excellent content such as QQ Music, and we help developers to improve in this area. The development of hardware platform is also very high. It is difficult to have an accurate figure, because gaoneii market is relatively early and is still in a state of parallel development.

Machine learning in action! Quick introduction to online advertising business and CTR knowledge

This article has been authorized by the author to Tencent Cloud + community, more original text pleaseClick on the

Search concern public number “cloud plus community”, the first time to obtain technical dry goods, after concern reply 1024 send you a technical course gift package!

Massive technical practice experience, all in the cloud plus community!