Text to speech is so easy

This article contains audio auditions, see the original

preface

Hello, hello, everyone, I am a pony, this two days on the study of the text to turn the sound function, sometimes worry about their mandarin is not standard, for example, to record a video, speak off the cuff may block, this time we can prepared text first, then using the artificial intelligence to generate audio, here to share my research under!

Text To Speech

There are a lot of text-to-language software, but most of them are charged. Developers usually choose the interface provided by cloud service vendors, such asAliyun speech synthesis 和Tencent Cloud speech synthesisBut the best way to use it isMicrosoft TTSWith its ability to appeal to a global audience in 119 languages and variants, over 270 neuro-speech variants, and realistic synthetic speech and fine audio controls, it is, as many developers know, the best software available for speech synthesis. I also registered, unexpectedly, I was stuck in the first step, registration requires a Visa card, but I did not, when I wanted to apply for a card, an idea flooded into my mind, can I use the recently popular WebRTC_API to record?

Realize the principle of

Microsoft TTS official website has a text to speech demo based on JavaScript, in fact, Ali cloud and Tencent cloud also have, so we can use the official website demo to record, the implementation principle is to insert a JS script in the official website, I use oil monkey script development, This script is to use WebRTC_API to achieve the recording function. Results the following

use

First: You need to install the Chrome Oilmonkey extension and then install this script;

Step 2: When recording, enable recording on the top of Chrome. If the MAC is not recording, enable recording in Settings.

Step 3: Input the text you want, first click play, and then click start, it will record, click stop recording, and then you can download the audio file.

SSML grammar

There is a Tab tag for recorded text. SSML is speech Synthesis Markup Language, which is XML like HTML, but can describe improved synthesis of speech, such as syllables, pronunciation, speed, volume.

Using multiple voices

For example, the following code can simulate a conversation between two people

<speak xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xmlns:emo="http://www.w3.org/2009/10/emotionml" version="1.0" xml:lang="en-US">
<voice name="en-US-JennyNeural">Good morning!</voice>
<voice name="en-US-ChristopherNeural">Good morning to you too Jenny!</voice>
</speak>
Copy the code

Adjust your speaking style

Use MSTTS: Express-As elements to express emotions (such as joy, compassion, and calmness). Voice can also be optimized for different scenarios, such as customer service, newscasts, and voice assistants.

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
       xmlns:mstts="https://www.w3.org/2001/mstts" xml:lang="en-US">
    <voice name="en-US-AriaNeural">
        <mstts:express-as style="cheerful">
            That'd be just amazing!
        </mstts:express-as>
    </voice>
</speak>
Copy the code

Adjust your speech language

Modify the language with

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
       xmlns:mstts="https://www.w3.org/2001/mstts" xml:lang="en-US">
    <voice name="zh-CN-XiaoxiaoNeural">Welcome to follow the wechat public account JS cool, English is<lang xml:lang="es-US">
            Welcome to follow wechat public account JS cool
        </lang>
    </voice>
</speak>
Copy the code

The intensity of style

Adjust the intensity of your speech style to better suit your use scenario. You can use the styleDegree attribute to specify a stronger or softer style to make speech more expressive or softer. Chinese (Mandarin, simplified) neurophonological support speech style intensity adjustment.

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
       xmlns:mstts="https://www.w3.org/2001/mstts" xml:lang="zh-CN">
    <voice name="zh-CN-XiaoxiaoNeural">
        <mstts:express-as style="sad" styledegree="2">Let's go. Be careful on the way. Go early and return early.</mstts:express-as>
    </voice>
</speak>
Copy the code

Add a pause

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
    <voice name="en-US-JennyNeural">
        Welcome to Microsoft Cognitive Services <break time="100ms" /> Text-to-Speech API.
    </voice>
</speak>
Copy the code

Specify paragraphs and sentences

The p and S elements are used to represent paragraphs and sentences, respectively

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
    <voice name="en-US-JennyNeural">
        <p>
            <s>Introducing the sentence element.</s>
            <s>Used to mark individual sentences.</s>
        </p>
        <p>
            Another simple paragraph.
            Sentence structure in this paragraph is not explicitly marked.
        </p>
    </voice>
</speak>
Copy the code

For more information, please refer to the official documentation

Application example

The weekend is here. It’s time to relax while studying. Let’s enjoy a video together

What did I do? Download a trailer from a trailer site, then go find the intro, convert it to audio, and then synthesize the video.

Do you understand anything? Many videos on Douyin rely on handling ➕AI dubbing became original videos.

<speak
  version="1.0"
  xmlns="http://www.w3.org/2001/10/synthesis"
  xmlns:mstts="https://www.w3.org/2001/mstts"
  xml:lang="en-US"
>
  <voice name="zh-CN-XiaoxiaoNeural">
    <p>
      <w>The Watergen Bridge of Chosin Lake</w>It is produced by Chen Kaige, Tsui Hark, Dante Lam,<w>Tsui hark</w>Directed a war movie starring Wu Jing and Yi Yangqianxi.</p>
    <p>Set against the background of the Jangjin-ho Battle during the second battle of the Korean War, the movie tells the story of the soldiers of the seven Companies who were given a more difficult task after the battle of Sinheung-ri and Hagaru-ri</p>
    <p>
      <w>The Watergate Bridge of Chosin Lake </ W > as a film<w>the Chosin Reservoir</w>Sequel, tells the story of the seven LianZhanShi after a sinhung-ni and hagaru-ri had after the battle, and will face more challenge and more fierce fire, they will be in the American marines teacher retreat route as the throat of the gate bridge over the enemy, the task will be more difficult, fight scenes will be more intense, the huge sacrifice for victory also can make a person more deeply.</p>
  </voice>
</speak>
Copy the code

audio

summary

1, at present due to the navigator. MediaDevices. GetDisplayMedia () also cannot directly to record the sound of the computer, computer must be placed outside the sound, then recording, so the tape need to find a quiet environment.

2, sometimes the network speed is not good may card, need to find a better network, I am behind the use of mobile phone hot spot, there is no card.

That’s all the content of this article. I hope this article is helpful to you. You can also refer to my previous articles or share your thoughts and experiences in the comments section.

[1] Aliyun speech synthesis: ai.aliyun.com/nls/tts

[2] tencent cloud of speech synthesis: cloud.tencent.com/product/tts

[3] Microsoft TTS: azure.microsoft.com/zh-cn/servi…

[4] Official Document: docs.microsoft.com/zh-cn/azure…

[5] Oil monkey script address: greasyfork.org/zh-CN/scrip…