“Early in the morning, I open the door, a white world greeted my eyes…”

In the movie “Love Malatang”, there is a scene in which gao yuanyuan’s voice reading a text, as well as the voice of daily conversation, was recorded by the boy, and finally edited into a sentence: “I really like it.”

This film is 20 years old.


Can go to B stand to see barrage version: www.bilibili.com/video/av186…

Now, no such primitive way is needed to get the goddess to say something sweet.

Conan, a schoolboy, has a magic bow tie voice changer, which can transform Conan’s voice into anyone’s voice. Of course, qubits today are not about Conan, but with the help of artificial intelligence, this technology is moving from science fiction cartoons to reality.

Two South Korean AI researchers recently launched a research project. They built an AI system that allows the goddess to mimic what you say. That is to say, what do you want to hear, tell the “goddess”, she will listen to you repeat ~

Or you could call it a geek.

A male voice can make actress Kate Winslet ‘say’ the same thing.


Insert: Winslet, the heroine of Titanic



How did you do that?


In fact, the technology behind this is called non-parallel Data speech conversion.

The authors are AI researchers Dabi Ahn and Kyubyong Park from Kakao Brain in South Korea.

To be fair, Kakao is the largest mobile social networking company in South Korea, whose KakaoTalk is called the Korean version of wechat. Of course, KakaoTalk is online about a year earlier than wechat. Tencent is now Kakao’s second-largest shareholder.

When they started doing voice style transfer, their goal was to convert someone’s voice to the voice of a specific target. That is, anyone can imitate the voice of a famous person or singer.



Their first impersonator was the actress Winslet.

To achieve this, the authors built a deep neural network and trained it using more than two hours of Winslet’s audio material as a dataset.


The proposed framework


This is a many-to-one speech conversion system. The main significance of this research is that speech can be generated in the voice of the target without the support of parallel data, only the voice waveform of the target.


Architecturally, this model consists of two modules:

Net1

This is a phoneme classifier.

  • Process: Acoustic wave -> spectrogram -> MFCCS -> phoneme area.

  • Net1 classifies the spectrograms on each time step into phonemes, with a log-amplitude spectrogram as input and the corresponding phonemes as output.

  • The objective function of Net1 is cross entropy loss.

  • The data set used is TIMIT,.

  • The test accuracy is over 70%. Phonemes are independent of the speaker, while waveforms are independent of the speaker.


Net2

This is a speech synthesizer, which contains a Net1 as a subnet.

Process: Net1-> spectrogram -> sound wave.

  • CBHG module mentioned in Tacotron is used here, namely: 1-D convolution group + high-speed network + bidirectional GRU. CBHG is useful for feature capture of sequential data. Tacotron paper: https://arxiv.org/abs/1703.10135

  • Loss is the reconstruction error between input and output.

  • Griffin-lim reconstruction was used for recovery from the spectrogram.

  • Two data sets are used here:

    • Target1 (anonymous Female) : The Arctic dataset. Address here: http://www.festvox.org/cmu_arctic/

    • Target2 (Winslet) : This is a non-public data set containing over two hours of Winslet reading audio.


implementation


Set up the

  • The sample rate: 16000 Hz

  • The window length: 25 ms

  • Hop length: 5 ms


The program

Net1 and Net2 should be trained in sequence.

For details of the training, see the code the authors have made available on GitHub. The address is github.com/andabi/deep…






Regarding the future development of the project, there are several objectives:

  • Use confrontational training

  • Produces a clearer and purer sound

  • Working across languages

  • Many-to-many speech conversion system


In an interesting experiment, the authors are moving away from large target sound datasets to smaller ones to complete their training. In other words, having the AI listen to the voice of the target object for just one minute can achieve great speech conversion!



via QbitAI