Introduction: Since the beginning of 2021, the audio social market with the explosive of Clubhouse capital attention, domestic related products lychee, Yingke have ushered in a sharp rise in the stock price, it is expected that in the future, audio in the social field will have a long-term and wide demand.

With the support of mobile Internet technology, audio social networking can not only meet the needs of social networking in multiple scenarios, but also usher in innovation in experience, especially AI+5G technology, which will drive audio social networking more obviously. Many social products use audio technology to add sound changes, bel canto, stereo, reverb and scene sound effects to enrich the user’s listening experience. This technical share will tell you what algorithms are used to realize voice changes in audio social scenes.

How does 01 sound change work?

When we watch video everyday, we sometimes use double speed playback. When the video speeds up, the male voice sounds a bit like a “female voice”; When we slow it down, we can hear something like the slow-talking “sloth voice” from Zootopia. These are simple sound changes.

From a technical point of view, it is not difficult to understand that if we use 16K sampling rate to collect a 100Hz sine wave, and use 32K or 8K sampling rate to play, then the frequency of the sine wave will be doubled (200Hz) or doubled (50Hz). In this way, the method of increasing or decreasing the frequency of the audio is very simple, doubling is to drop a sample every other time, and doubling down is to perform a linear interpolation. In technical terms, this is a resampling process. It is possible to achieve a tone change with the resampling method, but it is also easy to find that the audio length is longer or shorter, which is not the same as the length of the original input, which is not acceptable in real time communication. In real – time communication, what we need is a tune-changing sound function, which cannot be achieved by a single resampling method. Of course, in addition to resampling, we also have some other methods to achieve the requirements of tune-invariant.

What are the common algorithms for sound changes?

Common modulation algorithms include time domain, frequency domain and parametric method. Time domain is easy to realize, variable speed invariance + resampling is used to realize variable speed invariance. The frequency domain and parameter method are relatively complex, and the calculation is much larger than the time domain. This paper briefly introduces some common time domain and frequency domain algorithms.

In the time domain, it is mainly OLA (overlap-Add) algorithm: OLA, Synchronized overlap-add (SOLA), Synchronized overlap-add and Fixed Synthesis, SOLAFS, time-domain Pitch Synchronized overlap-add (TD-PSOLA), Waveform similarity overlay, In the frequency domain, it is mainly pitch-synchronized OLA (PSOLA) and so on.

1) the OLA

OLA is the simplest and most crude TSM approach. After the original speech is divided into frames, some of the speech frames are repeated or discarded to reconstruct the speech. This achieves a simple tone change effect. The principle is shown in the figure below:

A. Frame segmentation, frame processing for time-domain audio;

B. Add a Hanning window to the input signal X;

C. Take out the second frame at fixed interval Ha after the first frame;

D. Add window for the second frame and overlap-add for the first frame.

This operation to the end of the speech to rebuild a new tone tone speech. However, this algorithm has some limitations, can not guarantee the speech is continuous, may appear pitch fracture. Such speech sounds like a click, causing distortion.

2) Waveform Similarity overlay (WSOLA)

After understanding the simple and crude OLA algorithm, we can clearly know the limitations and defects of OLA algorithm. Of course, we also know what causes this kind of defect: caused by phase discontinuity. In order to reduce pitch fracture and phase discontinuity, Verhelst and Roelands proposed waveform similar superposition method (WSOLA). This algorithm is currently used in open source SoundTouch. Its principle is shown as follows:

A. Take out the first frame from the original audio, then window the frame and output it to the Y signal;

B +c. Find the second frame within the blue range of the dotted line, and the phase parameters of the second frame should be aligned with the phase of the first frame. Find the frame most similar to the second frame within the blue range as the output frame, as the second frame of y signal;

D. The most similar frame and the first frame overlap-add to y signal

The focus is on how to find the most similar frame in the two-step operation of b.c. One of the most straightforward ways to calculate autocorrelation is presented in many papers. Although WSOLA can solve the problems of pitch break and phase discontinuity, it will affect the timbre, which will be more obvious when WSOLA is applied to the audio of percussion instruments.

3) Pitch synchronization waveform stacking algorithm (Pitch-synchronized OLA, PSOLA)

The algorithm principle of PSOLA is different from that of WSOLA. PSOLA is processed in the frequency domain and can further achieve the purpose of pitch synchronization. In this algorithm, variable speed and modulation are two independent processes controlled by different parameters. The pitch is detected first and the pitch period is marked. Speech is divided into synthetic units by marked pitch periods. Speed control is achieved by repeating or losing synthetic units. The fundamental frequency of speech can be changed by changing the overlapping length of adjacent synthetic elements or by resampling and changing speed.

PSOLA is modified to the fundamental frequency, so the formant peak is well protected and the timbre will not be affected too much. However, the algorithm in the frequency domain processing, calculation is large, it is difficult to meet the real-time speed and modulation processing.

03 conclusion

This is a brief introduction to three common voice change algorithms, which can roughly implement uncle voice, girl voice, monster voice, etc. But if you want to make the sound sound more real and natural, you need to further optimize and debug. In addition to these algorithms, there are other sound changes, such as the common “thriller” using Vibrato or Tremolo algorithm, and “Valley Ethereal” using echo algorithm, etc., these algorithms are based on traditional signal processing implementation. In addition to the traditional signal processing sound change method, there is a more advanced sound change algorithm: AI sound change. Compared with traditional signal processing methods, AI sound changes will make the sound more real and natural.

Note: The pictures in this paper are from A Review of Time-Scale Modification of Music Signals

Keep an eye on Pano, and we will continue to share our product technical knowledge about audio and video, real-time communication in future articles.