Talk about the physical principles of speech recognition and human voice, as well as the concepts of real voice and falsetto voice, head voice and mixing

Continue with the recent topic. Zhihu has a lot of articles on speech recognition, but mostly black box to black box, not the physical process behind it.

In fact, human vocalization is an interesting process, and let’s talk about it from the first principles of physics, and to better understand the nouns involved in singing.

1. Spectrum, and the sources of vowels and consonants

Sound comes from vibrations. The important method is spectral analysis (Fourier analysis), which breaks down the sound into its frequency components:

Then please look at the following three spectrum maps, which are from the wechat mini program “Sound wizard” written by me. You can find them on wechat:

This is the “MA” sound (the frequency corresponding to the vertical case is 20 to 20,000 Hz) :

This is the “MI” sound:

This is the “MU” sound:

The pitch (fundamental frequency, corresponding to the leftmost peak in the red circle) of these tones is about the same, around 130Hz, but why can we tell the difference? It’s actually the location of the formant of the sound.

Here, vowel A is particularly high at 700Hz and 1400Hz, vowel I is particularly high at 2000Hz and 3500Hz, and the peak of vowel U is not obvious (because the corresponding frequency is low). If you’re interested in measuring your formant frequency, it’s going to be different for everyone, but it’s going to be the same.

As shown in the figure below, formant peaks are different because we can flexibly change the volume and shape of each cavity when making different vowels, thus changing the resonance and shaping the desired spectrum. All of this is done subconsciously, so the human voice can change in a thousand ways.

The frequency spectrum produced by the vibration of the vocal cord itself is very simple, but it can be shaped by the cavity to form different vowels, and even different timbre.

The analysis above is of vowels. Where do consonants come from? Consonants come from the time that the sound changes, and they need to be seen on the spectrogram, which is a spectrogram that the skilled hand can see directly:

Speech recognition, of course, can also be done on a spectrogram like this. But the spectrogram is a lot of work, so people use some lazy methods, or they just do it on the original waveform.

2. True voice, falsetto, headvoice, mixed voice, marginalization, pharynx, closed

Various vocalization concepts of singing are also clearly reflected in the spectrum. The mini program “Wechat Genie” has a “authenticity” score, which reflects the number of overtones.

In general, the higher the authenticity, the higher the quality of the sound. Of course, sometimes we use fake sounds to achieve special emotional effects.

True sound is characterized by many overtones (peaks) and low frequencies. It is recommended that you experience the feeling of singing the truth degree above 2, which can be called a better “chest sound”, usually requires a slightly higher volume to achieve, please stay away from the phone, to ensure that the recording volume does not exceed the limit:

Falsetto is characterized by low overtones (low peaks), so low “authenticity”. For example, here is the pure falsetto of the high pitch “A”, only the fundamental frequency and” A” have higher formants:

If it is higher, even the formant of “A” disappears, and the “truth” is very low, because the formant must be A multiple of the fundamental frequency. 659*2 has exceeded the height of the formant corresponding to “A” :

How to change falsetto more “real”? As mentioned earlier, the formant of the “I” is so high that many hobbyists of junior treble will explore adding the “I” sound to the falsetto to make the sound more “real.” This is a marginal technique of vocal cords. Here, for example, is a falsetto with an “I” in it, rising up at the back:

At the extreme of marginalization is “pharynx,” which sounds a bit like singing and is used in folk singing and can be added to pop songs in moderation. However, while these methods make falsetto sound “solid”, they are still sharp and cannot achieve mixed sounds by themselves.

Mixing, is an important goal of vocal practice, can make the treble as if there is no change of voice point, and the tone is beautiful, pure and thick, for example, in recent years, Lin Junjie’s mixing technology is very good, li Jian’s singing tone is also mixed.

For example, in the video below, Lin’s voice is so smooth in the high-pitched part starting at 40 seconds that even the female voice is relatively rough, which is a typical mix.

I want to sing live with you. JJ Lin and mei Luo sing “Songs written for No one”. JJ is really good…k.sina.com.cn

The spectrum of the treble in the mix is shown below, the overtones are as rich as the real tone (with multiple high peaks), and the frequency can be very high (this is recorded by android, you can see that the high and low tones are cut off at both ends, it is recommended that you use iPhone to record) :

A good mix with a “authenticity” rating of 2 or more. If you can do that, you must be a master of closure.

Because the sound principle of mixed sound is through the Bernoulli effect of air flow, muscle control, the cooperation of the cavity, the vocal cord is almost closed, so that the voice becomes a flute/pipe like an instrument. The instrument is characterized by rich overtones and melodious sounds.

Sounds simple, but it’s actually very difficult. If you can do this, there will be a “feeling of finding a fulcrum in the voice”, and then when not to change the sound point can be turned into a mix, so that you can easily directly change the sound point, to achieve the unity of timbre, and timbre to listen to very good.

Simpler than mixing is the head sound. The head tone is higher in frequency and less abundant in overtones, but still good:

Finally, there is the whistle (dolphin sound). Whistle sound is also achieved by closing, and if you look at the spectrum, it’s falsetto, but that’s probably just because the fundamental frequency is too high, so it’s hard to get a formant. The frequency of the whistle could soar through the sky:

3. Summary

Above we introduce the commonly used vocalization knowledge. You can search the “Voice wizard” mini program on wechat to intuitively understand the type and quality of your voice.

Ai-assisted voice training will be added in the future, as well as more interesting features (such as matching vocals with artists/others, matching songs with songs, finding problems). If you feel good, please recommend it to your friends.

Talk about the physical principles of speech recognition and human voice, as well as the concepts of real voice and falsetto voice, head voice and mixing

1. Spectrum, and the sources of vowels and consonants

2. True voice, falsetto, headvoice, mixed voice, marginalization, pharynx, closed

3. Summary

Related Posts

Can AI still write Tang and Song poems? This paper introduces a method of using sequence model to write Tang and Song poems

SVM XgBoost feature Engineering

PyTorch-10 Spatial Transformer Tutorial (STN