This article was originally published in: Walker AI
The vast majority of audio features originate from speech recognition tasks, which can streamline the original waveform sampling signal to speed up machine understanding of the semantic meaning of the audio. Since the late 1990s, these audio features have also been applied to music information retrieval tasks such as instrument recognition, and more features for audio music design have emerged.
1. Categories of audio features
To understand different categories of audio features is not to accurately classify a feature, but to deepen the understanding of the physical significance of features. Generally, audio features can be distinguished from the following dimensions:
(1) Whether features are directly extracted from signals by the model or statistics obtained based on the output of the model, such as mean value and variance;
(2) Whether the feature represents the transient value or the global value. The transient value is generally in frame unit while the global value covers a longer time dimension.
(3) The degree of abstraction of features. The lowest abstraction degree of the bottom feature is also the easiest to extract from the original audio signal. It can be further processed into a higher level of intermediate features that represent common musical elements in the score, such as pitch and the start time of notes, etc. The high level features are most abstract and mostly used for musical style and emotional tasks;
(4) according to the difference of feature extraction process can be divided into: direct from the original signal to extract the features (such as zero crossing rate), converts the signal frequency characteristics (such as mass spectrum heart), need the specific model’s characteristics, such as melody, inspired by the human ear auditory cognitive change the size of quantitative characteristics of characteristics (e.g., MFCCs).
Taking “Differences in feature Extraction process” as the main classification benchmark, we listed common features under various categories:
At the same time, we also found that some features do not completely belong to one of the categories, such as MFCC, because extraction of MFCC will transform the signal from time domain to frequency domain and then obtain it according to the MEL scale filter that simulates human auditory response. Therefore, it belongs to both frequency domain features and perceptual features.
2. Common extraction tools
Here are some common tools and platforms for extracting audio features.
The name of the | address | Adapter language |
---|---|---|
Aubio | aubio.org | c/python |
Essentia | essentia.upf.edu | c++/python |
Librosa | librosa.org | python |
Madmom | madmom.readthedocs.org | python |
pyAudioAnalysis | Github.com/tyiannak/py… | python |
Vamp-plugins | www.vamp-plugins.org | c++/python |
Yaafe | yaafe.sourceforge.net | python/matlab |
3. Audio signal processing
Audio digital signal is a series of digital representation in the time domain of continuous changes in the sample, that is, often said “waveform”. In order to analyze the digital signal operation, the signal should be sampled and quantized.
Sampling refers to the process of continuous time discretization, in which uniform sampling refers to sampling at equal time intervals, and the number of sound samples to be collected per second is called sampling frequency. 44.1khz and 11kHz commonly seen in audio files refer to the sampling (frequency) rate.
Quantization transforms continuous waveforms into discrete numbers. First, the whole amplitude is divided into a set of finite quantization steps. The amplitude division can be equal or unequal spacing, assigning the same quantization value to the sample values falling within a certain step. The bit depth in the audio file represents the quantized value, and the 16bit depth represents the quantized amplitude to 2^16.
Nyquist’s law states that a signal can be accurately reconstructed from its sample value if the sampling frequency is greater than or equal to twice the highest frequency component of the signal, and in fact the sampling frequency is significantly greater than Nyquist frequency.
4. Common transformations
4.1 STFT
Short Time Fourier Transform (STFT) is suitable for spectrum analysis of slow time-varying signals, and has been widely used in audio and image analysis and processing. The method is to divide the signal into frames, and then apply the Fourier transform to each frame. Each frame of speech signal can be considered to be extracted from different stationary signal waveforms, and the short-time spectrum of each frame of speech is the approximation of the spectrum of each stationary signal waveform.
Because the speech signal is short-time stable, so the signal can be divided into frame processing, calculate the Fourier transform of a frame, so that the short-time Fourier transform is obtained.
The Fourier transform (FFT) transforms the signal from the time domain to the frequency domain, while the inverse Fourier transform (IFFT) converts the frequency domain to the time domain. Fourier transform from time domain to frequency domain is the most common way of audio signal processing. The spectrum obtained by STFT is also called spectrogram or speech spectrum in audio signal.
4.2 Discrete cosine transform
DCT is a Transform related to Fourier Transform. It is similar to DFT, but only uses real numbers. The discrete cosine transform is equivalent to a discrete Fourier transform of roughly twice its length, which is performed on a real even function (because the Fourier transform of a real even function is still a real even function), and some of the transformations involve moving the input or output by half a unit.
4.3 Discrete wavelet transform
Discrete Wavelet Transform is very useful in numerical analysis and time-frequency analysis. Discrete Wavelet Transform is to discretize the scale and translation of basic Wavelet.
4.4 Meier spectrum and Meier cepstrum
The spectrogram is usually a very large graph. In order to obtain sound features of appropriate size, it is often transformed into The Meir spectrum through mel-Scale filter banks.
The pitch perception of the human ear is roughly linear with the logarithm of the fundamental frequency of the sound. On the Meir scale, if the meir frequencies of two speeches are twice different, the tones perceived by the human ear are about twice different. When the frequency is small, MEL changes rapidly with Hz. When the frequency is very high, MEL rises very slowly and the slope of the curve is very small. This shows that the human ear is sensitive to low frequency tones, while the human ear is insensitive to high frequency tones, which is inspired by meyer scale filter banks.
Meir scale filter is composed of a filter bank of several triangular filters, which are dense at low frequency and have a large threshold value, sparse at high frequency and low threshold value. It corresponds to the objective law that the higher the frequency, the duller the ear. The filter form shown above is called mel-filter bank with Same Bank Area. It is widely used in areas such as human voice (speech recognition, speaker recognition), but it loses a lot of high-frequency information if it is used in non-human areas. At this time, we may prefer the Mel-filter bank with the same bank height.
MEL spectrum implementation in Librosa:
import numpy as np
def melspectrogram(y=None, sr=22050, S=None, n_fft=2048, hop_length=512,
power=2.0, **kwargs) :
S, n_fft = _spectrogram(y=y, S=S, n_fft=n_fft, hop_length=hop_length, power=power)
# Build a Mel filter
mel_basis = filters.mel(sr, n_fft, **kwargs)
return np.dot(mel_basis, S)
Copy the code
Meier cepstrum is obtained by cepstrum analysis (logarithm, DCT transformation) on meier spectrum.
# -- Mel spectrogram and MFCCs -- #
def mfcc(y=None, sr=22050, S=None, n_mfcc=20, **kwargs) :
if S is None:
S = power_to_db(melspectrogram(y=y, sr=sr, **kwargs))
return scipy.fftpack.dct(S, axis=0.type=dct_type, norm=norm)[:n_mfcc]
Copy the code
4.5 Constant Q transformation
In music, all notes are made up of a number of octaves of equal temperament, which correspond to twelve semitones on an octave in the piano. The frequency ratio between these semitones is 21 over 12. Obviously, two octaves of the same scale, the higher octave is twice the frequency of the lower octave. Therefore, in music, the sound is distributed exponentially, but our Fourier transform of the audio spectrum is linear distribution, the frequency points of the two are not one to one, which leads to errors in some of the scale frequency estimates. Therefore, in modern music analysis, a time-frequency transform algorithm with the same exponential distribution law is generally adopted: Constant Q Transform.
CQT refers to a filter bank whose center frequency is distributed exponentially with different filtering bandwidths but whose ratio of center frequency to bandwidth is constant Q. It differs from the Fourier transform in that the frequency of the horizontal axis of its spectrum is not linear, but based on log2, and the length of the filter window can be changed according to the different frequency of the spectral line to obtain better performance. Since the distribution of CQT and the frequency of the musical scale is the same, the amplitude value of the musical signal at each note frequency can be directly obtained by calculating the CQT spectrum of the musical signal.
The resources
A Tutorial on Deep Learning for Music Information Retrieval
STFT and Spectrograms, Mel Spectrum (Mel Bank Features) and Mel cepstrum (MFCCs)
“Spectrum Conversion Algorithm based on Music Recognition — CQT”
The librosa document”
PS: more dry technology, pay attention to the public, | xingzhe_ai 】, and walker to discuss together!