A lot of progress has been made in deep learning in the visual and language domains, and it shows step by step what types of models and processes we use when working with audio data.

By Dimitre Oliveira

Original link/pub.towardsai.net/a-gentle-in…

Photo: www.tensorflow.org/tutorials/a…

There have been a lot of recent advances in deep learning in the visual and language areas, and it’s intuitive to understand why CNN works well on images because of the local correlation of pixels, and because it’s sequential, sequential models like RNN or converters also work well on language. But audio? What types of models and processes do we use when working with audio data?

In this article, you will learn how to handle a simple audio classification problem. You will learn some common and effective methods and Tensorflow code to implement them.

Note: The code presented in this article is based on my work in developing the Kaggle Contest for The Rainforest Connection Audio Detection, but for demonstration purposes I will use the Speech Commands dataset.

Waveform figure

We usually have audio files in “.wav “format, they are often called waveforms, which is a time series with the amplitude of the signal at each particular time. If we visualise one of these waveforms samples, it looks like this:

Intuitively one might consider using some KIND of RNN model to model these data as a regular time series (such as stock price predictions). In fact, this can be done, but since we are using audio signals, a more appropriate choice is to convert the waveform sample into a spectrogram.


spectrograms

A spectrogram is a graphic representation of a waveform signal that shows its range of frequency intensity over time. It is useful when you want to evaluate the frequency distribution of a signal over time. The following figure is a spectrogram representation of the waveform image above.

The X-axis is the sampling time, and the Y-axis is the frequency


Voice command cases

To make this tutorial easier, we will use the “Speech Commands Voice Commands” dataset, which has one-second audio clips with colloquial words such as “down”, “go”, “left”, “no”, “right”, “stop”, “up”, and “yes”.


Use Tensorflow for audio processing

Now that we know how to use the deep learning model to process audio data, we can move on to the code implementation. Our pipeline will follow the simple workflow described below:

Simple audio processing diagram

It is worth noting that in step 1 of our use case, which loads the data directly from the “.wav “file, step 3 is optional, as the audio files are only one second each, and since the files are longer it may be a good idea to clip the audio, also to keep all the samples fixed length.

Load the data

def load_dataset(filenames):
  dataset = tf.data.Dataset.from_tensor_slices(filenames)
  return dataset
Copy the code

The load_datasetv function is responsible for loading. Wav files and converting them into Tensorflow datasets.

Extract waveform and label

commands = np.array(tf.io.gfile.listdir(str(data_dir)))
commands = commands[commands != 'README.md']def decode_audio(audio_binary):
  audio, _ = tf.audio.decode_wav(audio_binary)
  return tf.squeeze(audio, axis=-1)def get_label(filename):
  label = tf.strings.split(filename, os.path.sep)[-2]
  label = tf.argmax(label == commands)
  return labeldef get_waveform_and_label(filename):
  label = get_label(filename)
  audio_binary = tf.io.read_file(filename)
  waveform = decode_audio(audio_binary)
  return waveform, label
Copy the code

After loading the.wav files, you can use tf.audio. Decode_wav to decode them, which will turn the.wav files into a float tensor. Next, we need to extract the tags from the file. In this particular use case, we can retrieve the tags from the file path of each sample, and then encode them only once.

A case in point

First, we get a file path that looks like this:

"data/mini_speech_commands/up/50f55535_nohash_0.wav"
Copy the code

The text after the second “/” is then extracted, in this case the tag is UP, and finally the tag is encoded once using the Commands list.

Commands: ['up' 'down' 'go' 'stop' 'left' 'no' 'yes' 'right'] 


Label = 'up'


After one-hot encoding: 

Label = [1, 0, 0, 0, 0, 0, 0, 0]
Copy the code

Convert the waveform into a spectrum table

The next step is to convert the waveform file to a spectrogram. Fortunately Tensorflow has a function that can do this. Tf.signal.stft applies the short-time Fourier transform (STFT) to convert the audio to the time-frequency domain. Note that the tF.signal.stft function has some parameters, such as frame_length and frame_step, that affect the generated spectrogram. I won’t go into the details of how to adjust them, but you can learn more by referring to this video. (: www.coursera.org/lecture/aud…

def get_spectrogram(waveform, padding=False, min_padding=48000):
  waveform = tf.cast(waveform, tf.float32)
  spectrogram = tf.signal.stft(waveform, frame_length=2048, frame_step=512, fft_length=2048)
  spectrogram = tf.abs(spectrogram)
  return spectrogramdef get_spectrogram_tf(waveform, label):
  spectrogram = get_spectrogram(waveform)
  spectrogram = tf.expand_dims(spectrogram, axis=-1)
  return spectrogram, label
Copy the code

Convert the spectrogram to RGB image

The final step is to convert the spectrogram to AN RGB image, which is optional, but here we will use a pre-trained model on the ImageNet dataset that requires input of 3 channels of images. Otherwise, you can only keep the spectrogram of one channel.

def prepare_sample(spectrogram, label):
  spectrogram = tf.image.resize(spectrogram, [HEIGHT, WIDTH])
  spectrogram = tf.image.grayscale_to_rgb(spectrogram)
  return spectrogram, label
Copy the code

Put it all together

HEIGHT, WIDTH = 128, 128
AUTO = tf.data.AUTOTUNEdef get_dataset(filenames, batch_size=32):
  dataset = load_dataset(filenames)
    
  dataset = files_ds.map(get_waveform_and_label, num_parallel_calls=AUTO)
  dataset = dataset.map(get_spectrogram_tf, num_parallel_calls=AUTO)
  dataset = dataset.map(prepare_sample, num_parallel_calls=AUTO)  dataset = dataset.shuffle(256)
  dataset = dataset.repeat()
  dataset = dataset.batch(batch_size)
  dataset = dataset.prefetch(AUTO)
  return dataset
Copy the code

Putting everything together, there is a get_dataset function that takes the filename as input and, after performing all the steps described above, returns a Tensorflow dataset with an RGB spectrogram image and its label.


model

def model_fn(input_shape, N_CLASSES):
  inputs = L.Input(shape=input_shape, name='input_audio')
  base_model = efn.EfficientNetB0(input_tensor=inputs, 
                                  include_top=False, 
                                  weights='imagenet')

  x = L.GlobalAveragePooling2D()(base_model.output)
  x = L.Dropout(.5)(x)
  output = L.Dense(N_CLASSES, activation='softmax',name='output')(x)
    
  model = Model(inputs=inputs, outputs=output)

  return model
Copy the code

Our model will have a EfficientNetB0 trunk, adding a GlobalAveragePooling2D at the top, then a Dropout, and finally a Dense layer that will perform the actual multi-class classification.

For a small data set, EfficientNetB0 could be a good baseline, and even for a fast and lightweight model, it has decent accuracy.


training

model = model_fn((None, None, CHANNELS), N_CLASSES)model.compile(optimizer=tf.optimizers.Adam(), 
              loss=losses.CategoricalCrossentropy(), 
              metrics=[metrics.CategoricalAccuracy()])

model.fit(x=get_dataset(FILENAMES), 
          steps_per_epoch=100, 
          epochs=10)
Copy the code

The training code is pretty standard for the Keras model, so you probably won’t find anything new here.


conclusion

You should now have a better idea of the workflow for applying deep learning to audio files, and while it’s not the only way you can do it, it’s the best trade-off between ease of use and performance. If you are going to model audio, you may also want to consider other promising approaches, such as transformers.

As an additional pre-processing step, truncating or filling the waveform may be a good idea if your sample has a different length, or if only a small part of it is needed, you can find the code for how to do this in the Resources section below.

reference

– Simple audio recognition: Recognizing keywords

www.tensorflow.org/tutorials/a…

– Rainforest-Audio classification Tensorflow starter

www.kaggle.com/dimitreoliv…

– Rainforest-Audio classification TF Improved

www.kaggle.com/dimitreoliv…

For details, please scan the __ QR code __ or click __ to read the original text __ for more information about the conference.

This article uses the article synchronization assistant to synchronize