The problem

Identify bird calls in soundscape recordings

The challenge you face in this competition is to determine which birds are called in the long recordings, as the training data is generated in meaningfully different environments. This is precisely the problem scientists face as they try to automate remote monitoring of bird populations. This contest builds on the previous one with soundscapes from new locations, more bird species, richer metadata on the test set recordings, and train set soundscapes.

File Introduction Train_short_audio – Most of the training data includes short recordings of individual bird calls generously uploaded by users. These files have been reduced to 32 kHz and are suitable for matching test set audio and converting to OGG format. Training data should include almost all relevant documentation: we expect to find more on, which is no good.

Train_soundscapes – Audio file equivalent to the test set. They are all about ten minutes long, in Ogg form. The test set also has soundscapes of the two recording locations shown here.

Test_soundscapes – When submitting a notebook, the test_Soundscapes directory will be filled with approximately 80 recordings for scoring. These will be about 10 minutes long in Ogg audio format. The file name includes the date of the record, which is particularly useful for identifying migratory birds.

This folder also contains a text file containing the name and approximate coordinates of the recording location, as well as a CSV with the test set soundscape recording date set.

Test.csv – only the first three lines are available for download; Full test. CSV is in the hidden test set.

Row_id: indicates the ID of a row.

Site: indicates the ID of a site.

Seconds: indicates the second end time window

Audio_id: indicates the ID of an audio file.

Train_metadata.csv – provides extensive metadata for training data. The most directly relevant areas are:

Primary_label: Code for birds. You can view details about bird codes by attaching them to, for example, American crows.…

Recodist: Users who provide recordings.

Latitude & Longitude: Coordinates of the recording location. Some birds may have what are known locally as “dialects”, so you may need to look for geographic diversity in the training data.

Date: While some birds can make year-round calls, such as to the police, others are limited to certain seasons. You may need to seek temporal diversity in training data.

Filename: indicates the name of the related audio file.

train_soundscape_labels.csv –

Row_id: indicates the ID of a row.

Site: indicates the ID of a site.

Seconds: indicates the second end time window

Audio_id: indicates the ID of an audio file.

Birds: Any bird song in the spatial delineation list appears in the 5 second window. This label indicates that no call has occurred. nocall

Sample_submit.csv – A properly formed sample submission file. Only the first three lines are public; the rest will be made available to your notebook as part of a hidden test set.


Birds: Any bird song in the spatial delineation list appears in the 5 second window. If there is no bird call, use the tag. nocall

Data download address…

Understand the problem

This competition is for the classification of bird calls, with a total of 397 categories. The given training set in the data set is selected to intercept the audio time domain waveform according to the width of 5s window, and the spectrum is obtained by Fourier transform, and then identified by the neural network.

Note here: Any bird calls from the spatial delineation list appear in the 5-second window, so be careful to split the training set into images every 5 seconds.


Audio data to image Audio data to image is mainly used: Librosa, to convert the image into a one-dimensional image of 224×224

Installation command: PIP install librosa or conda install -c conda-forge librosa

import warnings warnings.filterwarnings(action='ignore') import pandas as pd import librosa import numpy as np from sklearn.utils import shuffle from PIL import Image from tqdm import tqdm # Global vars RANDOM_SEED = 1337 SAMPLE_RATE = 32000 SIGNAL_LENGTH = 5 # seconds SPEC_SHAPE = (224, 224) # height x width FMIN = 20 FMAX = 16000 # Code adapted from: # # Make sure to check out the entire  notebook. # Load metadata file train = pd.read_csv('.. /input/birdclef-2021/train_metadata.csv', ) # Second, assume that birds with the most training samples are also the most common # A species needs at least 200 recordings with  a rating above 4 to be considered common birds_count = {} for bird_species, count in zip(train.primary_label.unique(), train.groupby('primary_label')['primary_label'].count().values): birds_count[bird_species] = count most_represented_birds = [key for key, value in birds_count.items()] TRAIN = train.query('primary_label in @most_represented_birds') LABELS = sorted(TRAIN.primary_label.unique()) # Let's see how many species and samples we have left print('NUMBER OF SPECIES IN TRAIN DATA:', len(LABELS)) print('NUMBER OF SAMPLES IN TRAIN DATA:', len(TRAIN)) print('LABELS:', most_represented_birds) # Shuffle the training data and limit the number of audio files to MAX_AUDIO_FILES TRAIN = shuffle(TRAIN, random_state=RANDOM_SEED) # Define a function that splits an audio file, # extracts spectrograms and saves them in a working directory def get_spectrograms(filepath, primary_label, output_dir): # Open the file with librosa (limited to the first 15 seconds) sig, rate = librosa.load(filepath, sr=SAMPLE_RATE, offset=None, duration=15) # Split signal into five second chunks sig_splits = [] for i in range(0, len(sig), int(SIGNAL_LENGTH * SAMPLE_RATE)): split = sig[i:i + int(SIGNAL_LENGTH * SAMPLE_RATE)] # End of signal? if len(split) < int(SIGNAL_LENGTH * SAMPLE_RATE): break sig_splits.append(split) # Extract mel spectrograms for each audio chunk s_cnt = 0 saved_samples = [] for chunk in  sig_splits: hop_length = int(SIGNAL_LENGTH * SAMPLE_RATE / (SPEC_SHAPE[1] - 1)) mel_spec = librosa.feature.melspectrogram(y=chunk, sr=SAMPLE_RATE, n_fft=2048, hop_length=hop_length, n_mels=SPEC_SHAPE[0], fmin=FMIN, fmax=FMAX) mel_spec = librosa.power_to_db(mel_spec, ref=np.max) # Normalize mel_spec -= mel_spec.min() mel_spec /= mel_spec.max() # Save as image file save_dir = os.path.join(output_dir, primary_label) if not os.path.exists(save_dir): os.makedirs(save_dir) save_path = os.path.join(save_dir, filepath.rsplit(os.sep, 1)[-1].rsplit('.', 1) [0] + '_' + STR (s_cnt) + 'PNG') im = Image. Fromarray (mel_spec * 255.0). The convert (" L ") im. Save (save_path) saved_samples.append(save_path) s_cnt += 1 return saved_samples print('FINAL NUMBER OF AUDIO FILES IN TRAINING DATA:', len(TRAIN)) # Parse audio files and extract training samples input_dir = '.. /input/birdclef-2021/train_short_audio/' output_dir = '.. /working/melspectrogram_dataset/' samples = [] with tqdm(total=len(TRAIN)) as pbar: for idx, row in TRAIN.iterrows(): pbar.update(1) if row.primary_label in most_represented_birds: audio_file_path = os.path.join(input_dir, row.primary_label, row.filename) samples += get_spectrograms(audio_file_path, row.primary_label, output_dir) print(samples) str_samples = ','.join(samples) TRAIN_SPECS = shuffle(samples, random_state=RANDOM_SEED) filename = open('a.txt', 'w') filename.write(str_samples) filename.close()

The following image is the result:

Shred the training set and the verification set

Use the train_test_split split data set of Sklearn.model_selection to split the training set and validation set in a 7:3 ratio.

import warnings warnings.filterwarnings(action='ignore') from sklearn.model_selection import train_test_split import shutil filename = open('a.txt', 'r') str_samples = filename.close() str_samples = str_samples.replace("\\", "/") samples = str_samples.split(',') trainval_files, test_files = train_test_split(samples, test_size=0.3, random_state=42) train_dir = '.. /working/train/' val_dir = '.. /working/val/' def copyfiles(file, dir): filelist = file.split('/') filename = filelist[-1] lable = filelist[-2] cpfile = dir + "/" + lable if not os.path.exists(cpfile): os.makedirs(cpfile) cppath = cpfile + '/' + filename shutil.copy(file, cppath) for file in trainval_files: copyfiles(file, train_dir) for file in test_files: copyfiles(file, val_dir)


B3 EfficientNet was used as a pre-training model, and datasets.ImageFolder was used to load datasets. In about 20 epochs, the accuracy rate is 95 percent.

import torch.optim as optim
import torch
import torch.nn as nn
import torch.nn.parallel
from torch.autograd import Variable
import torch.optim
import torchvision.transforms as transforms
import torchvision.datasets as datasets
from efficientnet_pytorch import EfficientNet
import os
import time
# Set the hyperparameter
momentum = 0.9
class_num = 397
EPOCHS = 500
lr = 0.001
use_gpu = True
net_name = 'efficientnet-b3'
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# Data preprocessing
transform = transforms.Compose([
    transforms.Normalize([], [])
dataset_train = datasets.ImageFolder('.. /working/train', transform)
dataset_val = datasets.ImageFolder('.. /working/val', transform)
# Label of the corresponding folder
dset_sizes = len(dataset_train)
dset_sizes_val = len(dataset_val)
print("dset_sizes_val Length:", dset_sizes_val)
train_loader =, batch_size=BATCH_SIZE, shuffle=True)
test_loader =, batch_size=BATCH_SIZE, shuffle=True)
def exp_lr_scheduler(optimizer, epoch, init_lr=0.001, lr_decay_epoch=10) :
    """Decay learning rate by a f# model_out_path ="./model/W_epoch_{}.pth".format(epoch) #, Model_out_path) actor of 0.1 every Lr_decay_epoch epochs.
    lr = init_lr * (0.8 ** (epoch // lr_decay_epoch))
    print('LR is set to {}'.format(lr))
    for param_group in optimizer.param_groups:
        param_group['lr'] = lr
    return optimizer
def train_model(model_ft, criterion, optimizer, lr_scheduler, num_epochs=50) :
    train_loss = []
    since = time.time()
    best_model_wts = model_ft.state_dict()
    best_acc = 0.0
    for epoch in range(num_epochs):
        print('Epoch {}/{}'.format(epoch, num_epochs - 1))
        print(The '-' * 10)
        optimizer = lr_scheduler(optimizer, epoch)
        running_loss = 0.0
        running_corrects = 0
        count = 0
        for data in train_loader:
            inputs, labels = data
            labels = torch.squeeze(labels.type(torch.LongTensor))
            if use_gpu:
                inputs, labels = Variable(inputs.cuda()), Variable(labels.cuda())
                inputs, labels = Variable(inputs), Variable(labels)
            outputs = model_ft(inputs)
            loss = criterion(outputs, labels)
            _, preds = torch.max(, 1)
            count += 1
            if count % 30= =0 or outputs.size()[0] < BATCH_SIZE:
                print('Epoch:{}: loss:{:.3f}'.format(epoch, loss.item()))
            running_loss += loss.item() * inputs.size(0)
            running_corrects += torch.sum(preds ==
        epoch_loss = running_loss / dset_sizes
        epoch_acc = running_corrects.double() / dset_sizes
        print('Loss: {:.4f} Acc: {:.4f}'.format(
            epoch_loss, epoch_acc))
        if epoch_acc > best_acc:
            best_acc = epoch_acc
            best_model_wts = model_ft.state_dict()
    # save best model
    save_dir = 'model'
    os.makedirs(save_dir, exist_ok=True)
    model_out_path = save_dir + "/" + net_name + '.pth', model_out_path)
    time_elapsed = time.time() - since
    print('Training complete in {:.0f}m {:.0f}s'.format(
        time_elapsed // 60, time_elapsed % 60))
    return train_loss, best_model_wts
model_ft = EfficientNet.from_pretrained('efficientnet-b3')
num_ftrs = model_ft._fc.in_features
model_ft._fc = nn.Linear(num_ftrs, class_num)
criterion = nn.CrossEntropyLoss()
if use_gpu:
    model_ft = model_ft.cuda()
    criterion = criterion.cuda()
optimizer = optim.Adam((model_ft.parameters()), lr=lr)
train_loss, best_model_wts = train_model(model_ft, criterion, optimizer, exp_lr_scheduler, num_epochs=EPOCHS)



The test set was segmented according to 5 seconds and then converted into images. The images transferred here were one-dimensional, but the images in the datasets.ImageFolder were in 3d. Transforms.Lambda(Lambda x: transforms) Since the input was dimensional, the 1-d images were converted to 3-d for the test. X.repeat (3, 1, 1)), so it will be converted to a 3d image, and other reference training sets will handle the logical changes.

import os
import pandas as pd
import torch
import librosa
import numpy as np
# Global vars
SIGNAL_LENGTH = 5  # seconds
SPEC_SHAPE = (224.224)  # height x width
FMIN = 20
FMAX = 16000
# Load metadata file
train = pd.read_csv('.. /input/birdclef-2021/train_metadata.csv'.)# Second, assume that birds with the most training samples are also the most common
# A species needs at least 200 recordings with a rating above 4 to be considered common
birds_count = {}
for bird_species, count in zip(train.primary_label.unique(),
                               train.groupby('primary_label') ['primary_label'].count().values):
    birds_count[bird_species] = count
most_represented_birds = [key for key, value in birds_count.items()]
TRAIN = train.query('primary_label in @most_represented_birds')
LABELS = sorted(TRAIN.primary_label.unique())
# Let's see how many species and samples we have left
print('LABELS:', most_represented_birds)
# First, get a list of soundscape files to process.
# We'll use the test_soundscape directory if it contains "ogg" files
# (which it only does when submitting the notebook),
# otherwise we'll use the train_soundscape folder to make predictions.
def list_files(path) :
    return [os.path.join(path, f) for f in os.listdir(path) if f.rsplit('. '.1)[-1] in ['ogg']]
test_audio = list_files('.. /input/birdclef-2021/test_soundscapes')
if len(test_audio) == 0:
    test_audio = list_files('.. /input/birdclef-2021/train_soundscapes')
print('{} FILES IN TEST SET.'.format(len(test_audio)))
path = test_audio[0]
data = path.split(os.sep)[-1].rsplit('. '.1) [0].split('_')
print('FILEPATH:', path)
print('ID: {}, SITE: {}, DATE: {}'.format(data[0], data[1], data[2]))
# This is where we will store our results
pred = {'row_id': [].'birds': []}
model = torch.load("./model/efficientnet-b3.pth")
model.eval(a)import torchvision.transforms as transforms
from PIL import Image
transform = transforms.Compose([
    transforms.Lambda(lambda x: x.repeat(3.1.1)),
    transforms.Normalize([], [])
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
# Analyze each soundscape recording
# Store results so that we can analyze them later
data = {'row_id': [].'birds': []}
for path in test_audio:
    path = path.replace("\ \"."/")
    # Open file with Librosa
    # Split file into 5-second chunks
    # Extract spectrogram for each chunk
    # Predict on spectrogram
    # Get row_id and birds and store result
    # (maybe using a post-filter based on location)
    # The above steps are just placeholders, we will use mock predictions.
    # Our "model" will predict "nocall" for each spectrogram.
    sig, rate = librosa.load(path, sr=SAMPLE_RATE)
    # Split signal into 5-second chunks
    # Just like we did before (well, this could actually be a seperate function)
    sig_splits = []
    for i in range(0.len(sig), int(SIGNAL_LENGTH * SAMPLE_RATE)):
        split = sig[i:i + int(SIGNAL_LENGTH * SAMPLE_RATE)]
        # End of signal?
        if len(split) < int(SIGNAL_LENGTH * SAMPLE_RATE):
    # Get the spectrograms and run inference on each of them
    # This should be the exact same process as we used to
    # generate training samples!
    seconds, scnt = 0.0
    for chunk in sig_splits:
        # Keep track of the end time of each chunk
        seconds += 5
        # Get the spectrogram
        hop_length = int(SIGNAL_LENGTH * SAMPLE_RATE / (SPEC_SHAPE[1] - 1))
        mel_spec = librosa.feature.melspectrogram(y=chunk,
        mel_spec = librosa.power_to_db(mel_spec, ref=np.max)
        # Normalize to match the value range we used during training.
        # That's something you should always double check!
        mel_spec -= mel_spec.min()
        mel_spec /= mel_spec.max()
        im = Image.fromarray(mel_spec * 255.0).convert("L")
        im = transform(im)
        # This sentence will not return an error
        im =
        # Predict
        p = model(im)[0]
        # Get highest scoring species
        idx = p.argmax()
        species = LABELS[idx]
        score = p[idx]
        # Prepare submission entry
        spath = path.split('/')[-1].rsplit('_'.1) [0]
        data['row_id'].append(path.split('/')[-1].rsplit('_'.1) [0] +
                              '_' + str(seconds))
        # Decide if it's a "nocall" or a species by applying a threshold
        if score > 0.75:
            scnt += 1
    print('SOUNSCAPE ANALYSIS DONE. FOUND {} BIRDS.'.format(scnt))
# Make a new data frame and look at a few "results"
results = pd.DataFrame(data, columns=['row_id'.'birds'])
# Convert our results to csv
results.to_csv("submission.csv", index=False)
