Is it a banger? Make your own dataset

These are some brief instructions on how to make the dataset I used in the article Is it a banger?.

I’m also going to assume you have downloaded the files in the GitHub repository.

Folder structure

We want to create a directory called data, with a subdirectory for each label, e.g.

data
├── label_1
├── label_2
├──    ·
├──    ·
├──    ·
└── label_k

In each label subdirectory, we have a text-file, where each line is the URL of a YouTube track or playlist with the relevant audio data.

For the article, we simply have

data
├── banger
│   └── URL_banger.txt
└── not_a_banger
    └── URL_not_a_banger.txt

You can see the URLs used in the article at URL_banger.txt and URL_not_a_banger.txt

We then need to run the following command in the directory is_it_a_banger/scripts/

./scripts/prepare_data_files.sh data 5

where 5 is the audio segment length in seconds. Note that this script requires ffmpeg and youtube-dl to work.

After running the script, you should have in each label subdirectory a bunch of 5 second .wav audio files.

We then need the following python to generate the pandas DataFrame from the generated audio files.

Imports

    import os
    import glob
    import librosa
    import numpy as np
    np.random.seed(1234)
    import pandas as pd

Get filenames and directories

    parent_dir = '../data'
    parent_dir_contents = [os.path.join(parent_dir, dirname) for dirname in os.listdir(parent_dir)]
    sub_dirs = [filename if os.path.isdir(filename) else None for filename in parent_dir_contents]
    sub_dirs = list(filter(None.__ne__, sub_dirs))
    labels_list = [os.path.relpath(path, parent_dir) for path in sub_dirs]

Extract Features

We’re going to use the librosa library for processing the audio signal. We’ll keep the raw audio samples and compute a log spectrogram.

Note that we clip samples at the end of the audio file, as the combination of running ffmpeg earlier and resampling to 22.05kHz means the audio sample arrays don’t have uniform length.

    def extract_features(file_name, sample_rate=22050, segment_time=5, samples_to_clip=500):
        audio, sample_rate = librosa.load(file_name, sr=sample_rate)
        end_idx = (sample_rate * segment_time) - samples_to_clip # remove some end samples as not strictly uniform size
        audio = audio[0:end_idx]
        log_specgram = librosa.logamplitude(np.abs(librosa.stft(audio))**2, ref_power=np.max)
        features = {"audio": audio, "log_specgram": log_specgram}
        return features

Turn labels into ‘one-hot’ vector encoding

    def one_hot_encode(label, labels_list):
        n_labels = len(labels_list)
        one_hot_encoded = np.zeros(n_labels)
        for idx, cmp in enumerate(labels_list):
            if label == cmp:
                one_hot_encoded[idx] = 1                     
        return one_hot_encoded

Trim file list

Only include a fraction of audio files for a given track to avoid training set 1) having too many highly correlated data points, and 2) having too large a file size.

    def trim_file_list(fnames_list, p_include=1.0):
        fnames_list = np.asarray(fnames_list)
        include = np.random.rand(*fnames_list.shape)
        fnames_list = fnames_list[include < p_include]
        return fnames_list

Build DataFrame from files

    def parse_audio_files(parent_dir, sub_dirs_list, labels_list, file_ext='*.wav', p_include=1.0,\
                          sample_rate=22050, segment_time=5, samples_to_clip=500):
        data = []
        index = []
        for label_idx, sub_dir in enumerate(sub_dirs_list):
            fnames_list = glob.glob(os.path.join(sub_dir, file_ext))
            fnames_list = trim_file_list(fnames_list, p_include=p_include)
            for fname in fnames_list:
                print("Processing " + os.path.basename(fname))
                features = extract_features(fname, segment_time=segment_time, \
                                            sample_rate=sample_rate, samples_to_clip=samples_to_clip)
                label = labels_list[label_idx]
                label_one_hot = one_hot_encode(label, labels_list)
                features['label'] = label
                features["label_one_hot"] = label_one_hot
                data.append(features)
                index.append(os.path.basename(fname))
        return pd.DataFrame(data, index=index)


    df = parse_audio_files(parent_dir, sub_dirs, labels_list, p_include=0.1, segment_time=5, samples_to_clip=1100)
    df = df.iloc[np.random.permutation(len(df))] # shuffle rows
    df.to_pickle(os.path.join(parent_dir, 'processed_dataset.pkl'))