Is it a banger? Audio classification in tensorflow
In Parks and Recreation Season 6 Episode 18 “Prom”, Tom Haverford famously tells us about his test of whether a song is a “banger” or not. There are many questions in this test: “does it feature any acoutic instruments?”, “how many drops?”, “how dope are the drops?” etc.
I think we can make his test even more rigorous: why don’t we use a deep neural network, trained on examples of bangers (and nonbangers), to tell us if a song is banger or not?
In this jupyter notebook, we’re going to construct, train and test this neural network.
Initial Environment
import matplotlib.pyplot as plt
import librosa.display
import numpy as np
np.random.seed(1337)
import pandas as pd
%matplotlib inline
The Dataset
df = pd.read_pickle("../data/processed_dataset.pkl")
This data set was generated using the instructions in this notebook. Let’s take a look.
df[:9]
audio  label  label_one_hot  log_specgram  

Cliff Richard  Greatest Hits 19581962 (Not Now Music) [Full Album]_0415.wav  [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...  not_a_banger  [1.0, 0.0]  [[80.0, 54.1524, 35.3907, 33.0633, 39.626... 
Selected New Year Mix_0121.wav  [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...  banger  [0.0, 1.0]  [[67.3112, 51.5708, 53.4622, 72.6484, 80.... 
Rihanna  Stay ft. Mikky Ekko_0036.wav  [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...  not_a_banger  [1.0, 0.0]  [[64.2413, 50.564, 57.0061, 37.2135, 37.0... 
The Lumineers  Slow It Down (Live on KEXP)_0049.wav  [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...  not_a_banger  [1.0, 0.0]  [[80.0, 73.9336, 59.1297, 49.4456, 45.314... 
Passenger _ Let Her Go (Official Video)_0016.wav  [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...  not_a_banger  [1.0, 0.0]  [[80.0, 79.4122, 63.2455, 56.2228, 56.834... 
Low Steppa  Vocal Loop (Premiere)_0032.wav  [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...  banger  [0.0, 1.0]  [[65.6515, 31.3697, 21.9142, 25.2813, 61.... 
Stardust  Music Sounds Better (Mistrix Dub) (Free Download)_0049.wav  [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...  banger  [0.0, 1.0]  [[80.0, 80.0, 78.6725, 79.2538, 80.0, 80... 
Ed Sheeran  Thinking Out Loud [Official Video]_0033.wav  [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...  not_a_banger  [1.0, 0.0]  [[80.0, 65.3586, 53.2574, 44.407, 50.1324... 
Best Of 2017 Tech House Yearmix_0145.wav  [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...  banger  [0.0, 1.0]  [[80.0, 57.0543, 39.8118, 61.7071, 38.360... 
We can see in the first column the names of the tracks (in .wav
format) with a numeric identifier at the end. Each track has been clipped into 5 second segments (at 22.05kHz sample rate) and the identifier tells us which segment we have.
The audio
column is a numpy array with the audio sample values.
The label
column tells us if the given file is labelled as a banger or not. For the most part, the labels are obvious to us (but not the machine): Ed Sheeran, The Lumineers, Cliff Richard… clearly NOT A BANGER. Various tech house mixes and artists  BANGERZ.
The label_one_hot
column gives us the vectorised, “onehot” encoding of the label. [0.0, 1.0] == banger
, [1.0, 0.0] == not_a_banger
.
The final column, log_specgram
, is the most interesting and what will comprise our features input to the neural net. It comprises the log spectrogram of the audio signal. This is the absolute value squared Short Time Fourier Transform of the audio signal. This gives us the frequency content of the signal within short time windows.
We’re going to use a common image classification tool, a ConvNet, on the log spectrogram image to do our classification.
Let’s take a closer look at the dataset.
bangerz = df.loc[df['label'] == "banger"]
clangerz = df.loc[df['label'] == "not_a_banger"]
num_bangerz = bangerz.index.size
num_clangerz = clangerz.index.size
print("Dataset has %g audio clips." % df.index.size)
print( "This is split between %g \"banger\"s and %g \"not_a_banger\"s" % (num_bangerz, num_clangerz) )
Dataset has 875 audio clips.
This is split between 422 "banger"s and 453 "not_a_banger"s
So we are split moreorless 50:50 between bangers and clangers. Now we want to look at the audio signal and log spectrogram for some examples.
def plot_waveforms(df, idx):
audio = df.iloc[idx].audio
log_specgram = df.iloc[idx].log_specgram
filename = df.iloc[idx].name
label = df.iloc[idx].label
# audio is np.array holding sample values, log_specgram is 2dim np.array
plt.figure(figsize=(15,6))
plt.subplot(1, 2, 1)
librosa.display.waveplot(audio, sr=22050)
plt.subplot(1, 2, 2)
librosa.display.specshow(log_specgram, x_axis='time',y_axis='log')
plt.colorbar(format='%+2.0f dB')
plt.suptitle(filename + ", label = \"" + label + "\".")
[plot_waveforms(bangerz, i) for i in [0, 1, 2]];
We can see for the first two bangers sharp, rhythmic, percussive signal, focused on the low end of the frequency spectrum. This is the kick drum!
In the third banger, we are likely in a section where the producer has used a highpass filter, since there is virtually no lowfrequency content here, yet we can still see some regularity from the kick in the higher end of the spectrum.
Now for the clangers!
[plot_waveforms(clangerz, i) for i in [0, 1, 2]];
Here we see a less percussive, rhythic signal across the board, with far less lowfrequency content.
Hopefully our ConvNet will be able to use this to its advantage.
Establish baseline
We can calculate a baseline classification accuracy, if we just choose the majority label in the dataset for any example. This is the accuracy we need to beat.
Ideally, we should run “Haverford’s algorithm” and compare, but I really didn’t feel like doing this for 875 examples! Volunteers welcome…
naive_accuracy = (max(num_bangerz, num_clangerz) / (float)(df.index.size))
print ("This is the accuracy if we always guess max{#banger, #not_a_banger}: %.3f" % naive_accuracy)
This is the accuracy if we always guess max{#banger, #not_a_banger}: 0.518
Form the training and testing data sets¶
Let’s set aside 80% of the data for training and 20% for testing.
train_frac = 0.8
def split_train_test(df, train_frac=0.8):
include = np.random.rand(*df.index.shape)
is_train = include < train_frac
train_data = df[is_train]
test_data = df[~is_train]
return train_data, test_data
train_data, test_data = split_train_test(df, train_frac)
print( "Training data has %g clips, test data has %g clips." % (train_data.index.size, test_data.index.size))
Training data has 711 clips, test data has 164 clips.
Tensorflow
Having prepped the training and test datasets, we’re ready to set up our ConvNet. We will closely follow the structure of the tensorflow deep MNIST example neural net with some small modifications – if it ain’t broke, don’t fix it!
The deep network will look something like this:
We feed the image of the log spectrogram into a convolutional layer, conv1
, followed by a maxpooling layer, hpool1
, which reduces the size of the image. We then feed this image into another convolutional layer, conv2
, followed by another maxpooling layer, hpool2
, which reduces the image size further. We then have two consecutive fully connected layers, fc1
and fc2
, between which we use dropout (this randomly removes edges during each epoch of training to mitigate overfitting). Finally, we classify.
We’re going to use the ADAM adaptive moment optimizer, with a crossentropy cost function.
Setup
import tensorflow as tf
tf.set_random_seed(1234)
# convolution params
log_specgram_shape = df.iloc[0]["log_specgram"].shape
CONV_STRIDE_LENGTH = 1
CONV_WINDOW_LENGTH = 5
MAX_POOL_STRIDE_LENGTH = 2
# features
CONV_1_NUM_FEATURES = 32
CONV_2_NUM_FEATURES = 16
DENSE_NUM_FEATURES = 256
# training
NUM_LABELS = df.label.unique().size
BATCH_SIZE = 50
NUM_EPOCHS = 1000
LEARNING_RATE = 1e4
LOG_TRAIN_STEPS = 1
Draw the computational graph
# This node is where we feed a batch of the training data and labels at each training step
x = tf.placeholder(tf.float32,shape=(None, *log_specgram_shape, 1))
y_ = tf.placeholder(tf.float32, shape=(None, len(df.label.unique())))
# Weight initialisation functions
# small noise for symmetry breaking and nonzero gradients
def weight_variable(shape):
initial = tf.truncated_normal(shape, stddev=0.1)
return tf.Variable(initial)
# ReLU neurons  initialise with small positive bias to stop 'dead' neurons
def bias_variable(shape):
initial = tf.constant(0.1, shape=shape)
return tf.Variable(initial)
def conv2d(x, W):
return tf.nn.conv2d(x, W, strides=[1, CONV_STRIDE_LENGTH, CONV_STRIDE_LENGTH, 1], padding='SAME')
# ksize is filter size
def max_pool_2x2(x):
return tf.nn.max_pool(x, ksize=[1, MAX_POOL_STRIDE_LENGTH, MAX_POOL_STRIDE_LENGTH, 1],
strides=[1, MAX_POOL_STRIDE_LENGTH, MAX_POOL_STRIDE_LENGTH, 1], padding='SAME')
First Convolutional Layer
We can now implement our first layer. It will consist of convolution, followed by max pooling. The convolution will compute CONV_1_NUM_FEATURES
features for each CONV_WINDOW_LENGTH
$\times$ CONV_WINDOW_LENGTH
patch. Its weight tensor will have a shape of [CONV_WINDOW_LENGTH, CONV_WINDOW_LENGTH, 1, CONV_1_NUM_FEATURES]
. The first two dimensions are the patch size, the next is the number of input channels (mono audio, so 1
), and the last is the number of output channels. We will also have a bias vector with a component for each output channel.
W_conv1 = weight_variable([CONV_WINDOW_LENGTH, CONV_WINDOW_LENGTH, 1, CONV_1_NUM_FEATURES])
b_conv1 = bias_variable([CONV_1_NUM_FEATURES])
h_conv1 = tf.nn.relu(conv2d(x, W_conv1) + b_conv1)
h_pool1 = max_pool_2x2(h_conv1)
Second Convolutional Layer
W_conv2 = weight_variable([CONV_WINDOW_LENGTH, CONV_WINDOW_LENGTH, CONV_1_NUM_FEATURES, CONV_2_NUM_FEATURES])
b_conv2 = bias_variable([CONV_2_NUM_FEATURES])
h_conv2 = tf.nn.relu(conv2d(h_pool1, W_conv2) + b_conv2)
h_pool2 = max_pool_2x2(h_conv2)
# 2x2 maxpool gives image dimensions np.ceil(np.array(log_specgram_shape)/2).astype(int)
Densely Connected Layer
Now that the image size has been reduced, we add a fullyconnected layer with 256 neurons. We reshape the tensor from the pooling layer into a batch of vectors, multiply by a weight matrix, add a bias, and apply a ReLU activation function.
def scale_shape_maxpool2x2(shape_tuple):
return np.ceil(np.array(shape_tuple)/2).astype(int)
log_specgram_shape_reduced = scale_shape_maxpool2x2(scale_shape_maxpool2x2(log_specgram_shape))
W_fc1 = weight_variable([np.prod(log_specgram_shape_reduced) * CONV_2_NUM_FEATURES, DENSE_NUM_FEATURES])
b_fc1 = bias_variable([DENSE_NUM_FEATURES])
h_pool2_flat = tf.reshape(h_pool2, [1, np.prod(log_specgram_shape_reduced) * CONV_2_NUM_FEATURES])
h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flat, W_fc1) + b_fc1)
Dropout
To reduce overfitting, we will apply dropout before the readout layer. We create a placeholder
for the probability that a neuron’s output is kept during dropout. This allows us to turn dropout on during training, and turn it off during testing.
keep_prob = tf.placeholder(tf.float32)
h_fc1_drop = tf.nn.dropout(h_fc1, keep_prob)
Readout Layer
W_fc2 = weight_variable([DENSE_NUM_FEATURES, NUM_LABELS])
b_fc2 = bias_variable([NUM_LABELS])
y_conv = tf.matmul(h_fc1_drop, W_fc2) + b_fc2
Training
Batching function
We need a function to feed in batches of data for training.
def return_batch(df, batch_size=10):
batch_df = df.sample(batch_size)
x = np.vstack(batch_df["log_specgram"]).reshape(batch_df.index.size, *log_specgram_shape, 1).astype(np.float32)
y = np.vstack(batch_df["label_one_hot"]).astype(np.float32)
return x, y
Time logging
We want some rough idea of how long training is going to take. On my laptop it was around 14 hours! 😱
import time
def estimate_time_remaining(time_in, current_step, steps_gap, total_steps):
current_time = time.time()  time_in
time_per_step = current_time / steps_gap
time_remaining = (total_steps  current_step) * time_per_step
m, s = divmod(time_remaining, 60)
h, m = divmod(m, 60)
print("Approximately %d hours, %02d minutes, %02d seconds remaining." % (h, m, s))
Train and Evaluate the Model
We’re using the numerically stable tf.nn.softmax_cross_entropy_with_logits
function here. This is the long part.
cross_entropy = tf.reduce_mean(
tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=y_conv))
train_step = tf.train.AdamOptimizer(LEARNING_RATE).minimize(cross_entropy)
correct_prediction = tf.equal(tf.argmax(y_conv, 1), tf.argmax(y_, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
sess = tf.Session()
sess.run(tf.global_variables_initializer())
with sess.as_default():
current_time = time.time()
for i in range(NUM_EPOCHS):
batch = return_batch(train_data, BATCH_SIZE)
# logging
if i % LOG_TRAIN_STEPS == 0:
train_accuracy = accuracy.eval(feed_dict={x: batch[0], y_: batch[1], keep_prob: 1.0})
print('Epoch %d, training accuracy %.3f' % (i, train_accuracy))
estimate_time_remaining(current_time, i, LOG_TRAIN_STEPS, NUM_EPOCHS)
current_time = time.time()
train_step.run(feed_dict={x: batch[0], y_: batch[1], keep_prob: 0.5})
Note: I’ve deleted the output of the above cell to keep the notebook short.
Save model and variables
with sess.as_default():
saver = tf.train.Saver()
save_path = saver.save(sess, "../data/model.ckpt")
print("Model saved in file: %s" % save_path)
Model saved in file: ../data/model.ckpt
Testing
Now we have a trained model, we want to test out how well it works on the test set.
with sess.as_default():
test_batch = return_batch(test_data, test_data.index.size)
test_accuracy = accuracy.eval(feed_dict={x: test_batch[0], y_: test_batch[1], keep_prob: 1.0})
print("Test accuracy: %.3f" % test_accuracy)
Test accuracy: 0.915
Yay! We have done a lot better than the baseline of 0.518.
What’s Next?
It looks like our initial attempt with a ConvNet trained on log spectrogram data has worked well as a first attempt. However, there are a bunch of things we could think about to improve things:

Feature selection: the
librosa
library which generated the log spectrogram can compute a whole host of different audio features such as mel spectrogram and decompositions of the signal into percussive and melodic components. 
Inspecting misclassified data: digging in to which audio clips were misclassified might give us insight into why they were misclassified. We could use this information to improve the model.

ConvNet: There are plenty of hyperparameters to tune here and even the architecture can be changed. Thinking more carefully about the structure of the input features and what design to use could help here.

Other models: perhaps another machine learning model, such as SVM or nearest neighbours classification could be more effective (and certainly would be quicker!)
Thanks for reading, and as Tommy H would say, keep it 💯.