Hands-on Tutorials

MuseGAN: Using GANs to generate original Music

With playable audio files to listen to the generated music

Here is the github repo (ads) of this project:

GANs are highly versatile, allowing for the generation of anything that can be synthesized into images. By utilizing this feature of GANs, it is possible to generate very unorthodox content, at least from the perspective of machine learning. This article is sharing my project where I used GANs to generate baroque music, based on midi files of Bach compositions.

This is not the first time that I used GANs to generate creative content. I wrote a GAN last time that would generate art, based on famous artworks by famous artists. From that project I have a few key takeaways about GANs and how to balance them out:

Quality over quantity. When I first trained the GAN to generate art, I used a massive jumble of realistic, abstract and impressionist artworks to train the GAN. The results of this paled in comparison with those generated by being only trained in each category.
Balance is key. The thing that holds the GAN together is the adversarial relationship between the discriminator and the generator. If the discriminator gets too get at recognizing fake generations, the generator is stuck in its current position. If the discriminator gets too weak at recognizing fake generations, the generator starts to exploit the environment and starts to generating content that tricks the discriminator, but does not imitate the real data points.

With these two key takeaways, I got to work on the program:

Data Preprocessing:

The first step to machine learning is the data preprocessing. For this project, it contains 3 steps:

Access Midi Files:

I found a dataset online on bach’s compositions online, scraped from an online website. I extracted all the midi files and put them into a folder.

Convert Midi Files into images:

I found a github page that had 2 programs that used the music21 library to convert midi files into images and back.

Each note can be represented as a white block. The height of the block defines the pitch, and the length defines how long the note is played.

I then wrote a script to integrate these two programs with my midi files, to create new images in a different directory:

import os
import numpy as np

path = 'XXXXXXXXX'

os.chdir(path)
midiz = os.listdir()
midis = []
for midi in midiz:
    midis.append(path+'\\'+midi)

This script goes to midi directory, and then adds all the midi file paths to a list, to be accessed later.

from music21 import midi

mf = midi.MidiFile()
mf.open(midis[0]) 
mf.read()
mf.close()
s = midi.translate.midiFileToStream(mf)
s.show('midi')

This script opens the first midi file, and plays it to make sure that the program is working. This might not work if you run this in a non-interactive environment.

import os
import numpy as np
import py_midicsv as pm

os.chdir(path)
midiz = os.listdir()
midis = []
for midi in midiz:
    midis.append(path+'\\'+midi)
    
new_dir = 'XXXXXXXX'
for midi in midis:
    try:
        midi2image(midi)
        basewidth = 106
        img_path = midi.split('\\')[-1].replace(".mid",".png")
        img_path = new_dir+"\\"+img_path
        print(img_path)
        img = Image.open(img_path)
        hsize = 106
        img = img.resize((basewidth,hsize), Image.ANTIALIAS)
        img.save(img_path)
    except:
        pass

This script uses the midi2image function from the github page and converts all the midi files, given the path to the midi files. They are also reshaped into the shape (106,106). Why? 106 is the height of the program, as this is the number of possible notes on a midi file. Also, it is much easier to work with squares for convolutional transpositions.

Construct Dataset:

import os
from PIL import Image
from matplotlib import pyplot as plt 
import numpy as np

path = 'XXXXXXXXXXXXXX'
os.getcwd()
img_list = os.listdir(path)

def access_images(img_list,path,length):
    pixels = []
    imgs = []
    for i in range(length):
        if 'png' in img_list[i]:
            try:
                img = Image.open(path+'/'+img_list[i],'r')
                img = img.convert('1')
                pix = np.array(img.getdata())
                pix = pix.astype('float32')
                pix /= 255.0
                pixels.append(pix.reshape(106,106,1))
                imgs.append(img)
            except:
                pass
    return np.array(pixels),imgs

def show_image(pix_list):
    array = np.array(pix_list.reshape(106,106), dtype=np.uint8)
    new_image = Image.fromarray(array)
    new_image.show()
    
pixels,imgs = access_images(img_list,path,200)

This script goes to the directory that contains all the images and records all the pixel values. This will be the pixel values that will be used as the real samples that will be fed into the discriminator, along with the computer generated samples. The pixel values have to be divided by 255, so that the values can be either 1 or 0 (white or black), which makes it easier for the program to work well.

np.unique(pixels)

This script just makes sure that the pixels values have been normalized between 1 and 0.

Creating the GAN:

Imports:

There are quite a lot of prerequisites for this program to work:

from numpy import zeros
from numpy import ones
from numpy import vstack
from numpy.random import randn
from numpy.random import randint
from keras.datasets.mnist import load_data
from keras.optimizers import Adam
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Reshape
from keras.layers import Flatten,BatchNormalization
from keras.layers import Conv2D
from keras.layers import Conv2DTranspose
from keras.layers import LeakyReLU
from keras.layers import Dropout
from matplotlib import pyplot
from IPython.display import clear_output

These are basically all the layers and numpy functions that need to be used when running the GAN. clear_output is just to clear the screen every 10 epochs, so that the screen does not get clogged up.

Define Discriminator:

def define_discriminator(in_shape = (106,106,1)):
    model = Sequential()
    model.add(Conv2D(64, (3,3), strides=(2, 2), padding='same', input_shape=in_shape))
    model.add(LeakyReLU(alpha=0.2))
    model.add(Dropout(0.5))
    model.add(Conv2D(64, (3,3), strides=(2, 2), padding='same'))
    model.add(LeakyReLU(alpha=0.2))
    model.add(Dropout(0.5))
    model.add(Flatten())
    model.add(BatchNormalization())
    model.add(Dense(1, activation='sigmoid'))
    opt = Adam(lr=0.0002, beta_1=0.5)
    model.compile(loss='binary_crossentropy', optimizer=opt, metrics=['accuracy'])
    return model

This discriminator has been fine-tuned after a lot of experimentation. The convolutional layers have a low number of nodes, so that the generator can catch up with the discriminator before the discriminator gets too ahead. The layers of dropout are also necessary so that the discriminator does not overfit on the data.

Define Generator:

def define_generator(latent_dim):
    model = Sequential()
    n_nodes = 128 * 53 * 53
    model.add(Dense(n_nodes, input_dim=latent_dim))
    model.add(LeakyReLU(alpha=0.2))
    model.add(Reshape((53, 53, 128)))
    model.add(Dense(1024))
    model.add(Conv2DTranspose(1024, (4,4), strides=(2,2), padding='same'))
    model.add(Dense(1024))
    model.add(LeakyReLU(alpha=0.2))
    model.add(Dense(1024))
    model.add(Conv2D(1, (7,7) , padding='same',activation = 'sigmoid'))
    return model

The generator is especially deep, because in nearly all cases, the generator will fall behind. The use of leaky relu is to prevent the “dying relu” problem, where values smaller than 1 block training completely. The latent_dimension for this GAN must not be too big, as it could slow down the generator’s training. The value that I decided for it is 100

def define_gan(g_model, d_model):
    d_model.trainable = False
    model = Sequential()
    model.add(g_model)
    model.add(d_model)
    opt = Adam(lr=0.0002, beta_1=0.5)
    model.compile(loss='binary_crossentropy', optimizer=opt)
    return model

This script combines the discriminator and the generator together, so the loss from the discriminator can be backpropagated back into the generator.

Generate Samples:

def generate_real_samples(dataset, n_samples):
    ix = randint(0, dataset.shape[0], n_samples)
    X = dataset[ix]
    y = ones((n_samples, 1))
    return X, y
 
def generate_latent_points(latent_dim, n_samples):
    x_input = randn(latent_dim * n_samples)
    x_input = x_input.reshape(n_samples, latent_dim)
    return x_input

def generate_fake_samples(g_model, latent_dim, n_samples):
    x_input = generate_latent_points(latent_dim, n_samples)
    X = g_model.predict(x_input)
    y = zeros((n_samples, 1))
    return X, y

These free functions are able to generate all the parameters and data that the GAN needs to function: The latent_points work as the input of the generator while the fake and real samples are to train and test the discriminator.

Train GAN:

def train(g_model, d_model, gan_model, dataset, latent_dim, n_epochs=51, n_batch=10):
    bat_per_epo = int(dataset.shape[0] / n_batch)
    half_batch = int(n_batch / 2)
    for i in range(n_epochs):
        for j in range(bat_per_epo):
            X_real, y_real = generate_real_samples(dataset, half_batch)
            X_fake, y_fake = generate_fake_samples(g_model, latent_dim, half_batch)
            X, y = vstack((X_real, X_fake)), vstack((y_real, y_fake))
            d_loss, _ = d_model.train_on_batch(X, y)
            X_gan = generate_latent_points(latent_dim, n_batch)
            y_gan = ones((n_batch, 1))
            g_loss = gan_model.train_on_batch(X_gan, y_gan)
            print('>%d, %d/%d, d=%.3f, g=%.3f' % (i+1, j+1, bat_per_epo, d_loss, g_loss))
        if (i+1) % 10 == 0:
            summarize_performance(i, g_model, d_model, dataset, latent_dim)
            clear_output()

This function trains the GAN. It basically orchestrates all the functions defined above and prints the loss for both the discriminator and the generator. This allows you to check the balance between the generator and the discriminator.

latent_dim = 100
d_model = define_discriminator()
g_model = define_generator(latent_dim)
gan_model = define_gan(g_model, d_model)
print(pixels.shape)
train(g_model, d_model, gan_model, np.array(pixels), latent_dim)

This script is just calling upon the functions and actually running the program.

Visualizing results:

from keras.models import load_model
from numpy.random import randn
from matplotlib import pyplot

def generate_latent_points(latent_dim, n_samples):
    x_input = randn(latent_dim * n_samples)
    x_input = x_input.reshape(n_samples, latent_dim)
    return x_input

model = g_model
latent_points = generate_latent_points(latent_dim,1)
X = g_model.predict(latent_points)

array = np.array(X.reshape(106,106),dtype = np.uint8)
array*= 255
new_image = Image.fromarray(array,'L')
new_image = new_image.save('composition.png')

This script calls upon the model to make predictions on latent points, which results in an array. This array is then converted into an image using PIL.

image2midi('composition.png')

After converting the image into a midi file, you can run these commands in a cell to listen to the midi file.

!apt install fluidsynth
!cp /usr/share/sounds/sf2/FluidR3_GM.sf2 ./font.sf2
!fluidsynth -ni font.sf2 composition.mid -F output.wav -r 44100
from IPython.display import Audio
Audio('output.wav')

Results:

Here are a few of my favourite cuts from the AI generated music:

The model has begun to pick up on song structure, basic harmony and rhythm, although it does start to sound a bit like jazz music.

My links:

If you want to see more of my content, click this link.