The website content discusses generating music from images using the new diffusers package from Hugging Face, with a focus on creating audio loops and samples for music production.
Abstract
The author has leveraged advancements in AI image generation to create a novel approach for music production by generating music from images using Hugging Face's diffusers package. This method involves converting audio into mel spectrograms, which are then used to train deep learning models. The models are capable of producing plausible spectrograms that can be converted back into audio, demonstrating a potential new tool for music producers and crate diggers. The author has made available pre-trained models and Google Colab notebooks for users to generate their own music loops, and suggests that this approach could revolutionize music sampling by using AI to create new hooks and sounds.
Opinions
The author is impressed by the progress in AI-generated image and text but notes a relative lag in audio generation, possibly due to less interest or greater complexity.
They believe that audio can be effectively represented as images through mel spectrograms, which can then be manipulated using AI models developed for image generation.
The author has taken an experimental approach, using their own Spotify "liked" playlist as a diverse training dataset, and suggests that more genre-specific training could yield interesting results.
They posit that AI-generated music could become a significant asset in the music industry, particularly in the context of sampling and creating new music from existing sounds.
The author is optimistic about the potential of AI in music creation, envisioning a future where AI could provide prompt-driven audio generation akin to DALL-E 2's capabilities in image generation.
Generating music using images
with Hugging Face’s new diffusers package
Aphex Twin embedded a self-portrait in the spectrogram of Equation (image credit Jarmo Niinisalo)
[UPDATE: I’ve also trained the model on 30,000 samples that have been used in music, sourced from WhoSampled and YouTube. The idea is that the model could be used to generate loops or “breaks” that can be sampled to make new tracks. People (“crate diggers”) go to a lot of lengths or are willing to pay a lot of money to find breaks in old records.]
I have been astonished by the recent improvements in Deep Learning models in the domains of image generation (DALL-E 2, MidJourney, Imagen, Make-A-Scene, etc.) and text generation (GPT-3, BLOOM, BART, T5, etc.) but, at the same time, surprised by the relative lack of progress with audio generation. Two notable exceptions come to mind: MuseNet treats sheet music as sequential tokens (similar to text) and leverages GPT-2, while Jukebox and WaveNet generate music from raw wave forms. Even so, is audio generation a laggard because there is less interest in it, or because it is intrinsically more challenging?
Whatever the case, audio can easily be converted to an image and vice versa, by way of a mel spectrogram.
Left as an exercise for the reader to determine which song this was taken from.
The horizontal axis of the spectrogram is time, the vertical axis is frequency (on a log scale) and the shade represents amplitude (also on a log scale). The mel spectrogram is supposed to correspond closely to how the human ear perceives sound.
If we can now easily generate convincing looking photos of celebrities using AI, why not try to generate plausible spectrograms and convert them into audio? This is exactly what I have done using the new Hugging Face diffuserspackage.
TL;DR
So, how well does it work? Check out some automatically generated loops:
You can also generate more for yourself on Google Colab
You can choose between a model trained on almost 500 tracks (around 20,000 spectrograms) from my Spotify “liked” playlist or one trained on 30,000 samples that have been used in music.
In the above repo, you will find utilities to create a dataset of spectrogram images from a directory of audio files, train a model to generate similar spectrograms and convert the generated spectrograms into audio. You will also find notebooks that allow you to play around with the pre-trained model.
If you are interested in the details of the model, then I recommend you read the Denoising Diffusion Probabilistic Models paper. According to Open AI, diffusion models beat GANs at their own game. The basic idea is that a model is trained to recover images from a version that has been corrupted by gaussian noise. If the model is trained with photos of celebrities, for example, it will come to learn what typical (or maybe not so typical!) facial features look like. To generate a random face of a celebrity, the model is given a completely random image and, each time it is run, the output image is slightly less noisy than before and looks a little more like a face (or a spectrogram in our case).
For simplicity, I chose to create square spectrogram images of 256 x 256 pixels, which correspond to five seconds of reasonable quality audio. I used Hugging Face’s accelerate package to split the batches into shards that would fit on my single RTX 2080 Ti GPU. The training took about 40 hours.
Bear in mind that my Spotify playlist is a bit of a mix of different styles of music. I felt it was important to use a training dataset that I knew intimately, so that I would be able to judge how much the model was creating and how much it was just regurgitating (as well as there being a good chance I might actually like the results). In the same way that the celebrities dataset is relatively homogenous, it would be interesting to train the model with piano only music, or just techno music, for example, to see whether it is able to learn anything about a particular genre.