A new AI model, Pop2Piano, generates piano covers from pop songs by analyzing and reinterpreting the music using transformer-based architecture.
Abstract
The article discusses the Pop2Piano model, an AI capable of generating piano covers from pop songs. This innovative approach involves transforming the original multi-instrumental tracks into piano arrangements by processing synchronized piano cover datasets. The model leverages a transformer architecture, specifically the T5-small model, with an encoder-decoder structure and learnable embedding layers to capture the arranger's style. Despite some limitations, such as the inability to handle rhythms beyond eighth notes, the AI has demonstrated plausible results in both objective and subjective evaluations. Users can test the model through a provided GitHub repository and Google Colab notebook, which allows for the creation and download of MIDI piano covers.
Opinions
The author acknowledges the complexity of creating piano covers, emphasizing the need for musical skill and creativity.
The article suggests that the AI's output is not only plausible but also comparable to human arrangers, as supported by a subjective evaluation where 70% of participants preferred the model's work over human arrangements.
The author provides a balanced view by discussing the limitations of the model, such as the use of four-beat length audio for context and the inability to generate covers with rhythms like triplets or trills.
The author is optimistic about the potential of AI in music, drawing parallels to advancements in image generation and highlighting the work of other tech giants like Microsoft and Google in the field of AI-generated music.
The author encourages reader engagement by inviting them to try the model and provides resources, including a GitHub repository and links to other related articles, to foster further exploration into AI and machine learning.
Generate a piano cover with AI
A new model generates a piano cover from a pop song: how it works? how you can try it?
A piano cover refers to a cover in which all musical instruments are replaced by the sound of the piano alone. Lots of them can be found on youtube, and they may sound almost trivial (spoiler: it is not).
In order to create a piano cover, a person must recognize all the musical elements in the melody and reinterpret it using only the piano. Therefore, one needs musical skills and also creativity in being able to recreate the melody. If it is already difficult for a human being, can an AI succeed?
Actually, as they state in the article such a challenge has already been attempted. The idea is to extract the tracks of the various instruments from the audio and rearrange them. The task is not easy, because a good cover is influenced by both the atmosphere and the composer’s style.
The authors started with 300 hours of the synchronized piano cover dataset. Basically, instead of using raw music, they took the original songs and piano covers. They synchronized the original songs with the covers, then divided them into segments. The covers were transformed to MIDI and they were reduced to 8th-note units. In total, they collected 5989 piano covers from 21 arrangers on youtube (they then used only 4989 and 307 hours).
“Fig. 1. A preprocessing pipeline for synchronizing and filtering paired {Pop, Piano Cover} audio data”. image source from the original article (source)
The Pop2Piano model architecture is T5-small [7] used for [9]. It is a Transformer network with an encoder-decoder structure. The number of learnable parameters is about 59M. Unlike [9], the relative positional embedding of the original T5 is used instead of the absolute positional embedding. Additionally, A learnable embedding layer is used for embedding the arranger style. — from the original article (source)
As can be seen from the figure it consists of an encoder and a decoder.
“Fig. 2. The architecture of our model is an encoder-decoder Transformer. Each input position for the encoder is one frame of the spectrogram. We concatenated an embedding vector representing a target arranger style to the spectrogram. Output midi tokens are autoregressively generated from the decoder.” image source from the original article (source)
And the authors have presented an example of the output:
“Fig. 3. An example of piano tokenization. the beat shift token means a relative time shift from that point in time.” image source from the original article (source)
Although the original song is still complex (composed of several instruments and the vocal part), the piano accompaniment seems plausible. Not only that, it sounds plausible but is also similar to the arranger’s work.
Moreover, even in a subjective evaluation, it seems to be plausible (25 participants among people who were not musicians). Participants had to listen to 10 seconds of 25 songs and compare them with the arrangement made by a human. Seventy percent preferred the model’s work.
Here is a video released by the authors as an example:
Also, on the project website, you can test other songs and arrangements (you can find them here).
The authors acknowledge that there are still limitations:
We recognize that some improvements can be made to our model. For instance, Pop2Piano uses only four-beat length audio for the context of input. Therefore, features such as melody contour or texture of accompaniment have less consistency when generating longer than four-beat. Also, time quantization based on eighth note beats prevents the model from generating piano covers with other rhythms such as triplets, 16th notes, and trills. — from the original article (source)
First, you have to change the Runtime (in the menu above select Runtime), then select Change Runtime Type (in the drop-down menu select GPU). Once that is done you need to run the first block of code (CTRL+ENTER or press on the little play symbol). This may take a few minutes, but as soon as it is complete go to the second block.
Again you must execute the code block. It should take about a minute
This block should also take a short time (depends on your connection since it downloads the template)
This block allows you to choose the arranger. You can choose in the drop-down menu which of the composers you prefer (if you want some guidance, they show the differences between the various composers on the project site).
In this block, you can upload the audio track whose piano cover you want to create (you can choose between audio WAV and MP3, I used an MP3 converted from a Youtube video).
Run this block of code (it shouldn’t take long).
You will only need to run this code to download the piano cover (in MIDI format). You will find it in the same folder where you had the original track.
Conclusions
The proposed model, once a song is loaded, allows a track to be downloaded in MIDI (mind you, it is not synchronized with the vocals as in the examples on the project site). I have tried several songs, and it works quite well with pop songs but less so with other genres (for example, if there is a long drum sequence).
In general, the result is interesting especially considering the architecture and the fact that the number of parameters is not very large (only 50 million parameters). As we have seen, Microsoft has also launched a model that generates music, and Google itself is investing in the same field. It seems that after images, music is the next frontier. What do you think? Have you tried it? let me know in the comments.
If you have found it interesting:
You can look for my other articles, you can also subscribe to get notified when I publish articles, and you can also connect or reach me onLinkedIn. Thanks for your support!
Here is the link to my GitHub repository, where I am planning to collect code and many resources related to machine learning, artificial intelligence, and more.