avatarSalvatore Raieli

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

5125

Abstract

red music (fine-grained attention), and for the other bars, the token only attends to their summary tokens to obtain concentrated information (coarse-grained attention). To achieve this, we first summarize the local information of each bar through the summarization step, and then aggregate the fine-grained and coarse-grained information through the aggregation step.</p></blockquote><p id="e3fe">These bars simply represent part of the sequence. The last important step is to figure out which ones represent the important information and are likely to be repeated in the musical sequence. For this, the authors used simple summary statistics by calculating the similarity between two different bars along the entire sequence.</p><p id="f4bc">They applied the same approach to analyze different styles, finding that some patterns are repeated across different genres and styles:</p><blockquote id="a1d9"><p>We further conduct the similarity statistics on different datasets involving music of various genres and styles. The results shown in Appendix A interestingly indicate that this pattern is universally applicable to the music of the great diversity. We believe that it can be regarded as a general rule applicable to most music in our daily life.</p></blockquote><p id="6726">The authors state that this structure allows the model to comply with music characteristics and cover the structure-related information (both short-term and long-term). In addition, the model preserves information as opposed to models that use sparse attention to reduce complexity (this leads to losing a large amount of information).</p><figure id="9a20"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*JkLskKyDSE6Tu2r_dO1i0Q.png"><figcaption>A snippet of a generated song. from the original article (<a href="https://arxiv.org/pdf/2210.10349.pdf">here</a>)</figcaption></figure><p id="6ff6">The authors used to train the model the <a href="https://colinraffel.com/projects/lmd/">Lakh MIDI (LMD) dataset</a>, which contains multi-instrument music in the format of <a href="https://en.wikipedia.org/wiki/MIDI">MIDI</a> (in total they used nearly thirty thousand songs or 1,700 hours of songs, which contain several instruments).</p><p id="bf88">The model includes 4 layers with hidden size 512, 8 attention heads, and a feed-forward layer of size 2,048. To evaluate it in addition to using perplexity and similarity error, they invited 10 people (including seven with musical backgrounds) to evaluate 100 randomly generated music pieces. People had to evaluate according to several criteria: musicality (whether the piece was pleasant and interesting), Short-term structure, Long-term structure, and overall.</p><p id="ff85">They also compared their model with other previous models showing that their model was superior:</p><figure id="9157"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*48pLqyGVJ4xyuUSiUuWPZQ.png"><figcaption>from the original article (<a href="https://arxiv.org/pdf/2210.10349.pdf">here</a>)</figcaption></figure><figure id="4a0a"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*CI6D4lYVa1TVaMsvYdg1rw.png"><figcaption>from the original article (<a href="https://arxiv.org/pdf/2210.10349.pdf">here</a>)</figcaption></figure><p id="e939">Although the model manages to generate music with good quality and structure, it is still far from perfect.</p><blockquote id="66ff"><p>First, since Museformer takes random samplings during inference and does not receive manual control, it can hardly ensure that every generated music piece is well-structured in an expected way. Techniques to enhance its reliability and controllability can be further explored. Furthermore, the musicality and creativity of the generated music are still far behind that of human-made music, which remains a problem for all the existing music generation models.</p></blockquote><p id="b1d5">Here you can listen to some of the musical examples that have been created:</p><div id="ed07" class="link-block"> <a href="https://ai-muzic.github.io/museformer/"> <div> <div> <h2>Museformer: Transformer with Fine- and Coarse-Grained Attention for Music Generation</h2> <div><h3>Corresponding author. Symbolic music generation aims to generate music scores automatically. A recent trend is to use…</h3></div> <div><p>ai-muzic.github.io</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/)"></div> </div> </div> </a> </div><p id="b34c">In addition, Museformer is part of a larger project by Microsoft. the project called Muzic (<a href="https://github.com/microsoft/muzic">GitHub repository here</a>).</p><figure id="171c"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*hjcrCBTw9n7Yd9Mp.png"><figcaption>image source: <a href="https://github.com/microsoft/muzic">official repository</a></figcaption></figure><p id="9f42">The project aims to understand music (r

Options

ecognize, find, transcribe) and then at a later time generate it. There are already several projects in the repository that can also be tested.</p><figure id="f08a"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*YBFyBq9nyPykIbm5.png"><figcaption>image source: <a href="https://github.com/microsoft/muzic">official repository</a></figcaption></figure><p id="db1e">Microsoft is not the only one engaging in projects dedicated to music. In fact, a few days ago Google also produced its own model that can continue a song or speech between people. Will what happened with AI art happen with music? what do you think?</p><h1 id="5caa">If you have found it interesting:</h1><p id="f031">You can look for my other articles, you can also <a href="https://salvatore-raieli.medium.com/subscribe"><b>subscribe</b></a> to get notified when I publish articles, and you can also connect or reach me on<b> <a href="https://www.linkedin.com/in/salvatore-raieli/">LinkedIn</a>. </b>Thanks for your support!</p><p id="b2bf">Here is the link to my GitHub repository, where I am planning to collect code and many resources related to machine learning, artificial intelligence, and more.</p><div id="a123" class="link-block"> <a href="https://github.com/SalvatoreRa/tutorial"> <div> <div> <h2>GitHub - SalvatoreRa/tutorial: Tutorials on machine learning, artificial intelligence, data science…</h2> <div><h3>Tutorials on machine learning, artificial intelligence, data science with math explanation and reusable code (in python…</h3></div> <div><p>github.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*7tbsT7Crj3lBBP4b)"></div> </div> </div> </a> </div><p id="07ff">Or feel free to check out some of my other articles on Medium:</p><div id="1e71" class="link-block"> <a href="https://readmedium.com/ai-reimagines-the-worlds-20-most-beautiful-words-cd07090ea59b"> <div> <div> <h2>AI reimagines the world’s 20 most beautiful words</h2> <div><h3>How to translate words that cannot be translated?</h3></div> <div><p>medium.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/1*Es-bkcBH5cwDTVOMWHNpww.png)"></div> </div> </div> </a> </div><div id="8f5f" class="link-block"> <a href="https://towardsdatascience.com/how-ai-could-help-preserve-art-f40c8376781d"> <div> <div> <h2>How AI Could Help Preserve Art</h2> <div><h3>Art masterpieces are a risk at any time; AI and new technologies can give a hand</h3></div> <div><p>towardsdatascience.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/1*0gZJThP288BRmXOF4aXYJA.png)"></div> </div> </div> </a> </div><div id="f3fd" class="link-block"> <a href="https://towardsdatascience.com/how-artificial-intelligence-could-save-the-amazon-rainforest-688fa505c455"> <div> <div> <h2>How artificial intelligence could save the Amazon rainforest</h2> <div><h3>Amazonia is at risk and AI could help preserve it</h3></div> <div><p>towardsdatascience.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*qWzmSebZKQT7nSvF)"></div> </div> </div> </a> </div><div id="f498" class="link-block"> <a href="https://readmedium.com/nobel-prize-cyberpunk-e1803aa0e087"> <div> <div> <h2>Nobel prize Cyberpunk</h2> <div><h3>A computational view of the most important prize and perspective on AI in scientific discovery</h3></div> <div><p>medium.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/1*kLOMUvFoA0NEaNRMqPyjTQ.png)"></div> </div> </div> </a> </div><div id="f341" class="link-block"> <a href="https://readmedium.com/mlearning-ai-submission-suggestions-b51e2b130bfb"> <div> <div> <h2>Mlearning.ai Submission Suggestions</h2> <div><h3>How to become a writer on Mlearning.ai</h3></div> <div><p>medium.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/1*6xCb1sNpjadaSBuVLPTFQQ.png)"></div> </div> </div> </a> </div></article></body>

Microsoft’s Museformer: AI music is the new frontier

AI art is exploding, music can be next.

Image generated with OpenAI Dall-E 2

Transformers have revolutionized natural language processing. In recent months, however, we have seen how AI has been applied to art and image generation. A few days ago, Microsoft announced Museformer, a model for music generation.

Music can be represented as an organized and discrete sequence of tokens (after all, music consists of a series of sounds in sequence). Transformers have been shown to be efficient in text generation, and in other words, transformers can be said to do nothing more than generate a sequence of text. The success of the Transformer lies in the fact that the self-attention mechanism allows long dependencies to be captured in a text. To model music, we need to be able to capture long dependencies and the correlation between various parts of the musical sequence. This lays the foundation for using a transformer to generate music.

The music score of Twinkle, Twinkle, Little Star, and its corresponding token representation. from the original article (here)

Of course, it is not that easy:

  • Long sequence modeling: musical sequences are very long (especially when there are several instruments). This is a problem because the attention mechanism has quadratic complexity and thus the computational cost becomes exponentially higher.
  • Music structure modeling: music has its own unique structure, there are definite patterns that are repeated and can have variations. These patterns recur sometimes long distances in the sequence, making it more complicated.

This is not the first time that attempts have been made to handle long sequences with transformers. There are two approaches that have been used primarily:

  • Local focusing, as in the case of Transformer XL and Longformer, where basically the focus is only on part of the input sequence and the rest is dropped. In the case, of music, the retained sequence may not contain the important part of the musical structure.
  • global approximation, used by Linear Transformer where there is a sequence compression, although this compression reduces complexity, it does not capture the correlation between the various parts of a musical sequence.

The insight of this paper, and that although these two approaches are inadequate one can take the best of the two. Indeed, not all parts of a musical sequence are important (and this information is not evenly distributed). So we need to safeguard and focus on these parts, and when generating music we focus on the important repetitions. The rest, those less important passages can be approximated. So the idea, in summary, is to focus on the important parts but reduce the complexity and computational computation.

This mechanism is put into practice by a mechanism called fine and coarse-grained attention (FC-Attention) that replaces the classical self-attention module:

The general idea is that we do not need to focus on the whole sequence with the same importance level given that the complexity of pair-wise full attention is unacceptably high, but instead we combine two different attention schemes — fine-grained attention for the structure-related bars, and coarse-grained attention for the other bars.

The fine- and coarse-grained attention. from the original article (here)

In other words, there are two steps: summarization, and aggregation. The first step reduces the complexity and creates a kind of “summary token” for part of the sequence, and then the information is subsequently aggregated (as in classical attention this allows contextualization of the information):

The basic idea of FC-Attention is that, instead of directly attending to all the tokens which causes the quadratic complexity, a token of a specific bar only directly attends to the structure-related bars that are essential for generating structured music (fine-grained attention), and for the other bars, the token only attends to their summary tokens to obtain concentrated information (coarse-grained attention). To achieve this, we first summarize the local information of each bar through the summarization step, and then aggregate the fine-grained and coarse-grained information through the aggregation step.

These bars simply represent part of the sequence. The last important step is to figure out which ones represent the important information and are likely to be repeated in the musical sequence. For this, the authors used simple summary statistics by calculating the similarity between two different bars along the entire sequence.

They applied the same approach to analyze different styles, finding that some patterns are repeated across different genres and styles:

We further conduct the similarity statistics on different datasets involving music of various genres and styles. The results shown in Appendix A interestingly indicate that this pattern is universally applicable to the music of the great diversity. We believe that it can be regarded as a general rule applicable to most music in our daily life.

The authors state that this structure allows the model to comply with music characteristics and cover the structure-related information (both short-term and long-term). In addition, the model preserves information as opposed to models that use sparse attention to reduce complexity (this leads to losing a large amount of information).

A snippet of a generated song. from the original article (here)

The authors used to train the model the Lakh MIDI (LMD) dataset, which contains multi-instrument music in the format of MIDI (in total they used nearly thirty thousand songs or 1,700 hours of songs, which contain several instruments).

The model includes 4 layers with hidden size 512, 8 attention heads, and a feed-forward layer of size 2,048. To evaluate it in addition to using perplexity and similarity error, they invited 10 people (including seven with musical backgrounds) to evaluate 100 randomly generated music pieces. People had to evaluate according to several criteria: musicality (whether the piece was pleasant and interesting), Short-term structure, Long-term structure, and overall.

They also compared their model with other previous models showing that their model was superior:

from the original article (here)
from the original article (here)

Although the model manages to generate music with good quality and structure, it is still far from perfect.

First, since Museformer takes random samplings during inference and does not receive manual control, it can hardly ensure that every generated music piece is well-structured in an expected way. Techniques to enhance its reliability and controllability can be further explored. Furthermore, the musicality and creativity of the generated music are still far behind that of human-made music, which remains a problem for all the existing music generation models.

Here you can listen to some of the musical examples that have been created:

In addition, Museformer is part of a larger project by Microsoft. the project called Muzic (GitHub repository here).

image source: official repository

The project aims to understand music (recognize, find, transcribe) and then at a later time generate it. There are already several projects in the repository that can also be tested.

image source: official repository

Microsoft is not the only one engaging in projects dedicated to music. In fact, a few days ago Google also produced its own model that can continue a song or speech between people. Will what happened with AI art happen with music? what do you think?

If you have found it interesting:

You can look for my other articles, you can also subscribe to get notified when I publish articles, and you can also connect or reach me on LinkedIn. Thanks for your support!

Here is the link to my GitHub repository, where I am planning to collect code and many resources related to machine learning, artificial intelligence, and more.

Or feel free to check out some of my other articles on Medium:

Artificial Intelligence
Machine Learning
Science
Music
Ml So Good
Recommended from ReadMedium