Turn Your Words into Music with Cutting-Edge AI — No Musical Skills Required

Imagine Midjourney but for music

Imagine having an AI music assistant? Illustration generated with Midjourney

Your words have unlocked a new skill: music creation.

Google has created an AI system for song creation called MusicLM. Although it’s not the first of its kind (like Riffusion or Jukebox) none of the others have been able to do what MusicLM does.

With all the buzz around AI chatbots (ChatGPT) and AI-generated visual art (Midjourney, Dall-E), you’re now part of the third revolution in AI tech with sound.

It hasn’t been easy. MusicLM had to overcome multiple issues like audio quality, melody cohesiveness, and scarcity in audio-text data pairings but they managed to create a system that creates realistic music from simple text.

And it’s impressive. There’re no musicians or composers behind the curtain. Just 280,000 hours of music experience and powerful algorithms to put your words into music.

Here’s how you can sing your story.

What can MusicLM do

The tool’s power is impressive. It’s so versatile that it needs a separate section for each of the amazing things it can do. Here’re 6 different ways to use it.

Turning complex text into audio

The little we’ve heard about it is that it can generate music from text, but not just any kind of text. It can range from a couple of words (e.g. melodic techno) to a complex description. Listen for yourself:

“melodic techno”. Audio here.
“swing”. Audio here.
“A fusion of reggaeton and electronic dance music, with a spacey, otherworldly sound. Induces the experience of being lost in space, and the music would be designed to evoke a sense of wonder and awe while being danceable.” Audio here.
“Slow tempo, bass-and-drums-led reggae song. Sustained electric guitar. High-pitched bongos with ringing tones. Vocals are relaxed with a laid-back feel, very expressive.” Audio here.

Some of these tunes are 5 minutes long!

Getting a complex song from a simple audio input

If you’re not good with words, you can still use MusicLM by humming, whistling, or playing the melody on an instrument, and the AI will generate a music clip based on your input.

Let’s say you want to listen to Beethoven’s “Ode to Joy” in a different style. You make an amateur humming version (like this) and then ask MusicLM to transform it into:

“A capella chorus”. Audio here.
“Jazz with the saxophone”. Audio here.
“Tribal drums and flute”. Audio here.

If you’re not into vocals, you can also record yourself playing the melody on a piano, guitar, marimba, or on a string instrument and it will still get the job done. Amazing.

Transforming a sequence of texts into an audio tale

You can write a sequence of text prompts, like different music genres, and the AI will generate a coherent song that transitions between these different parts.

This is called the story mode.

Here’s an example where the prompts are statements of music genres.

jazz song (0:00–0:15) pop song (0:15–0:30) rock song(0:30–0:45) death metal song (0:45–1:00) rap song (1:00–1:15) string quartet with violins (1:15–1:30) epic movie soundtrack with drums (1:30–1:45) scottish folk song with traditional instruments (1:45–2:00) Audio here.

The transitions are impressive for an AI-generated song. In such a short amount of time, the AI can link these different parts together to create a coherent song.

Providing text and an image to get a song

Next, we have the AI film score composer. MusicLM can also create a brief song from an image and description, such as a painting and its accompanying text.

It’s like one of the first exercises you get in film scoring, but this time it’s created in a matter of seconds.

So if we use this image:

With this description:

“Made early in his career, Matisse s Dance, 1910, shows a group of red dancers caught in a collective moment of innocent freedom and joy, holding hands as they whirl around in space. Simple and direct, the painting speaks volumes about our deep-rooted, primal human desire for connection, movement, rhythm, and music.” (TheCollector)

We get this: audio here.

Asking for different skill levels for a song

Remember how ChatGPT allows you to rewrite an article in a way a 5-year-old will understand? Or write in the style of a 7th grader?

Well, the same goes for music.

You can generate music for different skill levels: beginner, intermediate, or professional level.

Here’s how it sounds for piano:

Beginner level: audio here.
Intermediate level: audio here.
Professional level: audio here.

It might not be completely accurate and the audio could also be cleaner, but it’s a good start.

Getting different versions of a song

When you like the song but you’re not completely convinced with the interpretation, you can ask for a different version of it.

It’s something similar to asking Midjourney to get you a variation on the main image (v1, v2…etc.).

For instance, let’s say you like a progressive rock guitar solo that the AI created for you but the execution is not quite right. You ask for a different version of it.

Guitar solo 1: audio here.
Guitar solo 1 with more drums: audio here.
Guitar solo 1 with some variation on the breaks: audio here.

Or you can maintain the prompt but expect a different song with each new iteration.

Let’s say you want a string quartet with percussions, so you try it a couple of times around and see which one you like the most.

A very percussive happy version: audio here.
A more laid-back oriental version: audio here.
A classical with a regular beat version: audio here.

Why doesn’t it look as pro as Midjourney but for music?

In the image domain, there’re massive datasets that have generated very high-quality illustrations.

For instance, Dall-E was trained on a dataset of 12 million images while MusicLM was trained on 5.5 thousand music clips and text descriptions. There’s a huge gap between the two so naturally, the output won’t look the same.

We’ve been accustomed to creating descriptions out of images, so reverse engineering doesn’t seem as complicated as translating the text into audio.

It’s not as straightforward as capturing the essence of an image in a couple of words. You might be a little less ambiguous with an acoustic scene (e.g. sound of a horse carriage passing by an urban street), but what about describing music? There’s rhythm, melody, harmony, timbre, and so on. It becomes a nightmare!

And then there’s time.

Music happens in a temporal dimension, unlike a painting. How do you describe just in a few words the melodic sequence for your AI-generated song once you were somehow able to define the other multiple features of music?

It’s an extra variable that doesn’t apply to paintings.

Testing available text-to-audio tools in the market

I used two text-to-audio models already available in the market to compare their output. I used a simple description: “baroque song using modern instruments” and this is what I got:

Riffusion: a weird-looking loop that has some resemblance to baroque music with some interesting instrument selection.
Mubert: a nice slow melody on the violin with some sound effects with no resemblance to baroque music whatsoever.

The researchers tested three tools and found that MusicLM was the best in terms of quality and accuracy to the original prompt, based on 1,200 participant ratings.

The takeaway

We’re lucky enough to be part of one revolutionary tech after another in such a short time.

From human-like conversations with AI to the creation of amazing art with just a couple of words, has led us into a new relationship with digital technology.

The leap has been astounding.

And if that wasn’t enough, we’ve been amazed once again in the field of sound.

MusicLM is versatile, intuitive, fast, and resourceful.

And if these samples don’t feel like top-notch pieces of art, think of them as the early stage of the videogame industry. We were amazed by just watching a 2D game with simple pixel art, limited color palettes, basic animations, and low resolution. And look at what we’ve got now.

This retro look is what we’re now experiencing with AI-generated music.

And it’s going to change everything.

If you enjoy reading stories like these and would like to support writers on Medium, consider signing up to become a Medium member. It’s just $5 a month and you’ll have unlimited access to articles from amazing writers all over the world.

Here are some other articles you might enjoy reading:

A Visual Microphone? The Revolutionary Tech That Can Extract Audio from Images

The power of subtle motions

towardsdatascience.com

Lennon or McCartney — Who Contributed the Most to the Beatles’ Success?

Some stats for the ultimate battle

medium.com

This article was written with the assistance of OpenAI’s language model, ChatGPT.