Summary

The article discusses the challenges and techniques involved in making AI voices, specifically Azure Speech, sing effectively, highlighting the complexity of achieving natural-sounding singing through SSML (Speech Synthesis Markup Language) and considering alternative methods for AI-generated singing.

Abstract

The author of the article has experimented with Azure Speech to create a singing voice, acknowledging the difficulty of this task due to the service's voices being trained primarily for speaking rather than singing. The article delves into the use of SSML to manipulate pitch, rate, and volume to simulate singing, but notes that this approach can be tedious and often results in subpar "bad karaoke" quality. The author also explores the structure of SSML, including the use of nodes like <speak>, <voice>, and <prosody> to control various aspects of the speech output. Despite these efforts, the complexity of modulating each word to capture the nuances of singing proves to be a significant hurdle. As an alternative, the author suggests using song-to-song voice cloning technologies, which can mimic the intonation of a recorded song with a chosen AI voice. This method is seen as a more promising approach to AI-generated singing, with services like Revocalize.ai and VoiceMod.net offering examples of this technology.

Opinions

The author believes that Azure Speech voices are not inherently suited for singing, as they are primarily trained for spoken language.
There is an opinion that manually adjusting SSML parameters for singing is labor-intensive and often yields unsatisfactory results.
The author suggests that the future of AI-generated singing may lie in voice cloning technologies that can replicate the intonation and style of existing songs.
The article implies that while SSML can provide some control over speech modulation, it is currently insufficient for producing high-quality singing voices with Azure Speech.
The author seems optimistic about the potential of AI voice cloning services, despite their current limitations and personal use focus.

Making AI Sing is Hard

I made a bad karaoke singer with Azure Speech.

Can you sing?

Azure Speech voices are trained for speaking, not singing. It’s pretty hard to make the voices sound like they’re singing. One way we can approach this is by using Speech Synthesis Markup Language (SSML). I tried applying pitch and rate, word by word in this string of text. I got a result but it’s like bad karaoke.

What is SSML?

SSML is a markup language, a subset of Extensible Markup Language (XML). SSML controls how a Azure Speech Service voice delivers it’s speech. Much like XML, it starts with a root node and namespace.

<speak 
    version="1.0" 
    xmlns="http://www.w3.org/2001/10/synthesis" 
    xml:lang="en-US">
</speak>

Speak is the root node for SSML. This node declares the spoken language to use for the document. Child nodes contain the text to speak, along with modifiers for voice, pitch, speed and volume.

Next comes the Voice node. This defines the Voice that speaks the enclosed text. You can switch between voices as often as needed. The list of all Voices is available at Microsoft’s site.

<voice name="en-US-GuyNeural">
    The quick brown fox jumps over the lazy dog.
</voice>
<voice name=”en-US-JennyNeural”>
    Why does the fox jump?
</voice>

The multi-lingual voices don’t support all the SSML elements. I chose Guy and Jenny because they have the widest support.

Language support - Speech service - Azure AI services

The Speech service supports numerous languages for speech to text and text to speech conversion, along with speech…

learn.microsoft.com

Rhythm and structure

Azure Speech will automatically add pauses to the output. Punctuation helps to trigger these pauses. The Break node adds an explicit pause.

<voice name="en-US-GuyNeural">
    The quick brown fox <break strength="strong"/>jumps over the lazy dog.
</voice>

You can give the algorithm hints by marking up the text with both paragraphs and sentences. With paragraphs and sentences in place, you can also adjust the pause time. Pauses both before after and between sentences, and the start and end of text can be changed.

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
     xmlns:mstts="http://www.w3.org/2001/mstts" xml:lang="en-US">
    <voice name="en-US-GuyNeural">
        <mstts:silence  type="Trailing" value="2000ms"/>
        <p><s>The quick brown fox jumps over the lazy dog.</s></p>
    </voice>
    <voice name=”en-US-JennyNeural”>
        <p><s>Why does the fox jump?</s></p>
    </voice>
</speak>

Note there’s another namespace required for Microsoft’s extension to SSML mstts. The extension brings several features, including adjusting inter-sentence pauses.

Prosody

I used the <prosody> node the most. Prosody controls the modulation of the Voice when speaking. Eventually I was modifying each word.

I changed the pitch.

I changed the volume.

I changed the speed.

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" 
    xml:lang="en-US">
    <voice name="en-US-JennyNeural">
        <prosody rate="fast">The <prosody volume="x-loud" rate="slow" pitch="+30">phones</prosody> 
        are <prosody rate="slow" pitch="+30">alive</prosody> 
        with the <prosody rate="slow" pitch="-50">sound</prosody> 
        of <prosody rate="slow" pitch="+30">A</prosody>
        <prosody rate="default" pitch="+30">I</prosody>.</prosody>
    </voice>
</speak>

It took a lot of trial and error to get this far.

Why singing is hard

Singing with Azure Speech is difficult because it’s a lot of work to modulate each word. A particular phrase or even an individual word needs it’s own intonation.

So the SSML markup grows fast.

I had to change each individual word so that I could control the pitch and rate exactly. This took a lot of repeat listening to try to figure out what was the appropriate sound for that particular word.

Alternatives

All this word by word work calls for a shortcut. The easiest approach now is song to song singing voices.

Take a recording of a person singing a song, with all it’s intonation.
Pick an AI voice, even your own cloned voice.
Output the song using the new voice and the original intonation.

Revocalize.ai is a good example of this. It’s just for personal use now, but it’s a sign of the copycat game to come. The machine will learn to mimic someone else’s skill so you can pretend to have it.

There aren’t a lot of current examples for AI generated singing voices. Some of them use SSML. Some of them are more procedurally generated from simple text according to a type of music genre that you pick. VoiceMod.net is a good example of type.

Maybe Azure Speech custom Voices could be fed only singing voices for training. Then it would be a trained voice that could sing text. But at this point it doesn’t look like singing with Azure Speech will work.