Making AI Sing is Hard
I made a bad karaoke singer with Azure Speech.
Summary
The article discusses the challenges and techniques involved in making AI voices, specifically Azure Speech, sing effectively, highlighting the complexity of achieving natural-sounding singing through SSML (Speech Synthesis Markup Language) and considering alternative methods for AI-generated singing.
Abstract
The author of the article has experimented with Azure Speech to create a singing voice, acknowledging the difficulty of this task due to the service's voices being trained primarily for speaking rather than singing. The article delves into the use of SSML to manipulate pitch, rate, and volume to simulate singing, but notes that this approach can be tedious and often results in subpar "bad karaoke" quality. The author also explores the structure of SSML, including the use of nodes like <speak>, <voice>, and <prosody> to control various aspects of the speech output. Despite these efforts, the complexity of modulating each word to capture the nuances of singing proves to be a significant hurdle. As an alternative, the author suggests using song-to-song voice cloning technologies, which can mimic the intonation of a recorded song with a chosen AI voice. This method is seen as a more promising approach to AI-generated singing, with services like Revocalize.ai and VoiceMod.net offering examples of this technology.
Opinions
I made a bad karaoke singer with Azure Speech.
Azure Speech voices are trained for speaking, not singing. It’s pretty hard to make the voices sound like they’re singing. One way we can approach this is by using Speech Synthesis Markup Language (SSML). I tried applying pitch and rate, word by word in this string of text. I got a result but it’s like bad karaoke.
SSML is a markup language, a subset of Extensible Markup Language (XML). SSML controls how a Azure Speech Service voice delivers it’s speech. Much like XML, it starts with a root node and namespace.
<speak
version="1.0"
xmlns="http://www.w3.org/2001/10/synthesis"
xml:lang="en-US">
</speak>Speak is the root node for SSML. This node declares the spoken language to use for the document. Child nodes contain the text to speak, along with modifiers for voice, pitch, speed and volume.
Next comes the Voice node. This defines the Voice that speaks the enclosed text. You can switch between voices as often as needed. The list of all Voices is available at Microsoft’s site.
<voice name="en-US-GuyNeural">
The quick brown fox jumps over the lazy dog.
</voice>
<voice name=”en-US-JennyNeural”>
Why does the fox jump?
</voice>The multi-lingual voices don’t support all the SSML elements. I chose Guy and Jenny because they have the widest support.
Azure Speech will automatically add pauses to the output. Punctuation helps to trigger these pauses. The Break node adds an explicit pause.
<voice name="en-US-GuyNeural">
The quick brown fox <break strength="strong"/>jumps over the lazy dog.
</voice>You can give the algorithm hints by marking up the text with both paragraphs and sentences. With paragraphs and sentences in place, you can also adjust the pause time. Pauses both before after and between sentences, and the start and end of text can be changed.
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:mstts="http://www.w3.org/2001/mstts" xml:lang="en-US">
<voice name="en-US-GuyNeural">
<mstts:silence type="Trailing" value="2000ms"/>
<p><s>The quick brown fox jumps over the lazy dog.</s></p>
</voice>
<voice name=”en-US-JennyNeural”>
<p><s>Why does the fox jump?</s></p>
</voice>
</speak>Note there’s another namespace required for Microsoft’s extension to SSML mstts. The extension brings several features, including adjusting inter-sentence pauses.
I used the <prosody> node the most. Prosody controls the modulation of the Voice when speaking. Eventually I was modifying each word.
I changed the pitch.
I changed the volume.
I changed the speed.
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
xml:lang="en-US">
<voice name="en-US-JennyNeural">
<prosody rate="fast">The <prosody volume="x-loud" rate="slow" pitch="+30">phones</prosody>
are <prosody rate="slow" pitch="+30">alive</prosody>
with the <prosody rate="slow" pitch="-50">sound</prosody>
of <prosody rate="slow" pitch="+30">A</prosody>
<prosody rate="default" pitch="+30">I</prosody>.</prosody>
</voice>
</speak>It took a lot of trial and error to get this far.
Singing with Azure Speech is difficult because it’s a lot of work to modulate each word. A particular phrase or even an individual word needs it’s own intonation.
So the SSML markup grows fast.
I had to change each individual word so that I could control the pitch and rate exactly. This took a lot of repeat listening to try to figure out what was the appropriate sound for that particular word.
All this word by word work calls for a shortcut. The easiest approach now is song to song singing voices.
Revocalize.ai is a good example of this. It’s just for personal use now, but it’s a sign of the copycat game to come. The machine will learn to mimic someone else’s skill so you can pretend to have it.
There aren’t a lot of current examples for AI generated singing voices. Some of them use SSML. Some of them are more procedurally generated from simple text according to a type of music genre that you pick. VoiceMod.net is a good example of type.
Maybe Azure Speech custom Voices could be fed only singing voices for training. Then it would be a trained voice that could sing text. But at this point it doesn’t look like singing with Azure Speech will work.

Austin StarksIt literally took one try. I was shocked.
Alexander Nguyen1-page. Well-formatted.
Vishal RajputEvery technology has its hype and cool down period.
Rohit PatelIn this article, we talk about how LLMs work, from scratch — assuming only that you know how to add and multiply two numbers. The article…
Pranav Mehta11 creative ways I use AI that are honestly useful
Muhammad Saad UddinDiscover How It Will Transform ML and AI Dynamics