Make Your Text To Speech Emotional

Summary

Azure Speech adds emotional expressions to Text-to-Speech (TTS) with Speech Synthesis Markup Language (SSML) and Microsoft's extension, MSTTS.

Abstract

Azure Voice Text to Speech allows voice customization using the Speech Synthesis Markup Language (SSML). Microsoft has added an extension called MSTTS to SSML, which includes several voice commands not present in standard SSML. The express-as command in MSTTS allows you to configure the expression of the given text, with properties like style, styledegree, and role. Styles include more than 14 variations of intonation and expression, with more styles available with Chinese voices. The express-as command is only available in SSML processed with the Azure Voice service.

Opinions

The author believes that adding emotional expressions to TTS using SSML and MSTTS is a significant development.
The author suggests that the express-as command could be an easy add-on for a generative conversational AI function.
The author expresses interest in creating a custom GPT for classifying the emotional intent of a text and tagging dialogue with appropriate styles.
The author recommends trying out an AI service that provides the same performance and functions as ChatGPT Plus (GPT-4) but is more cost-effective.
The author implies that the express-as command could be used to make a bad karaoke singer with Azure Speech, as mentioned in another article.
The author does not provide a clear opinion on the effectiveness or quality of the emotional expressions added to TTS using SSML and MSTTS.
The author does not discuss any potential drawbacks or limitations of using SSML and MSTTS for adding emotional expressions to TTS.

Make Your Text To Speech Emotional

Azure Speech adds emotional expressions to TTS with SSML MSTTS

Azure Voice Text to Speech allows voice customization using the Speech Synthesis Markup Language (SSML). As part of this markup language, Microsoft has added an extension called MSTTS.

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" 
xmlns:mstts="https://www.w3.org/2001/mstts" xml:lang="en-US">

You can learn more about SSML in another article.

MSTTS adds several voice commands not present in standard SSML. Extensions in MSTTS include adding background audio, silence duration configs, and lip movement output.

The command express-asallows you to configure the expression of the given text. The express-as command includes 3 properties: style, styledegree, and role. Use styleto select an expression, and styledegreeto set how far to push the expression. styledegreevalues range from 0.01 to 2, with 2 being the most expressive.

<voice name="en-US-JennyNeural">
    <mstts:express-as style="excited" styledegree="2">
        The quick brown fox jumps over the lazy dog.
    </mstts:express-as>
</voice>

Currently roles are only supported in the Chinese language.

<voice name="zh-CN-XiaomoNeural">
    <mstts:express-as style="sad" styledegree="0.5" role="YoungAdultFemale">
        敏捷的棕色狐狸跳过了懒狗
    </mstts:express-as>
</voice>

Styles include more than 14 variations of intonation and expression. More styles are available with Chinese voices. The list of styles I prefer comes from the JennyNeural voice.

angry

assistant

chat

cheerful

customerservice

excited

friendly

hopeful

newscast

sad

shouting

terrified

unfriendly

whispering

Using the express-as command is only available in SSML processed with the Azure Voice service. I can see this additional markup being an easy add-on for a generative conversational AI function, using the AI to classify the emotional intent of a text and tag any dialogue with appropriate styles. Maybe I’ll make a custom GPT for just that purpose.

SSML MSTTS Text To Speech in Azure

Make Your Text To Speech Emotional

Making AI Sing is Hard

I made a bad karaoke singer with Azure Speech.