Tortoise-TTS Fully Explained | Part 5

Summary

The article provides a detailed explanation of the Vocoder (UnivNet) component within the Tortoise-TTS model, which converts Mel spectrograms into waveform audio, and includes practical code examples, video tutorials, and sound clips for a comprehensive understanding.

Abstract

In the fifth part of a series exploring the Tortoise-TTS model architecture, the author delves into the role of the Vocoder, specifically the UnivNet vocoder, in transforming Mel spectrograms into audible waveform audio. The article not only offers theoretical insights but also practical code snippets demonstrating the process of denormalizing Mel spectrograms and using the vocoder for audio generation. Additionally, the author encourages readers to listen to the generated voice samples and provides embedded YouTube videos and SoundCloud audio clips for reference. The article concludes with an invitation for collaboration and discussion, as well as a call to action for readers to subscribe and follow the author for future content on AI models.

Opinions

The author is enthusiastic about the capabilities of the Tortoise-TTS model and its Vocoder component, as indicated by the celebratory "Woohoo! 🎉" when discussing the generation of speech samples.
The author values both theoretical explanations and practical applications, as evidenced by the inclusion of code examples and multimedia content.
There is an emphasis on the importance of community and collaboration, with the author inviting readers to engage in discussions and provide feedback on the content.
The author is optimistic about the future of AI, expressing intent to publish more articles on the subject and recommending an AI service as a cost-effective alternative to ChatGPT Plus(GPT-4).

Tortoise-TTS Fully Explained | Part 5 | Vocoder (UnivNet)

In this series, I will take you on a deep dive into the architecture of the Tortoise-TTS model and explain in detail how the Tortoise-TTS model works. This will not only be done theoretically, but will also be accompanied by code.

If you like videos more, feel free to check out my YouTube video to this article:

Step 12: Transform Mel Spectrogram to Waveform audio

Now that we have generated k different mel spectrograms, we can convert/reverse them to waveform audio. To do this, the Tortoise-TTS model uses the UnivNet vocoder. Simply put, the vocoder takes the generated Mel spectrograms and converts them to waveform audio.

with temporary_cuda(tts.vocoder) as vocoder:
    generated_speech = []
    for mel_spectogram in generated_mel_spectograms:
        # the generated normalized Mel spectrogram must be 
        # "denormalized" before applying to the vocoder
        mel_spectogram = denormalize_tacotron_mel(mel_spectogram.to(tts.device))[:,:,:output_seq_len]
        # transform generated Mel spectrogram to waveform audio
        wav = vocoder.inference(mel_spectogram)
        generated_speech.append(wav.cpu())

Step 13: Play Generated Speech

Woohoo! 🎉 Now it’s time to listen to the k different generated voice samples. We should make sure that our input text is spoken in the generated speech samples with the voice from our voice samples.

for speech in generated_speech:
    torchaudio.save('generated.wav', speech.squeeze(0), 24000)
    IPython.display.Audio('generated.wav')

Final Thoughts

I hope you enjoyed this article. I will publish more articles about how to use AI models and how they work in the future. Follow me if that sounds interesting to you. :-)

Isn’t collaboration great? I’m always happy to answer questions or discuss ideas proposed in my articles. So don’t hesitate to reach out to me! 🙌 Also, make sure to subscribe or follow to not miss out on new articles.

Tortoise-TTS Fully Explained | Part 5 | Vocoder (UnivNet)

Table of Contents

Vocoder

Step 12: Transform Mel Spectrogram to Waveform audio

Step 13: Play Generated Speech

Final Thoughts