avatarMartin Thissen

Summary

The article provides a detailed explanation of the Diffusion Model within the Tortoise-TTS architecture, including its implementation and the benefits of conditioning-free diffusion for speech synthesis.

Abstract

The fourth part of the series on Tortoise-TTS architecture delves into the Diffusion Model, a crucial component for reconstructing high-dimensional Mel spectrograms from latent Mel token representations. The author explains the initialization of the Diffusion Sampler, which uses a spaced diffusion model to expedite the denoising process. This model skips steps in the diffusion process, allowing for faster generation of Mel spectrograms from noisy inputs. The article also covers the iterative denoising technique that leverages both conditioned and conditioning-free signals to enhance the realism of synthesized speech. Practical code examples are provided, illustrating the process of generating Mel spectrograms with the appropriate sampling rates and mel filter channels, tailored to the requirements of the UnivNet vocoder. The author shares personal experiences with the computational demands of the process, having utilized an NVIDIA RTX 6000 Ada GPU for testing, and encourages readers to engage with the content and follow for future insights into AI models.

Opinions

  • The author advocates for the effectiveness of diffusion models in speech synthesis, particularly their ability to use low-dimensional signals to reconstruct high-dimensional data.
  • Conditioning-free diffusion is highlighted as a significant feature that improves the realism of generated speech by blending conditioned and unconditioned signals.
  • The author expresses satisfaction with the performance of the NVIDIA RTX 6000 Ada GPU, which was provided for the purpose of testing and content creation.
  • There is an endorsement for continued learning and exploration of AI models, with an invitation for readers to subscribe or follow for more information.
  • The author suggests that the AI service ZAI.chat is a cost-effective alternative to ChatGPT Plus (GPT-4), offering similar performance at a lower price.

Tortoise-TTS Fully Explained | Part 4 | Diffusion Model

In this series, I will take you on a deep dive into the architecture of the Tortoise-TTS model and explain in detail how the Tortoise-TTS model works. This will not only be done theoretically, but will also be accompanied by code.

If you like videos more, feel free to check out my YouTube video to this article:

Table of Contents

  • Part 1 — The Overall Architecture
  • Part 2 — The Autoregressive Model
  • Part 3 — The CLVP Model
  • Part 4 — The Diffusion Model
  • Part 5 — The Vocoder Model

Diffusion Model

Input: Speaker Conditioning, Latent Mel token representation, Timestep Signal

Output: Mel spectrogram

Step 10: Initializing the Diffusion Sampler

Alright, now it’s time to leave our highly compressed Mel token space and return to the “Mel spectrogram space”. For this, we will use a diffusion model. Diffusion models have been proven to be quite effective at using low-dimensional guidance signals to reconstruct the high-dimensional space that those guidance signals were derived from. In other words, they allow us to recreate a Mel spectrogram based on the latent Mel token representations and the speaker conditioning. This is also known as upsampling or super-resolution. The diffusion model used in the Tortoise architecture is a spaced diffusion model, which allows for faster diffusion by skipping steps of the basic diffusion process (DDIM). There are faster samplers available nowadays, but for the purposes of this article we will stick to the spaced diffusion model. In the following example, the 4000 diffusion steps used to train the diffusion model are strided into 250 time steps, which significantly speeds up the denoising process. It is also important to note that the denoising process is performed without conditioning. While I initially wondered if all the effort before computing the results of the autoregressive model was for nothing, conditioning-free diffusion actually means that the denoising process is both conditioned and conditioning-free. Specifically, conditioning-free diffusion performs two forward runs for each diffusion step: one with conditioning and one without. The results of the two are mixed according to the conditioning_free_k value. The author of the Tortoise architecture states that conditioning-free diffusion significantly improves realism of the generated speech.

# OpenAI Paper (Prafulla Dhariwal, Alex Nichol) - Diffusion Models Beat GANs on Image Synthesis.
# https://github.com/openai/guided-diffusion/blob/main/guided_diffusion/respace.py
trained_diffusion_steps = 4000
desired_diffusion_steps = 250
sampler = SpacedDiffusion(use_timesteps=space_timesteps(trained_diffusion_steps, [desired_diffusion_steps]),
                          model_mean_type='epsilon', # the model predicts added noise epsilon
                          model_var_type='learned_range', # determines how the model outputs the variance
                          loss_type='mse', # loss used, not relevant for inference
                          betas=get_named_beta_schedule('linear', trained_diffusion_steps), # determines the amount of noise added or subtracted at each step, not relevant for inference though
                          # Conditioning-free diffusion performs two forward passes for
                          # each diffusion step: one with the outputs of the autoregressive model and one with no conditioning priors. The output
                          # of the two is blended according to the cond_free_k value below. Conditioning-free diffusion dramatically improves realism.
                          conditioning_free=True,
                          # As cond_free_k increases, the output becomes dominated by the conditioning-free signal.
                          # Formula is: output=cond_present_output*(cond_free_k+1)-cond_absenct_output*cond_free_k
                          conditioning_free_k=2
                          )

Step 11: Iteratively Denoise Noisy Input Image to Mel Spectrogram

After all conditions have been preprocessed and the diffusion sampler has been initialized, we can now generate a corresponding Mel spectrogram. As described in the previous step, the generation of the Mel spectrogram is guided by the latents of the autoregressive model as well as the speaker and time conditioning. The UnivNet vocoder works with a sampling rate of 24 kHz, while the Mel tokens were derived from audio with a sampling rate of 22.05 kHz. We have to take this into account when calculating the length of the output sequence. Also, the vocoder expects the mel spectrogram to have 100 mel filter channels, which we need to take into account when initializing the output_shape. Then we will combine the autoregressive latents and the speaker conditioning into a single conditioning. Now we can sample a noisy image from a normal distribution using the defined output_shape and denoise this image iteratively with the spaced diffusion sampler. Finally, we obtain the generated Mel spectrogram:

with (
    temporary_cuda(tts.diffusion) as diffusion_model,
    torch.no_grad()
     ):
    generated_mel_spectograms = []
    for gpt_latents in preprocessed_gpt_latents:
        gpt_latents = gpt_latents.unsqueeze(0).to(tts.device)
        # 4x because the Mel codes compress a Mel spectrogram by factor 4
        # 24000 // 22050 allows consideration of 24kHz sampling rate expected by the vocoder
        output_seq_len = gpt_latents.shape[1] * 4 * 24000 // 22050
        # 100 = mel-filter channels (or banks) expected by the vocoder
        output_shape = (gpt_latents.shape[0], 100, output_seq_len)
        # preparing the conditioning  
        precomputed_embeddings = diffusion.timestep_independent(gpt_latents, diffusion_conditioning, output_seq_len, False)
        # define noisy image which will be iteratively denoised
        img = torch.randn(output_shape, device=gpt_latents.device)
        for i in reversed(range(desired_diffusion_steps)):
            # current time step as scalar (or list of scalars)
            t = torch.tensor([i] * output_shape[0], device=tts.device)
            # denoise img at time step t using the precomputed conditioning
            # and the spaced diffusion sampler
            out = sampler.p_sample(
                diffusion_model,
                img,
                t,
                model_kwargs={'precomputed_aligned_embeddings': precomputed_embeddings},
            )
            # assign img with prediction from the diffusion model for
            # img at time step t-1
            img = out["sample"]
        generated_mel_spectograms.append(img.cpu())

Running this code can take some time. I did use an NVIDIA RTX 6000 Ada GPU, with which the generation of four Mel spectrograms took around 26 seconds. NVIDIA kindly provided me with the RTX 6000 Ada GPU to support my YouTube and Medium channels. The RTX 6000 Ada, which is a high-end GPU, has the following specs:

  • 568 Tensor Cores
  • 960 GB/s Memory Bandwidth
  • 48GB VRAM

Final Thoughts

I hope you enjoyed this article. I will publish more articles about how to use AI models and how they work in the future. Follow me if that sounds interesting to you. :-)

Isn’t collaboration great? I’m always happy to answer questions or discuss ideas proposed in my articles. So don’t hesitate to reach out to me! 🙌 Also, make sure to subscribe or follow to not miss out on new articles.

YouTube: https://bit.ly/3LqA1Os

LinkedIn: http://bit.ly/3i5Sc1g

Deep Learning
AI
Speech
Pytorch
Voice Cloning
Recommended from ReadMedium