Summary

The web content provides an in-depth explanation of the Tortoise-TTS model architecture, focusing on the CLVP model and its role in text-to-speech synthesis, along with a giveaway for an NVIDIA RTX 3080 Ti GPU.

Abstract

The article series on the undefined website delves into the intricate details of the Tortoise-TTS model, a cutting-edge text-to-speech synthesis system. Part 3 of the series specifically addresses the CLVP model, which is responsible for selecting the best Mel token sequences by calculating similarity scores between input text and the generated sequences. The process involves using the autoregressive model to obtain latent representations, conditioning the diffusion model, and pre-processing intermediate latents for optimal speech synthesis. The author also announces a giveaway contest for an NVIDIA RTX 3080 Ti GPU, encouraging readers to attend the NVIDIA GTC 2024 conference and submit proof of attendance. The article emphasizes the importance of understanding AI models and invites readers to engage in discussions and follow for future content.

Opinions

The author believes that the Tortoise-TTS model's architecture is a significant advancement in AI, particularly in text-to-speech synthesis.
The CLVP model is highlighted as a key component in the Tortoise-TTS system, showcasing its effectiveness in selecting high-quality Mel token sequences.
The "Tortoise Trick" of conditioning the diffusion model with latent representations from the autoregressive model is presented as a notable contribution to the field.
The author expresses enthusiasm about the NVIDIA GTC 2024 conference, suggesting that it is a valuable opportunity for learning about the latest developments in AI.
The giveaway of an NVIDIA RTX 3080 Ti GPU is seen as an exciting incentive for the community to participate in the conference and engage with the content provided by the author.
The author values community interaction and encourages readers to reach out with questions or ideas, fostering a collaborative environment.

Tortoise-TTS Fully Explained | Part 3 | CLVP Model (CLIP)

In this series, I will take you on a deep dive into the architecture of the Tortoise-TTS model and explain in detail how the Tortoise-TTS model works. This will not only be done theoretically, but will also be accompanied by code.

If you like videos more, feel free to check out my YouTube video to this article:

Part 1 — The Overall Architecture
Part 2 — The Autoregressive Model
Part 3 — The CLVP Model
Part 4 — The Diffusion Model
Part 5 — The Vocoder Model

CLVP Model

Input: Text, N x Mel token sequences

Output: N x Score (float value)

Step 7: Pick Top-K MEL Token Sequences Using the CLVP Model (CLIP)

Once we have generated the mel token sequences, it is time to select the k best sequences for which we will then generate audio. To do this, we use the CLVP model, which calculates a similarity score between the input text and the Mel token sequences. Only the sequences with the k highest scores are then processed further.

# number of MEL token sequences (with highest CLVP score) to generate audio for
top_k = 4

with torch.no_grad(), temporary_cuda(tts.clvp) as clvp:
  for i in range(generated_mel_codes.shape[0]):
    # the fix_autoregressive_output function performs some padding to fix a
    # mismatch issue between what the diffusion model was trained on and what
    # the autoregressive code generator creates (which has no padding or end).
    generated_mel_codes[i] = fix_autoregressive_output(generated_mel_codes[i], tts.autoregressive.stop_mel_token)

  text_input = text_tokens.repeat(generated_mel_codes.shape[0], 1).to(tts.device)
  # calculate the CLVP scores for all MEL code sequences
  clvp_scores = clvp(text_input, generated_mel_codes, return_loss=False)
  # continue only with the top_k best ranked MEL code sequences
  topk_mel_codes_indices = torch.topk(clvp_scores, k=top_k).indices
  generated_mel_codes = generated_mel_codes[topk_mel_codes_indices]

Autoregressive Model

Input: Text, N x Mel token sequences

Output: N x Latent representations inside AR model

Step 8: Obtaining GPT Latents for Generated MEL Code Sequences as Input for the Diffusion Model

You are probably confused why we are using the autoregressive model again. One of the key contributions of the Tortoise-TTS model is the idea of conditioning the diffusion model with the latent representations of the Mel tokens inside the autoregressive model. This is also called the Tortoise Trick. As you might imagine, the latent representations of Mel tokens are much more expressive, as many more features such as phoneme, intonation, rhythm, pitch, timbre or tone can be contained compared to a single discretely-valued token.

with torch.no_grad(), temporary_cuda(tts.autoregressive) as autoregressive:
    # repeat speech_conditioning k-times to match shape of generated_mel_codes
    speech_conditioning = gpt_conditioning.squeeze().repeat(top_k, 1).to("cuda")
    
    # repeat text_tokens k-times to match shape of generated_mel_codes
    text_inputs = text_tokens.repeat(top_k, 1).to("cuda")

    # number of input text tokens as scalar (can be ignored
    # as the this argument is internally not used)
    text_lengths = torch.tensor([text_tokens.shape[-1]], device="cuda")

    # wav_lengths -> length of the output audio (number of mel codes * Mel compression rate)
    # Mel spectogram compresses audio 256 times. Mel codes compress Mel spectogram 4 times.
    # Mel compression rate = 256*4 = 1024
    wav_lengths = torch.tensor([generated_mel_codes.shape[-1]*tts.autoregressive.mel_length_compression], device="cuda")
    
    # obtain the latent representations of each mel token inside the
    # autoregressive model
    gpt_latents = autoregressive(speech_conditioning,
                                 text_inputs,
                                 text_lengths,
                                 generated_mel_codes,
                                 wav_lengths,
                                 return_latent=True,
                                 clip_inputs=False)

Step 9: Pre-Process Intermediate Latents for Better Speech Synthesis Results

Now that we have obtained the latent representations from the autoregressive model, a final pre-processing step is required. Since all generated Mel token sequences are stored in a batch, the length of each sequence is the same. However, depending on the length of the input text, the sequences may contain many stop_token or padding_token that do not contain any information relevant for the diffusion model. It therefore makes sense to ignore such tokens and their latent representations when conditioning the diffusion model. Using the following code, the latents of the autoregressive model are truncated if there are 8 or more consecutive calm tokens in the corresponding Mel token sequence. It is important to note that all stop_token and padding_token have been replaced by the calm token when calling the fix_autoregressive_output method in step 7. Overall, this improves efficiency, but still leaves some “breathing room” for the diffusion model to terminate the speech.

preprocessed_gpt_latents = []
for i in range(generated_mel_codes.shape[0]):
    mel_codes = generated_mel_codes[i].unsqueeze(0).cpu()

    # Find the first occurrence of the "calm" token and trim the codes to that.
    calm_token = 83

    # This code works because all EOS and padding tokens were replaced by the
    # calm_token when the fix_autoregressive_output method was called earlier.
    calm_tokens = 0
    for j in range(mel_codes.shape[-1]):
        if mel_codes[0, j] == calm_token:
            calm_tokens += 1
        else:
            calm_tokens = 0
        if calm_tokens > 8:  # 8 calm tokens give the diffusion model some "breathing room" to terminate speech.
            preprocessed_gpt_latents.append(gpt_latents[i, :j].cpu())
            break

Giveaway — Win an NVIDIA RTX 3080 Ti GPU 🎉

I have exciting news for you guys. One of you can win an NVIDIA RTX 3080 Ti GPU with 12GB VRAM, 320 Tensor Cores, 912 GB/s memory bandwidth. What do you need to do to win this GPU? Attend to NVIDIA’s GTC 2024 conference (March 18–21) and send a screenshot as a proof of attendence to me, that’s it! The GTC conference is happening online and in-person. In case you haven’t heard about the GTC conference yet, the GTC conference covers a wide range of topics in the area of AI giving you a great idea of what’s coming next in AI. There are more than 600 sessions and people from all major players in the field of AI like Meta, OpenAI, Google DeepMind, NVIDIA, or RunwayML will be holding talks. Personally, I find the “What’s Next in Generative AI”, “The Fastest Stable Diffusion in the World” as well as the “Human-Like AI Voices: Exploring the Evolution of Voice Technology” talks very interesting. Good luck to everyone and don’t miss out on this one!

Win an NVIDIA RTX 3080 Ti 🎉

Step 1: Register for GTC 2024

Step 2: Send Your Proof of Attendance (Deadline March 22nd)

Final Thoughts

I hope you enjoyed this article. I will publish more articles about how to use AI models and how they work in the future. Follow me if that sounds interesting to you. :-)

Isn’t collaboration great? I’m always happy to answer questions or discuss ideas proposed in my articles. So don’t hesitate to reach out to me! 🙌 Also, make sure to subscribe or follow to not miss out on new articles.

YouTube: https://bit.ly/3LqA1Os

LinkedIn: http://bit.ly/3i5Sc1g