avatarEva Rtology

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

3918

Abstract

llation</h2><div id="69f0"><pre><span class="hljs-attribute">conda</span> create --prefix ./toucan_conda_venv --no-default-packages python=<span class="hljs-number">3</span>.<span class="hljs-number">8</span>

<span class="hljs-attribute">pip</span> install --no-cache-dir -r requirements.txt

<span class="hljs-attribute">pip</span> install torch==<span class="hljs-number">1</span>.<span class="hljs-number">9</span>.<span class="hljs-number">0</span>+cu111 torchvision==<span class="hljs-number">0</span>.<span class="hljs-number">10</span>.<span class="hljs-number">0</span>+cu111 torchaudio==<span class="hljs-number">0</span>.<span class="hljs-number">9</span>.<span class="hljs-number">0</span> -f https://download.pytorch.org/whl/torch_stable.html</pre></div><h2 id="f802">Creating a new Pipeline</h2><p id="669e">To create a new pipeline to train a HiFiGAN vocoder, you only need a set of audio files. To create a new pipeline for a FastSpeech 2, you need audio files, corresponding text labels, and an already trained Aligner model to estimate the duration information that FastSpeech 2 needs as input.</p><h2 id="79e9">Training a Model</h2><p id="7dcc">Once you have a pipeline built, training is super easy.</p><div id="def5"><pre><span class="hljs-keyword">python</span> run_training_pipeline.<span class="hljs-keyword">py</span> <shorthand of the pipeline></pre></div><h2 id="ac78">Using a trained Model for Inference</h2><p id="aa80">You can load your trained models using an inference interace. Simply instanciate it with the proper directory handle identifying the model you want to use, the rest should work out in the background. You might want to set a language embedding or a speaker embedding. The methods for that should be self-explanatory.</p><h2 id="4063">Conclusion:</h2><p id="9ce6">The goal of synthesizing natural and high-quality speech from text involves two aspects: <b>The first</b> is the direct modeling and generation of waveforms at a sampling rate of 48 kHz, with a good compromise between the acoustic model and the HiFiNet vocoder, which provides higher perceptual quality than previous lower sampling rate systems; <b>Second,</b> the variation information in speech is modeled by a systematic design that includes both explicit and implicit modeling, which improves prosody and naturalness. Overall, the naturalness (MOS) of System F is significantly higher than all other systems and is not significantly different from natural speech; the speaker similarity (SMOS) of System F is better than all other systems, showing the superiority of the <b>Toucan</b> system.</p><div id="0ffd"><pre><span class="language-xml">@inproceedings{lux2021toucan, title=</span><span class="hljs-template-variable">{{<span class="hljs-name">The</span> IMS Toucan system for the Blizzard Challenge <span class="hljs-number">2021</span>}}</span><span class="language-xml">, author={Florian Lux and Julia Koch and Antje Schweitzer and Ngoc Thang Vu}, year={2021}, booktitle={Proc. Blizzard Challenge Workshop}, volume={2021}, publisher=</span><span class="hljs-template-variable">{{<span class="hljs-name">Speech</span> Synthesis SIG}}</span><span class="language-xml"> }</span></pre></div><div id="b598"><pre><span class="language-xml">@article{lux2022laml, title=</span><span class="hljs-template-variable">{{<span class="hljs-name">Language-Agnostic</span> Meta-Learning for Low-Resource Text-to-Speech with Articulatory Features}}</span><span class="language-xml">, author={Florian Lux and Ngoc Thang Vu}, year={2022}, journal={arXiv preprint arXiv:2203.03191}, }</span></pre></div><figure id="fbd4"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*-3b2urrvEo5dCftajylptw.png"><figcaption><a href="https://github.com/DigitalPhonetics/IMS-Toucan">https://github.com/DigitalPhonetics/IMS-Toucan</a></figcaption></figure><h2 id="6109">project page:</h2

Options

<p id="0fdc"><a href="https://github.com/DigitalPhonetics/IMS-Toucan">https://github.com/DigitalPhonetics/IMS-Toucan</a></p><h2 id="f72d">Demo:</h2><p id="81fd"><a href="https://huggingface.co/spaces/Flux9665/IMS-Toucan">https://huggingface.co/spaces/Flux9665/IMS-Toucan</a></p><h2 id="58cd">🟠 SpeechCloning Demo:</h2><p id="5f7b"><a href="https://huggingface.co/spaces/Flux9665/SpeechCloning">https://huggingface.co/spaces/Flux9665/SpeechCloning</a></p><p id="67c7">I invite you to explore the concept of “AI creativity” by reading and learning from the many articles found on 🔵 <a href="https://mlearning.substack.com/"><b>MLearning.ai</b></a> 🟠</p><div id="34dd" class="link-block">
      <a href="https://evartology.medium.com/membership">
        <div>
          <div>
            <h2>Join Medium with my referral link - Eva Rtology</h2>
            <div><h3>As a Medium member, a portion of your membership fee goes to writers you read, and you get full access to every story…</h3></div>
            <div><p>evartology.medium.com</p></div>
          </div>
          <div>
            <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*mLVZqX6TrgpCRnBw)"></div>
          </div>
        </div>
      </a>
    </div><p id="c164">I am <a href="https://readmedium.com/how-to-become-a-curator-3c0c75f74637">an Art Curator,</a> founder at <a href="https://evartology.com/">EvArtology</a>. I advise companies and institutions in the <a href="https://readmedium.com/machine-learning-will-free-creatives-79f005145e4">creative industries</a> on using AI tools in their daily work. Human collaboration with ML models can be very creative and bring huge benefits. <a href="https://readmedium.com/is-ai-art-really-art-a363073d62d0">The new era begins now.</a></p><blockquote id="a5e0"><p><i>Data Scientists must think like an artist when finding a solution when creating a piece of code. <a href="https://medium.com/mlearning-ai/tagged/art">Artists</a> enjoy working on interesting problems, even if there is no obvious answer.</i></p></blockquote><p id="49a2">All our writers (<a href="https://www.getrevue.co/profile/mlearning_ai/members"><b>members</b></a>) receive the opportunity to be promoted on our social media, which increases the popularity of articles published on MLearning.ai</p><ol><li><a href="https://www.linkedin.com/company/mlearning-ai/">Linkedin</a> (6.5K+ ML-professionals)</li><li><a href="https://twitter.com/Mlearning_ai">Twitter</a> (4.7K+ followers)</li><li><a href="https://www.instagram.com/mlearning.ai/">Instagram</a> (2.2K + followers )</li><li><a href="https://readmedium.com/take-vr-tour-of-these-ml-stories-a7550340a6a2">Sketchfab</a> * — individual v<a href="https://readmedium.com/zahra-ahmads-vroom-1510367d679d">Roo</a>ML!</li><li><a href="https://www.facebook.com/Art.Machine.Learning">Facebook</a></li><li><a href="https://www.youtube.com/watch?v=-AXMoEiGdaI">Youtube</a></li><li><a href="https://podcasts.apple.com/pl/podcast/learning-better-and-faster/id1580007913">Apple Podcasts</a></li><li><a href="https://mlearning.substack.com/">Substack</a></li></ol><p id="6860">🔵 <a href="https://readmedium.com/mlearning-ai-submission-suggestions-b51e2b130bfb">Submission Suggestions</a></p><div id="1c35" class="link-block">
      <a href="https://readmedium.com/mlearning-ai-submission-suggestions-b51e2b130bfb">
        <div>
          <div>
            <h2>Mlearning.ai Submission Suggestions</h2>
            <div><h3>How to become a writer on Mlearning.ai</h3></div>
            <div><p>medium.com</p></div>
          </div>
          <div>
            <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/1*ib0DX0UzRoFcNuZILb7rNA.jpeg)"></div>
          </div>
        </div>
      </a>
    </div></article></body>

Machine Learning Art

Speech Cloning

SOTA speech synthesis models

https://mlearning.substack.com/

I’m going to show you how easy it is to create state-of-the-art speech synthesis models with Toucan toolkit. You’ll see how everything is pure Python and PyTorch based to keep it as simple and beginner-friendly as possible.

IMS Toucan is a toolkit for teaching, training and using state-of-the-art Speech Synthesis models, developed at the Institute for Natural Language Processing (IMS), University of Stuttgart, Germany. The basic PyTorch Modules of FastSpeech 2 are taken from ESPnet, the PyTorch Modules of HiFiGAN are taken from the ParallelWaveGAN repository which are also authored by the brilliant Tomoki Hayashi.

Project Page (scroll down)

New Features

🔵 Vocoders can be used to perform super-resolution and spectrogram inversion simultaneously.

🔵 Articulatory representations of phonemes as the input for all models. This allows to easily use multilingual data to benefit less resource-rich languages.

🔵 A checkpoint trained with a variant of model agnostic meta learning from which you should be able to fine-tune a model with very little data in almost any language (except for tonal languages, as mentioned in the last point).

🔵 A small self-contained Aligner that is trained with CTC and an auxiliary spectrogram reconstruction objective, inspired by this implementation.

🔵 By conditioning the TTS on an ensemble of speaker embeddings as well an an embedding lookup table for language embeddings, multi-lingual and multi-speaker models are possible.

🔵 Vocoders can also be used to do some slight speech-enhancement by corrupting a small percentage of their input spectrograms.

🟠 Exactly cloning the speaking style of a reference utterance is also possible and it works in conjunction with everything else! So any utterance in any language spoken by any speaker can be replicated and controlled to allow for maximum customizability.

Installation

conda create --prefix ./toucan_conda_venv --no-default-packages python=3.8

pip install --no-cache-dir -r requirements.txt

pip install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html

Creating a new Pipeline

To create a new pipeline to train a HiFiGAN vocoder, you only need a set of audio files. To create a new pipeline for a FastSpeech 2, you need audio files, corresponding text labels, and an already trained Aligner model to estimate the duration information that FastSpeech 2 needs as input.

Training a Model

Once you have a pipeline built, training is super easy.

python run_training_pipeline.py <shorthand of the pipeline>

Using a trained Model for Inference

You can load your trained models using an inference interace. Simply instanciate it with the proper directory handle identifying the model you want to use, the rest should work out in the background. You might want to set a language embedding or a speaker embedding. The methods for that should be self-explanatory.

Conclusion:

The goal of synthesizing natural and high-quality speech from text involves two aspects: The first is the direct modeling and generation of waveforms at a sampling rate of 48 kHz, with a good compromise between the acoustic model and the HiFiNet vocoder, which provides higher perceptual quality than previous lower sampling rate systems; Second, the variation information in speech is modeled by a systematic design that includes both explicit and implicit modeling, which improves prosody and naturalness. Overall, the naturalness (MOS) of System F is significantly higher than all other systems and is not significantly different from natural speech; the speaker similarity (SMOS) of System F is better than all other systems, showing the superiority of the Toucan system.

@inproceedings{lux2021toucan,
  title={{The IMS Toucan system for the Blizzard Challenge 2021}},
  author={Florian Lux and Julia Koch and Antje Schweitzer and Ngoc Thang Vu},
  year={2021},
  booktitle={Proc. Blizzard Challenge Workshop},
  volume={2021},
  publisher={{Speech Synthesis SIG}}
}
@article{lux2022laml,
  title={{Language-Agnostic Meta-Learning for Low-Resource Text-to-Speech with Articulatory Features}},
  author={Florian Lux and Ngoc Thang Vu},
  year={2022},
  journal={arXiv preprint arXiv:2203.03191},
}
https://github.com/DigitalPhonetics/IMS-Toucan

project page:

https://github.com/DigitalPhonetics/IMS-Toucan

Demo:

https://huggingface.co/spaces/Flux9665/IMS-Toucan

🟠 SpeechCloning Demo:

https://huggingface.co/spaces/Flux9665/SpeechCloning

I invite you to explore the concept of “AI creativity” by reading and learning from the many articles found on 🔵 MLearning.ai 🟠

I am an Art Curator, founder at EvArtology. I advise companies and institutions in the creative industries on using AI tools in their daily work. Human collaboration with ML models can be very creative and bring huge benefits. The new era begins now.

Data Scientists must think like an artist when finding a solution when creating a piece of code. Artists enjoy working on interesting problems, even if there is no obvious answer.

All our writers (members) receive the opportunity to be promoted on our social media, which increases the popularity of articles published on MLearning.ai

  1. Linkedin (6.5K+ ML-professionals)
  2. Twitter (4.7K+ followers)
  3. Instagram (2.2K + followers )
  4. Sketchfab * — individual vRooML!
  5. Facebook
  6. Youtube
  7. Apple Podcasts
  8. Substack

🔵 Submission Suggestions

Machine Learning
Ai Art
Ml So Good
Data Science
Sota
Recommended from ReadMedium