Beginner’s Guide to NVIDIA NeMo

A toolkit to develop and train speech and language models

The piece provide you with a glimpse on the fundamental concepts behind NVIDIA NeMo. It is an extremely powerful tookit when it comes to building your own state of the art models for conversational AI. For your information, a typical conversational AI pipeline consists of the following domains:

Automated Speech Recognition (ASR)
Natural Language Processing (NLP)
Text to Speech (TTS)

If you are finding for a full-fledged toolkit to train or fine-tune model for these domains, you might want to have a look at NeMo. It allows researchers and model developers to build their own neural network architectures using reusable components called Neural Modules (NeMo). Based on the official documentation, neural modules are

“… conceptual blocks of neural networks that take typed inputs and produce typed outputs. Such modules typically represent data layers, encoders, decoders, language models, loss functions, or methods of combining activations.”

One major plus point for NeMo is that it can be used to train new model or perform transfer learning on existing pre-trained models. On top of that, there are quite a number of pre-trained models available for your usage at NVIDIA GPU Cloud (NGC). At the time of this writing, the GPU-accelerated cloud platform has the following pre-trained models:

ASR

Jasper 10x5 — Librispeech
Multi-dataset Jasper 10x5 — LibriSpeech, Mozilla Common Voice, WSJ, Fisher, and Switchboard
AI-shell2 Jasper 10x5 — AI-shell2 Mandarin chinese
Quartznet — Librispeech with speed perturbation
QuartzNetLibrispeechMCV — Librispeech, Mozilla common voice
Multi-dataset Quartznet — LibriSpeech, Mozilla Common Voice, WSJ, Fisher, and Switchboard
WSJ-Quartznet — Wall street journal,Librispeech, Mozilla common voice
AI-shell2 Quartznet — AI-shell2 Mandarin chinese

NLP

BertLargeUncased — Uncased Wikipedia and Bookcorpus on a sequence length 512 using BERT Large
BertBaseCased — Cased Wikipedia and Bookcorpus on a sequence length 512 using BERT Base
BertBaseUncased — Uncased Wikipedia and Bookcorpus on a sequence length 512 using BERT Base
Transformer-Big — WikiText-2

TTS

Tacotron2 — LJSpeech
Waveglow — LJSpeech

Setup

Make sure that you fulfill the following requirements:

Python 3.6 or 3.7
PyTorch 1.4.* with GPU support
NVIDIA APEX (optional)

For your information NVIDIA APEX is an utilities that helps to streamline mixed precision and distributed training in Pytorch. It is not required but it helps to improve the performance and training time. If you intend to use NVIDIA APEX, it is highly recommended to use Linux operating system as the support for Windows is still at experimental phase.

Installation is pretty straightforward if you are using docker.

docker run --runtime=nvidia -it --rm -v --shm-size=16g -p 8888:8888 -p 6006:6006 --ulimit memlock=-1 --ulimit stack=67108864 nvcr.io/nvidia/nemo:v0.11

If you have signed up for the NVIDIA NGC PyTorch container, execute the following command one by one.

Pull the docker

docker pull nvcr.io/nvidia/pytorch:20.01-py3

Run the following command

docker run --gpus all -it --rm -v <nemo_github_folder>:/NeMo  --shm-size=8g -p 8888:8888 -p 6006:6006 --ulimit memlock=-1 --ulimit  stack=67108864 nvcr.io/nvidia/pytorch:20.01-py3

The following steps are the continuation point for the rest of the installation. If you are running it locally or via Google Colab, you should start the installation from here. Run the following command to install the necessary dependencies.

apt-get update && apt-get install -y libsndfile1 ffmpeg && pip install Cython

Once you are done, the next step is to pip install NeMo module depends on your use cases. If you are using it to train a new model, run the following command

pip install nemo_toolkit

To get NeMo that comes with Automated Speech Recognition collections

pip install nemo_toolkit[asr]

NeMo and Natural Language Processing collections can be installed via

pip install nemo_toolkit[nlp]

To install NeMo and Text to Speech collections, run the following command

pip install nemo_toolkit[tts]

If you are looking for a full installation which includes all of the collections, you can do so via

pip install nemo_toolkit[all]

Programming Model

Each application that is based on NeMo API typically use the following workflow:

Creation of NeuralModuleFactory and necessary NeuralModule
Defining a Directed Acyclic Graph (DAG) of NeuralModule
Call to “action” such as train

One important thing to note is that NeMo follows lazy execution model. This means that no actual computation will be performed until inference is called or after training.

Neural Types

All input and output ports of every neural module in NeMo are typed. They are implemented with the Python class NeuralType and helper classes derived from ElementType, AxisType and AxisKindAbstract. A Neural Type consist of the following data:

axes — represents what varying a particular axis means (batch, time)
elements_type — represents the semantics and properties of what is stored inside the activations (audio signal,text embedding, logits)

Initialization

Instantiation are mostly done inside your module’s input_ports and output_ports properties. You can instantiate a Neural Type as follow

axes: Optional[Tuple] = None, elements_type: ElementType = VoidType(), optional=False

Let’s have a look at the following example for (audio) data layer output ports.

{
    'audio_signal': NeuralType(('B', 'T'), AudioSignal(freq=self._sample_rate)),
    'a_sig_length': NeuralType(tuple('B'), LengthsType()),
    'transcripts': NeuralType(('B', 'T'), LabelsType()),
    'transcript_length': NeuralType(tuple('B'), LengthsType()),
}

B — represents AxisKind.Batch
T — represents AxisKind.Time
D — represents AxisKind.Dimension

Compare

You can compare two NeuralType via the compare() function. It will return a NeuralTypeComparisonResult that convey the following meaning

SAME = 0
LESS = 1 (A is B)
GREATER = 2 (B is A)
DIM_INCOMPATIBLE = 3 (Dimension is not compatible. Resize connector might fix incompatibility)
TRANSPOSE_SAME = 4 (Format of the data is not compatible but a transpose and/or converting between lists and tensors will make them same)
CONTAINER_SIZE_MISMATCH = 5 (A and B contain different number of elements)
INCOMPATIBLE = 6 (A and B are incompatible)
SAME_TYPE_INCOMPATIBLE_PARAMS = 7 (A and B are of the same type but parametrized differently)

Let’s move on to the next section to explore more on the sample examples.

Examples

Basic example

Let’s have a look at the following example which build a model that learns Taylor’s coefficients for y=sin(x).

import nemo

# instantiate Neural Factory with supported backend
nf = nemo.core.NeuralModuleFactory()

# instantiate necessary neural modules
# RealFunctionDataLayer defaults to f=torch.sin, sampling from x=[-4, 4]
dl = nemo.tutorials.RealFunctionDataLayer(
    n=10000, batch_size=128)
fx = nemo.tutorials.TaylorNet(dim=4)
loss = nemo.tutorials.MSELoss()

# describe activation's flow
x, y = dl()
p = fx(x=x)
lss = loss(predictions=p, target=y)

# SimpleLossLoggerCallback will print loss values to console.
callback = nemo.core.SimpleLossLoggerCallback(
    tensors=[lss],
    print_func=lambda x: logging.info(f'Train Loss: {str(x[0].item())}'))

# Invoke "train" action
nf.train([lss], callbacks=[callback],
         optimization_params={"num_epochs": 3, "lr": 0.0003},
         optimizer="sgd")

ASR

Check out the following notebooks to kick-start your project on speech recognition:

NLP

Collection of notebooks using NeMO for natual language processing tasks:

TTS

Sample notebook for text to speech task using NeMo:

Tacotron + WaveGlow to Generate Audio

Conclusion

Let’s recap on what we have learned today.

We started off with a brief introduction on NVIDIA NeMo toolkit. Besides, we were exposed to a few pre-trained models that are readily available at NVIDIA GPU Cloud (NGC).

Then, we installed the toolkit either via docker or local installation with pip install.

We explored in-depth on the programming model and NeuralType which makes up the basic concept behind NeMo.

Lastly, we played around and tested a few examples for automated speech recognition, natural language processing and text to speech tasks.

Thanks for reading this piece. Hope to see you again in the next article!