Beginner’s Guide to NVIDIA NeMo
A toolkit to develop and train speech and language models

The piece provide you with a glimpse on the fundamental concepts behind NVIDIA NeMo. It is an extremely powerful tookit when it comes to building your own state of the art models for conversational AI. For your information, a typical conversational AI pipeline consists of the following domains:
- Automated Speech Recognition (ASR)
- Natural Language Processing (NLP)
- Text to Speech (TTS)
If you are finding for a full-fledged toolkit to train or fine-tune model for these domains, you might want to have a look at NeMo. It allows researchers and model developers to build their own neural network architectures using reusable components called Neural Modules (NeMo). Based on the official documentation, neural modules are
“… conceptual blocks of neural networks that take typed inputs and produce typed outputs. Such modules typically represent data layers, encoders, decoders, language models, loss functions, or methods of combining activations.”
One major plus point for NeMo is that it can be used to train new model or perform transfer learning on existing pre-trained models. On top of that, there are quite a number of pre-trained models available for your usage at NVIDIA GPU Cloud (NGC). At the time of this writing, the GPU-accelerated cloud platform has the following pre-trained models:
ASR
- Jasper 10x5 — Librispeech
- Multi-dataset Jasper 10x5 — LibriSpeech, Mozilla Common Voice, WSJ, Fisher, and Switchboard
- AI-shell2 Jasper 10x5 — AI-shell2 Mandarin chinese
- Quartznet — Librispeech with speed perturbation
- QuartzNetLibrispeechMCV — Librispeech, Mozilla common voice
- Multi-dataset Quartznet — LibriSpeech, Mozilla Common Voice, WSJ, Fisher, and Switchboard
- WSJ-Quartznet — Wall street journal,Librispeech, Mozilla common voice
- AI-shell2 Quartznet — AI-shell2 Mandarin chinese
NLP
- BertLargeUncased — Uncased Wikipedia and Bookcorpus on a sequence length 512 using BERT Large
- BertBaseCased — Cased Wikipedia and Bookcorpus on a sequence length 512 using BERT Base
- BertBaseUncased — Uncased Wikipedia and Bookcorpus on a sequence length 512 using BERT Base
- Transformer-Big — WikiText-2
TTS
Setup
Make sure that you fulfill the following requirements:
- Python 3.6 or 3.7
- PyTorch 1.4.* with GPU support
- NVIDIA APEX (optional)
For your information NVIDIA APEX is an utilities that helps to streamline mixed precision and distributed training in Pytorch. It is not required but it helps to improve the performance and training time. If you intend to use NVIDIA APEX, it is highly recommended to use Linux operating system as the support for Windows is still at experimental phase.
Installation is pretty straightforward if you are using docker.
docker run --runtime=nvidia -it --rm -v --shm-size=16g -p 8888:8888 -p 6006:6006 --ulimit memlock=-1 --ulimit stack=67108864 nvcr.io/nvidia/nemo:v0.11If you have signed up for the NVIDIA NGC PyTorch container, execute the following command one by one.
Pull the docker
docker pull nvcr.io/nvidia/pytorch:20.01-py3Run the following command
docker run --gpus all -it --rm -v <nemo_github_folder>:/NeMo --shm-size=8g -p 8888:8888 -p 6006:6006 --ulimit memlock=-1 --ulimit stack=67108864 nvcr.io/nvidia/pytorch:20.01-py3The following steps are the continuation point for the rest of the installation. If you are running it locally or via Google Colab, you should start the installation from here. Run the following command to install the necessary dependencies.
apt-get update && apt-get install -y libsndfile1 ffmpeg && pip install CythonOnce you are done, the next step is to pip install NeMo module depends on your use cases. If you are using it to train a new model, run the following command
pip install nemo_toolkitTo get NeMo that comes with Automated Speech Recognition collections
pip install nemo_toolkit[asr]NeMo and Natural Language Processing collections can be installed via
pip install nemo_toolkit[nlp]To install NeMo and Text to Speech collections, run the following command
pip install nemo_toolkit[tts]If you are looking for a full installation which includes all of the collections, you can do so via
pip install nemo_toolkit[all]Programming Model
Each application that is based on NeMo API typically use the following workflow:
- Creation of NeuralModuleFactory and necessary NeuralModule
- Defining a Directed Acyclic Graph (DAG) of NeuralModule
- Call to “action” such as train
One important thing to note is that NeMo follows lazy execution model. This means that no actual computation will be performed until inference is called or after training.
Neural Types
All input and output ports of every neural module in NeMo are typed. They are implemented with the Python class NeuralType and helper classes derived from ElementType, AxisType and AxisKindAbstract. A Neural Type consist of the following data:
axes— represents what varying a particular axis means (batch, time)elements_type— represents the semantics and properties of what is stored inside the activations (audio signal,text embedding, logits)
Initialization
Instantiation are mostly done inside your module’s input_ports and output_ports properties. You can instantiate a Neural Type as follow
axes: Optional[Tuple] = None, elements_type: ElementType = VoidType(), optional=FalseLet’s have a look at the following example for (audio) data layer output ports.
{
'audio_signal': NeuralType(('B', 'T'), AudioSignal(freq=self._sample_rate)),
'a_sig_length': NeuralType(tuple('B'), LengthsType()),
'transcripts': NeuralType(('B', 'T'), LabelsType()),
'transcript_length': NeuralType(tuple('B'), LengthsType()),
}B— represents AxisKind.BatchT— represents AxisKind.TimeD— represents AxisKind.Dimension
Compare
You can compare two NeuralType via the compare() function. It will return a NeuralTypeComparisonResult that convey the following meaning
- SAME = 0
- LESS = 1 (A is B)
- GREATER = 2 (B is A)
- DIM_INCOMPATIBLE = 3 (Dimension is not compatible. Resize connector might fix incompatibility)
- TRANSPOSE_SAME = 4 (Format of the data is not compatible but a transpose and/or converting between lists and tensors will make them same)
- CONTAINER_SIZE_MISMATCH = 5 (A and B contain different number of elements)
- INCOMPATIBLE = 6 (A and B are incompatible)
- SAME_TYPE_INCOMPATIBLE_PARAMS = 7 (A and B are of the same type but parametrized differently)
Let’s move on to the next section to explore more on the sample examples.
Examples
Basic example
Let’s have a look at the following example which build a model that learns Taylor’s coefficients for y=sin(x).
import nemo
# instantiate Neural Factory with supported backend
nf = nemo.core.NeuralModuleFactory()
# instantiate necessary neural modules
# RealFunctionDataLayer defaults to f=torch.sin, sampling from x=[-4, 4]
dl = nemo.tutorials.RealFunctionDataLayer(
n=10000, batch_size=128)
fx = nemo.tutorials.TaylorNet(dim=4)
loss = nemo.tutorials.MSELoss()
# describe activation's flow
x, y = dl()
p = fx(x=x)
lss = loss(predictions=p, target=y)
# SimpleLossLoggerCallback will print loss values to console.
callback = nemo.core.SimpleLossLoggerCallback(
tensors=[lss],
print_func=lambda x: logging.info(f'Train Loss: {str(x[0].item())}'))
# Invoke "train" action
nf.train([lss], callbacks=[callback],
optimization_params={"num_epochs": 3, "lr": 0.0003},
optimizer="sgd")ASR
Check out the following notebooks to kick-start your project on speech recognition:
- Introduction to End-To-End Automatic Speech Recognition
- Automatic speech recognition from a microphone’s stream in NeMo
- Speech Command recognition based on QuartzNet model
NLP
Collection of notebooks using NeMO for natual language processing tasks:
- BERT Pretraining
- BioBERT for Question Answering
- BioBERT for Named-entity Recognition
- BioBERT for Relationship Extraction
TTS
Sample notebook for text to speech task using NeMo:
Conclusion
Let’s recap on what we have learned today.
We started off with a brief introduction on NVIDIA NeMo toolkit. Besides, we were exposed to a few pre-trained models that are readily available at NVIDIA GPU Cloud (NGC).
Then, we installed the toolkit either via docker or local installation with pip install.
We explored in-depth on the programming model and NeuralType which makes up the basic concept behind NeMo.
Lastly, we played around and tested a few examples for automated speech recognition, natural language processing and text to speech tasks.
Thanks for reading this piece. Hope to see you again in the next article!





