ML Paper Challenge Day 18 — Deep Speech 2: End-to-End Speech Recognition in English and Mandarin

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

1570

Abstract

_WA.png"><figcaption></figcaption></figure><p id="0ef1"><b>Input:</b> log-spectrograms of power normalised audio clips, calculated on 20ms windows <b>Output: </b>alphabet of each language <b>Inference:</b> CTC models paired a with language model trained on a bigger corpus of text</p><h2 id="2d82">Batch Normalisation for Deep RNNs</h2><p id="a0ec"><b>Objective:</b> To train networks using gradient descent when the size and depth increases</p><p id="bb50"><b>2 Ways to apply:</b></p><figure id="0999"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*hJE5Ov6ZNa8eQXpHGe1t7g.png"><figcaption>Way 1</figcaption></figure><ol><li>Insert a BatchNorm transformation, B(·), immediately before every non-linearity -> <b>Not effective</b></li><li>Batch normalise only the vertical connections For each hidden unit, compute the mean and variance statistics over all items in the mini-batch over the length of the sequence.</li></ol><figure id="926d"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*4fKHgD_2QysKvvs8g5hpAQ.png"><figcaption>Way 2</figcaption></figure><h2 id="e921">SortaGrad</h2><p id="332a"><b>Objective: </b>Make training more stable. Accelerates training and results in better generalisation</p><p id="7c44">Use the length of the utterance as a heuristic for difficulty and train on the shorter (easier) utterances first.</p><p id="8559"><i>In the first training epoch,</i> iterate through mini-batches in the training set in increasing order of the length of the longest utterance in the mini-batch.</p><p id="4ef3"><i>Afte

Options

r the first epoch,</i> training reverts back to a random order over mini-batches</p><h2 id="0f02">GRU vs LSTM</h2><p id="a83a">GRU and LSTM reach similar accuracy for the same number of parameters, but the GRUs are faster to train and less likely to diverge.</p><h2 id="89e6">Frequency Convolutions</h2><p id="4529"><b>Objective:</b> model spectral variance due to speaker variability more concisely than what is possible with large fully connected networks</p><p id="7d34">Multiple layers of Time-and-frequency domain (2D) convolution do better than one layer.</p><h2 id="75ff">Lookahead Convolution and Unidirectional Models</h2><p id="ce07"><b>Objective: </b>Bidirectional RNN models are challenging to deploy in an online, low-latency setting because they cannot stream the transcription process as the utterance arrives from the user. However, models with only forward recurrences routinely perform worse than similar bidirectional models.</p><p id="2f98">The layer learns weights to linearly combine each neuron’s activations τ time-steps into the future, and thus allows us to control the amount of future context needed.</p><figure id="c360"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*PkSrMzk5JQO16kiJLMsGXA.png"><figcaption></figcaption></figure><h2 id="de7c">Adaptation to Mandarin</h2><ul><li>The only architectural changes we make to our networks are due to the characteristics of the Chinese character set.</li><li>Use a character level language model in Mandarin as words are not usually segmented in text.</li></ul></article></body>

Batch Normalisation for Deep RNNs

Objective: To train networks using gradient descent when the size and depth increases

2 Ways to apply:

Way 1

Insert a BatchNorm transformation, B(·), immediately before every non-linearity -> Not effective

Batch normalise only the vertical connections For each hidden unit, compute the mean and variance statistics over all items in the mini-batch over the length of the sequence.

Way 2

SortaGrad

Objective: Make training more stable. Accelerates training and results in better generalisation

Use the length of the utterance as a heuristic for difficulty and train on the shorter (easier) utterances first.

In the first training epoch, iterate through mini-batches in the training set in increasing order of the length of the longest utterance in the mini-batch.

After the first epoch, training reverts back to a random order over mini-batches

Lookahead Convolution and Unidirectional Models

Objective: Bidirectional RNN models are challenging to deploy in an online, low-latency setting because they cannot stream the transcription process as the utterance arrives from the user. However, models with only forward recurrences routinely perform worse than similar bidirectional models.

The layer learns weights to linearly combine each neuron’s activations τ time-steps into the future, and thus allows us to control the amount of future context needed.

ML Paper Challenge Day 18 — Deep Speech 2: End-to-End Speech Recognition in English and Mandarin

Papers with Code - Deep Speech 2: End-to-End Speech Recognition in English and Mandarin

We show that an end-to-end deep learning approach can be used to recognize either English or Mandarin Chinese…

Model Architecture

Batch Normalisation for Deep RNNs

SortaGrad

GRU vs LSTM

Frequency Convolutions

Lookahead Convolution and Unidirectional Models

Adaptation to Mandarin