avatarChun-kit Ho

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

1570

Abstract

_WA.png"><figcaption></figcaption></figure><p id="0ef1"><b>Input:</b> log-spectrograms of power normalised audio clips, calculated on 20ms windows <b>Output: </b>alphabet of each language <b>Inference:</b> CTC models paired a with language model trained on a bigger corpus of text</p><h2 id="2d82">Batch Normalisation for Deep RNNs</h2><p id="a0ec"><b>Objective:</b> To train networks using gradient descent when the size and depth increases</p><p id="bb50"><b>2 Ways to apply:</b></p><figure id="0999"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*hJE5Ov6ZNa8eQXpHGe1t7g.png"><figcaption>Way 1</figcaption></figure><ol><li>Insert a BatchNorm transformation, B(·), immediately before every non-linearity -> <b>Not effective</b></li><li>Batch normalise only the vertical connections For each hidden unit, compute the mean and variance statistics over all items in the mini-batch over the length of the sequence.</li></ol><figure id="926d"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*4fKHgD_2QysKvvs8g5hpAQ.png"><figcaption>Way 2</figcaption></figure><h2 id="e921">SortaGrad</h2><p id="332a"><b>Objective: </b>Make training more stable. Accelerates training and results in better generalisation</p><p id="7c44">Use the length of the utterance as a heuristic for difficulty and train on the shorter (easier) utterances first.</p><p id="8559"><i>In the first training epoch,</i> iterate through mini-batches in the training set in increasing order of the length of the longest utterance in the mini-batch.</p><p id="4ef3"><i>Afte

Options

r the first epoch,</i> training reverts back to a random order over mini-batches</p><h2 id="0f02">GRU vs LSTM</h2><p id="a83a">GRU and LSTM reach similar accuracy for the same number of parameters, but the GRUs are faster to train and less likely to diverge.</p><h2 id="89e6">Frequency Convolutions</h2><p id="4529"><b>Objective:</b> model spectral variance due to speaker variability more concisely than what is possible with large fully connected networks</p><p id="7d34">Multiple layers of Time-and-frequency domain (2D) convolution do better than one layer.</p><h2 id="75ff">Lookahead Convolution and Unidirectional Models</h2><p id="ce07"><b>Objective: </b>Bidirectional RNN models are challenging to deploy in an online, low-latency setting because they cannot stream the transcription process as the utterance arrives from the user. However, models with only forward recurrences routinely perform worse than similar bidirectional models.</p><p id="2f98">The layer learns weights to linearly combine each neuron’s activations τ time-steps into the future, and thus allows us to control the amount of future context needed.</p><figure id="c360"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*PkSrMzk5JQO16kiJLMsGXA.png"><figcaption></figcaption></figure><h2 id="de7c">Adaptation to Mandarin</h2><ul><li>The only architectural changes we make to our networks are due to the characteristics of the Chinese character set.</li><li>Use a character level language model in Mandarin as words are not usually segmented in text.</li></ul></article></body>

ML Paper Challenge Day 18 — Deep Speech 2: End-to-End Speech Recognition in English and Mandarin

Day 18: 2020.04.29 Paper: Deep Speech 2: End-to-End Speech Recognition in English and Mandarin Category: Model/Deep Learning/Speech Recognition

Model Architecture

Input: log-spectrograms of power normalised audio clips, calculated on 20ms windows Output: alphabet of each language Inference: CTC models paired a with language model trained on a bigger corpus of text

Batch Normalisation for Deep RNNs

Objective: To train networks using gradient descent when the size and depth increases

2 Ways to apply:

Way 1
  1. Insert a BatchNorm transformation, B(·), immediately before every non-linearity -> Not effective
  2. Batch normalise only the vertical connections For each hidden unit, compute the mean and variance statistics over all items in the mini-batch over the length of the sequence.
Way 2

SortaGrad

Objective: Make training more stable. Accelerates training and results in better generalisation

Use the length of the utterance as a heuristic for difficulty and train on the shorter (easier) utterances first.

In the first training epoch, iterate through mini-batches in the training set in increasing order of the length of the longest utterance in the mini-batch.

After the first epoch, training reverts back to a random order over mini-batches

GRU vs LSTM

GRU and LSTM reach similar accuracy for the same number of parameters, but the GRUs are faster to train and less likely to diverge.

Frequency Convolutions

Objective: model spectral variance due to speaker variability more concisely than what is possible with large fully connected networks

Multiple layers of Time-and-frequency domain (2D) convolution do better than one layer.

Lookahead Convolution and Unidirectional Models

Objective: Bidirectional RNN models are challenging to deploy in an online, low-latency setting because they cannot stream the transcription process as the utterance arrives from the user. However, models with only forward recurrences routinely perform worse than similar bidirectional models.

The layer learns weights to linearly combine each neuron’s activations τ time-steps into the future, and thus allows us to control the amount of future context needed.

Adaptation to Mandarin

  • The only architectural changes we make to our networks are due to the characteristics of the Chinese character set.
  • Use a character level language model in Mandarin as words are not usually segmented in text.
Speech Recognition
Deep Learning
Neural Networks
Recommended from ReadMedium