Free AI web copilot to create summaries, insights and extended knowledge, download it at here
1570
Abstract
_WA.png"><figcaption></figcaption></figure><p id="0ef1"><b>Input:</b> log-spectrograms of power normalised audio clips, calculated on 20ms windows <b>Output: </b>alphabet of each language <b>Inference:</b> CTC models paired a with language model trained on a bigger corpus of text</p><h2 id="2d82">Batch Normalisation for Deep RNNs</h2><p id="a0ec"><b>Objective:</b> To train networks using gradient descent when the size and depth increases</p><p id="bb50"><b>2 Ways to apply:</b></p><figure id="0999"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*hJE5Ov6ZNa8eQXpHGe1t7g.png"><figcaption>Way 1</figcaption></figure><ol><li>Insert a BatchNorm transformation, B(·), immediately before every non-linearity -> <b>Not effective</b></li><li>Batch normalise only the vertical connections For each hidden unit, compute the mean and variance statistics over all items in the mini-batch over the length of the sequence.</li></ol><figure id="926d"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*4fKHgD_2QysKvvs8g5hpAQ.png"><figcaption>Way 2</figcaption></figure><h2 id="e921">SortaGrad</h2><p id="332a"><b>Objective: </b>Make training more stable. Accelerates training and results in better generalisation</p><p id="7c44">Use the length of the utterance as a heuristic for difficulty and train on the shorter (easier) utterances first.</p><p id="8559"><i>In the first training epoch,</i> iterate through mini-batches in the training set in increasing order of the length of the longest utterance in the mini-batch.</p><p id="4ef3"><i>Afte
Options
r the first epoch,</i> training reverts back to a random order over mini-batches</p><h2 id="0f02">GRU vs LSTM</h2><p id="a83a">GRU and LSTM reach similar accuracy for the same number of parameters, but the GRUs are faster to train and less likely to diverge.</p><h2 id="89e6">Frequency Convolutions</h2><p id="4529"><b>Objective:</b> model spectral variance due to speaker variability more concisely than what is possible with large fully connected networks</p><p id="7d34">Multiple layers of Time-and-frequency domain (2D) convolution do better than one layer.</p><h2 id="75ff">Lookahead Convolution and Unidirectional Models</h2><p id="ce07"><b>Objective: </b>Bidirectional RNN models are challenging to deploy in an online, low-latency setting because they cannot stream the transcription process as the utterance arrives from the user. However, models with only forward recurrences routinely perform worse than similar bidirectional models.</p><p id="2f98">The layer learns weights to linearly combine each neuron’s activations τ time-steps into the future, and thus allows us to control the amount of future context needed.</p><figure id="c360"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*PkSrMzk5JQO16kiJLMsGXA.png"><figcaption></figcaption></figure><h2 id="de7c">Adaptation to Mandarin</h2><ul><li>The only architectural changes we make to our networks are due to the characteristics of the Chinese character set.</li><li>Use a character level language model in Mandarin as words are not usually segmented in text.</li></ul></article></body>
Day 18: 2020.04.29 Paper: Deep Speech 2: End-to-End Speech Recognition in English and Mandarin Category: Model/Deep Learning/Speech Recognition

Input: log-spectrograms of power normalised audio clips, calculated on 20ms windows Output: alphabet of each language Inference: CTC models paired a with language model trained on a bigger corpus of text
Objective: To train networks using gradient descent when the size and depth increases
2 Ways to apply:


Objective: Make training more stable. Accelerates training and results in better generalisation
Use the length of the utterance as a heuristic for difficulty and train on the shorter (easier) utterances first.
In the first training epoch, iterate through mini-batches in the training set in increasing order of the length of the longest utterance in the mini-batch.
After the first epoch, training reverts back to a random order over mini-batches
GRU and LSTM reach similar accuracy for the same number of parameters, but the GRUs are faster to train and less likely to diverge.
Objective: model spectral variance due to speaker variability more concisely than what is possible with large fully connected networks
Multiple layers of Time-and-frequency domain (2D) convolution do better than one layer.
Objective: Bidirectional RNN models are challenging to deploy in an online, low-latency setting because they cannot stream the transcription process as the utterance arrives from the user. However, models with only forward recurrences routinely perform worse than similar bidirectional models.
The layer learns weights to linearly combine each neuron’s activations τ time-steps into the future, and thus allows us to control the amount of future context needed.

Ahrorjon HaidarovIn this notebook I will show you how to build classification model for music classification. For more information I will use deep learning…
Benedict NeoA free curriculum for hackers and programmers to learn AI