Free AI web copilot to create summaries, insights and extended knowledge, download it at here

Abstract

h2><p id="4f9e">Consider the sample query “The 2021 Women’s US Open was won”. A standard language model would predict a plausible continuation with the knowledge stored in the network parameters. RETRO, instead, looks for similar sequences in the Retrieval Database, withdraws their continuations, and conditions on them to predict a new plausible continuation.</p><p id="ff90">The search for similar sentences is done with Nearest Neighbors on BERT embeddings pre-computed on all the sentences stored on the Retrieval Database.</p><figure id="c32c"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*BrqOI8eYeu37rICra3fskQ.png"><figcaption>Overall architecture of the RETRO model. Image by <a href="https://arxiv.org/pdf/2112.04426.pdf">DeepMind</a>.</figcaption></figure><p id="ff36">By working on texts extracted from the Retrieval Database, RETRO increases the interpretability of model predictions and provides a route for direct interventions to improve the safety of text continuation.</p><h2 id="2c4a">RETRO performance</h2><p id="87b2">RETRO obtains comparable performance to <a href="https://en.wikipedia.org/wiki/GPT-3">GPT-3</a> and <a href="https://www.ai21.com/blog/announcing-ai21-studio-and-jurassic-1">Jurassic-1</a> on the <a href="https://arxiv.org/abs/2101.00027">Pile dataset</a> (a standard language modeling benchmark), despite using 25× fewer parameters.</p><p id="4e8d">Evaluating RETRO performance on the Pile dataset, a 7.5 billion parameter RETRO model outperforms the 175 billion parameter Jurassic-1 on 10 out of 16 datasets and outperforms the 280B Gopher on 9 out of 16 datasets, despite being over an order of magnitude smaller.</p><figure id="c973"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*1hMBo1vkGq4N9YBKf_g2bQ.png"><figcaption>RETRO, Gopher, and Jurassic-1 performance on the Pile dataset with respect to a 7B parameters baseline without retrieval. Image by <a href="https://arxiv.org/pdf/2112.04426.pdf">DeepMind</a>.</figcaption></figure><p id="4a97">Thank you for reading! If you are interested in learning more about NLP, remember to follow NLPlanet on <a href="https://medium.com/nlplanet">Medium</a>, <a href="https://www.linke

Options

din.com/company/nlplanet">LinkedIn</a>, and <a href="https://twitter.com/nlplanet_">Twitter</a>!</p><p id="191c"><b>Two minutes NLP related posts</b></p><div id="a0ce" class="link-block"> <a href="https://readmedium.com/two-minutes-nlp-the-openai-webgpt-model-that-answers-questions-browsing-the-web-35f690884c25"> <div> <div> <h2>Two minutes NLP — The OpenAI WebGPT model that answers questions browsing the web</h2> <div><h3>GPT-3, Information Retrieval, Text Synthesis, Imitation Learning, and Reward Modeling</h3></div> <div><p>medium.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*bCQQJosDzpvwWvlM)"></div> </div> </div> </a> </div><div id="1854" class="link-block"> <a href="https://readmedium.com/two-minutes-nlp-gopher-language-model-performance-in-a-nutshell-d8da55d3c44a"> <div> <div> <h2>Two minutes NLP — Gopher Language Model performance in a nutshell</h2> <div><h3>Gopher, GPT-3, Jurassic-1, and Megatron-Turing NLG</h3></div> <div><p>medium.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*mbQPVhQbj0xLJmq6)"></div> </div> </div> </a> </div><div id="ad99" class="link-block"> <a href="https://readmedium.com/two-minutes-nlp-visualizing-global-vs-local-attention-c61b42758019"> <div> <div> <h2>Two minutes NLP — Visualizing Global vs Local Attention</h2> <div><h3>Seq2seq, Global Attention, Local Attention, Monotonical Alignment, and Predictive Alignment</h3></div> <div><p>medium.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*nAbdoVX1-uJ9peTK)"></div> </div> </div> </a> </div></article></body>

Two minutes NLP — How the DeepMind RETRO model decouples reasoning and memorization

Language Models, Retrieval Databases, GPT-3, Jurassic-1, and the Pile

In recent years, significant performance gains in language modeling have been achieved by increasing the number of parameters in Transformer models. This has led to a huge increase in training energy costs and resulted in a generation of large Language Models with 100+ billion parameters. At the same time, large datasets containing trillions of words have been collected to train these models.

The benefits of increasing the number of parameters come from two factors:

More reasoning capabilities in the form of computations at training and inference time.
More memorization of the training data.

DeepMind is exploring how to decouple these aspects, i.e. how to efficiently augment language models with a massive-scale memory without significantly increasing computations. Specifically, DeepMind suggests retrieval from a large text database as a complementary path to scaling language models.

With this goal in mind, DeepMind introduced the Retrieval-Enhanced Transformer (RETRO) model: a language model that predicts the next words by conditioning on document chunks retrieved from a large corpus.

How RETRO works

Consider the sample query “The 2021 Women’s US Open was won”. A standard language model would predict a plausible continuation with the knowledge stored in the network parameters. RETRO, instead, looks for similar sequences in the Retrieval Database, withdraws their continuations, and conditions on them to predict a new plausible continuation.

The search for similar sentences is done with Nearest Neighbors on BERT embeddings pre-computed on all the sentences stored on the Retrieval Database.

Overall architecture of the RETRO model. Image by DeepMind.

By working on texts extracted from the Retrieval Database, RETRO increases the interpretability of model predictions and provides a route for direct interventions to improve the safety of text continuation.

RETRO performance

RETRO obtains comparable performance to GPT-3 and Jurassic-1 on the Pile dataset (a standard language modeling benchmark), despite using 25× fewer parameters.

Evaluating RETRO performance on the Pile dataset, a 7.5 billion parameter RETRO model outperforms the 175 billion parameter Jurassic-1 on 10 out of 16 datasets and outperforms the 280B Gopher on 9 out of 16 datasets, despite being over an order of magnitude smaller.

RETRO, Gopher, and Jurassic-1 performance on the Pile dataset with respect to a 7B parameters baseline without retrieval. Image by DeepMind.

Thank you for reading! If you are interested in learning more about NLP, remember to follow NLPlanet on Medium, LinkedIn, and Twitter!

Two minutes NLP related posts

Two minutes NLP — The OpenAI WebGPT model that answers questions browsing the web

GPT-3, Information Retrieval, Text Synthesis, Imitation Learning, and Reward Modeling

medium.com

Two minutes NLP — Gopher Language Model performance in a nutshell

Gopher, GPT-3, Jurassic-1, and Megatron-Turing NLG

medium.com

Two minutes NLP — Visualizing Global vs Local Attention

Seq2seq, Global Attention, Local Attention, Monotonical Alignment, and Predictive Alignment

medium.com