Free AI web copilot to create summaries, insights and extended knowledge, download it at here

Abstract

reasoning, logical reasoning, and maths we see much smaller performance improvements and several tasks that have a deterioration in performance. The general trend is less improvement in reasoning-heavy tasks and a larger and more consistent improvement in knowledge-intensive tests.</p><h2 id="67f3">Gopher performance across the 57 tasks in Massive Multitask Language Understanding (MMLU)</h2><figure id="72c0"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*WkxlKhVdEU0THGFe0nogaQ.png"><figcaption>Gopher performance across the 57 tasks in MMLU. Image by DeepMind.</figcaption></figure><p id="31eb"><a href="https://arxiv.org/abs/2009.03300">MMLU</a> tasks consist of real-world human exams covering a range of academic subjects. Gopher improves over the prior supervised SOTA models by a considerable margin (>30%) however it is far from human expert.</p><blockquote id="29e2"><p>Gopher is situated between the 2022 and 2023 forecast of average SOTA accuracy made by 73 expert human forecasters on the MMLU tasks.</p></blockquote><h2 id="6e5b">Performance Improvements with Scale</h2><figure id="4711"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*ByS6AnhxmZIpV851XxSoMQ.png"><figcaption>Gopher comparison with smaller language models across 124 tasks. Image by DeepMind.</figcaption></figure><p id="6ea5">Follows a comparison of the performance of Gopher (280B parameters) to the best performance of the smaller models up to 7.1B. The figure shows the percentage change in performance metric (higher is better) across 124 tasks, where each bar represents a task.</p><p id="2904">In nearly every case, Gopher outperforms the best smaller model’s performance. Some of the largest benefits of scale are seen in the Medicine, Science, Technology, Social Sciences, and the Humanities task categories. These same categories are also where we see the greatest performance improvement over language model SOTA. On the other hand, scale has a reduced benefit for tasks in the Maths, Logical Reasoning, and Common Sense categories.</p><p id="a500">The results suggest that scalability alone is unlikely to lead to breakthroughs in performance for certain kinds of mathematical or logical reasoning tasks.</p><p id="8b36">Thank you for reading! If you are interested i

Options

n learning more about NLP, remember to follow NLPlanet on <a href="https://medium.com/nlplanet">Medium</a>, <a href="https://www.linkedin.com/company/nlplanet">LinkedIn</a>, and <a href="https://twitter.com/nlplanet_">Twitter</a>!</p><p id="57ad"><b>Two minutes NLP related posts</b></p><div id="2887" class="link-block"> <a href="https://readmedium.com/two-minutes-nlp-11-word-embeddings-models-you-should-know-a0581763b9a9"> <div> <div> <h2>Two minutes NLP — 11 word embeddings models you should know</h2> <div><h3>TF-IDF, Word2Vec, GloVe, FastText, ELMO, CoVe, BERT, RoBERTa, etc.</h3></div> <div><p>medium.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/1*DpgA3bbFBmQ9I_amVDFvCQ.png)"></div> </div> </div> </a> </div><div id="1aee" class="link-block"> <a href="https://readmedium.com/two-minutes-nlp-topic-modeling-and-semantic-search-with-top2vec-87855a973c8d"> <div> <div> <h2>Two minutes NLP — Topic Modeling and Semantic Search with Top2Vec</h2> <div><h3>Top2Vec, Doc2Vec, UMAP, HDBSCAN, and topic vectors</h3></div> <div><p>medium.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*d69b4JqnW-hqiD-f)"></div> </div> </div> </a> </div><div id="30b9" class="link-block"> <a href="https://readmedium.com/two-minutes-nlp-33-important-nlp-tasks-explained-31e2caad2b1b"> <div> <div> <h2>Two minutes NLP — 33 important NLP tasks explained</h2> <div><h3>Information Retrieval, Knowledge Bases, Chatbots, Text Generation, Text-to-Data, Text Reasoning, etc.</h3></div> <div><p>medium.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/1*pR9nbCcPHwCZnSX5VHrYZA.png)"></div> </div> </div> </a> </div></article></body>

Two minutes NLP — New DeepMind’s Gopher Language Model performance in a nutshell

Gopher, GPT-3, Jurassic-1, and Megatron-Turing NLG

In their new paper Scaling Language Models: Methods, Analysis & Insights from Training Gopher, DeepMind presents an analysis of Transformer-based language model performance across a wide range of model scales — from models with tens of millions of parameters up to a 280 billion parameter model called Gopher. These models are evaluated on 152 diverse tasks, achieving state-of-the-art performance across the majority.

Gopher comparison with previous language model State of the Art

The figure shows the percentage change in performance metric (higher is better) of Gopher versus state-of-the-art language model performance across 124 tasks, where each bar represents a task. Gopher shows an improvement across roughly 80% of the tasks. The best-published results include (175B parameters) GPT-3, (178B parameters) Jurassic-1, and (530B parameters) Megatron-Turing NLG.

Gopher displays the most uniform improvement across reading comprehension, humanities, ethics, STEM, medicine, and fact-checking categories. For common sense reasoning, logical reasoning, and maths we see much smaller performance improvements and several tasks that have a deterioration in performance. The general trend is less improvement in reasoning-heavy tasks and a larger and more consistent improvement in knowledge-intensive tests.

Gopher performance across the 57 tasks in Massive Multitask Language Understanding (MMLU)

Gopher performance across the 57 tasks in MMLU. Image by DeepMind.

MMLU tasks consist of real-world human exams covering a range of academic subjects. Gopher improves over the prior supervised SOTA models by a considerable margin (>30%) however it is far from human expert.

Gopher is situated between the 2022 and 2023 forecast of average SOTA accuracy made by 73 expert human forecasters on the MMLU tasks.

Performance Improvements with Scale

Gopher comparison with smaller language models across 124 tasks. Image by DeepMind.

Follows a comparison of the performance of Gopher (280B parameters) to the best performance of the smaller models up to 7.1B. The figure shows the percentage change in performance metric (higher is better) across 124 tasks, where each bar represents a task.

In nearly every case, Gopher outperforms the best smaller model’s performance. Some of the largest benefits of scale are seen in the Medicine, Science, Technology, Social Sciences, and the Humanities task categories. These same categories are also where we see the greatest performance improvement over language model SOTA. On the other hand, scale has a reduced benefit for tasks in the Maths, Logical Reasoning, and Common Sense categories.

The results suggest that scalability alone is unlikely to lead to breakthroughs in performance for certain kinds of mathematical or logical reasoning tasks.

Thank you for reading! If you are interested in learning more about NLP, remember to follow NLPlanet on Medium, LinkedIn, and Twitter!

Two minutes NLP related posts

Two minutes NLP — 11 word embeddings models you should know

TF-IDF, Word2Vec, GloVe, FastText, ELMO, CoVe, BERT, RoBERTa, etc.

medium.com

Two minutes NLP — Topic Modeling and Semantic Search with Top2Vec

Top2Vec, Doc2Vec, UMAP, HDBSCAN, and topic vectors

medium.com

Two minutes NLP — 33 important NLP tasks explained

Information Retrieval, Knowledge Bases, Chatbots, Text Generation, Text-to-Data, Text Reasoning, etc.

medium.com