avatarChing (Chingis)

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

4697

Abstract

ver, due to the implementation, it is much faster than the original implementation.</li></ul><div id="2e26" class="link-block"> <a href="https://arxiv.org/abs/2205.14135"> <div> <div> <h2>FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness</h2> <div><h3>Transformers are slow and memory-hungry on long sequences, since the time and memory complexity of self-attention are…</h3></div> <div><p>arxiv.org</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*jiDvS7q1nAv4ziO-)"></div> </div> </div> </a> </div><p id="4f78"><b>“Linformer: Self-Attention with Linear Complexity”</b></p><p id="343b">The paper discusses the challenges of training and deploying large transformer models for long sequences due to the high computational and memory costs of the standard self-attention mechanism. The authors propose a new self-attention mechanism called Linformer that approximates the self-attention matrix as a low-rank matrix. This new mechanism reduces the overall complexity of self-attention from quadratic to linear in both time and space.</p><ul><li>They hypothesize that self-attention maps are low-rank matrices (provided long sequences). Therefore, projecting sequences into smaller spaces won’t result in major information loss.</li></ul><div id="5ea4" class="link-block"> <a href="https://arxiv.org/abs/2006.04768"> <div> <div> <h2>Linformer: Self-Attention with Linear Complexity</h2> <div><h3>Large transformer models have shown extraordinary success in achieving state-of-the-art results in many natural…</h3></div> <div><p>arxiv.org</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*U2NQgZZH87Ez1q6c)"></div> </div> </div> </a> </div><p id="4e65"><b>Local Windowed Attention</b></p><ul><li>The following paper empirically demonstrates that language models might not need full global attention in all layers.</li><li>In fact, a transformer needs local attention in the bottom layers, with the top layers reserved for global attention to integrate the findings of previous layer.</li></ul><div id="6d72" class="link-block"> <a href="https://aclanthology.org/2020.acl-main.672/"> <div> <div> <h2>Do Transformers Need Deep Long-Range Memory?</h2> <div><h3>Jack Rae, Ali Razavi. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020.</h3></div> <div><p>aclanthology.org</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*pxOGC0glXa26vZqm)"></div> </div> </div> </a> </div><div id="d927" class="link-block"> <a href="https://arxiv.org/abs/2004.05150"> <div> <div> <h2>Longformer: The Long-Document Transformer</h2> <div><h3>Transformer-based models are unable to process long sequences due to their self-attention operation, which scales…</h3></div> <div><p>arxiv.org</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*EgfQMGxficBEOGf_)"></div> </div> </div> </a> </div><h2 id="3b58">BERT variants</h2><p id="7162"><b>Note: reading the following papers will get you familiar with pretraining strategies and architectural details that can be leveraged in the future when you will be building your own models.</b></p><p id="e5e9"><b>RoBERTa: A Robustly Optimized BERT Pretraining Approach</b></p><ul><li>RoBERTa is trained on a large corpus of unlabeled text from the internet, similar to BERT.</li><li>RoBERTa has the same architecture as BERT, but uses a byte-level BPE as a tokenizer (same as GPT-2) and uses a different pretraining scheme.</li><li>It uses a masked language modeling objective, where it predicts missing words in a sentence. RoBERTa also incorporates a next sentence prediction task during pre-training.</li><li>It has a larger training dataset and longer training time compared to BERT, which leads to improved performance.</li><li>RoBERTa outperforms previous models on a wide range of natural language pro

Options

cessing tasks, including text classification, named entity recognition, and question answering.</li></ul><div id="7006" class="link-block"> <a href="https://arxiv.org/abs/1907.11692"> <div> <div> <h2>RoBERTa: A Robustly Optimized BERT Pretraining Approach</h2> <div><h3>Language model pretraining has led to significant performance gains but careful comparison between different approaches…</h3></div> <div><p>arxiv.org</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*Z0FLWtWl0EyxVTy6)"></div> </div> </div> </a> </div><p id="eee0"><b>ALBERT: A Lite BERT for Self-supervised Learning of Language Representations</b></p><ul><li>ALBERT uses repeating layers which results in a small memory footprint yet computationally remains similar to a BERT.</li><li>They also use a self-supervised loss that focuses on modeling inter-sentence coherence, and show it consistently helps downstream tasks with multi-sentence inputs.</li><li>Next sentence prediction is replaced by a sentence ordering prediction.</li></ul><div id="ccab" class="link-block"> <a href="https://arxiv.org/abs/1909.11942"> <div> <div> <h2>ALBERT: A Lite BERT for Self-supervised Learning of Language Representations</h2> <div><h3>Increasing model size when pretraining natural language representations often results in improved performance on…</h3></div> <div><p>arxiv.org</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*fjpsjqaDK6yl4IvL)"></div> </div> </div> </a> </div><p id="ef68"><b>DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter</b></p><ul><li>Compared to RoBERTa models, which provide results with improved performances, DistilBERT aims to reduce computation time. To compress the model size, DistilBERT applies the “teacher-student” framework also referred to as knowledge distillation where a larger model or the “teacher” network is trained and the knowledge is passed on to the smaller model also known as the “student” network.</li><li>DistilBERT retains 97% performance of BERT with 40% fewer parameters and faster inference time.</li></ul><div id="3927" class="link-block"> <a href="https://arxiv.org/abs/1910.01108"> <div> <div> <h2>DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter</h2> <div><h3>As Transfer Learning from large-scale pre-trained models becomes more prevalent in Natural Language Processing (NLP)…</h3></div> <div><p>arxiv.org</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*0pYMcKE6bDsadQDO)"></div> </div> </div> </a> </div><h1 id="3885">Some Last Words</h1><p id="53e0">As we look ahead to the year 2023, the field of NLP is expected to witness continued growth and innovation. By studying the papers mentioned in this blog, aspiring NLP researchers and practitioners can gain valuable insights into the foundational concepts, state-of-the-art models, and emerging trends in the field.</p><p id="35e7">I might have missed some good papers but I believe these papers are important not only to understand the model architectures but also pretraining strategies and attention mechanisms. If you have any papers to suggest, please feel free to share them in the comment section.</p><p id="f5f3">I will be writing part 2 soon where I will be extending this list up until we reach Large Language Models (LLM). Thank you for your time reading this piece! Cheers!</p><h2 id="54ec">WRITER at MLearning.ai // Code Interpreter // Animate Midjourney</h2><div id="493a" class="link-block"> <a href="https://readmedium.com/mlearning-ai-submission-suggestions-b51e2b130bfb"> <div> <div> <h2>Mlearning.ai Submission Suggestions</h2> <div><h3>How to become a writer on Mlearning.ai</h3></div> <div><p>medium.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/1*6xCb1sNpjadaSBuVLPTFQQ.png)"></div> </div> </div> </a> </div></article></body>

Carefully Curated Papers to Study NLP in 2023 for Beginners (Part 1)

Natural Language Processing (NLP) has been a rapidly evolving field over the last decade, with groundbreaking research and advancements in various subdomains. In this blog, I would like to make a list of papers that are in my opinion must-read if you’re studying NLP in 2023. We will explore a curated list of NLP papers that every enthusiast should study to stay at the forefront of this exciting field.

Fundamentals

Linear Classifier: An Often-Forgotten Baseline for Text Classification

  • This paper does not propose novel architectures yet shows the simplest baselines, like Linear Classifier with TF-IDF features, can outperform highly-established models like BERT.
  • This paper is good for beginners to realize that sometimes bigger models do not guarantee the best results. In fact, over-parametrization is the real deal and there are many works that tackle this issue, such as distillation learning, pruning and etc.

Attention is All You Need

This paper introduced the Transformer architecture, which revolutionized the field of sequence modeling. The self-attention mechanism proposed in this paper replaced recurrent neural networks (RNNs) and convolutional neural networks (CNNs) as the go-to methods for many NLP tasks.

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

  • This paper introduced the BERT (Bidirectional Encoder Representations from Transformers) model, which laid the foundation for the modern era of NLP.
  • The architecture is based on Transformer’s encoder architecture, so it shouldn’t be difficult after understanding Transformers.
  • It demonstrated the effectiveness of pre-training a deep bidirectional Transformer model on a large corpus and paved the way for numerous downstream NLP tasks, including question-answering, sentiment analysis, and named entity recognition.
  • Outside of that, understanding BERT is fundamental for sub-fields like representation learning, self-supervised learning, and multi-modal learning in NLP.

Attention

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness”.

  • The FlashAttention algorithm rearranges the computation of attention and uses traditional techniques such as tiling and recomputation to greatly increase its speed and decrease the amount of memory used, going from a quadratic to a linear relationship with the length of the sequence.
  • Performance-wise, it produces the exact same attention maps. However, due to the implementation, it is much faster than the original implementation.

“Linformer: Self-Attention with Linear Complexity”

The paper discusses the challenges of training and deploying large transformer models for long sequences due to the high computational and memory costs of the standard self-attention mechanism. The authors propose a new self-attention mechanism called Linformer that approximates the self-attention matrix as a low-rank matrix. This new mechanism reduces the overall complexity of self-attention from quadratic to linear in both time and space.

  • They hypothesize that self-attention maps are low-rank matrices (provided long sequences). Therefore, projecting sequences into smaller spaces won’t result in major information loss.

Local Windowed Attention

  • The following paper empirically demonstrates that language models might not need full global attention in all layers.
  • In fact, a transformer needs local attention in the bottom layers, with the top layers reserved for global attention to integrate the findings of previous layer.

BERT variants

Note: reading the following papers will get you familiar with pretraining strategies and architectural details that can be leveraged in the future when you will be building your own models.

RoBERTa: A Robustly Optimized BERT Pretraining Approach

  • RoBERTa is trained on a large corpus of unlabeled text from the internet, similar to BERT.
  • RoBERTa has the same architecture as BERT, but uses a byte-level BPE as a tokenizer (same as GPT-2) and uses a different pretraining scheme.
  • It uses a masked language modeling objective, where it predicts missing words in a sentence. RoBERTa also incorporates a next sentence prediction task during pre-training.
  • It has a larger training dataset and longer training time compared to BERT, which leads to improved performance.
  • RoBERTa outperforms previous models on a wide range of natural language processing tasks, including text classification, named entity recognition, and question answering.

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

  • ALBERT uses repeating layers which results in a small memory footprint yet computationally remains similar to a BERT.
  • They also use a self-supervised loss that focuses on modeling inter-sentence coherence, and show it consistently helps downstream tasks with multi-sentence inputs.
  • Next sentence prediction is replaced by a sentence ordering prediction.

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

  • Compared to RoBERTa models, which provide results with improved performances, DistilBERT aims to reduce computation time. To compress the model size, DistilBERT applies the “teacher-student” framework also referred to as knowledge distillation where a larger model or the “teacher” network is trained and the knowledge is passed on to the smaller model also known as the “student” network.
  • DistilBERT retains 97% performance of BERT with 40% fewer parameters and faster inference time.

Some Last Words

As we look ahead to the year 2023, the field of NLP is expected to witness continued growth and innovation. By studying the papers mentioned in this blog, aspiring NLP researchers and practitioners can gain valuable insights into the foundational concepts, state-of-the-art models, and emerging trends in the field.

I might have missed some good papers but I believe these papers are important not only to understand the model architectures but also pretraining strategies and attention mechanisms. If you have any papers to suggest, please feel free to share them in the comment section.

I will be writing part 2 soon where I will be extending this list up until we reach Large Language Models (LLM). Thank you for your time reading this piece! Cheers!

WRITER at MLearning.ai // Code Interpreter // Animate Midjourney

Artificial Intelligence
Machine Learning
Deep Learning
NLP
Ml So Good
Recommended from ReadMedium