Carefully Curated Papers to Study NLP in 2023 for Beginners (Part 1)

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

4697

Abstract

ver, due to the implementation, it is much faster than the original implementation.</li></ul><div id="2e26" class="link-block"> <a href="https://arxiv.org/abs/2205.14135"> <div> <div> <h2>FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness</h2> <div><h3>Transformers are slow and memory-hungry on long sequences, since the time and memory complexity of self-attention are…</h3></div> <div><p>arxiv.org</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*jiDvS7q1nAv4ziO-)"></div> </div> </div> </a> </div><p id="4f78"><b>“Linformer: Self-Attention with Linear Complexity”</b></p><p id="343b">The paper discusses the challenges of training and deploying large transformer models for long sequences due to the high computational and memory costs of the standard self-attention mechanism. The authors propose a new self-attention mechanism called Linformer that approximates the self-attention matrix as a low-rank matrix. This new mechanism reduces the overall complexity of self-attention from quadratic to linear in both time and space.</p><ul><li>They hypothesize that self-attention maps are low-rank matrices (provided long sequences). Therefore, projecting sequences into smaller spaces won’t result in major information loss.</li></ul><div id="5ea4" class="link-block"> <a href="https://arxiv.org/abs/2006.04768"> <div> <div> <h2>Linformer: Self-Attention with Linear Complexity</h2> <div><h3>Large transformer models have shown extraordinary success in achieving state-of-the-art results in many natural…</h3></div> <div><p>arxiv.org</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*U2NQgZZH87Ez1q6c)"></div> </div> </div> </a> </div><p id="4e65"><b>Local Windowed Attention</b></p><ul><li>The following paper empirically demonstrates that language models might not need full global attention in all layers.</li><li>In fact, a transformer needs local attention in the bottom layers, with the top layers reserved for global attention to integrate the findings of previous layer.</li></ul><div id="6d72" class="link-block"> <a href="https://aclanthology.org/2020.acl-main.672/"> <div> <div> <h2>Do Transformers Need Deep Long-Range Memory?</h2> <div><h3>Jack Rae, Ali Razavi. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020.</h3></div> <div><p>aclanthology.org</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*pxOGC0glXa26vZqm)"></div> </div> </div> </a> </div><div id="d927" class="link-block"> <a href="https://arxiv.org/abs/2004.05150"> <div> <div> <h2>Longformer: The Long-Document Transformer</h2> <div><h3>Transformer-based models are unable to process long sequences due to their self-attention operation, which scales…</h3></div> <div><p>arxiv.org</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*EgfQMGxficBEOGf_)"></div> </div> </div> </a> </div><h2 id="3b58">BERT variants</h2><p id="7162"><b>Note: reading the following papers will get you familiar with pretraining strategies and architectural details that can be leveraged in the future when you will be building your own models.</b></p><p id="e5e9">“<b>RoBERTa: A Robustly Optimized BERT Pretraining Approach</b>”</p><ul><li>RoBERTa is trained on a large corpus of unlabeled text from the internet, similar to BERT.</li><li>RoBERTa has the same architecture as BERT, but uses a byte-level BPE as a tokenizer (same as GPT-2) and uses a different pretraining scheme.</li><li>It uses a masked language modeling objective, where it predicts missing words in a sentence. RoBERTa also incorporates a next sentence prediction task during pre-training.</li><li>It has a larger training dataset and longer training time compared to BERT, which leads to improved performance.</li><li>RoBERTa outperforms previous models on a wide range of natural language pro

Options

cessing tasks, including text classification, named entity recognition, and question answering.</li></ul><div id="7006" class="link-block"> <a href="https://arxiv.org/abs/1907.11692"> <div> <div> <h2>RoBERTa: A Robustly Optimized BERT Pretraining Approach</h2> <div><h3>Language model pretraining has led to significant performance gains but careful comparison between different approaches…</h3></div> <div><p>arxiv.org</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*Z0FLWtWl0EyxVTy6)"></div> </div> </div> </a> </div><p id="eee0">“<b>ALBERT: A Lite BERT for Self-supervised Learning of Language Representations</b>”</p><ul><li>ALBERT uses repeating layers which results in a small memory footprint yet computationally remains similar to a BERT.</li><li>They also use a self-supervised loss that focuses on modeling inter-sentence coherence, and show it consistently helps downstream tasks with multi-sentence inputs.</li><li>Next sentence prediction is replaced by a sentence ordering prediction.</li></ul><div id="ccab" class="link-block"> <a href="https://arxiv.org/abs/1909.11942"> <div> <div> <h2>ALBERT: A Lite BERT for Self-supervised Learning of Language Representations</h2> <div><h3>Increasing model size when pretraining natural language representations often results in improved performance on…</h3></div> <div><p>arxiv.org</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*fjpsjqaDK6yl4IvL)"></div> </div> </div> </a> </div><p id="ef68">“<b>DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter</b>”</p><ul><li>Compared to RoBERTa models, which provide results with improved performances, DistilBERT aims to reduce computation time. To compress the model size, DistilBERT applies the “teacher-student” framework also referred to as knowledge distillation where a larger model or the “teacher” network is trained and the knowledge is passed on to the smaller model also known as the “student” network.</li><li>DistilBERT retains 97% performance of BERT with 40% fewer parameters and faster inference time.</li></ul><div id="3927" class="link-block"> <a href="https://arxiv.org/abs/1910.01108"> <div> <div> <h2>DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter</h2> <div><h3>As Transfer Learning from large-scale pre-trained models becomes more prevalent in Natural Language Processing (NLP)…</h3></div> <div><p>arxiv.org</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*0pYMcKE6bDsadQDO)"></div> </div> </div> </a> </div><h1 id="3885">Some Last Words</h1><p id="53e0">As we look ahead to the year 2023, the field of NLP is expected to witness continued growth and innovation. By studying the papers mentioned in this blog, aspiring NLP researchers and practitioners can gain valuable insights into the foundational concepts, state-of-the-art models, and emerging trends in the field.</p><p id="35e7">I might have missed some good papers but I believe these papers are important not only to understand the model architectures but also pretraining strategies and attention mechanisms. If you have any papers to suggest, please feel free to share them in the comment section.</p><p id="f5f3">I will be writing part 2 soon where I will be extending this list up until we reach Large Language Models (LLM). Thank you for your time reading this piece! Cheers!</p><h2 id="54ec">WRITER at MLearning.ai // Code Interpreter // Animate Midjourney</h2><div id="493a" class="link-block"> <a href="https://readmedium.com/mlearning-ai-submission-suggestions-b51e2b130bfb"> <div> <div> <h2>Mlearning.ai Submission Suggestions</h2> <div><h3>How to become a writer on Mlearning.ai</h3></div> <div><p>medium.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/1*6xCb1sNpjadaSBuVLPTFQQ.png)"></div> </div> </div> </a> </div></article></body>

Carefully Curated Papers to Study NLP in 2023 for Beginners (Part 1)

Natural Language Processing (NLP) has been a rapidly evolving field over the last decade, with groundbreaking research and advancements in various subdomains. In this blog, I would like to make a list of papers that are in my opinion must-read if you’re studying NLP in 2023. We will explore a curated list of NLP papers that every enthusiast should study to stay at the forefront of this exciting field.

BERT variants

Note: reading the following papers will get you familiar with pretraining strategies and architectural details that can be leveraged in the future when you will be building your own models.

“RoBERTa: A Robustly Optimized BERT Pretraining Approach”

RoBERTa is trained on a large corpus of unlabeled text from the internet, similar to BERT.

RoBERTa has the same architecture as BERT, but uses a byte-level BPE as a tokenizer (same as GPT-2) and uses a different pretraining scheme.

It uses a masked language modeling objective, where it predicts missing words in a sentence. RoBERTa also incorporates a next sentence prediction task during pre-training.

It has a larger training dataset and longer training time compared to BERT, which leads to improved performance.

RoBERTa outperforms previous models on a wide range of natural language processing tasks, including text classification, named entity recognition, and question answering.

“ALBERT: A Lite BERT for Self-supervised Learning of Language Representations”

ALBERT uses repeating layers which results in a small memory footprint yet computationally remains similar to a BERT.

They also use a self-supervised loss that focuses on modeling inter-sentence coherence, and show it consistently helps downstream tasks with multi-sentence inputs.

Next sentence prediction is replaced by a sentence ordering prediction.

“DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter”

Compared to RoBERTa models, which provide results with improved performances, DistilBERT aims to reduce computation time. To compress the model size, DistilBERT applies the “teacher-student” framework also referred to as knowledge distillation where a larger model or the “teacher” network is trained and the knowledge is passed on to the smaller model also known as the “student” network.

DistilBERT retains 97% performance of BERT with 40% fewer parameters and faster inference time.

Some Last Words

As we look ahead to the year 2023, the field of NLP is expected to witness continued growth and innovation. By studying the papers mentioned in this blog, aspiring NLP researchers and practitioners can gain valuable insights into the foundational concepts, state-of-the-art models, and emerging trends in the field.

I might have missed some good papers but I believe these papers are important not only to understand the model architectures but also pretraining strategies and attention mechanisms. If you have any papers to suggest, please feel free to share them in the comment section.

I will be writing part 2 soon where I will be extending this list up until we reach Large Language Models (LLM). Thank you for your time reading this piece! Cheers!

Carefully Curated Papers to Study NLP in 2023 for Beginners (Part 1)

Fundamentals

Linear Classifier: An Often-Forgotten Baseline for Text Classification

Large-scale pre-trained language models such as BERT are popular solutions for text classification. Due to the superior…

Attention Is All You Need

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an…

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations…

Attention

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Transformers are slow and memory-hungry on long sequences, since the time and memory complexity of self-attention are…

Linformer: Self-Attention with Linear Complexity

Large transformer models have shown extraordinary success in achieving state-of-the-art results in many natural…

Do Transformers Need Deep Long-Range Memory?

Jack Rae, Ali Razavi. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020.

Longformer: The Long-Document Transformer

Transformer-based models are unable to process long sequences due to their self-attention operation, which scales…

BERT variants

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Language model pretraining has led to significant performance gains but careful comparison between different approaches…

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

Increasing model size when pretraining natural language representations often results in improved performance on…

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

As Transfer Learning from large-scale pre-trained models becomes more prevalent in Natural Language Processing (NLP)…

Some Last Words

WRITER at MLearning.ai // Code Interpreter // Animate Midjourney

Mlearning.ai Submission Suggestions

How to become a writer on Mlearning.ai