Two minutes NLP — 11 word embeddings models you should know

Summary

The article provides an overview of 11 influential word embedding models in NLP, categorizing them into context-independent and context-dependent types, and emphasizes their significance in deep learning tasks.

Abstract

The article "Two minutes NLP — 11 word embeddings models you should know" outlines a variety of word embedding models that have been pivotal in the field of Natural Language Processing (NLP). These models are crucial for providing input features to tasks such as sequence labeling and text classification. The models are divided into two main categories: context-independent and context-dependent. Context-independent models like Bag-of-words, TF-IDF, Word2Vec, GloVe, and FastText assign a unique representation to each word, disregarding the context in which they appear. In contrast, context-dependent models, including ELMO, CoVe, BERT, XLM, RoBERTa, and ALBERT, generate word representations that vary according to the surrounding text, thereby capturing nuanced meanings. The article also highlights the evolution of these models, from traditional machine learning approaches to more sophisticated deep learning techniques, and their impact on advancing NLP applications.

Opinions

The author suggests that the role of word embeddings is critical in deep models for downstream NLP tasks.
The taxonomy provided by the author indicates a clear progression in the complexity and capability of word embedding models over time.
The author implies that context-dependent models, particularly those based on transformers like BERT and RoBERTa, represent the cutting edge in word representation learning.
By mentioning the ability of FastText to handle rare and out-of-vocabulary words, the author points out the practical advantages of certain models in real-world applications.
The author encourages further exploration into NLP by inviting readers to follow NLPlanet for more insights and information in the field.

TF-IDF, Word2Vec, GloVe, FastText, ELMO, CoVe, BERT, RoBERTa, etc.

Taxonomy of word embeddings. Image by the author.

The role of word embeddings in deep models is important for providing input features to downstream tasks like sequence labeling and text classification. Several word embedding methods have been proposed in the past decade.

Context-independent

The learned representations are characterised by being unique and distinct for each word without considering the word’s context.

Context-independent without machine learning

Bag-of-words: a text, such as a sentence or a document, is represented as the bag of its words, disregarding grammar and even word order but keeping multiplicity.

TF-IDF: gets this importance score by getting the term’s frequency (TF) and multiplying it by the term inverse document frequency (IDF).

Context-independent with machine learning

Word2Vec: shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words. Word2vec can utilize either of two model architectures: continuous bag-of-words (CBOW) or continuous skip-gram. In the CBOW architecture, the model predicts the current word from a window of surrounding context words. In the continuous skip-gram architecture, the model uses the current word to predict the surrounding window of context words.

GloVe (Global Vectors for Word Representation): Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.

FastText: unlike GloVe, it embeds words by treating each word as being composed of character n-grams instead of a word whole. This feature enables it not only to learn rare words but also out-of-vocabulary words.

Context-dependent

Unlike context-independent word embeddings, context-dependent methods learn different embeddings for the same word based on its context.

Context-dependent and RNN based

ELMO (Embeddings from Language Model): learns contextualized word representations based on a neural language model with a character-based encoding layer and two BiLSTM layers.

CoVe (Contextualized Word Vectors): uses a deep LSTM encoder from an attentional sequence-to-sequence model trained for machine translation to contextualize word vectors.

Context-dependent and transformer-based

BERT (Bidirectional Encoder Representations from Transformers): transformer-based language representation model trained on a large cross-domain corpus. Applies a masked language model to predict words that are randomly masked in a sequence, and this is followed by a next-sentence-prediction task for learning the associations between sentences.

XLM (Cross-lingual Language Model): it’s a transformer pretrained using next token prediction, a BERT-like masked language modeling objective, and a translation objective.

RoBERTa (Robustly Optimized BERT Pretraining Approach): it builds on BERT and modifies key hyperparameters, removing the next-sentence pretraining objective and training with much larger mini-batches and learning rates.

ALBERT (A Lite BERT for Self-supervised Learning of Language Representations): it presents parameter-reduction techniques to lower memory consumption and increase the training speed of BERT.

Thank you for reading! If you are interested in learning more about NLP, remember to follow NLPlanet on Medium, LinkedIn, and Twitter!

Two minutes NLP related posts

Two minutes NLP — 11 word embeddings models you should know

TF-IDF, Word2Vec, GloVe, FastText, ELMO, CoVe, BERT, RoBERTa, etc.

Two minutes NLP — 33 important NLP tasks explained

Information Retrieval, Knowledge Bases, Chatbots, Text Generation, Text-to-Data, Text Reasoning, etc.

Two minutes NLP — Speech Recognition options with Python

DeepSpeech, SpeechBrain, SpeechRecognition, Speech-to-Text APIs

Two minutes NLP — Topic Modeling and Semantic Search with Top2Vec

Top2Vec, Doc2Vec, UMAP, HDBSCAN, and topic vectors