Summary

This context provides a guide on how to fine-tune GloVe embeddings using the Mittens library in Python.

Abstract

The context discusses the use of word embeddings in natural language processing (NLP), focusing on static embeddings like Word2vec and GloVe, and dynamic embeddings like BERT, ELMO, RoBERTa, ALBERT, and XLNET. It highlights the challenges of using pre-trained models with small datasets and vocabulary not present in pre-trained models. The solution proposed is to fine-tune the pre-trained models with new data from the dataset using the Mittens library. The process involves three steps: loading the pre-trained model, building a co-occurrence matrix of the new dataset, and training new embeddings. The context provides code snippets for each step and a link to the full code on GitHub.

Bullet points

Word embeddings are popular in NLP, with static embeddings like Word2vec and GloVe and dynamic embeddings like BERT, ELMO, RoBERTa, ALBERT, and XLNET.
Small datasets and vocabulary not present in pre-trained models pose challenges when using pre-trained models.
Fine-tuning pre-trained models with new data from the dataset can solve these challenges.
Mittens is a Python library for fine-tuning GloVe embeddings.
The process involves three steps: loading the pre-trained model, building a co-occurrence matrix of the new dataset, and training new embeddings.
Code snippets are provided for each step, and the full code is available on GitHub.

Fine tune GloVe embeddings using Mittens

After 2013, Word embeddings got really popular even outside of NLP community. Word2vec and GloVe belong to the family of static word embeddings. Then came the series of dynamic embeddings BERT, ELMO, RoBERTa, ALBERT, XLNET.. All these embeddings depend upon the context words. In this post let’s see how we can fine tune the static embeddings.

Ever had the situation when you have a really small dataset and want to apply static word embeddings but faced with the following problems:

Dataset vocabulary is not present in pretrained model
Unable to train the whole model from the dataset because it’s too small

The solution is to load the pretrained model and fine tune them with the new data from the dataset, thus the unseen vocabularies are also added to the model.

Why can’t we fine-tune word2vec:

Gensim is the most used library for word2vec, and fine tuning those embeddings has some issues. The embeddings of the vocabularies in the new dataset will be trained without any changes to the old embeddings. This results in discrepancy between pretrained embeddings and new embeddings.

fasttext also does not provide fine-tuning features.

Fine-tuning GloVes

Mittens is a python library for fine-tuning GloVe embeddings. The process contains 3 simple steps. Loading the pretrained model, building co-occurrence matrix of the new dataset and train the new embeddings.

Load pretrained model

Mittens needed the pretrained model to be loaded as a dictionary. So, lets just do the same. Get the pretrained model from https://nlp.stanford.edu/projects/glove

def glove2dict(glove_filename):
    with open(glove_filename, encoding='utf-8') as f:
        reader = csv.reader(f, delimiter=' ',quoting=csv.QUOTE_NONE)
        embed = {line[0]: np.array(list(map(float, line[1:])))
                for line in reader}
    return embed

glove_path = "glove.6B.50d.txt"
pre_glove = glove2dict(glove_path)

Data pre-processing

Lets do some pre-processing on the dataset before building the co-occurrence matrix of the words.

sw = list(stop_words.ENGLISH_STOP_WORDS)
brown_data = brown.words()[:200000]
brown_nonstop = [token.lower() for token in brown_data if (token.lower() not in sw)]
oov = [token for token in brown_nonstop if token not in pre_glove.keys()]

We have used brown corpus as a sample dataset and oov represents the vocabulary not present in pretrained glove. The co-occurrence matrix is built from oovs. It is a sparse matrix, requiring a space complexity of O(n^2). Thus sometimes the really rare oov words has to be filtered out to save space. This is an optional step.

def get_rareoov(xdict, val):
    return [k for (k,v) in Counter(xdict).items() if v<=val]

oov_rare = get_rareoov(oov, 1)
corp_vocab = list(set(oov) - set(oov_rare))

remove those rare oovs, if needed and prepare the dataset

brown_tokens = [token for token in brown_nonstop if token not in oov_rare]
brown_doc = [' '.join(brown_tokens)]
corp_vocab = list(set(oov))

Building co-occurrence matrix:

We need word-word co-occurrence not the usual term-document matrix. sklearn’s CountVectorizer transforms the document into word-doc matrix. The matrix multiplication Xt*X gives the word-word co-occurrence matrix.

cv = CountVectorizer(ngram_range=(1,1), vocabulary=corp_vocab)
X = cv.fit_transform(brown_doc)
Xc = (X.T * X)
Xc.setdiag(0)
coocc_ar = Xc.toarray()

Fine-tuning the mittens model

To install Mittens, trypip install -U mittens Check out the full documentation for more info. Just instantiate the model and run the fit function.

mittens_model = Mittens(n=50, max_iter=1000)
new_embeddings = mittens_model.fit(
    coocc_ar,
    vocab=corp_vocab,
    initial_embedding_dict= pre_glove)

Save the model as pickle for future use.

newglove = dict(zip(corp_vocab, new_embeddings))
f = open("repo_glove.pkl","wb")
pickle.dump(newglove, f)
f.close()

Here’s the full code.

Thanks for reading the article. Feel free to contact me via Github, Twitter and Linkedin. Cheers!

Sources: 1. https://github.com/roamanalytics/mittens 2. https://surancy.github.io/co-occurrence-matrix-visualization