Fine tune GloVe embeddings using Mittens

After 2013, Word embeddings got really popular even outside of NLP community. Word2vec and GloVe belong to the family of static word embeddings. Then came the series of dynamic embeddings BERT, ELMO, RoBERTa, ALBERT, XLNET.. All these embeddings depend upon the context words. In this post let’s see how we can fine tune the static embeddings.
Ever had the situation when you have a really small dataset and want to apply static word embeddings but faced with the following problems:
- Dataset vocabulary is not present in pretrained model
- Unable to train the whole model from the dataset because it’s too small
The solution is to load the pretrained model and fine tune them with the new data from the dataset, thus the unseen vocabularies are also added to the model.
Why can’t we fine-tune word2vec:
Gensim is the most used library for word2vec, and fine tuning those embeddings has some issues. The embeddings of the vocabularies in the new dataset will be trained without any changes to the old embeddings. This results in discrepancy between pretrained embeddings and new embeddings.
fasttext also does not provide fine-tuning features.
Fine-tuning GloVes
Mittens is a python library for fine-tuning GloVe embeddings. The process contains 3 simple steps. Loading the pretrained model, building co-occurrence matrix of the new dataset and train the new embeddings.
Load pretrained model
Mittens needed the pretrained model to be loaded as a dictionary. So, lets just do the same. Get the pretrained model from https://nlp.stanford.edu/projects/glove
def glove2dict(glove_filename):
with open(glove_filename, encoding='utf-8') as f:
reader = csv.reader(f, delimiter=' ',quoting=csv.QUOTE_NONE)
embed = {line[0]: np.array(list(map(float, line[1:])))
for line in reader}
return embedglove_path = "glove.6B.50d.txt"
pre_glove = glove2dict(glove_path)Data pre-processing
Lets do some pre-processing on the dataset before building the co-occurrence matrix of the words.
sw = list(stop_words.ENGLISH_STOP_WORDS)
brown_data = brown.words()[:200000]
brown_nonstop = [token.lower() for token in brown_data if (token.lower() not in sw)]
oov = [token for token in brown_nonstop if token not in pre_glove.keys()]We have used brown corpus as a sample dataset and oov represents the vocabulary not present in pretrained glove. The co-occurrence matrix is built from oovs. It is a sparse matrix, requiring a space complexity of O(n^2). Thus sometimes the really rare oov words has to be filtered out to save space. This is an optional step.
def get_rareoov(xdict, val):
return [k for (k,v) in Counter(xdict).items() if v<=val]oov_rare = get_rareoov(oov, 1)
corp_vocab = list(set(oov) - set(oov_rare))remove those rare oovs, if needed and prepare the dataset
brown_tokens = [token for token in brown_nonstop if token not in oov_rare]
brown_doc = [' '.join(brown_tokens)]
corp_vocab = list(set(oov))Building co-occurrence matrix:
We need word-word co-occurrence not the usual term-document matrix. sklearn’s CountVectorizer transforms the document into word-doc matrix. The matrix multiplication Xt*X gives the word-word co-occurrence matrix.
cv = CountVectorizer(ngram_range=(1,1), vocabulary=corp_vocab) X = cv.fit_transform(brown_doc) Xc = (X.T * X) Xc.setdiag(0) coocc_ar = Xc.toarray()
Fine-tuning the mittens model
To install Mittens, trypip install -U mittens Check out the full documentation for more info. Just instantiate the model and run the fit function.
mittens_model = Mittens(n=50, max_iter=1000)
new_embeddings = mittens_model.fit(
coocc_ar,
vocab=corp_vocab,
initial_embedding_dict= pre_glove)Save the model as pickle for future use.
newglove = dict(zip(corp_vocab, new_embeddings))
f = open("repo_glove.pkl","wb")
pickle.dump(newglove, f)
f.close()Here’s the full code.





