avatarStephen Ho

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

4283

Abstract

d. But I actually like a lot of other packages too. For WMD problem, I first tried out <a href="http://cvxopt.org/">cvxopt</a> first, which should actually solve the exact same problem, but the indexing is hard to maintain. Because I am dealing with words, it is good to have a direct hash map, or a dictionary. I can use the Dictionary class in <a href="https://radimrehurek.com/gensim/">gensim</a>. But I later found out I should use <a href="https://pythonhosted.org/PuLP/">PuLP</a>, as it allows indices with words as a hash map (dict in Python), and WMD is a linear programming problem, making PuLP is a perfect choice, considering code efficiency.</p><p id="8add">An example of using PuLP can be demonstrated by the British 1997 UG Exam, as in the first problem of this <a href="http://people.brunel.ac.uk/~mastjjb/jeb/or/morelp.html">link</a>, with the <a href="https://github.com/stephenhky/PyWMD/blob/master/PULPDemo.ipynb">Jupyter Notebook</a> demonstrating this.</p><h1 id="1722">Implementation of WMD using PuLP</h1><p id="dbcd">The demonstration can be found in the <a href="https://github.com/stephenhky/PyWMD/blob/master/WordMoverDistanceDemo.ipynb">Jupyter Notebook</a>.</p><p id="a1e4">Load the necessary packages:</p><div id="5e70"><pre><span class="hljs-title">from</span> itertools <span class="hljs-keyword">import</span> product <span class="hljs-title">from</span> collections <span class="hljs-keyword">import</span> defaultdict

<span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np <span class="hljs-title">from</span> scipy.spatial.distance <span class="hljs-keyword">import</span> euclidean <span class="hljs-keyword">import</span> pulp <span class="hljs-keyword">import</span> gensim</pre></div><p id="ec98">Then define the functions the gives the BOW document vectors:</p><div id="e969"><pre>def tokens_to_fracdict(tokens): cntdict = <span class="hljs-built_in">defaultdict</span>(lambda : <span class="hljs-number">0</span>) for token in tokens: cntdict[token] += <span class="hljs-number">1</span> totalcnt = <span class="hljs-built_in">sum</span>(cntdict.<span class="hljs-built_in">values</span>()) return {token: <span class="hljs-built_in">float</span>(cnt)/totalcnt for token, cnt in cntdict.<span class="hljs-built_in">items</span>()}</pre></div><p id="6d09">Then implement the core calculation. Note that PuLP is actually a symbolic computing package. This function return a <code>pulp.LpProblem</code> class:</p><div id="4af8"><pre>def word_mover_distance_probspec(first_sent_tokens, second_sent_tokens, wvmodel, lpFile=None): all_tokens = list(<span class="hljs-keyword">set</span>(first_sent_tokens+second_sent_tokens)) wordvecs <span class="hljs-comment">= {token: wvmodel[token] for token in all_tokens}</span>

first_sent_buckets <span class="hljs-comment">= tokens_to_fracdict(first_sent_tokens)</span>
second_sent_buckets <span class="hljs-comment">= tokens_to_fracdict(second_sent_tokens)</span>

T <span class="hljs-comment">= pulp.LpVariable.dicts(</span><span class="hljs-comment">'T_matrix'</span><span class="hljs-comment">, list(product(all_tokens, all_tokens)), lowBound=0)</span>

prob <span class="hljs-comment">= pulp.LpProblem(</span><span class="hljs-comment">'WMD'</span><span class="hljs-comment">, sense=pulp.LpMinimize)</span>
prob <span class="hljs-comment">+= pulp.lpSum([T[token1, token2]*euclidean(wordvecs[token1], wordvecs[token2])</span>
                    for <span class="hljs-comment">token1, token2 in product(all_tokens, all_tokens)])</span>
for <span class="hljs-comment">token2 in second_sent_buckets:</span>
    prob <span class="hljs-comment">+= pulp.lpSum([T[token1, token2] for token1 in first_sent_buckets])==second_sent_buckets[token2]</span>
for <span class="hljs-comment">token1 in first_sent_buckets:</span>
    prob <span class="hljs-comment">+= pulp.lpSum([T[token1, token2] for token2 in second_sent_buckets])==first_sent_buckets[token1]</span>

if <span class="hljs-comment">lpFile!=None:</span>
    prob.writeLP(lpFile)

prob.solve()

return <span class="hljs-comment">prob</span></pre></div><p id="889d">To extract the value, just run <code>pulp.value(prob.objective)<

Options

/code></p><p id="ce4c">We use Google Word2Vec. Refer the</p><figure id="4ed8"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*HVqp0yJgAP0KCDPZ."><figcaption></figcaption></figure><p id="a0dc">matrices in the <a href="https://github.com/stephenhky/PyWMD/blob/master/WordMoverDistanceDemo.ipynb">Jupyter Notebook</a>. Running this by a few examples:</p><ol><li>document1 = President, talk, Chicago document2 = President, speech, Illinois WMD = 2.88587622936</li><li>document1 = physician, assistant document2 = doctor WMD = 2.8760048151</li><li>document1 = physician, assistant document2 = doctor, assistant WMD = 1.00465738773 (compare with example 2!)</li><li>document1 = doctors, assistant document2 = doctor, assistant WMD = 1.02825379372 (compare with example 3!)</li><li>document1 = doctor, assistant document2 = doctor, assistant WMD = 0.0 (totally identical; compare with example 3!)</li></ol><p id="e0b7">There are more examples in the notebook.</p><h1 id="24c9">Conclusion</h1><p id="96ea">WMD is a good metric comparing two documents or sentences, by capturing the semantic meanings of the words. It is more powerful than BOW model as it captures the meaning similarities; it is more powerful than the cosine distance between average word vectors, as the transfer of meaning using words from one document to another is considered. But it is not immune to the problem of misspelling.</p><p id="daab">This algorithm works well for short texts. However, when the documents become large, this formulation will be computationally expensive. The author actually suggested a few modifications, such as the removal of constraints, and word centroid distances.</p><p id="d4d2">Example codes can be found in my Github repository: <a href="https://github.com/stephenhky/PyWMD">stephenhky/PyWMD</a>.</p><p id="8217">P.S.:</p><ol><li>Original post: <a href="https://datawarrior.wordpress.com/2017/08/16/word-movers-distance-as-a-linear-programming-problem/">https://datawarrior.wordpress.com/2017/08/16/word-movers-distance-as-a-linear-programming-problem/</a></li><li>This same code has been implemented in the Python package <a href="http://shorttext.readthedocs.io/en/latest/">shorttext</a>. Tutorial: <a href="http://shorttext.readthedocs.io/en/latest/tutorial_metrics.html">http://shorttext.readthedocs.io/en/latest/tutorial_metrics.html</a></li><li>Feature image adapted from the original paper by Kusner <i>et. al.</i></li></ol><p id="4672">References:</p><ul><li>Matt Kusner, Yu Sun, Nicholas Kolkin, Kilian Weinberger, “From Word Embeddings To Document Distances,” <i>Proceedings of the 32nd International Conference on Machine Learning</i>, PMLR 37:957–966 (2015). [<a href="http://proceedings.mlr.press/v37/kusnerb15.html">PMLR</a>]</li><li>Github: mkusner/wmd. [<a href="https://github.com/mkusner/wmd">Github</a>]</li><li>Kwan-Yuet Ho, “Toying with Word2Vec,” <i>Everything About Data Analytics</i>, WordPress (2015). [<a href="https://datawarrior.wordpress.com/2015/10/25/codienerd-2-toying-with-word2vec/">WordPress</a>]</li><li>Kwan-Yuet Ho, “On Wasserstein GAN,”<i>Everything About Data Analytics</i>, WordPress (2017). [<a href="https://datawarrior.wordpress.com/2017/03/30/on-wasserstein-gan/">WordPress</a>]</li><li>Martin Arjovsky, Soumith Chintala, Léon Bottou, “Wasserstein GAN,” arXiv:1701.07875 (2017). [<a href="https://arxiv.org/abs/1701.07875">arXiv</a>]</li><li>Alison L. Gibbs, Francis Edward Su, “On Choosing and Bounding Probability Metrics,” arXiv:math/0209021 (2002) [<a href="https://arxiv.org/abs/math/0209021">arXiv</a>]</li><li>cvxopt: Python Software for Convex Optimization. [<a href="http://cvxopt.org/">HTML</a>]</li><li>gensim: Topic Modeling for Humans. [<a href="https://radimrehurek.com/gensim/">HTML</a>]</li><li>PuLP: Optimization for Python. [<a href="https://pythonhosted.org/PuLP/">PythonHosted</a>]</li><li>Demonstration of PuLP: Github: stephenhky/PyWMD. [<a href="https://github.com/stephenhky/PyWMD/blob/master/PULPDemo.ipynb">Jupyter</a>]</li><li>Implemenation of WMD: Github: stephenhky/PyWMD. [<a href="https://github.com/stephenhky/PyWMD/blob/master/WordMoverDistanceDemo.ipynb">Jupyter</a>]</li><li>Github: stephenhky/PyWMD. [<a href="https://github.com/stephenhky/PyWMD">Github</a>]</li></ul></article></body>

Word Mover’s Distance as a Linear Programming Problem

Much about the use of word-embedding models such as Word2Vec and GloVe have been covered. However, how to measure the similarity between phrases or documents? One natural choice is the cosine similarity, as I have toyed with in a previous post. However, it smoothed out the influence of each word. Two years ago, a group in Washington University in St. Louis proposed the Word Mover’s Distance (WMD) in a PMLR paper that captures the relations between words, not simply by distance, but also the “transportation” from one phrase to another conveyed by each word. This Word Mover’s Distance (WMD) can be seen as a special case of Earth Mover’s Distance (EMD), or Wasserstein distance, the one people talked about in Wasserstein GAN. This is better than bag-of-words (BOW) model in a way that the word vectors capture the semantic similarities between words.

Word Mover’s Distance (WMD)

The formulation of WMD is beautiful. Consider the embedded word vectors

, where d is the dimension of the embeddings, and n is the number of words. For each phrase, there is a normalized BOW vector

, and

, where i’s denote the word tokens. The distance between words are the Euclidean distance of their embedded word vectors, denoted by

, where i and j denote word tokens. The document distance, which is WMD here, is defined by

, where T is a n times n matrix. Each element

denote how nuch of word i in the first document (denoted by d) travels to word j in the new document (denoted by d’).

Then the problem becomes the minimization of the document distance, or the WMD, and is formulated as:

given the constraints:

, and

This is essentially a simplified case of the Earth Mover’s distance (EMD), or the Wasserstein distance. (See the review by Gibbs and Su.)

Using PuLP

The WMD is essentially a linear optimization problem. There are many optimization packages on the market, and my stance is that, for those common ones, there are no packages that are superior than others. In my job, I happened to handle a missing data problem, in turn becoming a non-linear optimization problem with linear constraints, and I chose limSolve, after I shop around. But I actually like a lot of other packages too. For WMD problem, I first tried out cvxopt first, which should actually solve the exact same problem, but the indexing is hard to maintain. Because I am dealing with words, it is good to have a direct hash map, or a dictionary. I can use the Dictionary class in gensim. But I later found out I should use PuLP, as it allows indices with words as a hash map (dict in Python), and WMD is a linear programming problem, making PuLP is a perfect choice, considering code efficiency.

An example of using PuLP can be demonstrated by the British 1997 UG Exam, as in the first problem of this link, with the Jupyter Notebook demonstrating this.

Implementation of WMD using PuLP

The demonstration can be found in the Jupyter Notebook.

Load the necessary packages:

from itertools import product
from collections import defaultdict

import numpy as np
from scipy.spatial.distance import euclidean
import pulp
import gensim

Then define the functions the gives the BOW document vectors:

def tokens_to_fracdict(tokens):
    cntdict = defaultdict(lambda : 0)
    for token in tokens:
        cntdict[token] += 1
    totalcnt = sum(cntdict.values())
    return {token: float(cnt)/totalcnt for token, cnt in cntdict.items()}

Then implement the core calculation. Note that PuLP is actually a symbolic computing package. This function return a pulp.LpProblem class:

def word_mover_distance_probspec(first_sent_tokens, second_sent_tokens, wvmodel, lpFile=None):
    all_tokens = list(set(first_sent_tokens+second_sent_tokens))
    wordvecs = {token: wvmodel[token] for token in all_tokens}

    first_sent_buckets = tokens_to_fracdict(first_sent_tokens)
    second_sent_buckets = tokens_to_fracdict(second_sent_tokens)

    T = pulp.LpVariable.dicts('T_matrix', list(product(all_tokens, all_tokens)), lowBound=0)

    prob = pulp.LpProblem('WMD', sense=pulp.LpMinimize)
    prob += pulp.lpSum([T[token1, token2]*euclidean(wordvecs[token1], wordvecs[token2])
                        for token1, token2 in product(all_tokens, all_tokens)])
    for token2 in second_sent_buckets:
        prob += pulp.lpSum([T[token1, token2] for token1 in first_sent_buckets])==second_sent_buckets[token2]
    for token1 in first_sent_buckets:
        prob += pulp.lpSum([T[token1, token2] for token2 in second_sent_buckets])==first_sent_buckets[token1]

    if lpFile!=None:
        prob.writeLP(lpFile)

    prob.solve()

    return prob

To extract the value, just run pulp.value(prob.objective)

We use Google Word2Vec. Refer the

matrices in the Jupyter Notebook. Running this by a few examples:

  1. document1 = President, talk, Chicago document2 = President, speech, Illinois WMD = 2.88587622936
  2. document1 = physician, assistant document2 = doctor WMD = 2.8760048151
  3. document1 = physician, assistant document2 = doctor, assistant WMD = 1.00465738773 (compare with example 2!)
  4. document1 = doctors, assistant document2 = doctor, assistant WMD = 1.02825379372 (compare with example 3!)
  5. document1 = doctor, assistant document2 = doctor, assistant WMD = 0.0 (totally identical; compare with example 3!)

There are more examples in the notebook.

Conclusion

WMD is a good metric comparing two documents or sentences, by capturing the semantic meanings of the words. It is more powerful than BOW model as it captures the meaning similarities; it is more powerful than the cosine distance between average word vectors, as the transfer of meaning using words from one document to another is considered. But it is not immune to the problem of misspelling.

This algorithm works well for short texts. However, when the documents become large, this formulation will be computationally expensive. The author actually suggested a few modifications, such as the removal of constraints, and word centroid distances.

Example codes can be found in my Github repository: stephenhky/PyWMD.

P.S.:

  1. Original post: https://datawarrior.wordpress.com/2017/08/16/word-movers-distance-as-a-linear-programming-problem/
  2. This same code has been implemented in the Python package shorttext. Tutorial: http://shorttext.readthedocs.io/en/latest/tutorial_metrics.html
  3. Feature image adapted from the original paper by Kusner et. al.

References:

  • Matt Kusner, Yu Sun, Nicholas Kolkin, Kilian Weinberger, “From Word Embeddings To Document Distances,” Proceedings of the 32nd International Conference on Machine Learning, PMLR 37:957–966 (2015). [PMLR]
  • Github: mkusner/wmd. [Github]
  • Kwan-Yuet Ho, “Toying with Word2Vec,” Everything About Data Analytics, WordPress (2015). [WordPress]
  • Kwan-Yuet Ho, “On Wasserstein GAN,”Everything About Data Analytics, WordPress (2017). [WordPress]
  • Martin Arjovsky, Soumith Chintala, Léon Bottou, “Wasserstein GAN,” arXiv:1701.07875 (2017). [arXiv]
  • Alison L. Gibbs, Francis Edward Su, “On Choosing and Bounding Probability Metrics,” arXiv:math/0209021 (2002) [arXiv]
  • cvxopt: Python Software for Convex Optimization. [HTML]
  • gensim: Topic Modeling for Humans. [HTML]
  • PuLP: Optimization for Python. [PythonHosted]
  • Demonstration of PuLP: Github: stephenhky/PyWMD. [Jupyter]
  • Implemenation of WMD: Github: stephenhky/PyWMD. [Jupyter]
  • Github: stephenhky/PyWMD. [Github]
Machine Learning
Word Mover Distance
Text Mining
Word Embedding Model
Wasserstein
Recommended from ReadMedium