Free AI web copilot to create summaries, insights and extended knowledge, download it at here

4720

Abstract

(same as TransE) to achieve superior performance. In computation, DistMult is similar to TransE while DistMult uses multiplicative interaction while TransE uses addictive interaction.</p><p id="291f">One of the problems of DistMult is that it can only model symmetric relations but not suitable for general knowledge graphs as it simply the relations by using a diagonal matrix.</p><h2 id="5782">ComplEx</h2><p id="99fb">To handle symmetric and antisymmetric relations, <a href="https://arxiv.org/pdf/1606.06357.pdf">Trouillon et al.</a>, (2016) proposed to use complex embeddings (both real and imaginary parts). Symmetric relations mean sRo = oRs if a does not equal to b while s is a subject entity, R is relation and o is object entity. If it holds if a is equal to b only, it is antisymmetric relations.</p><figure id="7a86"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*lp3nyNKJBP9tNm37.png"><figcaption>Example of Complex Number (<a href="https://en.wikipedia.org/wiki/Complex_number">source</a>)</figcaption></figure><p id="029b">The scoring function is similar to DistMult as a diagonal matric is introduced to score vectors. DistMult scoring function helps to calculate the symmetric part while the antisymmetric part is handled by imaginary embeddings.</p><figure id="16ca"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*kyDDP9GcVgVl_O6d8eW3FA.png"><figcaption>ComplEx Scoring Function. W: diagonal matric, e_s: subject entity vector, e_o: object entity vector, Re: real vectors, Im: Imaginary vectors. (Trouillon et al., 2016)</figcaption></figure><h1 id="a009">Training Objective</h1><p id="130d"><code>Link prediction</code>(<a href="https://www.cs.cornell.edu/home/kleinber/link-pred.pdf">Nowell and Kleinberg, 2004</a>) is one of the ways to train an entity embeddings. Given defined nodes relations (i.e. graph), we can generate negative samples (i.e. corrupted relations, we will discuss in a later section) to disturb the model and allowing a model to learn relations among entities.</p><figure id="4654"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*9VV-NMXF83ZstvHR.png"><figcaption>Relations between customer and food (<a href="https://eng.uber.com/uber-eats-graph-learning/">source</a>)</figcaption></figure><p id="a667">Once we have both positive samples and negative samples, we can use ranking loss, logistic loss or softmax loss as loss function to score those samples.</p><ul><li>Ranking Loss: Loss will be introduced if positive samples score is larger than negative samples and small margin error.</li><li>Logistic Loss: Instead of ranking samples, it can predict the probabilities of edge existence.</li><li>Softmax Loss: Understand the distribution of the probabilities of entities' connections.</li></ul><h1 id="e7b8">Large Scale Training</h1><p id="87ca">We go through several unsupervised learning methods to train entity embeddings. Let imagine that you are working for Facebook or Uber and having an extremely large amount of nodes and edges. How can we fit those data into memory? Lerer et al. released a <a href="https://arxiv.org/pdf/1903.12287.pdf">PyTorch-BigGraph (PBG)</a> that supports million of nodes and trillion of edges in 2019. PBG provides a way to perform distributed execution across multiple machines.</p><h2 id="0a4f">Partition</h2><p id="ee9b">The first step of performs distributed execution is partitioning data. Step are:</p><ol><li>Perform partition to each entity (if necessary).</li><li>Divided edges to buckets. Edges will be put into the bucket (p1, p2) if edges connecting with source partition p1 and destination partition p2.</li><li>Shuffle bucket order within a partition. It is important that at least one of partitions (i.e. p1 or p2 in the bucket (p1, p2)) was trained to expect the first. Empirically, it is better than a random order.</li></ol><figure id="2a67"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*EGh3LXr1adOtO_k7tPR-Qw.png"><figcaption>Left: Nodes are split into multiple partitions. Non-overlapping partitions can be executed in parallel. Center: No partition for entities with small cardinality. Right: Order matters while bucker order guarantee that at least one bucket has previously-trained. (Lerer et al., 2019)</figcaption></figure><p id="5cc4">There are 2 fundamental changes when performing distributed execution. Negative samples are drawn from the same partition and edges are no longer sampled independently and identically distributed (i.i.d.). One of the impacts is suffering from slower convergence. We will come again these problems in later sections.</p><h2 id="105c">Distribution</h2><p id="6d0a">After partitioned data, it can be sent

Options

to different machines for parallel training. The traditional mechanism uses parameter servers to store embeddings information and updating the parameters asynchronously after gradient sent from workers. However, one of the drawbacks is the huge network bandwidth overhead. Therefore, Lerer et al. proposed a solution to overcome it.</p><p id="8477"><code>Lock Server</code> is proposed to lock partitioned embeddings in a single machine. If partitions are disjoint, it can be trained parallel, otherwise, embeddings will be locked in <code>Lock Server</code>. Only shared parameters will be synchronized.</p><figure id="051b"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*cRR89L17yn2tqlfVIJidgQ.png"><figcaption>1. Rank 2 Trainer requests bucket from Lock Server. 2. Swap partitions from a shared partition. 3. Load edges from a shared file system and performing training. (Lerer et al., 2019)</figcaption></figure><h2 id="893e">Negative Sampling</h2><p id="f044">3 negative sampling methods are available in PBG. <code>All negatives</code> method is the simplest way which is generating all possible edges for all data. <code>Same-batch negatives</code> method is only generating all possible edges within a batch. It reduces the total number of negatives and network bandwidth overhead. The last method is <code>Uniformly-sampled negatives</code> which generates a fixed number of negatives samples within a batch. You may visit <a href="https://torchbiggraph.readthedocs.io/en/latest/loss_optimization.html#negative-sampling">documentation</a> to a deeper understanding of the negative sampling methods.</p><h1 id="0682">Take Away</h1><ul><li>PyTorch BigGraph (PBG) supports CPU only up till now (Nov 2019) while authors are working on GPU support. Stay tuned.</li><li>PyTorch BigGraph (PBG) <a href="https://torchbiggraph.readthedocs.io/en/latest/scoring.html#operators">implemented</a> TransE, RESCAL, DistMult and ComplEx model (with small modification). In other words, you just need to provide correct format data, PBG will do the rest of them for you.</li><li><a href="https://arxiv.org/pdf/1606.06357.pdf">Trouillon et al.</a>, (2016) evaluated the impact of the number of negatives generated per positive training sample. They found that 50 negative examples per positive training sample are a good trade-off between accuracy and training time.</li><li>The aforementioned model and loss function are re-implemented in <a href="https://github.com/facebookresearch/PyTorch-BigGraph">PBG</a>, you may simply call the library to train graph embeddings.</li></ul><h1 id="1c18">Extra Reading</h1><ul><li><a href="https://github.com/facebookresearch/PyTorch-BigGraph">PyTorch BigGraph</a> source repository</li><li>Graph Learning in <a href="https://eng.uber.com/uber-eats-graph-learning/">Uber Eats</a></li></ul><h1 id="e9db">About Me</h1><p id="0b6c">I am Data Scientist in the Bay Area. Focusing on the state-of-the-art in Data Science, Artificial Intelligence, especially in NLP and platform related. Feel free to connect with <a href="https://makcedward.github.io/">me</a> on <a href="https://www.linkedin.com/in/edwardma1026">LinkedIn</a> or follow me on <a href="http://medium.com/@makcedward/">Medium</a> or <a href="https://github.com/makcedward">Github</a>.</p><h1 id="831b">Reference</h1><ul><li>D. N. Nowell and J. Kleinberg. <a href="https://www.cs.cornell.edu/home/kleinber/link-pred.pdf">The Link Prediction Problem for Social Networks</a>. 2004</li><li>M. Nicke, V. Tresp and H. Kriegel. <a href="https://pdfs.semanticscholar.org/68a3/3a3afac65eb6e0fb3726c1f9c8b727f32a42.pdf?_ga=2.21151099.1397092755.1575835510-317581445.1533093975">A Three-Way Model for Collective Learning on Multi-Relational Data</a>. 2011</li><li>S. Bhagat, G. Cormode and S. Muthukrishnan. <a href="https://arxiv.org/pdf/1101.3291.pdf">Node Classification in Social Networks</a>. 2011</li><li>A. Borders, N. Usunier and A .G. Duran. <a href="https://papers.nips.cc/paper/5071-translating-embeddings-for-modeling-multi-relational-data.pdf">Translating Embeddings for Modeling Multi-relational Data</a>. 2013.</li><li>B. Yang, W. T. Yih, X. He, J. Gao and L. Deng. <a href="https://arxiv.org/pdf/1412.6575v4.pdf">Embedding Entities and Relations for Learning and Inference in Knowledge Bases</a>. 2015</li><li>T. Trouillon, J. Welbl, S. Riedel, E. Gaussier and G. Bouchard. <a href="https://arxiv.org/pdf/1606.06357.pdf">Complex Embeddings for Simple Link Prediction</a>. 2016</li><li>A. Lerer, L. Wu, J. Shen, T. Lacroix, L. Wehrstedt, A. Bose and A. Peysakhovich. <a href="https://arxiv.org/pdf/1903.12287.pdf">PyTorch-BigGraph: A Large-scale Graph Embeddings Framework</a>. 2019</li></ul></article></body>

A Gentle Introduction to Graph Embeddings

Instead of using traditional machine learning classification tasks, we can consider using graph neural network (GNN) to perform node classification problems. By providing an explicit link of nodes, this classification problem is no longer classified as an independent problem but leveraging graph structures such as the degree of nodes. The usefulness of graph properties assumes that individual nodes are correlated with other similar nodes.

Typically example is a social media network. Imagine how Facebook connects you and somebody else based on what post you like, where you check-in etc. A graph is capable to represent this kind of relationship and we can leverage it to train GNN. Detail use cases of GNN will be covered in later stories.

We will explore the Graph Embeddings this time. The Graph Embeddings, same as word embeddings in NLP, uses low dimensional representation to represent entities with semantic similarity. In other words, similar entities (e.g. both apple and orange are fruit) have similar vector representations.

Model

Lots of researchers studied how can GNN works. TransE (Border et al., 2013), RESCAL (Nickle et al., 2011), DistMult (Yang et al., 2015) and ComplEx (Trouillon et al., 2016) will be introduced in this section.

TransE

If you are familiar with word2vec (Mikolov et al., 2013), you can assume that TransE (Border et al., 2013) is similar to word2vec. Giving subject entity (aka head), relation and object entity (aka tail), object entity embeddings should be close to the subject entity embeddings plus relation embeddings if the subject entity is similar to the object entity. Otherwise, the subject entity should be far away from the object entity.

Word2vec Sample: King + Woman ~= Queen (source)

RESCAL

RESCAL (Nickle et al., 2011) uses multiple matrics to represent the relations among entities. Assume that the total number of entity is n while the total number of a relation is m, the total number of parameters is n x n x m. If there is no relation between an entity i and entity j, the value is set to zero.

Matrics of Entity (E) and Relation (R) (Nickle et al., 2011)

One of the challenges of RESCAL (Nickle et al., 2011) is scalability. Since the matrics store relation between every subject entity and object entity, a huge amount of parameters are introduced.

DistMult

DistMult (Yang et al., 2015) is similar to RESCAL (Nickle et al., 2011) except for the number of parameters. Instead of use complex matrics, Yang et al. reduce the number of relations parameters by using a diagonal matrix only (i.e. restricted matrices). It requires fewer parameters for training. A number of parameters of RESCAL can be more than DistMult ten to a hundred times.

DistMult enjoys a low number of parameters (same as TransE) to achieve superior performance. In computation, DistMult is similar to TransE while DistMult uses multiplicative interaction while TransE uses addictive interaction.

One of the problems of DistMult is that it can only model symmetric relations but not suitable for general knowledge graphs as it simply the relations by using a diagonal matrix.

ComplEx

To handle symmetric and antisymmetric relations, Trouillon et al., (2016) proposed to use complex embeddings (both real and imaginary parts). Symmetric relations mean sRo = oRs if a does not equal to b while s is a subject entity, R is relation and o is object entity. If it holds if a is equal to b only, it is antisymmetric relations.

The scoring function is similar to DistMult as a diagonal matric is introduced to score vectors. DistMult scoring function helps to calculate the symmetric part while the antisymmetric part is handled by imaginary embeddings.

ComplEx Scoring Function. W: diagonal matric, e_s: subject entity vector, e_o: object entity vector, Re: real vectors, Im: Imaginary vectors. (Trouillon et al., 2016)

Training Objective

Link prediction(Nowell and Kleinberg, 2004) is one of the ways to train an entity embeddings. Given defined nodes relations (i.e. graph), we can generate negative samples (i.e. corrupted relations, we will discuss in a later section) to disturb the model and allowing a model to learn relations among entities.

Relations between customer and food (source)

Once we have both positive samples and negative samples, we can use ranking loss, logistic loss or softmax loss as loss function to score those samples.

Ranking Loss: Loss will be introduced if positive samples score is larger than negative samples and small margin error.
Logistic Loss: Instead of ranking samples, it can predict the probabilities of edge existence.
Softmax Loss: Understand the distribution of the probabilities of entities' connections.

Large Scale Training

We go through several unsupervised learning methods to train entity embeddings. Let imagine that you are working for Facebook or Uber and having an extremely large amount of nodes and edges. How can we fit those data into memory? Lerer et al. released a PyTorch-BigGraph (PBG) that supports million of nodes and trillion of edges in 2019. PBG provides a way to perform distributed execution across multiple machines.

Partition

The first step of performs distributed execution is partitioning data. Step are:

Perform partition to each entity (if necessary).
Divided edges to buckets. Edges will be put into the bucket (p1, p2) if edges connecting with source partition p1 and destination partition p2.
Shuffle bucket order within a partition. It is important that at least one of partitions (i.e. p1 or p2 in the bucket (p1, p2)) was trained to expect the first. Empirically, it is better than a random order.

Left: Nodes are split into multiple partitions. Non-overlapping partitions can be executed in parallel. Center: No partition for entities with small cardinality. Right: Order matters while bucker order guarantee that at least one bucket has previously-trained. (Lerer et al., 2019)

There are 2 fundamental changes when performing distributed execution. Negative samples are drawn from the same partition and edges are no longer sampled independently and identically distributed (i.i.d.). One of the impacts is suffering from slower convergence. We will come again these problems in later sections.

Distribution

After partitioned data, it can be sent to different machines for parallel training. The traditional mechanism uses parameter servers to store embeddings information and updating the parameters asynchronously after gradient sent from workers. However, one of the drawbacks is the huge network bandwidth overhead. Therefore, Lerer et al. proposed a solution to overcome it.

Lock Server is proposed to lock partitioned embeddings in a single machine. If partitions are disjoint, it can be trained parallel, otherwise, embeddings will be locked in Lock Server. Only shared parameters will be synchronized.

1. Rank 2 Trainer requests bucket from Lock Server. 2. Swap partitions from a shared partition. 3. Load edges from a shared file system and performing training. (Lerer et al., 2019)

Negative Sampling

3 negative sampling methods are available in PBG. All negatives method is the simplest way which is generating all possible edges for all data. Same-batch negatives method is only generating all possible edges within a batch. It reduces the total number of negatives and network bandwidth overhead. The last method is Uniformly-sampled negatives which generates a fixed number of negatives samples within a batch. You may visit documentation to a deeper understanding of the negative sampling methods.

Take Away

PyTorch BigGraph (PBG) supports CPU only up till now (Nov 2019) while authors are working on GPU support. Stay tuned.
PyTorch BigGraph (PBG) implemented TransE, RESCAL, DistMult and ComplEx model (with small modification). In other words, you just need to provide correct format data, PBG will do the rest of them for you.
Trouillon et al., (2016) evaluated the impact of the number of negatives generated per positive training sample. They found that 50 negative examples per positive training sample are a good trade-off between accuracy and training time.
The aforementioned model and loss function are re-implemented in PBG, you may simply call the library to train graph embeddings.

Extra Reading

PyTorch BigGraph source repository
Graph Learning in Uber Eats

About Me

I am Data Scientist in the Bay Area. Focusing on the state-of-the-art in Data Science, Artificial Intelligence, especially in NLP and platform related. Feel free to connect with me on LinkedIn or follow me on Medium or Github.

Reference

D. N. Nowell and J. Kleinberg. The Link Prediction Problem for Social Networks. 2004
M. Nicke, V. Tresp and H. Kriegel. A Three-Way Model for Collective Learning on Multi-Relational Data. 2011
S. Bhagat, G. Cormode and S. Muthukrishnan. Node Classification in Social Networks. 2011
A. Borders, N. Usunier and A .G. Duran. Translating Embeddings for Modeling Multi-relational Data. 2013.
B. Yang, W. T. Yih, X. He, J. Gao and L. Deng. Embedding Entities and Relations for Learning and Inference in Knowledge Bases. 2015
T. Trouillon, J. Welbl, S. Riedel, E. Gaussier and G. Bouchard. Complex Embeddings for Simple Link Prediction. 2016
A. Lerer, L. Wu, J. Shen, T. Lacroix, L. Wehrstedt, A. Bose and A. Peysakhovich. PyTorch-BigGraph: A Large-scale Graph Embeddings Framework. 2019