avatarMaciej Zalwert

Summary

The undefined website provides an in-depth tutorial on the LexRank algorithm, detailing its application in summarizing textual content and extracting key information.

Abstract

LexRank is a graph-based algorithm that computes the relative importance of sentences within a text for the purpose of summarization. It leverages eigenvector centrality in a graph representation where sentences are nodes and edges are weighted by cosine similarity of sentence vectors. The algorithm begins with converting text into numerical vectors using word embeddings, then calculates intra-sentence cosine similarity to construct an adjacency matrix. This matrix is refined into a connectivity matrix by applying a threshold to filter out weaker connections, thus avoiding the local trap problem where less important sentences are overrepresented. The final step involves using the power iteration method to determine eigenvector centrality, which identifies the most significant sentences. The tutorial emphasizes the effectiveness of LexRank in summarizing articles, as demonstrated by the author's experience with topics such as weight loss and success on Medium.

Opinions

  • The author believes that LexRank is an effective tool for summarization, as evidenced by their personal use in analyzing articles on diverse topics.
  • The complexity of the theory behind LexRank is acknowledged, but the author asserts that the algorithm is straightforward in practice.
  • Pre-trained word embeddings are recommended for ease of use, suggesting that generating one's own embeddings is resource-intensive.
  • The author expresses that the use of eigenvector centrality and the power iteration method is a key strength of LexRank, allowing it to identify the most important sentences in a text.
  • There is an opinion that LexRank can overcome the challenge of local traps by considering the number of connections and their strengths, which contributes to a more accurate summary.

LexRank algorithm explained: a step-by-step tutorial with examples

Image by Nothing Ahead from Pexels

LexRank is a stochastic graph-based method for computing relative importance of textual units for Natural Language Processing. I used LexRank to summarize and extract the most relevant information from hundreds of articles, e.g. analysis on how to lose weight and how to succeed on a Medium as a writer. LexRank proved to work effectively — in my humble opinion.

At the first glance, the theory behind the algorithm seems complicated, but in reality, it’s very simple.

Let’s start with the official general description of the concept.

We consider a new approach, LexRank, for computing sentence importance based on the concept of eigenvector centrality in a graph representation of sentences. In this model, a connectivity matrix based on intra-sentence cosine similarity is used as the adjacency matrix of the graph representation of sentences.

Let’s divide LexRank algorithm into pieces so that step-by-step, with explained mathematical concepts, we get to the model output:

  1. Input to the model
  2. Word embeddings
  3. Intra-sentence cosine similarity
  4. Adjacency matrix
  5. Connectivity matrix
  6. Eigenvector centrality
  7. Output of the model

Input to the model

The basic input to the model may be just an article or set of articles. It depends on whether your goal is to summarize one article or many articles.

LexRank can consume any text like:

  • “This is an example sentence of the article. This is the second sentence. This is the third sentence, that is the most important cause it says other sentences are just examples.”

Most probably when we utilize LexRank in the above example it would give only the last sentence as a summary of the text.

Ok, but how to train a model with a text?

Word embeddings

To apply any mathematical algorithm we need to convert the text input into numbers, typically in the form of a real-valued vectors. Representation of words in the vector format is called word embeddings.

Usually word embeddings are computed such that the words that are represented as similar vectors are expected to be similar in meaning.

Let’s see a simple example on cat, dog and aeroplane

# example representation vector of a word "cat", "dog" and "aeroplane"
cat_representation = [0, 0, 0, 1, 0, 1]
dog_representation = [0, 0, 0, 1, 1, 1]
aeroplane_represenatation = [1, 1, 0, 0, 0, 0]

As you can see a “cat” and a “dog” representation in vectors are very similar whereas “aeroplane” is completely different from both animals words. In simple,“cat” and “dog” have common parts in the vector whereas aeroplane has zero common parts.

Note: generation of word embeddings is far beyond this tutorial. I will create another series of articles on it. To train your own word embeddings you would need a lot of data (like all articles from Wikipedia :)) and tremendous computing power.

Fortunately for us there are many already trained word embeddings that can be easily used e.g. from fast text here.

Ok, but why it is important? How can we use it?

Intra-sentence cosine similarity

We can use word embeddings in the sentence in many ways. However, in LexRank implementation an intra-sentence of sentence is used. It means the average of all word embeddings within a sentence that are used to compare to other sentences.

Let’s see an example

sentence_1 = "this is a cat"
sentence_2 = "this is a dog"
# example representation of words in the form of vectors where each word is represented as a separate vector
this_vector = [0,0,0]
is_vector = [1,0,0]
cat_vector = [0, 1, 0]
dog_vector = [0, 0, 1]
# how sentence looks in the form of vectors:
sentence_1 = "[0,0,0] [1,0,0] [0, 1, 0]"
sentence_2 = "[0,0,0] [1,0,0] [0, 0, 1]"
# how to calculate intra sentence? Just take an average
intra_sentence_1 = [1/3, 1/3, 0]
intra_sentence_2 = [1/3, 0, 1/3]

But what’s a cosine similarity?

It’s also very easy! It’s just a metric (like any other), helpful in determining, how similar the data objects are irrespective of their size. In this case, how similar sentences are represented by vectors.

The formula for calculating the cosine similarity is: 
Cos(x, y) = x . y / ||x|| * ||y||
Where:
x . y - a dot product 
|| x || - length of a vector x
|| y || - length of a vector y

How to calculate it?

x = sentence_1
y = sentence_2
x . y = 1/3*1/3 + 1/3*0 + 0*1/3 = 1/9
|| x || = √ (1/3)^2 + (1/3)^2 + (0)^2 = 0.47
|| y || = √ (1/3)^2 + (0)^2 + (1/3)^2 = 0.47
Cos(x, y) = 1/9 / ((0.47) * (0.47)) = 0.024

Easy, isn’t it?

And now the algorithm calculates it to every single sentence within a given text input.

But how can we represent it?

Adjacency matrix

To represent a similarities across all sentences LexRank uses a connectivity matrix. Let’s see an example:

# Example of adjacency matrix with weights as similarities
Let's use previous sentences and add a new one with example similarities
sentence_1 = "this is a cat"
sentence_2 = "this is a dog"
sentence_3 = "this is an aeroplane"
example of adjacency matrix with similarities

As you can see sentence_1 and sentence_2 are similar whereas sentence_3 has no similarity with any other sentence.

There is one very big problem here. When you have many articles it might happen that in one article there will be many similar sentences. In such a case the model could output only sentences from this one article. In more technical terms it is called a local trap where few unimportant sentences votes to each other.

To get rid of the above problem LexRank takes into account that the importance of sentences should also stem from the importance of the sentences “recommending” it.

But how to do it?

Connectivity matrix

Note: An adjacency matrix is usually a binary matrix with just information whether the two vertices have an edge between them. A connectivity matrix is usually a list of which vertex numbers have an edge between them.

To get rid of the local trap problem LexRank adds a count of connections from other sentences. In simple words, a sentence may have just 2 connections, but very strong, and win.

For example:

sentence_1 is similar to 10 other sentences
sentence_2 is similar to 6 other sentences
sentence_3 is similar to sentence_1 and sentence_2
sentence_3 is a winner because all connections from sentence_1 and setence_2 counts.

To count the number of connection LexRank applies a threshold. For example, it only counts sentences as similar to itself where cosine similarity is more than 0.3 or 0.1.

Let’s see how the example connectivity matrix looks like with threshold 0.2:

example connectivity matrix

As you can see sentence_3 is of the highest importance.

Last step — how to find all sentences with the highest score on the whole matrix?

Eigenvector centrality

To find out the most important sentences LexRank utilizes eigenvector centrality. The method is called power iteration method.

How it works?

  • In the first step each matrix row is multiplied by a 1
  • Secondly, we square rows results and take a root from the sum like: (1² + 1² + 2²)^(1/2)
  • We repeat this step until the normalized value does not change much between any iteration

Output

Finally, we have the output. In the above example sentence_3 got the highest score (0,667).

Hope it helps! Cheers!

Naturallanguageprocessing
NLP
Summarization
Theory
Recommended from ReadMedium