Skip to main content

PageRank for TextRank-Text Summarization

As the name says, Text summarization aims at getting a summary of the text which is as close as possible to a human written summary. It can be achieved in two forms:
·         Using Extraction: Relatively easy to perform, it generates an extract of original text by incorporating the key areas of text as is, or slightly modifying the areas of text.
·         Using Abstraction:  Difficult to perform, it semantically analyses the text.
There are numerous text summarization techniques.  However here we will look at a technique of text summarization using TextRank algorithm which uses the classic “Google page rank algorithm”.

GOOGLE - PageRank Algorithm:


This is used as a web page ranking algorithm, which uses a graph based approach to gauge the importance of web pages:
Key features:
1.       It can be applied to any entity like webpage or author name.
2.       A hyperlink to a page is called a vote of support. (In text classification it is analogous to inter-sentence  similarity.)


      IMG Source :  https://www.slideshare.net/pkpramit/page-rank

3.       If a page is linked to multiple pages of high rank, it will get high rank itself.
4.       The Page Rank algorithm gives a probability distribution that represents the likelihood that a person randomly clicking on links will arrive at any particular page.
5.       Sum of page ranks of all the pages is one. (How does Google manages large amount of data then ? Perhaps logarithmically.)
6.       The Page Rank computations require several passes, called "iterations", through the collection to adjust approximate PageRank values to more closely reflect the theoretical true value.

Working:

The  page rank is calculated as :

PageRank(A) = (1-d) + d (PageRank(T1)/C(T1) + ... + PageRank(Tn)/C(Tn))

Where,
- PR(A) is the page rank of page A
- PR(Ti) is the page rank of pages Ti which link to page A
- C(Ti) is the number of outbound links on page Ti and  d is the damping factor

“The PageRank of pages Ti which link to page A does not influence the PageRank of page A uniformly. Within the PageRank algorithm, the PageRank of a page T is always weighted by the number of outbound links C(T) on page T. This means that the more outbound links a page T has, lesser will it benefit from it.

The weighted PageRanks of pages Ti is then added up. Therefore an additional inbound link for page A will always increase page A's PageRank.”
-(Ref :http://pr.efactory.de/e-pagerank-algorithm.shtml  - Ah! An Outbound Link! I might have reduced the rank of this page)

To take care of a case when an  imaginary surfer who is randomly clicking on links eventually stops clicking. The probability, at any step, that the person will continue is a damping factor d. It is usually set to .85
Google recalculates PageRank scores each time it crawls the Web and rebuilds its index. As Google increases the number of documents in its collection, the initial approximation of PageRank decreases for all documents.If a page has no links to other pages, it becomes a sink and therefore if the random surfer arrives at a sink page, it picks another URL and continues surfing again.

Page Ranking as Markov's Chain:

Page ranking can be understood by a Markov's chain in which the states are pages, and the transitions, which are all equally probable, are the links between pages.
If a user at E, does 100 clicks at E reaches A, 70 times and reaches E, 30 times. Similarly for A as shown in figure. The Markov chain helps us find the probabilities of a random user being present at any page after x iterations of this page. This calculation is made using second level probability i.e. probability of going from i to j in second iteration.
                                  

The values have been stabilized at p8 i.e. at 8th iteration. These are the values used by website admins to focus on pages of higher rank to earn revenue. (In our case using these values to extract sentences that form better summary)

Example:


‘A’ writes a blog called TECH BEND at ‘Kundanbhaduri.com’. As can be seen below, the Google page rank for query ‘TechBend blog’  is highest for ‘Kundanbhaduri.com’ and hence it appears at top of our search query.


The data given below is a transition matrix for different sites. Clearly ‘KundanBhaduri.com’ has achieved higher page rank by linking itself to other prominent websites like Ted, Techcrunch etc. Therefore, when a user queries ‘Techbend’ he/she  gets  ‘KundanBhaduri.com’  as top link instead of ‘TechBend.com’.


Map the PageRank to TextRank:

1.      Pre-Process the text:
remove all stop words, tokenize the text into sentences using some training model or some rules.

2.      Create a graph where vertices are sentences
3.      Connect every sentence to every other sentence by an edge where weight defines the similarity of two sentences.

The weight of the edge is how similar the two sentences are. The overlap of two sentences can be determined as the number of common tokens between lexical representations of two sentences.The overlap of two sentences can be determined simply as the number of common tokens between lexical representation of two sentences.

Create the sparse matrix of words and the sentences and count of each in each of the sentences.
 

Create a similarity matrix after normalization with tf-idf:


Note: Remember the Markov's transition matrix as discussed in the example above ?

4.      Run the page rank algorithm on the graph.
5.      Pick the sentences(vertices) with maximum score. Add to summary or send for further processing.

Note:

gensim.summarization is a popular implementation of TextRank algorithm. 
Refer the following link : https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf

That's all for this post. You may refer below mentioned ' USELESS OUTBOUND LINKS' ,
but still useful for the better understanding.

Thanks !!

Links and References:

http://www.cai.sk/ojs/index.php/cai/article/viewFile/37/24
http://www.google.com

Images from :

https://www.slideshare.net/andrewkoo/textrank-algorithm



Comments

Popular posts from this blog

NLP in Video Games

From the last few decades, NLP (Natural Language Processing) has obtained a high level of success in the field  of Computer Science, Artificial Intelligence and Computational Logistics. NLP can also be used in video games, in fact, it is very interesting to use NLP in video games, as we can see games like Serious Games includes Communication aspects. In video games, the communication includes linguistic information that is passed either through spoken content or written content. Now the question is why and where can we use NLP in video games?  There are some games that are related to pedagogy or teaching (Serious Games). So, NLP can be used in these games to achieve these objectives in the real sense. In other games, one can use the speech control using NLP so that the player can play the game by concentrating only on visuals rather on I/O. These things at last increases the realism of the game. Hence, this is the reason for using NLP in games.  We ...

Word embeddings and an application in SMT

We all are aware of (not so) recent advancements in word representation, such as Word2Vec, GloVe etc. for various NLP tasks. Let's try to dig a little deeper of how they work, and why they are so helpful! The basics, what is a Word vector? We need a mathematical way of representing words so as to process them. We call this representation, a word vector. This representation can be as simple as a one-hot encoded vector having the size of the vocabulary.  For ex, if we had 3 words in our vocabulary {man, woman, child}, we can generate word vectors in the following manner Man : {0, 0, 1} Woman : {0, 1, 0} Child : {1, 0, 0} Such an encoding cannot be used to for any meaningful comparisons, other than checking for equality. In vectors such as Word2Vec, a word is represented as a distribution over some dimensions. Each word is assigned some particular weight for each of the dimensions. Picking up the previous example, this time the vectors can be as following (assuming a 2 dime...

Discourse Analysis

NLP makes machine to understand human language but we are facing issues like word ambiguity, sarcastic sentiments analysis and many more. One of the issue is to predict correctly relation between words like " Patrick went to the club on last Friday. He met Richard ." Here, ' He' refers to 'Patrick'. This kind of issue makes Discourse analysis one of the important applications of Natural Language Processing. What is Discourse Analysis ? The word discourse in linguistic terms means language in use. Discourse analysis may be defined as the process of performing text or language analysis, which involves text interpretation and knowing the social interactions. Discourse analysis may involve dealing with morphemes, n-grams, tenses, verbal aspects, page layouts, and so on. It is often used to refer to the analysis of conversations or verbal discourse. It is useful for performing tasks, like A naphora Resolution (AR) , Named Entity Recognition (NE...