PageRank for TextRank-Text Summarization

As the name says, Text summarization aims at getting a summary of the text which is as close as possible to a human written summary. It can be achieved in two forms:

· Using Extraction: Relatively easy to perform, it generates an extract of original text by incorporating the key areas of text as is, or slightly modifying the areas of text.

· Using Abstraction: Difficult to perform, it semantically analyses the text.

There are numerous text summarization techniques. However here we will look at a technique of text summarization using TextRank algorithm which uses the classic “Google page rank algorithm”.

GOOGLE - PageRank Algorithm:

This is used as a web page ranking algorithm, which uses a graph based approach to gauge the importance of web pages:

Key features:

1. It can be applied to any entity like webpage or author name.

2. A hyperlink to a page is called a vote of support. (In text classification it is analogous to inter-sentence similarity.)

IMG Source : https://www.slideshare.net/pkpramit/page-rank

3. If a page is linked to multiple pages of high rank, it will get high rank itself.

4. The Page Rank algorithm gives a probability distribution that represents the likelihood that a person randomly clicking on links will arrive at any particular page.

5. Sum of page ranks of all the pages is one. (How does Google manages large amount of data then ? Perhaps logarithmically.)

6. The Page Rank computations require several passes, called "iterations", through the collection to adjust approximate PageRank values to more closely reflect the theoretical true value.

Working:

The page rank is calculated as :

PageRank(A) = (1-d) + d (PageRank(T1)/C(T1) + ... + PageRank(Tn)/C(Tn))

Where,

- PR(A) is the page rank of page A

- PR(Ti) is the page rank of pages Ti which link to page A

- C(Ti) is the number of outbound links on page Ti and d is the damping factor

“The PageRank of pages Ti which link to page A does not influence the PageRank of page A uniformly. Within the PageRank algorithm, the PageRank of a page T is always weighted by the number of outbound links C(T) on page T. This means that the more outbound links a page T has, lesser will it benefit from it.

The weighted PageRanks of pages Ti is then added up. Therefore an additional inbound link for page A will always increase page A's PageRank.”

-(Ref :http://pr.efactory.de/e-pagerank-algorithm.shtml - Ah! An Outbound Link! I might have reduced the rank of this page)

To take care of a case when an imaginary surfer who is randomly clicking on links eventually stops clicking. The probability, at any step, that the person will continue is a damping factor d. It is usually set to .85

Google recalculates PageRank scores each time it crawls the Web and rebuilds its index. As Google increases the number of documents in its collection, the initial approximation of PageRank decreases for all documents.If a page has no links to other pages, it becomes a sink and therefore if the random surfer arrives at a sink page, it picks another URL and continues surfing again.

Page Ranking as Markov's Chain:

Page ranking can be understood by a Markov's chain in which the states are pages, and the transitions, which are all equally probable, are the links between pages.

If a user at E, does 100 clicks at E reaches A, 70 times and reaches E, 30 times. Similarly for A as shown in figure. The Markov chain helps us find the probabilities of a random user being present at any page after x iterations of this page. This calculation is made using second level probability i.e. probability of going from i to j in second iteration.

The values have been stabilized at p8 i.e. at 8^th iteration. These are the values used by website admins to focus on pages of higher rank to earn revenue. (In our case using these values to extract sentences that form better summary)

Example:

‘A’ writes a blog called TECH BEND at ‘Kundanbhaduri.com’. As can be seen below, the Google page rank for query ‘TechBend blog’ is highest for ‘Kundanbhaduri.com’ and hence it appears at top of our search query.

The data given below is a transition matrix for different sites. Clearly ‘KundanBhaduri.com’ has achieved higher page rank by linking itself to other prominent websites like Ted, Techcrunch etc. Therefore, when a user queries ‘Techbend’ he/she gets ‘KundanBhaduri.com’ as top link instead of ‘TechBend.com’.

Map the PageRank to TextRank:

1. Pre-Process the text:

remove all stop words, tokenize the text into sentences using some training model or some rules.

2. Create a graph where vertices are sentences

3. Connect every sentence to every other sentence by an edge where weight defines the similarity of two sentences.

The weight of the edge is how similar the two sentences are. The overlap of two sentences can be determined as the number of common tokens between lexical representations of two sentences.The overlap of two sentences can be determined simply as the number of common tokens between lexical representation of two sentences.

Create the sparse matrix of words and the sentences and count of each in each of the sentences.

Create a similarity matrix after normalization with tf-idf:

Note: Remember the Markov's transition matrix as discussed in the example above ?

4. Run the page rank algorithm on the graph.

5. Pick the sentences(vertices) with maximum score. Add to summary or send for further processing.

Note:

gensim.summarization is a popular implementation of TextRank algorithm.

Refer the following link : https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf

That's all for this post. You may refer below mentioned ' USELESS OUTBOUND LINKS' ,

but still useful for the better understanding.

Thanks !!

Links and References:

https://www.slideshare.net/KundanBhaduri/the-google-pagerank-algorithm-how-does-it-work

https://en.wikipedia.org/wiki/PageRank

http://www.cai.sk/ojs/index.php/cai/article/viewFile/37/24

http://www.google.com

Images from :

https://www.slideshare.net/KundanBhaduri/the-google-pagerank-algorithm-how-does-it-work

https://www.slideshare.net/andrewkoo/textrank-algorithm

Semantic Similarity using Word Embeddings and Wordnet

Measuring semantic similarity between documents has varied applications in NLP and Artificial sentences such as in chatbots, voicebots, communication in different languages etc. . It refers to quantifying similarity of sentences based on their literal meaning rather than only syntactic structure. A semantic net such as WordNet and Word Embeddings such as Google’s Word2Vec, DocToVec can be used to compute semantic similarity. Let us see how. Word Embeddings Word embeddings are vector representations of words. A word embedding tries to map a word to a numerical vector representation using a dictionary of words, i.e. words and phrases from the vocabulary are mapped to the vector space and represented using real numbers. The closeness of vector representations of 2 words in the real space is a measure of similarity between them. Word embeddings can be broadly classified into frequency based (eg: count vector, tfidf, co occurrence etc) and prediction based (eg: Contin...

Amalgam

Search This Blog