As the name says, Text summarization aims at getting a
summary of the text which is as close as possible to a human written summary. It
can be achieved in two forms:
·
Using Extraction: Relatively easy to perform, it
generates an extract of original text by incorporating the key areas of text as
is, or slightly modifying the areas of text.
·
Using Abstraction: Difficult to perform, it semantically
analyses the text.
There are numerous text
summarization techniques. However here we will look at a technique of text summarization using TextRank algorithm which uses the
classic “Google page rank algorithm”.
GOOGLE - PageRank Algorithm:
This is used as a web page ranking
algorithm, which uses a graph based approach to gauge the importance of web
pages:
Key features:
1.
It can be applied to any entity like webpage or
author name.
2.
A hyperlink to a page is called a vote of
support. (In text classification it is analogous to inter-sentence similarity.)
IMG Source : https://www.slideshare.net/pkpramit/page-rank
3.
If a page is linked to multiple pages of high
rank, it will get high rank itself.
4.
The Page Rank algorithm gives a probability distribution that
represents the likelihood that a person randomly clicking on links will arrive
at any particular page.
5.
Sum of page ranks of all the pages is one. (How
does Google manages large amount of data then ? Perhaps logarithmically.)
6.
The Page Rank computations require several
passes, called "iterations", through the collection to adjust
approximate PageRank values to more closely reflect the theoretical true value.
Working:
The page rank is calculated as :
PageRank(A) = (1-d) + d (PageRank(T1)/C(T1) + ... +
PageRank(Tn)/C(Tn))
- PR(A) is the page rank of page A
- PR(Ti) is the page rank of pages Ti which link to page A
- C(Ti) is the number of outbound links on page Ti and d is the damping factor
“The PageRank of pages
Ti which link to page A does not influence the PageRank of page A uniformly.
Within the PageRank algorithm, the PageRank of a page T is always weighted by
the number of outbound links C(T) on page T. This means that the more outbound
links a page T has, lesser will it benefit from it.
The weighted PageRanks
of pages Ti is then added up. Therefore an additional inbound link for page A
will always increase page A's PageRank.”
-(Ref
:http://pr.efactory.de/e-pagerank-algorithm.shtml - Ah! An Outbound Link!
I might have reduced the rank of this page)
To take care of a case
when an imaginary surfer who is randomly clicking on links eventually
stops clicking. The probability, at any step, that the person will continue is
a damping factor d. It is usually set to .85
Google recalculates
PageRank scores each time it crawls the Web and rebuilds its index. As Google
increases the number of documents in its collection, the initial approximation
of PageRank decreases for all documents.If a page has no links to other pages,
it becomes a sink and therefore if the random surfer arrives at a sink page, it
picks another URL and continues surfing again.
Page Ranking as Markov's Chain:
Page ranking can be understood by a Markov's chain in which the states are pages, and the transitions,
which are all equally probable, are the links between pages.
If a user at E, does 100 clicks at E reaches A, 70 times and
reaches E, 30 times. Similarly for A as shown in figure. The Markov chain helps
us find the probabilities of a random user being present at any page after x
iterations of this page. This calculation is made using second level
probability i.e. probability of going from i to j in second iteration.
The values have been stabilized at
p8 i.e. at 8th iteration. These are the values used by website
admins to focus on pages of higher rank to earn revenue. (In our case using
these values to extract sentences that form better summary)
The data given below is a
transition matrix for different sites. Clearly ‘KundanBhaduri.com’ has achieved higher page rank by linking itself
to other prominent websites like Ted, Techcrunch etc. Therefore, when a user
queries ‘Techbend’ he/she gets ‘KundanBhaduri.com’ as top link instead of ‘TechBend.com’.
Map the PageRank
to TextRank:
1. Pre-Process the text:
remove all stop words, tokenize the text into sentences using some training model or some rules.
2. Create a graph where vertices are sentences
3. Connect every sentence to every other sentence by an edge where weight defines the similarity of two sentences.
The weight of the edge is how similar the two sentences are.
The overlap of two sentences can be determined as the number of common tokens
between lexical representations of two sentences.The overlap of two sentences can be determined simply as the number of common tokens between lexical representation of two sentences.
Create the sparse matrix of words and the sentences and
count of each in each of the sentences.
Create a similarity matrix after normalization with tf-idf:
Note: Remember the Markov's transition matrix as discussed in the example above ?
4. Run the page rank algorithm on the graph.
5. Pick the sentences(vertices) with maximum score. Add to summary or send for further processing.
Note:
Refer the following link : https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf
That's all for this post. You may refer below mentioned ' USELESS OUTBOUND LINKS' ,
but still useful for the better understanding.
Thanks !!
Links and
References:
http://www.cai.sk/ojs/index.php/cai/article/viewFile/37/24
http://www.google.com
Images from :
https://www.slideshare.net/andrewkoo/textrank-algorithm
Comments
Post a Comment