Revised version of PageRank: Exploring both Content and Link Quality for Anti-Spamming

Exploring both Content and Link Quality for Anti-Spamming

Abstract: Across the world, the internet is slowly becoming a primary source of information consumption. Most of the internet users look up to the web in search of news, updates, events, etc. The very first tool any person, a layman for that matter, look up to is: search engines. Web surfing has become very convenient due of its existence. On the one hand, this technology has made information retrieval in handy for us. But on the other hand, it also gave birth to various kinds of malpractices which are meant to mold our views artificially. This in-turn worsens our web experience by affecting the quality of results retrieved. Attackers achieve this by fooling the search engines, and by taking disadvantage of the loopholes in the search ranking algorithms. They compromise the web information flow for their personal gains. This compromise the credibility and reputation of the system. This makes “anti-spamming” a hot topic among researchers. This paper proposes a solution for improving the PageRank algorithm to provide better search results. It proposes a metric called: TrustRank which ranks the web pages as good or bad based on certain factors.

The methodology behind this algorithm can be summarised as follows:

CONTENT QUALITY (QoC(p)): It considers the pages which point to the page p. It depends on two metrics: the content and link quality of its parent pages. It focuses on the quality of the incoming links more than their quantity. If p is pointed by pages with good link quality, it can be considered as a trustable page. Else, it may not be assigned a good trust rank.

LINK QUALITY (QoL(p)): In contrast to QoC(p), QoL(p) considers the content and link quality of its child pages. If p points to a credible page, it can also be considered as a good page.

Computation of their link and content quality can be formalized as the below two metrics:

COMPUTING QoC(p) and QoL(p):

We now need to give each page an initial value of link and content quality. There could be several ways of doing so:

1. As done in PageRank algorithm, assign both values as 1/N.

2. Using a good seed set. We select a subset of pages (say, S) from the set of all available pages. Then we accordingly check each page in the set S to find out the ones which have either high content or link quality. We assign 1 to the link quality value if p points to more no. of good pages. And the number of bad pages is less than a certain threshold (here we take it as 2). For remaining ones, the value is set to 0. Then they are normalized.
3. Using a bad seed set. (opposite to the above one) Here we assign negative values (-1) to the bad pages.
4. By merging the above two ways, we assign initial values of seed pages. A table like this can be obtained for Fig.1:

Now we finally devise an algorithm for computing the link and content quality of a web graph using an iterative approach. It takes the attenuation due to propagation also into consideration, because it propagates its trust or penalty by links.
Attenuation factor:

Results achieved when running this algorithm on Figure 1 are:

Results:

1. Without seed set:

2. With seed set:

As listed below, there are several advantages as well as limitations of this approach:

Advantages:

1. This approach helps in identifying the trustability of a web page both in terms of its content and link quality based on their trust scores.
2. It does not impose any "Parent Penalty" for a page which is pointing to a bad page, thus preventing a good page going to the set of bad pages.
3. It focusses not only on the quality of the page content but also on the quality of its outgoing links.

Limitations:

1. It is based on the assumption that good web pages don't point to bad pages very often.
2. Although TrustRank can find more reputable pages than the traditional ranking (link-based) approaches, because some good pages could also point to a bad page, so this method can still give a fair opportunity to the bad pages to be considered as good.
3. TrustRank can cover only those pages which are in the propagation set.

Conclusion:

This paper overviewed the different ways in which an attacker can fool the information retrieval algorithms. It tried to handle some of those by proposing the metric of TrustRank.

References:

[1] Pr0 - google’s pagerank 0, 2002. http://pr.efactory.de/epr0.shtml.

[2] J. Cho and S. Roy. Impact of search engines on page popularity. In WWW2004, May 2004.

[3] J. Cho, S. Roy, and R. E. Adams. Page quality: In search of an unbiased web ranking. In SIGMOD 2005, June 2005.

[4] Z. Gyongyi, P. Berkhin, H. Garcia-Molina, and J. Pedersen. Link spam detection based on mass estimation. Technical report, Stanford University, 2005.

[5] Z. Gyongyi and H. Garcia-Molina. Web spam taxonomy. Technical report, Stanford University, 2004.

[6] Z. Gyongyi and H. Garcia-Molina. Link spam alliances. Technical report, Stanford University, 2005.

[7] Z. Gyongyi, H. Garcia-Molina, and J. Pedersen. Combating web spam with trustrank. In VLDB 2004, 2004.

[8] M. Henzinger, R. Motwani, and C. Silverstein. Challenges in web search engines. In SIGIR Forum, 36(2) 2002.

[9] J. M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604–632, 1999.

[10] P. T. Metaxas and J. DeStefano. Web spam, propaganda and trust. In AIRWeb2005, May 2005

[11] L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web. Technical report, Stanford University, 1998.

[12] B. Wu and B. D. Davison. Identifying link farm spam pages. In WWW 2005, May 2005.

[13] Zhang, L., Zhang, Y., Zhang, Y., & Li, X. (2006, September). Exploring both content and link quality for anti-spamming. In Computer and Information Technology, 2006. CIT'06. The Sixth IEEE International Conference on (pp. 37-37). IEEE.

Amalgam

Search This Blog

Revised version of PageRank: Exploring both Content and Link Quality for Anti-Spamming

Comments

Post a Comment

Popular posts from this blog

NLP in Video Games

Word embeddings and an application in SMT

Discourse Analysis