Skip to main content

Revised version of PageRank: Exploring both Content and Link Quality for Anti-Spamming

Exploring both Content and Link Quality for Anti-Spamming


Abstract: Across the world, the internet is slowly becoming a primary source of information consumption. Most of the internet users look up to the web in search of news, updates, events, etc. The very first tool any person, a layman for that matter, look up to is: search engines. Web surfing has become very convenient due of its existence. On the one hand, this technology has made information retrieval in handy for us. But on the other hand, it also gave birth to various kinds of malpractices which are meant to mold our views artificially. This in-turn worsens our web experience by affecting the quality of results retrieved. Attackers achieve this by fooling the search engines, and by taking disadvantage of the loopholes in the search ranking algorithms. They compromise the web information flow for their personal gains. This compromise the credibility and reputation of the system. This makes “anti-spamming” a hot topic among researchers. This paper proposes a solution for improving the PageRank algorithm to provide better search results. It proposes a metric called: TrustRank which ranks the web pages as good or bad based on certain factors.


The methodology behind this algorithm can be summarised as follows:


CONTENT QUALITY (QoC(p)): It considers the pages which point to the page p. It depends on two metrics: the content and link quality of its parent pages. It focuses on the quality of the incoming links more than their quantity. If p is pointed by pages with good link quality, it can be considered as a trustable page. Else, it may not be assigned a good trust rank.


LINK QUALITY (QoL(p)): In contrast to QoC(p), QoL(p) considers the content and link quality of its child pages. If p points to a credible page, it can also be considered as a good page.

Computation of their link and content quality can be formalized as the below two metrics:


COMPUTING QoC(p) and QoL(p):


We now need to give each page an initial value of link and content quality. There could be several ways of doing so:


1. As done in PageRank algorithm, assign both values as 1/N.



2. Using a good seed set. We select a subset of pages (say, S) from the set of all available pages. Then we accordingly check each page in the set S to find out the ones which have either high content or link quality. We assign 1 to the link quality value if p points to more no. of good pages. And the number of bad pages is less than a certain threshold (here we take it as 2). For remaining ones, the value is set to 0. Then they are normalized.
3. Using a bad seed set. (opposite to the above one) Here we assign negative values (-1) to the bad pages.
4. By merging the above two ways, we assign initial values of seed pages. A table like this can be obtained for Fig.1:
Now we finally devise an algorithm for computing the link and content quality of a web graph using an iterative approach. It takes the attenuation due to propagation also into consideration, because it propagates its trust or penalty by links.
Attenuation factor:





Results achieved when running this algorithm on Figure 1 are:




Results:

1. Without seed set:





2. With seed set:








As listed below, there are several advantages as well as limitations of this approach:



Advantages:
1. This approach helps in identifying the trustability of a web page both in terms of its content and link quality based on their trust scores.
2. It does not impose any "Parent Penalty" for a page which is pointing to a bad page, thus preventing a good page going to the set of bad pages.
3. It focusses not only on the quality of the page content but also on the quality of its outgoing links.


Limitations:
1. It is based on the assumption that good web pages don't point to bad pages very often.
2. Although TrustRank can find more reputable pages than the traditional ranking (link-based) approaches, because some good pages could also point to a bad page, so this method can still give a fair opportunity to the bad pages to be considered as good.
3. TrustRank can cover only those pages which are in the propagation set.
Conclusion:


This paper overviewed the different ways in which an attacker can fool the information retrieval algorithms. It tried to handle some of those by proposing the metric of TrustRank.

References:
[1] Pr0 - google’s pagerank 0, 2002. http://pr.efactory.de/epr0.shtml. 
[2] J. Cho and S. Roy. Impact of search engines on page popularity. In WWW2004, May 2004. 
[3] J. Cho, S. Roy, and R. E. Adams. Page quality: In search of an unbiased web ranking. In SIGMOD 2005, June 2005. 
[4] Z. Gyongyi, P. Berkhin, H. Garcia-Molina, and J. Pedersen. Link spam detection based on mass estimation. Technical report, Stanford University, 2005. 
[5] Z. Gyongyi and H. Garcia-Molina. Web spam taxonomy. Technical report, Stanford University, 2004. 
[6] Z. Gyongyi and H. Garcia-Molina. Link spam alliances. Technical report, Stanford University, 2005. 
[7] Z. Gyongyi, H. Garcia-Molina, and J. Pedersen. Combating web spam with trustrank. In VLDB 2004, 2004. 
[8] M. Henzinger, R. Motwani, and C. Silverstein. Challenges in web search engines. In SIGIR Forum, 36(2) 2002. 
[9] J. M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604–632, 1999. 
[10] P. T. Metaxas and J. DeStefano. Web spam, propaganda and trust. In AIRWeb2005, May 2005
[11] L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web. Technical report, Stanford University, 1998. 
[12] B. Wu and B. D. Davison. Identifying link farm spam pages. In WWW 2005, May 2005.
[13] Zhang, L., Zhang, Y., Zhang, Y., & Li, X. (2006, September). Exploring both content and link quality for anti-spamming. In Computer and Information Technology, 2006. CIT'06. The Sixth IEEE International Conference on (pp. 37-37). IEEE.

Comments

Popular posts from this blog

NLP in Video Games

From the last few decades, NLP (Natural Language Processing) has obtained a high level of success in the field  of Computer Science, Artificial Intelligence and Computational Logistics. NLP can also be used in video games, in fact, it is very interesting to use NLP in video games, as we can see games like Serious Games includes Communication aspects. In video games, the communication includes linguistic information that is passed either through spoken content or written content. Now the question is why and where can we use NLP in video games?  There are some games that are related to pedagogy or teaching (Serious Games). So, NLP can be used in these games to achieve these objectives in the real sense. In other games, one can use the speech control using NLP so that the player can play the game by concentrating only on visuals rather on I/O. These things at last increases the realism of the game. Hence, this is the reason for using NLP in games.  We can use NLP to impr

Discourse Analysis

NLP makes machine to understand human language but we are facing issues like word ambiguity, sarcastic sentiments analysis and many more. One of the issue is to predict correctly relation between words like " Patrick went to the club on last Friday. He met Richard ." Here, ' He' refers to 'Patrick'. This kind of issue makes Discourse analysis one of the important applications of Natural Language Processing. What is Discourse Analysis ? The word discourse in linguistic terms means language in use. Discourse analysis may be defined as the process of performing text or language analysis, which involves text interpretation and knowing the social interactions. Discourse analysis may involve dealing with morphemes, n-grams, tenses, verbal aspects, page layouts, and so on. It is often used to refer to the analysis of conversations or verbal discourse. It is useful for performing tasks, like A naphora Resolution (AR) , Named Entity Recognition (NE

Coreference Resolution and Applications in NLP

In computational linguistics and natural language processing coreference resolution (CR) is an avidly studies problem in discourse which has managed to be only partially solved by the state of the art and consequently remain one of the most exciting open problems in this field. Introduction and Definition The process of linking together mentions of a particular entity in a speech or text excerpt that related to real world entities is termed as coreference resolution. This process identifies the dependence between a phrase with the rest of the sentence or other sentences in the text.  This is an integral part of natural languages to avoid repetition, demonstrate possession/relation etc. A basic example to illustrate the above definition is given below : Another example which uses elements from popular fiction literature : Harry  wouldn’t bother to read “ Hogwarts: A History ” as long as  Hermione  is around.  He  knows  she  knows  the book  by heart. The different type