Skip to main content

Cross Language Plagiarism Detection

Plagiarism: the practice of taking someone else work and presenting that as of their own. So its reverse Plagiarism detection is detecting weather a given piece of document is plagiarised or not. Detection of plagiarism was so naive in earlier times ,People used to just compare strings across 2 documents , then as the technology grew people also grew smarter , they started coming with ideas by which they can copy a document while go undetected from detection.

In this post i will talk about a particular type of plagiarism detection which is essentially one of hardest in this domain i.e Cross-Language plagiarism detection(CLPD). People take a piece of document and convert it into some other language and then post it, now our task is to detect weather it is copied from some other language or not ? In this post i will keep my limits to documents only not programming code, though there are many ways as well by which it can be detected that weather a piece of code in converted into some other language for example C -> JAVA etc but methods for detecting document and code are completely different. It requires assembly conversion of code for  detection while N-gram language models and other things are required in case of documents

I will talk about current 3 state of the art method for CLPD but before that i must tell you about the previous work done in this task and the architecture we will follow

Previous work

Potthast et al. offered a overview of the prototypical CLPD process , it is as follows


  1. Heuristic Retrieval: a set of potential plagiarised documents is collected from all documents 
  2. Detailed Analysis: all potentials targets are compares against all documents section wise, if a section is identified to be similar more than expected then a potential case of plagiarism is located
  3. Heuristic Post-Processing: candidates that are not similar enough gets discarded and additional heuristic are applied to merge nearby candidates
Now above model need some similarity metric to work upon for the retrieval and analysis task , so below are the 5 previous most use models for CL similarity assesment.

  1. Lexicon based systems: rely on lexical similarities between languages and linguistic influence
  2. Thesaurus based systems: rely on word mapping in cross language eg EuroWordNet
  3. Comparable corpus based systems: trained over comparable corpora eg: cross-language explicit semantic analysis (CL-ESA)
  4. Parallel corpus based systems: trained on parallel corpora , to find cross language co-occurrences or to obtain translation modules.
  5. Machine translation based systems: these models are vogue in CLPD , they simplify the task in monolingual problem.
we will follow the above 3 stepped  architecture for CLPD , its the only similarity metric that we will change to achieve better results , otherwise the core algorithm will be same 

Algorithm

For CL heuristic retrieval , we select top 50 d' ∈ D' for each dq according to sim(dq, d') and then at detailed analysis, dq(d') is split into chunks of length w with step t. sim(sq, s') computes the similarity between 2 text fragments either on the bases of CL-ASA , CL-CNG or T+MA , then retrieve the 5 most similar fragments s ∈ S w.r.t sq , the resulting pairs {sq, s} are input for post processing step, If distance between 2 pairs is lower than threshold then pairs are merged , only those candidates that are composed of at least 3 of the identified fragments are considered potentially plagiarised.

this is core algorithm for our approach to PD , the similarity between the text can be based on any similarity estimation model

State of the Art Models

we will explore 3 state of the art similarity model 

Cross language character n-gram - CL-CNG

The text is case- folded, punctuation marks and diacritics are removed. Multiple white-space and new-line characters are replaced by a single white-space. Moreover, a single white-space is inserted at the beginning and end of the text. Finally, the resulting text strings are encoded into character n-grams as depicted below, where ‘‘-’’ should be considered as white-space and n = 4: 

‘‘El espíritu’’ -> ‘‘- el-’’,‘‘el-e’’,‘‘ l-es’’,‘‘-esp’’,‘‘espi’’,‘‘spir’’,‘‘piri’’,‘‘irit’’,‘‘ritu’’,‘‘itu-’’.
 Similarity sim(dq, d') is estimated by the unigram language model:


where P(q|d') is the document level probability of term q in document d' and C denotes the entire collection. We use n = 4 and Ã¥ = 0.7 as these values yielded the best results for English–Spanish in the original work. 

Cross language alignment-based similarity analysis - CL-ASA

Similarity sim(dq, d0) is computed by estimating the likelihood of d' of being a translation of dq. It is an adaptation of Bayes’ rule for MT [20] that Barrón-Cedeño defines as:

where, Q(d') is known as length model (M). The length of the d’s translation into d' is closely related to a translation length factor, de- fined as:

where l and r are the mean and standard deviation of the character lengths between actual translations from L into L' . If the length of d' is unexpected given dq, it receives a low likelihood.
In statistical MT, the conditional probability p(dq|d') is known as translation model probability (TM), computed on the basis of a statistical bilingual dictionary. The adaptation of this model is defined as:
CL-ASA is considered a parallel corpus-based system. Firstly, its parameters are learnt from a parallel corpus. Secondly, every potential translation of a word participates in the similarity assessment, making it flexible. 

Translation plus monolingual analysis - T+MA

The first step of this approach is the translation of all the documents into a common language, lets say Spanish into English. Afterwards, weight the documents’ terms with TF-IDF and compare the texts using the cosine measure over a bag-of-words representation. When identifying specific plagiarised fragments, In this case we will use the original offsets of the documents in Spanish.


Results

Different experiments were performed to compare performance or each proposed model above. Experiments accuracy is based on Precision and Recall and models were trained on different length of documents eg: long, medium, short etc

Experiment 1: 




Experiment 2:


Conclusion

Different similarity estimation models can be plugged into above proposed architecture. Strategy was tested extensively on a set of experiments reflecting different steps and scenarios of cross-language plagiarism detection: from the detection of entirely plagiarised documents to the identification of specific borrowed text fragments.

The similarity models showed a remarkable performance when detecting plagiarism of entire documents, including further paraphrased translations. When aiming at detecting specific borrowed fragments and their source, both short and further paraphrased cases caused difficulties. Still the precision of cross- language alignment-based similarity analysis was always high (for some types higher than 0.9). As a result, if it identifies a potential case of plagiarism, it is certainly worth analysing it. 

Future

As future work, people are aiming to improve the heuristic retrieval module, i.e., retrieving good potential source documents for a possible case of plagiarism. This is a complicated task as, to the best of our knowledge, no large scale cross-language corpus with the necessary characteristics exists. 

References

[1]https://ac.els-cdn.com/S0950705113002001/1-s2.0-S0950705113002001-main.pdf?_tid=7b55d992-b4b1-11e7-846d-00000aacb35e&acdnat=1508406183_035ee3d807023e5ed232135b75305631

[2]: https://arxiv.org/pdf/1705.08828.pdf

[3]: M. Potthast, B. Stein, A. Eiselt, A. Barrón-Cedeño, P. Rosso, Overview of the 1st International Competition on Plagiarism Detection, vol. 502, CEUR-WS.org, San Sebastian, Spain, 2009, pp. 1–9. <http://ceur-ws.org/Vol-502>. 

[4]:  A. Barrón-Cedeño, P. Rosso, D. Pinto, A. Juan, On cross-lingual plagiarism analysis using a statistical model, in: B. Stein, E. Stamatatos, M. Koppel (Eds.), ECAI 2008 Workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse (PAN 2008), vol. 377, CEUR-WS.org, Patras, Greece, 2008, pp.
9–13, <http://ceur-ws.org/Vol-377>. 

[5]: A. Barrón-Cedeño, P. Rosso, E. Agirre, G. Labaka, Plagiarism detection across distant language pairs, in: [26], 2010. 

[6]: B. Stein, S. Meyer zu Eissen, M. Potthast, Strategies for retrieving plagiarized documents, in: C. Clarke, N. Fuhr, N. Kando, W. Kraaij, A. de Vries (Eds.), Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM Press, Amsterdam, The Netherlands, 2007, pp. 825–826



Comments

  1. This comment has been removed by the author.

    ReplyDelete
  2. The expert editing administrations is the best since we offer our assistance through staff that is profoundly qualified with website and a pertinent postgraduate degree in the zones in which they work exceptionally experienced at altering and editing.

    ReplyDelete
  3. Here in this article, you give the unimaginable data to the understudies to set up the tests or substance blade. This http://www.ukraineoutsourcingrates.com/top-it-outsourcing-companies-ranking-in-sumy/ site is good to the students and the understudies did not see how to set up the books for the examination.

    ReplyDelete

Post a Comment

Popular posts from this blog

NLP in Video Games

From the last few decades, NLP (Natural Language Processing) has obtained a high level of success in the field  of Computer Science, Artificial Intelligence and Computational Logistics. NLP can also be used in video games, in fact, it is very interesting to use NLP in video games, as we can see games like Serious Games includes Communication aspects. In video games, the communication includes linguistic information that is passed either through spoken content or written content. Now the question is why and where can we use NLP in video games?  There are some games that are related to pedagogy or teaching (Serious Games). So, NLP can be used in these games to achieve these objectives in the real sense. In other games, one can use the speech control using NLP so that the player can play the game by concentrating only on visuals rather on I/O. These things at last increases the realism of the game. Hence, this is the reason for using NLP in games.  We can use NLP to impr

Discourse Analysis

NLP makes machine to understand human language but we are facing issues like word ambiguity, sarcastic sentiments analysis and many more. One of the issue is to predict correctly relation between words like " Patrick went to the club on last Friday. He met Richard ." Here, ' He' refers to 'Patrick'. This kind of issue makes Discourse analysis one of the important applications of Natural Language Processing. What is Discourse Analysis ? The word discourse in linguistic terms means language in use. Discourse analysis may be defined as the process of performing text or language analysis, which involves text interpretation and knowing the social interactions. Discourse analysis may involve dealing with morphemes, n-grams, tenses, verbal aspects, page layouts, and so on. It is often used to refer to the analysis of conversations or verbal discourse. It is useful for performing tasks, like A naphora Resolution (AR) , Named Entity Recognition (NE

Semantic Similarity using Word Embeddings and Wordnet

Measuring semantic similarity between documents has varied applications in NLP and  Artificial sentences such as in chatbots, voicebots, communication in different languages etc. . It refers to quantifying similarity of sentences based on their literal meaning rather than only syntactic structure.  A semantic net such as WordNet and Word Embeddings such as Google’s Word2Vec, DocToVec can be used to compute semantic similarity. Let us see how. Word Embeddings Word embeddings are vector representations of words. A word embedding tries to map a word to a numerical vector representation using a dictionary of words, i.e. words and phrases from the vocabulary are mapped to the vector space and represented using real numbers. The closeness of vector representations of 2 words in the real space is a measure of similarity between them. Word embeddings can be broadly classified into frequency based (eg: count vector, tfidf, co occurrence etc) and prediction based (eg: Continuous bag