Cross Language Plagiarism Detection

Plagiarism: the practice of taking someone else work and presenting that as of their own. So its reverse Plagiarism detection is detecting weather a given piece of document is plagiarised or not. Detection of plagiarism was so naive in earlier times ,People used to just compare strings across 2 documents , then as the technology grew people also grew smarter , they started coming with ideas by which they can copy a document while go undetected from detection.

In this post i will talk about a particular type of plagiarism detection which is essentially one of hardest in this domain i.e Cross-Language plagiarism detection(CLPD). People take a piece of document and convert it into some other language and then post it, now our task is to detect weather it is copied from some other language or not ? In this post i will keep my limits to documents only not programming code, though there are many ways as well by which it can be detected that weather a piece of code in converted into some other language for example C -> JAVA etc but methods for detecting document and code are completely different. It requires assembly conversion of code for detection while N-gram language models and other things are required in case of documents

I will talk about current 3 state of the art method for CLPD but before that i must tell you about the previous work done in this task and the architecture we will follow

Previous work

Potthast et al. offered a overview of the prototypical CLPD process , it is as follows

Heuristic Retrieval: a set of potential plagiarised documents is collected from all documents
Detailed Analysis: all potentials targets are compares against all documents section wise, if a section is identified to be similar more than expected then a potential case of plagiarism is located
Heuristic Post-Processing: candidates that are not similar enough gets discarded and additional heuristic are applied to merge nearby candidates

Now above model need some similarity metric to work upon for the retrieval and analysis task , so below are the 5 previous most use models for CL similarity assesment.

Lexicon based systems: rely on lexical similarities between languages and linguistic influence
Thesaurus based systems: rely on word mapping in cross language eg EuroWordNet
Comparable corpus based systems: trained over comparable corpora eg: cross-language explicit semantic analysis (CL-ESA)
Parallel corpus based systems: trained on parallel corpora , to find cross language co-occurrences or to obtain translation modules.
Machine translation based systems: these models are vogue in CLPD , they simplify the task in monolingual problem.

we will follow the above 3 stepped architecture for CLPD , its the only similarity metric that we will change to achieve better results , otherwise the core algorithm will be same

Algorithm

For CL heuristic retrieval , we select top 50 d' ∈ D' for each d_q according to sim(d_q, d') and then at detailed analysis, d_q(d') is split into chunks of length w with step t. sim(s_q, s') computes the similarity between 2 text fragments either on the bases of CL-ASA , CL-CNG or T+MA , then retrieve the 5 most similar fragments s ∈ S w.r.t s_q , the resulting pairs {s_q, s} are input for post processing step, If distance between 2 pairs is lower than threshold then pairs are merged , only those candidates that are composed of at least 3 of the identified fragments are considered potentially plagiarised.

this is core algorithm for our approach to PD , the similarity between the text can be based on any similarity estimation model

State of the Art Models

we will explore 3 state of the art similarity model

Cross language character n-gram - CL-CNG

The text is case- folded, punctuation marks and diacritics are removed. Multiple white-space and new-line characters are replaced by a single white-space. Moreover, a single white-space is inserted at the beginning and end of the text. Finally, the resulting text strings are encoded into character n-grams as depicted below, where ‘‘-’’ should be considered as white-space and n = 4:

‘‘El espíritu’’ -> ‘‘- el-’’,‘‘el-e’’,‘‘ l-es’’,‘‘-esp’’,‘‘espi’’,‘‘spir’’,‘‘piri’’,‘‘irit’’,‘‘ritu’’,‘‘itu-’’.

Similarity sim(dq, d') is estimated by the unigram language model:

where P(q|d') is the document level probability of term q in document d' and C denotes the entire collection. We use n = 4 and å = 0.7 as these values yielded the best results for English–Spanish in the original work.

Cross language alignment-based similarity analysis - CL-ASA

Similarity sim(dq, d0) is computed by estimating the likelihood of d' of being a translation of dq. It is an adaptation of Bayes’ rule for MT [20] that Barrón-Cedeño defines as:

where, Q(d') is known as length model (∂M). The length of the d’s translation into d' is closely related to a translation length factor, de- fined as:

where l and r are the mean and standard deviation of the character lengths between actual translations from L into L' . If the length of d' is unexpected given dq, it receives a low likelihood.

In statistical MT, the conditional probability p(dq|d') is known as translation model probability (TM), computed on the basis of a statistical bilingual dictionary. The adaptation of this model is defined as:

CL-ASA is considered a parallel corpus-based system. Firstly, its parameters are learnt from a parallel corpus. Secondly, every potential translation of a word participates in the similarity assessment, making it flexible.

Translation plus monolingual analysis - T+MA

The first step of this approach is the translation of all the documents into a common language, lets say Spanish into English. Afterwards, weight the documents’ terms with TF-IDF and compare the texts using the cosine measure over a bag-of-words representation. When identifying specific plagiarised fragments, In this case we will use the original offsets of the documents in Spanish.

Results

Different experiments were performed to compare performance or each proposed model above. Experiments accuracy is based on Precision and Recall and models were trained on different length of documents eg: long, medium, short etc

Experiment 1:

Experiment 2:

Conclusion

Different similarity estimation models can be plugged into above proposed architecture. Strategy was tested extensively on a set of experiments reflecting different steps and scenarios of cross-language plagiarism detection: from the detection of entirely plagiarised documents to the identification of specific borrowed text fragments.

The similarity models showed a remarkable performance when detecting plagiarism of entire documents, including further paraphrased translations. When aiming at detecting specific borrowed fragments and their source, both short and further paraphrased cases caused difficulties. Still the precision of cross- language alignment-based similarity analysis was always high (for some types higher than 0.9). As a result, if it identifies a potential case of plagiarism, it is certainly worth analysing it.

Future

As future work, people are aiming to improve the heuristic retrieval module, i.e., retrieving good potential source documents for a possible case of plagiarism. This is a complicated task as, to the best of our knowledge, no large scale cross-language corpus with the necessary characteristics exists.

References

[1]: https://ac.els-cdn.com/S0950705113002001/1-s2.0-S0950705113002001-main.pdf?_tid=7b55d992-b4b1-11e7-846d-00000aacb35e&acdnat=1508406183_035ee3d807023e5ed232135b75305631

[2]: https://arxiv.org/pdf/1705.08828.pdf

[3]: M. Potthast, B. Stein, A. Eiselt, A. Barrón-Cedeño, P. Rosso, Overview of the 1st International Competition on Plagiarism Detection, vol. 502, CEUR-WS.org, San Sebastian, Spain, 2009, pp. 1–9. <http://ceur-ws.org/Vol-502>.

[4]: A. Barrón-Cedeño, P. Rosso, D. Pinto, A. Juan, On cross-lingual plagiarism analysis using a statistical model, in: B. Stein, E. Stamatatos, M. Koppel (Eds.), ECAI 2008 Workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse (PAN 2008), vol. 377, CEUR-WS.org, Patras, Greece, 2008, pp.

9–13, <http://ceur-ws.org/Vol-377>.

[5]: A. Barrón-Cedeño, P. Rosso, E. Agirre, G. Labaka, Plagiarism detection across distant language pairs, in: [26], 2010.

[6]: B. Stein, S. Meyer zu Eissen, M. Potthast, Strategies for retrieving plagiarized documents, in: C. Clarke, N. Fuhr, N. Kando, W. Kraaij, A. de Vries (Eds.), Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM Press, Amsterdam, The Netherlands, 2007, pp. 825–826.

Word embeddings and an application in SMT

We all are aware of (not so) recent advancements in word representation, such as Word2Vec, GloVe etc. for various NLP tasks. Let's try to dig a little deeper of how they work, and why they are so helpful! The basics, what is a Word vector? We need a mathematical way of representing words so as to process them. We call this representation, a word vector. This representation can be as simple as a one-hot encoded vector having the size of the vocabulary. For ex, if we had 3 words in our vocabulary {man, woman, child}, we can generate word vectors in the following manner Man : {0, 0, 1} Woman : {0, 1, 0} Child : {1, 0, 0} Such an encoding cannot be used to for any meaningful comparisons, other than checking for equality. In vectors such as Word2Vec, a word is represented as a distribution over some dimensions. Each word is assigned some particular weight for each of the dimensions. Picking up the previous example, this time the vectors can be as following (assuming a 2 dime...

Unknown9 July 2018 at 22:35
This comment has been removed by the author.
anonymous26 July 2018 at 06:03
keep it up
anonymous30 July 2018 at 00:01
The expert editing administrations is the best since we offer our assistance through staff that is profoundly qualified with website and a pertinent postgraduate degree in the zones in which they work exceptionally experienced at altering and editing.
Anonymouse25 February 2019 at 01:41
Here in this article, you give the unimaginable data to the understudies to set up the tests or substance blade. This http://www.ukraineoutsourcingrates.com/top-it-outsourcing-companies-ranking-in-sumy/ site is good to the students and the understudies did not see how to set up the books for the examination.

Amalgam

Search This Blog