Plagiarism: the practice of taking someone else work and presenting that as of their own. So its reverse Plagiarism detection is detecting weather a given piece of document is plagiarised or not. Detection of plagiarism was so naive in earlier times ,People used to just compare strings across 2 documents , then as the technology grew people also grew smarter , they started coming with ideas by which they can copy a document while go undetected from detection.
In this post i will talk about a particular type of plagiarism detection which is essentially one of hardest in this domain i.e Cross-Language plagiarism detection(CLPD). People take a piece of document and convert it into some other language and then post it, now our task is to detect weather it is copied from some other language or not ? In this post i will keep my limits to documents only not programming code, though there are many ways as well by which it can be detected that weather a piece of code in converted into some other language for example C -> JAVA etc but methods for detecting document and code are completely different. It requires assembly conversion of code for detection while N-gram language models and other things are required in case of documents
I will talk about current 3 state of the art method for CLPD but before that i must tell you about the previous work done in this task and the architecture we will follow
where P(q|d') is the document level probability of term q in document d' and C denotes the entire collection. We use n = 4 and Ã¥ = 0.7 as these values yielded the best results for English–Spanish
in the original work.
where, Q(d') is known as length model (∂M). The length of the d’s translation into d' is closely related to a translation length factor, de- fined as:
where l and r are the mean and standard deviation of the character lengths between actual translations from L into L' . If the length of d' is unexpected given dq, it receives a low likelihood.
In this post i will talk about a particular type of plagiarism detection which is essentially one of hardest in this domain i.e Cross-Language plagiarism detection(CLPD). People take a piece of document and convert it into some other language and then post it, now our task is to detect weather it is copied from some other language or not ? In this post i will keep my limits to documents only not programming code, though there are many ways as well by which it can be detected that weather a piece of code in converted into some other language for example C -> JAVA etc but methods for detecting document and code are completely different. It requires assembly conversion of code for detection while N-gram language models and other things are required in case of documents
I will talk about current 3 state of the art method for CLPD but before that i must tell you about the previous work done in this task and the architecture we will follow
Previous work
Potthast et al. offered a overview of the prototypical CLPD process , it is as follows
- Heuristic Retrieval: a set of potential plagiarised documents is collected from all documents
- Detailed Analysis: all potentials targets are compares against all documents section wise, if a section is identified to be similar more than expected then a potential case of plagiarism is located
- Heuristic Post-Processing: candidates that are not similar enough gets discarded and additional heuristic are applied to merge nearby candidates
Now above model need some similarity metric to work upon for the retrieval and analysis task , so below are the 5 previous most use models for CL similarity assesment.
- Lexicon based systems: rely on lexical similarities between languages and linguistic influence
- Thesaurus based systems: rely on word mapping in cross language eg EuroWordNet
- Comparable corpus based systems: trained over comparable corpora eg: cross-language explicit semantic analysis (CL-ESA)
- Parallel corpus based systems: trained on parallel corpora , to find cross language co-occurrences or to obtain translation modules.
- Machine translation based systems: these models are vogue in CLPD , they simplify the task in monolingual problem.
we will follow the above 3 stepped architecture for CLPD , its the only similarity metric that we will change to achieve better results , otherwise the core algorithm will be same
Algorithm
For CL heuristic retrieval , we select top 50 d' ∈ D' for each dq according to sim(dq, d') and then at detailed analysis, dq(d') is split into chunks of length w with step t. sim(sq, s') computes the similarity between 2 text fragments either on the bases of CL-ASA , CL-CNG or T+MA , then retrieve the 5 most similar fragments s ∈ S w.r.t sq , the resulting pairs {sq, s} are input for post processing step, If distance between 2 pairs is lower than threshold then pairs are merged , only those candidates that are composed of at least 3 of the identified fragments are considered potentially plagiarised.
this is core algorithm for our approach to PD , the similarity between the text can be based on any similarity estimation model
this is core algorithm for our approach to PD , the similarity between the text can be based on any similarity estimation model
State of the Art Models
we will explore 3 state of the art similarity model
Cross language character n-gram - CL-CNG
The text is case-
folded, punctuation marks and diacritics are removed. Multiple
white-space and new-line characters are replaced by a single
white-space. Moreover, a single white-space is inserted at the
beginning and end of the text. Finally, the resulting text strings
are encoded into character n-grams as depicted below, where ‘‘-’’
should be considered as white-space and n = 4:
‘‘El espíritu’’ -> ‘‘-
el-’’,‘‘el-e’’,‘‘ l-es’’,‘‘-esp’’,‘‘espi’’,‘‘spir’’,‘‘piri’’,‘‘irit’’,‘‘ritu’’,‘‘itu-’’.
Similarity sim(dq, d') is estimated by the unigram language model:
Cross language alignment-based similarity analysis - CL-ASA
Similarity sim(dq, d0) is computed by estimating the likelihood of d' of being a translation of dq. It is an adaptation of Bayes’ rule for MT [20] that Barrón-Cedeño defines as:where, Q(d') is known as length model (∂M). The length of the d’s translation into d' is closely related to a translation length factor, de- fined as:
where l and r are the mean and standard deviation of the character lengths between actual translations from L into L' . If the length of d' is unexpected given dq, it receives a low likelihood.
CL-ASA is considered a parallel corpus-based system. Firstly, its
parameters are learnt from a parallel corpus. Secondly, every potential translation of a word participates in the similarity assessment, making it flexible.
Translation plus monolingual analysis - T+MA
The first step of this approach is the translation of all the documents into a common language, lets say Spanish into English. Afterwards, weight the documents’ terms with TF-IDF and compare the texts using the cosine
measure over a bag-of-words representation. When identifying specific plagiarised fragments, In this case we will use the original offsets of the documents in Spanish.
The similarity models showed a remarkable performance when detecting plagiarism of entire documents, including further paraphrased translations. When aiming at detecting specific borrowed fragments and their source, both short and further paraphrased cases caused difficulties. Still the precision of cross- language alignment-based similarity analysis was always high (for some types higher than 0.9). As a result, if it identifies a potential case of plagiarism, it is certainly worth analysing it.
Results
Different experiments were performed to compare performance or each proposed model above. Experiments accuracy is based on Precision and Recall and models were trained on different length of documents eg: long, medium, short etc
Experiment 1:
Experiment 2:
Conclusion
Different similarity estimation models can be plugged into above proposed architecture. Strategy was tested extensively on a set of experiments reflecting different steps and scenarios of cross-language
plagiarism detection: from the detection of entirely plagiarised
documents to the identification of specific borrowed text fragments.
The similarity models showed a remarkable performance when detecting plagiarism of entire documents, including further paraphrased translations. When aiming at detecting specific borrowed fragments and their source, both short and further paraphrased cases caused difficulties. Still the precision of cross- language alignment-based similarity analysis was always high (for some types higher than 0.9). As a result, if it identifies a potential case of plagiarism, it is certainly worth analysing it.
Future
As future work, people are aiming to improve the heuristic retrieval module, i.e., retrieving good potential source documents for a possible
case of plagiarism. This is a complicated task as, to the best of our
knowledge, no large scale cross-language corpus with the necessary characteristics exists.
References
[1]: https://ac.els-cdn.com/S0950705113002001/1-s2.0-S0950705113002001-main.pdf?_tid=7b55d992-b4b1-11e7-846d-00000aacb35e&acdnat=1508406183_035ee3d807023e5ed232135b75305631
[2]: https://arxiv.org/pdf/1705.08828.pdf
[3]: M. Potthast, B. Stein, A. Eiselt, A. Barrón-Cedeño, P. Rosso, Overview of the 1st International Competition on Plagiarism Detection, vol. 502, CEUR-WS.org, San Sebastian, Spain, 2009, pp. 1–9. <http://ceur-ws.org/Vol-502>.
[4]: A. Barrón-Cedeño, P. Rosso, D. Pinto, A. Juan, On cross-lingual plagiarism analysis using a statistical model, in: B. Stein, E. Stamatatos, M. Koppel (Eds.), ECAI 2008 Workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse (PAN 2008), vol. 377, CEUR-WS.org, Patras, Greece, 2008, pp.
9–13, <http://ceur-ws.org/Vol-377>.
[5]: A. Barrón-Cedeño, P. Rosso, E. Agirre, G. Labaka, Plagiarism detection across distant language pairs, in: [26], 2010.
[6]: B. Stein, S. Meyer zu Eissen, M. Potthast, Strategies for retrieving plagiarized documents, in: C. Clarke, N. Fuhr, N. Kando, W. Kraaij, A. de Vries (Eds.), Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM Press, Amsterdam, The Netherlands, 2007, pp. 825–826.
[5]: A. Barrón-Cedeño, P. Rosso, E. Agirre, G. Labaka, Plagiarism detection across distant language pairs, in: [26], 2010.
[6]: B. Stein, S. Meyer zu Eissen, M. Potthast, Strategies for retrieving plagiarized documents, in: C. Clarke, N. Fuhr, N. Kando, W. Kraaij, A. de Vries (Eds.), Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM Press, Amsterdam, The Netherlands, 2007, pp. 825–826.
This comment has been removed by the author.
ReplyDeletekeep it up
ReplyDeleteThe expert editing administrations is the best since we offer our assistance through staff that is profoundly qualified with website and a pertinent postgraduate degree in the zones in which they work exceptionally experienced at altering and editing.
ReplyDeleteHere in this article, you give the unimaginable data to the understudies to set up the tests or substance blade. This http://www.ukraineoutsourcingrates.com/top-it-outsourcing-companies-ranking-in-sumy/ site is good to the students and the understudies did not see how to set up the books for the examination.
ReplyDelete