Cross-Lingual Plagiarism Detection Method
The result's identifiers
Result code in IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216208%3A11320%2F22%3AK6ESGMHU" target="_blank" >RIV/00216208:11320/22:K6ESGMHU - isvavai.cz</a>
Result on the web
<a href="https://doi.org/10.1007/978-3-031-12285-9_13" target="_blank" >https://doi.org/10.1007/978-3-031-12285-9_13</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.1007/978-3-031-12285-9_13" target="_blank" >10.1007/978-3-031-12285-9_13</a>
Alternative languages
Result language
angličtina
Original language name
Cross-Lingual Plagiarism Detection Method
Original language description
In this paper, we describe a method for cross-lingual plagiarism detection for a distant language pair (Russian-English). All documents in a reference collection are split into fragments of fixed size. These fragments are indexed in a special inverted index, which maps words to a bit array. Each bit in the bit array shows whether a $$i_{th}$$ithsentence contains this word. This index is used for the retrieval of candidate fragments. We employ bit arrays stored in the index for assessing similarity of query and candidate sentences by lexis. Before doing retrieval, top keywords of a query document are mapped from one language to other with the help of cross-lingual word embeddings. We also train a language-agnostic sentence encoder that helps in comparing sentence pairs that have few or no lexis in common. The combined similarity score of sentence pairs is used by a text alignment algorithm, which tries to find blocks of contiguous and similar sentence pairs. We introduce a dataset for evaluation of this task - automatically translated Paraplag (monolingual dataset for plagiarism detection). The proposed method shows good performance on our dataset in terms of F1. We also evaluate the method on another publicly available dataset, on which our method outperforms previously reported results.
Czech name
—
Czech description
—
Classification
Type
D - Article in proceedings
CEP classification
—
OECD FORD branch
10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)
Result continuities
Project
—
Continuities
—
Others
Publication year
2022
Confidentiality
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Data specific for result type
Article name in the collection
Data Analytics and Management in Data Intensive Domains
ISBN
978-3-031-12285-9
ISSN
—
e-ISSN
—
Number of pages
16
Pages from-to
207-222
Publisher name
Springer International Publishing
Place of publication
—
Event location
Cham
Event date
Jan 1, 2022
Type of event by nationality
WRD - Celosvětová akce
UT code for WoS article
—