Semantic and Similarity Measure Methods for Plagiarism Detection of Students' Assignments
Identifikátory výsledku
Kód výsledku v IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F61989100%3A27240%2F16%3A86096395" target="_blank" >RIV/61989100:27240/16:86096395 - isvavai.cz</a>
Nalezeny alternativní kódy
RIV/61989100:27740/16:86096395
Výsledek na webu
<a href="http://dx.doi.org/10.1007/978-3-319-29504-6_12" target="_blank" >http://dx.doi.org/10.1007/978-3-319-29504-6_12</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.1007/978-3-319-29504-6_12" target="_blank" >10.1007/978-3-319-29504-6_12</a>
Alternativní jazyky
Jazyk výsledku
angličtina
Název v původním jazyce
Semantic and Similarity Measure Methods for Plagiarism Detection of Students' Assignments
Popis výsledku v původním jazyce
This paper aims at detecting semantic plagiarism in Czech texts. The paper integrates a similarity measure technique previously used for text compression along with a synonyms structured thesaurus and a stemming algorithm to detect rewording and restructuring of texts in Czech language. Out of a 100GB corpus, we extracted 884 files of B.A., M.A., and Ph.D. students' assignments, semester works and theses, from Computer Science major. The total size of the extracted testing data used was 1.98GB of plain text for our initial experiment. The method is tested first on short texts. Then, the method is applied on longer texts of students' assignments. Our results on short texts showed more accurate results to detect paraphrased texts of semantic similarity, but lower accuracy was detected in case of identical texts with rearranged paragraphs. Our results experiment conducted on the long texts corpus of students' assignment and theses show a semantic plagiarism rate of 23.9%. However, after manual scanning of documents, some noise results occur as a result of using the same technical terms and scientific definitions and references in bibliography lists in different documents. These results will be fine-tuned and optimized in the future by building a file-specific stop word list, additional exact match method and removing references and other standard text templates often used in certain parts of students' assignment works and theses
Název v anglickém jazyce
Semantic and Similarity Measure Methods for Plagiarism Detection of Students' Assignments
Popis výsledku anglicky
This paper aims at detecting semantic plagiarism in Czech texts. The paper integrates a similarity measure technique previously used for text compression along with a synonyms structured thesaurus and a stemming algorithm to detect rewording and restructuring of texts in Czech language. Out of a 100GB corpus, we extracted 884 files of B.A., M.A., and Ph.D. students' assignments, semester works and theses, from Computer Science major. The total size of the extracted testing data used was 1.98GB of plain text for our initial experiment. The method is tested first on short texts. Then, the method is applied on longer texts of students' assignments. Our results on short texts showed more accurate results to detect paraphrased texts of semantic similarity, but lower accuracy was detected in case of identical texts with rearranged paragraphs. Our results experiment conducted on the long texts corpus of students' assignment and theses show a semantic plagiarism rate of 23.9%. However, after manual scanning of documents, some noise results occur as a result of using the same technical terms and scientific definitions and references in bibliography lists in different documents. These results will be fine-tuned and optimized in the future by building a file-specific stop word list, additional exact match method and removing references and other standard text templates often used in certain parts of students' assignment works and theses
Klasifikace
Druh
D - Stať ve sborníku
CEP obor
IN - Informatika
OECD FORD obor
—
Návaznosti výsledku
Projekt
—
Návaznosti
S - Specificky vyzkum na vysokych skolach
Ostatní
Rok uplatnění
2016
Kód důvěrnosti údajů
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Údaje specifické pro druh výsledku
Název statě ve sborníku
Advances in Intelligent Systems and Computing. Volume 427
ISBN
978-3-319-29503-9
ISSN
2194-5357
e-ISSN
—
Počet stran výsledku
8
Strana od-do
117-125
Název nakladatele
Springer Verlag
Místo vydání
London
Místo konání akce
Paříž
Datum konání akce
9. 9. 2015
Typ akce podle státní příslušnosti
WRD - Celosvětová akce
Kód UT WoS článku
000371912400012