Multilingual Stylometry. The Influence of Language on the Performance of Authorship Attribution using Corpora from the European Literary Text Collection (ELTeC)
Identifikátory výsledku
Kód výsledku v IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F68378068%3A_____%2F24%3A00603253" target="_blank" >RIV/68378068:_____/24:00603253 - isvavai.cz</a>
Výsledek na webu
<a href="https://ceur-ws.org/Vol-3834/paper9.pdf" target="_blank" >https://ceur-ws.org/Vol-3834/paper9.pdf</a>
DOI - Digital Object Identifier
—
Alternativní jazyky
Jazyk výsledku
angličtina
Název v původním jazyce
Multilingual Stylometry. The Influence of Language on the Performance of Authorship Attribution using Corpora from the European Literary Text Collection (ELTeC)
Popis výsledku v původním jazyce
Stylometric authorship attribution is concerned with the task of assigning texts of unknown, pseudony- mous or disputed authorship to their most likely author, often based on a comparison of the frequency of a selected set of features that represent the texts. The parameters of the analysis, such as feature selec- tion and the choice of similarity measure or classification algorithm, have received significant attention in the past. Two additional key factors for the performance and reliability of stylometric methods, how- ever, have so far received less attention, namely corpus composition and corpus language. As a first step, the aim of this study is to investigate the influence of language on the performance of stylometric authorship attribution. We address this question using four different corpora derived from the European Literary Text Collection (ELTeC). We use machine-translation to obtain each corpus in the other three languages. We find that, as expected, the attribution accuracy varies between language-based corpora, and that translated corpora, on average, display a lower attribution accuracy compared to their counter- parts in the original language. Overall, our study contributes to a better understanding of stylometric methods of authorship attribution.
Název v anglickém jazyce
Multilingual Stylometry. The Influence of Language on the Performance of Authorship Attribution using Corpora from the European Literary Text Collection (ELTeC)
Popis výsledku anglicky
Stylometric authorship attribution is concerned with the task of assigning texts of unknown, pseudony- mous or disputed authorship to their most likely author, often based on a comparison of the frequency of a selected set of features that represent the texts. The parameters of the analysis, such as feature selec- tion and the choice of similarity measure or classification algorithm, have received significant attention in the past. Two additional key factors for the performance and reliability of stylometric methods, how- ever, have so far received less attention, namely corpus composition and corpus language. As a first step, the aim of this study is to investigate the influence of language on the performance of stylometric authorship attribution. We address this question using four different corpora derived from the European Literary Text Collection (ELTeC). We use machine-translation to obtain each corpus in the other three languages. We find that, as expected, the attribution accuracy varies between language-based corpora, and that translated corpora, on average, display a lower attribution accuracy compared to their counter- parts in the original language. Overall, our study contributes to a better understanding of stylometric methods of authorship attribution.
Klasifikace
Druh
D - Stať ve sborníku
CEP obor
—
OECD FORD obor
60206 - Specific literatures
Návaznosti výsledku
Projekt
—
Návaznosti
I - Institucionalni podpora na dlouhodoby koncepcni rozvoj vyzkumne organizace
Ostatní
Rok uplatnění
2024
Kód důvěrnosti údajů
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Údaje specifické pro druh výsledku
Název statě ve sborníku
CHR 2024: Computational Humanities Research 2024: Proceedings of the Computational Humanities Research Conference 2024
ISBN
—
ISSN
1613-0073
e-ISSN
—
Počet stran výsledku
23
Strana od-do
386-408
Název nakladatele
Technical University & CreateSpace Independent Publishing
Místo vydání
Aachen
Místo konání akce
Aarhus
Datum konání akce
4. 12. 2024
Typ akce podle státní příslušnosti
EUR - Evropská akce
Kód UT WoS článku
—