Hapax remains: Regularity of low-frequency words in authorial texts

Identifikátory výsledku

Kód výsledku v IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F61989592%3A15210%2F21%3A73611800" target="_blank" >RIV/61989592:15210/21:73611800 - isvavai.cz</a>
Výsledek na webu
<a href="https://academic.oup.com/dsh/advance-article/doi/10.1093/llc/fqab077/6413835" target="_blank" >https://academic.oup.com/dsh/advance-article/doi/10.1093/llc/fqab077/6413835</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.1093/llc/fqab077" target="_blank" >10.1093/llc/fqab077</a>

Alternativní jazyky

Jazyk výsledku
angličtina
Název v původním jazyce
Hapax remains: Regularity of low-frequency words in authorial texts
Popis výsledku v původním jazyce
This article highlights the usual overlook in the literature of regular occurrences of low-frequency words (hapax legomena) in specific authors’ texts. This overlook arises from a linguistic assumption of non-systematic and context-dependent low-frequency word occurrences in extensive texts, and from the tendency of SVM methods to mark low-frequency words as irrelevant compared to the more frequent lexicon (e.g. Boukhaled, M. A. and Ganascia, J.-G. (2015). Using function words for authorship attribution: bag-of-words vs. sequential rules. In The 11th International Workshop on Natural Language Processing and Cognitive Science, October 2014, Venice, Italy. de Gruyter, Natural Language Processing and Cognitive Science Proceedings 2014, pp. 115–122.). Many approaches to authorship attribution are based on the n most frequent ‘function words’, which (1) are grammatically essential, frequent, and therefore included in each text; (2) are not affected by the topic of the text; and (3) reflect the unintentional linguistic activity of the author (Binongo, J. N. G. (2003). Who wrote the 15th book of Oz? An application of multivariate analysis to authorship attribution. Chance, 16(2): 9–17). Hapax legomena meet these conditions as well, except frequency (Baayen, H., van Halteren, H., and Tweedie, F. (1996). Outside the cave of shadows: using syntactic annotation to enhance authorship attribution. Literary and Linguistic Computing, 11(3): 121–32). We test the hypothesis that hapax legomena may work for purposes of authorship attribution based on selecting only hapaxes from whole texts (or randomly selected tokens of hapaxes) while using a specific pre-processed input (eigendecomposition of a cosine distance matrix) to the SVM classifier. This method evaluated the attribution of texts from fourteen Czech authors (yielding ninety-one pairs in total) and Evert, S., Proisl, T., Jannidis, F. et al. (2017). Understanding and explaining Delta measures for authorship attribution. Digital Scholarship in the Humanities, 32(2): 4–16 data set, and proved itself a suitable tool for identifying authors of previously unknown texts. Our method identifies a sparse network of regular occurrences of low-frequency words in different authors’ texts.
Název v anglickém jazyce
Hapax remains: Regularity of low-frequency words in authorial texts
Popis výsledku anglicky
This article highlights the usual overlook in the literature of regular occurrences of low-frequency words (hapax legomena) in specific authors’ texts. This overlook arises from a linguistic assumption of non-systematic and context-dependent low-frequency word occurrences in extensive texts, and from the tendency of SVM methods to mark low-frequency words as irrelevant compared to the more frequent lexicon (e.g. Boukhaled, M. A. and Ganascia, J.-G. (2015). Using function words for authorship attribution: bag-of-words vs. sequential rules. In The 11th International Workshop on Natural Language Processing and Cognitive Science, October 2014, Venice, Italy. de Gruyter, Natural Language Processing and Cognitive Science Proceedings 2014, pp. 115–122.). Many approaches to authorship attribution are based on the n most frequent ‘function words’, which (1) are grammatically essential, frequent, and therefore included in each text; (2) are not affected by the topic of the text; and (3) reflect the unintentional linguistic activity of the author (Binongo, J. N. G. (2003). Who wrote the 15th book of Oz? An application of multivariate analysis to authorship attribution. Chance, 16(2): 9–17). Hapax legomena meet these conditions as well, except frequency (Baayen, H., van Halteren, H., and Tweedie, F. (1996). Outside the cave of shadows: using syntactic annotation to enhance authorship attribution. Literary and Linguistic Computing, 11(3): 121–32). We test the hypothesis that hapax legomena may work for purposes of authorship attribution based on selecting only hapaxes from whole texts (or randomly selected tokens of hapaxes) while using a specific pre-processed input (eigendecomposition of a cosine distance matrix) to the SVM classifier. This method evaluated the attribution of texts from fourteen Czech authors (yielding ninety-one pairs in total) and Evert, S., Proisl, T., Jannidis, F. et al. (2017). Understanding and explaining Delta measures for authorship attribution. Digital Scholarship in the Humanities, 32(2): 4–16 data set, and proved itself a suitable tool for identifying authors of previously unknown texts. Our method identifies a sparse network of regular occurrences of low-frequency words in different authors’ texts.

Klasifikace

Druh
J<sub>imp</sub> - Článek v periodiku v databázi Web of Science
CEP obor
—
OECD FORD obor
60203 - Linguistics

Návaznosti výsledku

Projekt
—
Návaznosti
O - Projekt operacniho programu

Ostatní

Rok uplatnění
2021
Kód důvěrnosti údajů
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů

Údaje specifické pro druh výsledku

Název periodika
Digital Scholarship in the Humanities
ISSN
2055-7671
e-ISSN
2055-768X
Svazek periodika
37
Číslo periodika v rámci svazku
3
Stát vydavatele periodika
GB - Spojené království Velké Británie a Severního Irska
Počet stran výsledku
23
Strana od-do
693-715
Kód UT WoS článku
000763924000001
EID výsledku v databázi Scopus
2-s2.0-85141297978

Podobné výsledky(10)

Celkom iste sa príde na to, že niektoré slová sa opakujú doslovne: Horeckého hypersyntax Opakuji, tedy jsem: parasyntaktická perspektiva Možný přínos edice nejstaršího českého biblického překladu pro staročeskou lexikografii

Co hledáte?

Rychlé hledání

Chytré vyhledávání

Hapax remains: Regularity of low-frequency words in authorial texts

Identifikátory výsledku

Alternativní jazyky

Klasifikace

Návaznosti výsledku

Ostatní

Údaje specifické pro druh výsledku

Podobné výsledky(10)

Co hledáte?

Rychlé hledání

Chytré vyhledávání

Popis výsledku

Identifikátory výsledku

Identifikátory výsledku

Alternativní jazyky

Alternativní jazyky

Klasifikace

Klasifikace

Návaznosti výsledku

Návaznosti výsledku

Ostatní

Ostatní

Údaje specifické pro druh výsledku

Údaje specifické pro druh výsledku

Podobné výsledky(10)