All

What are you looking for?

All
Projects
Results
Organizations

Quick search

  • Projects supported by TA ČR
  • Excellent projects
  • Projects with the highest public support
  • Current projects

Smart search

  • That is how I find a specific +word
  • That is how I leave the -word out of the results
  • “That is how I can find the whole phrase”

Hapax remains: Regularity of low-frequency words in authorial texts

The result's identifiers

  • Result code in IS VaVaI

    <a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F61989592%3A15210%2F21%3A73611800" target="_blank" >RIV/61989592:15210/21:73611800 - isvavai.cz</a>

  • Result on the web

    <a href="https://academic.oup.com/dsh/advance-article/doi/10.1093/llc/fqab077/6413835" target="_blank" >https://academic.oup.com/dsh/advance-article/doi/10.1093/llc/fqab077/6413835</a>

  • DOI - Digital Object Identifier

    <a href="http://dx.doi.org/10.1093/llc/fqab077" target="_blank" >10.1093/llc/fqab077</a>

Alternative languages

  • Result language

    angličtina

  • Original language name

    Hapax remains: Regularity of low-frequency words in authorial texts

  • Original language description

    This article highlights the usual overlook in the literature of regular occurrences of low-frequency words (hapax legomena) in specific authors’ texts. This overlook arises from a linguistic assumption of non-systematic and context-dependent low-frequency word occurrences in extensive texts, and from the tendency of SVM methods to mark low-frequency words as irrelevant compared to the more frequent lexicon (e.g. Boukhaled, M. A. and Ganascia, J.-G. (2015). Using function words for authorship attribution: bag-of-words vs. sequential rules. In The 11th International Workshop on Natural Language Processing and Cognitive Science, October 2014, Venice, Italy. de Gruyter, Natural Language Processing and Cognitive Science Proceedings 2014, pp. 115–122.). Many approaches to authorship attribution are based on the n most frequent ‘function words’, which (1) are grammatically essential, frequent, and therefore included in each text; (2) are not affected by the topic of the text; and (3) reflect the unintentional linguistic activity of the author (Binongo, J. N. G. (2003). Who wrote the 15th book of Oz? An application of multivariate analysis to authorship attribution. Chance, 16(2): 9–17). Hapax legomena meet these conditions as well, except frequency (Baayen, H., van Halteren, H., and Tweedie, F. (1996). Outside the cave of shadows: using syntactic annotation to enhance authorship attribution. Literary and Linguistic Computing, 11(3): 121–32). We test the hypothesis that hapax legomena may work for purposes of authorship attribution based on selecting only hapaxes from whole texts (or randomly selected tokens of hapaxes) while using a specific pre-processed input (eigendecomposition of a cosine distance matrix) to the SVM classifier. This method evaluated the attribution of texts from fourteen Czech authors (yielding ninety-one pairs in total) and Evert, S., Proisl, T., Jannidis, F. et al. (2017). Understanding and explaining Delta measures for authorship attribution. Digital Scholarship in the Humanities, 32(2): 4–16 data set, and proved itself a suitable tool for identifying authors of previously unknown texts. Our method identifies a sparse network of regular occurrences of low-frequency words in different authors’ texts.

  • Czech name

  • Czech description

Classification

  • Type

    J<sub>imp</sub> - Article in a specialist periodical, which is included in the Web of Science database

  • CEP classification

  • OECD FORD branch

    60203 - Linguistics

Result continuities

  • Project

  • Continuities

    O - Projekt operacniho programu

Others

  • Publication year

    2021

  • Confidentiality

    S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů

Data specific for result type

  • Name of the periodical

    Digital Scholarship in the Humanities

  • ISSN

    2055-7671

  • e-ISSN

    2055-768X

  • Volume of the periodical

    37

  • Issue of the periodical within the volume

    3

  • Country of publishing house

    GB - UNITED KINGDOM

  • Number of pages

    23

  • Pages from-to

    693-715

  • UT code for WoS article

    000763924000001

  • EID of the result in the Scopus database

    2-s2.0-85141297978