All

What are you looking for?

All
Projects
Results
Organizations

Quick search

  • Projects supported by TA ČR
  • Excellent projects
  • Projects with the highest public support
  • Current projects

Smart search

  • That is how I find a specific +word
  • That is how I leave the -word out of the results
  • “That is how I can find the whole phrase”

From web to dialects: how to enhance non-standard Russian lects lemmatisation?

The result's identifiers

  • Result code in IS VaVaI

    <a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216208%3A11320%2F23%3AUBLWDWCL" target="_blank" >RIV/00216208:11320/23:UBLWDWCL - isvavai.cz</a>

  • Result on the web

    <a href="https://aclanthology.org/2023.clasp-1.17/" target="_blank" >https://aclanthology.org/2023.clasp-1.17/</a>

  • DOI - Digital Object Identifier

Alternative languages

  • Result language

    angličtina

  • Original language name

    From web to dialects: how to enhance non-standard Russian lects lemmatisation?

  • Original language description

    "The growing need for using small data distinguished by a set of distributional properties becomes all the more apparent in the era of large language models (LLM). In this paper, we show that for the lemmatisation of the web as corpora texts, heterogeneous social media texts, and dialect texts, the morphological tagging by a model trained on a small dataset with specific properties generally works better than the morphological tagging by a model trained on a large dataset. The material we use is Russian non-standard texts and interviews with dialect speakers. The sequence-to-sequence lemmatisation with the help of taggers trained on smaller linguistically aware datasets achieves the average results of 85 to 90 per cent. These results are consistently (but not always), by 1-2 per cent. higher than the results of lemmatisation with the help of the large-dataset-trained taggers. We analyse these results and outline the possible further research directions."

  • Czech name

  • Czech description

Classification

  • Type

    D - Article in proceedings

  • CEP classification

  • OECD FORD branch

    10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)

Result continuities

  • Project

  • Continuities

Others

  • Publication year

    2023

  • Confidentiality

    S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů

Data specific for result type

  • Article name in the collection

    "Proceedings of the 2023 CLASP Conference on Learning with Small Data (LSD)"

  • ISBN

    979-8-89176-000-4

  • ISSN

    2002-9764

  • e-ISSN

  • Number of pages

    9

  • Pages from-to

    167-175

  • Publisher name

    Association for Computational Linguistics

  • Place of publication

    Gothenburg, Sweden

  • Event location

    Gothenburg, Sweden

  • Event date

    Jan 1, 2023

  • Type of event by nationality

    WRD - Celosvětová akce

  • UT code for WoS article