All

What are you looking for?

All
Projects
Results
Organizations

Quick search

  • Projects supported by TA ČR
  • Excellent projects
  • Projects with the highest public support
  • Current projects

Smart search

  • That is how I find a specific +word
  • That is how I leave the -word out of the results
  • “That is how I can find the whole phrase”

Does Size Matter? - Comparing Evaluation Dataset Size for the Bilingual Lexicon Induction

The result's identifiers

  • Result code in IS VaVaI

    <a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216224%3A14330%2F23%3A00133036" target="_blank" >RIV/00216224:14330/23:00133036 - isvavai.cz</a>

  • Result on the web

    <a href="https://raslan2022.nlp-consulting.net/" target="_blank" >https://raslan2022.nlp-consulting.net/</a>

  • DOI - Digital Object Identifier

Alternative languages

  • Result language

    angličtina

  • Original language name

    Does Size Matter? - Comparing Evaluation Dataset Size for the Bilingual Lexicon Induction

  • Original language description

    Cross-lingual word embeddings have been a popular approach for inducing bilingual lexicons. However, the evaluation of this task varies from paper to paper, and gold standard dictionaries used for the evaluation are frequently criticised for occurring mistakes. Although there have been efforts to unify the evaluation and gold standard dictionaries, we propose a new property that should be considered when compiling an evaluation dataset: size. In this paper, we evaluate three baseline models on three diverse language pairs (Estonian-Slovak, Czech-Slovak, English-Korean) and experiment with evaluation datasets of various sizes: 200, 500, 1.5K, and 3K source words. Moreover, we compare the results with manual error analysis. In this experiment, we show whether the size of an evaluation dataset impacts the results and how to select the ideal evaluation dataset size. We make our code and datasets publicly available.

  • Czech name

  • Czech description

Classification

  • Type

    D - Article in proceedings

  • CEP classification

  • OECD FORD branch

    10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)

Result continuities

  • Project

  • Continuities

    I - Institucionalni podpora na dlouhodoby koncepcni rozvoj vyzkumne organizace

Others

  • Publication year

    2023

  • Confidentiality

    S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů

Data specific for result type

  • Article name in the collection

    Proceedings of the Seventeenth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2023

  • ISBN

    9788026317937

  • ISSN

    2336-4289

  • e-ISSN

  • Number of pages

    10

  • Pages from-to

    47-56

  • Publisher name

    Tribun EU

  • Place of publication

    Karlova Studánka

  • Event location

    Karlova Studánka

  • Event date

    Jan 1, 2023

  • Type of event by nationality

    CST - Celostátní akce

  • UT code for WoS article