Does Size Matter? - Comparing Evaluation Dataset Size for the Bilingual Lexicon Induction
The result's identifiers
Result code in IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216224%3A14330%2F23%3A00133036" target="_blank" >RIV/00216224:14330/23:00133036 - isvavai.cz</a>
Result on the web
<a href="https://raslan2022.nlp-consulting.net/" target="_blank" >https://raslan2022.nlp-consulting.net/</a>
DOI - Digital Object Identifier
—
Alternative languages
Result language
angličtina
Original language name
Does Size Matter? - Comparing Evaluation Dataset Size for the Bilingual Lexicon Induction
Original language description
Cross-lingual word embeddings have been a popular approach for inducing bilingual lexicons. However, the evaluation of this task varies from paper to paper, and gold standard dictionaries used for the evaluation are frequently criticised for occurring mistakes. Although there have been efforts to unify the evaluation and gold standard dictionaries, we propose a new property that should be considered when compiling an evaluation dataset: size. In this paper, we evaluate three baseline models on three diverse language pairs (Estonian-Slovak, Czech-Slovak, English-Korean) and experiment with evaluation datasets of various sizes: 200, 500, 1.5K, and 3K source words. Moreover, we compare the results with manual error analysis. In this experiment, we show whether the size of an evaluation dataset impacts the results and how to select the ideal evaluation dataset size. We make our code and datasets publicly available.
Czech name
—
Czech description
—
Classification
Type
D - Article in proceedings
CEP classification
—
OECD FORD branch
10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)
Result continuities
Project
—
Continuities
I - Institucionalni podpora na dlouhodoby koncepcni rozvoj vyzkumne organizace
Others
Publication year
2023
Confidentiality
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Data specific for result type
Article name in the collection
Proceedings of the Seventeenth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2023
ISBN
9788026317937
ISSN
2336-4289
e-ISSN
—
Number of pages
10
Pages from-to
47-56
Publisher name
Tribun EU
Place of publication
Karlova Studánka
Event location
Karlova Studánka
Event date
Jan 1, 2023
Type of event by nationality
CST - Celostátní akce
UT code for WoS article
—