Construction of Amharic information retrieval resources and corpora
Identifikátory výsledku
Kód výsledku v IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216208%3A11320%2F25%3AU77CT9GK" target="_blank" >RIV/00216208:11320/25:U77CT9GK - isvavai.cz</a>
Výsledek na webu
<a href="https://www.scopus.com/inward/record.uri?eid=2-s2.0-85197301968&doi=10.1007%2fs10579-024-09719-x&partnerID=40&md5=54b748f1a7c16f31baa227ead33e086d" target="_blank" >https://www.scopus.com/inward/record.uri?eid=2-s2.0-85197301968&doi=10.1007%2fs10579-024-09719-x&partnerID=40&md5=54b748f1a7c16f31baa227ead33e086d</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.1007/s10579-024-09719-x" target="_blank" >10.1007/s10579-024-09719-x</a>
Alternativní jazyky
Jazyk výsledku
angličtina
Název v původním jazyce
Construction of Amharic information retrieval resources and corpora
Popis výsledku v původním jazyce
The development of information retrieval systems and natural language processing tools has been made possible for many natural languages because of the availability of natural language resources and corpora. Although Amharic is the working language of Ethiopia, it is still an under-resourced language. There are no adequate resources and corpora for Amharic ad-hoc retrieval evaluation to date. The existing ones are not publicly accessible and are not suitable for making scientific evaluation of information retrieval systems. To promote the development of Amharic ad-hoc retrieval, we build an ad-hoc retrieval test collection that consists of raw text, morphologically annotated stem-based and root-based corpora, a stopword list, stem-based and root-based lexicons, and WordNet-like resources. We also created word embeddings using the raw text and morphologically segmented forms of the corpora. When building these resources and corpora, we heavily consider the morphological characteristics of the language. The aim of this paper is to present these Amharic resources and corpora that we made available to the research community for information retrieval tasks. These resources and corpora are also evaluated experimentally and by linguists. © The Author(s), under exclusive licence to Springer Nature B.V. 2024.
Název v anglickém jazyce
Construction of Amharic information retrieval resources and corpora
Popis výsledku anglicky
The development of information retrieval systems and natural language processing tools has been made possible for many natural languages because of the availability of natural language resources and corpora. Although Amharic is the working language of Ethiopia, it is still an under-resourced language. There are no adequate resources and corpora for Amharic ad-hoc retrieval evaluation to date. The existing ones are not publicly accessible and are not suitable for making scientific evaluation of information retrieval systems. To promote the development of Amharic ad-hoc retrieval, we build an ad-hoc retrieval test collection that consists of raw text, morphologically annotated stem-based and root-based corpora, a stopword list, stem-based and root-based lexicons, and WordNet-like resources. We also created word embeddings using the raw text and morphologically segmented forms of the corpora. When building these resources and corpora, we heavily consider the morphological characteristics of the language. The aim of this paper is to present these Amharic resources and corpora that we made available to the research community for information retrieval tasks. These resources and corpora are also evaluated experimentally and by linguists. © The Author(s), under exclusive licence to Springer Nature B.V. 2024.
Klasifikace
Druh
J<sub>SC</sub> - Článek v periodiku v databázi SCOPUS
CEP obor
—
OECD FORD obor
10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)
Návaznosti výsledku
Projekt
—
Návaznosti
—
Ostatní
Rok uplatnění
2024
Kód důvěrnosti údajů
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Údaje specifické pro druh výsledku
Název periodika
Language Resources and Evaluation
ISSN
1574-020X
e-ISSN
—
Svazek periodika
2024
Číslo periodika v rámci svazku
2024
Stát vydavatele periodika
US - Spojené státy americké
Počet stran výsledku
29
Strana od-do
1-29
Kód UT WoS článku
—
EID výsledku v databázi Scopus
2-s2.0-85197301968