Set of Ethiopian Web Corpora

Result code in IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216224%3A14330%2F16%3A00096851" target="_blank" >RIV/00216224:14330/16:00096851 - isvavai.cz</a>
Result on the web
—
DOI - Digital Object Identifier
—

Result language
angličtina
Original language name
Set of Ethiopian Web Corpora
Original language description
A set of 5 corpora for 4 Ethiopian languages: Amharic, Oromo, Somali and Tigrinya. The Amharic WIC corpus is a reprocessed existing corpus with part of speech annotation. The released version contains cleaning (especially numeric expressions) and unification of two versions with different scripts (Geez and SERA transliteration). The web corpora were built using automatic tools from Internet texts. They contain from 2.5 million words (Tigrinya) to 80 million words (Somali)
Czech name
—
Czech description
—

Project
<a href="/en/project/7F14047" target="_blank" >7F14047: Harvesting big text data for under-resourced languages</a><br>
Continuities
P - Projekt vyzkumu a vyvoje financovany z verejnych zdroju (s odkazem do CEP)

Publication year
2016
Confidentiality
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů

Internal product ID
habcorp2016
Technical parameters
Amharic WIC corpus, 200 thousand tokens; amWaC16 Amharic corpus, 20 million tokens; orWaC16 Oromo corpus, 5.1 million tokens; soWaC16 Somali corpus, 80 million tokens; tiWaC16 Tigrinya corpus, 2.5 million tokens.
Economical parameters
only small text corpora were available so far, this results provides an order of magnitude bigger corpora, the size enables using advanced statistical techniques like word embeddings
Owner IČO
00216224
Owner name
Masarykova univerzita

Similar results(10)