All

What are you looking for?

All
Projects
Results
Organizations

Quick search

  • Projects supported by TA ČR
  • Excellent projects
  • Projects with the highest public support
  • Current projects

Smart search

  • That is how I find a specific +word
  • That is how I leave the -word out of the results
  • “That is how I can find the whole phrase”

Set of Ethiopian Web Corpora

The result's identifiers

  • Result code in IS VaVaI

    <a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216224%3A14330%2F16%3A00096851" target="_blank" >RIV/00216224:14330/16:00096851 - isvavai.cz</a>

  • Result on the web

  • DOI - Digital Object Identifier

Alternative languages

  • Result language

    angličtina

  • Original language name

    Set of Ethiopian Web Corpora

  • Original language description

    A set of 5 corpora for 4 Ethiopian languages: Amharic, Oromo, Somali and Tigrinya. The Amharic WIC corpus is a reprocessed existing corpus with part of speech annotation. The released version contains cleaning (especially numeric expressions) and unification of two versions with different scripts (Geez and SERA transliteration). The web corpora were built using automatic tools from Internet texts. They contain from 2.5 million words (Tigrinya) to 80 million words (Somali)

  • Czech name

  • Czech description

Classification

  • Type

    R - Software

  • CEP classification

  • OECD FORD branch

    60200 - Languages and Literature

Result continuities

  • Project

    <a href="/en/project/7F14047" target="_blank" >7F14047: Harvesting big text data for under-resourced languages</a><br>

  • Continuities

    P - Projekt vyzkumu a vyvoje financovany z verejnych zdroju (s odkazem do CEP)

Others

  • Publication year

    2016

  • Confidentiality

    S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů

Data specific for result type

  • Internal product ID

    habcorp2016

  • Technical parameters

    Amharic WIC corpus, 200 thousand tokens; amWaC16 Amharic corpus, 20 million tokens; orWaC16 Oromo corpus, 5.1 million tokens; soWaC16 Somali corpus, 80 million tokens; tiWaC16 Tigrinya corpus, 2.5 million tokens.

  • Economical parameters

    only small text corpora were available so far, this results provides an order of magnitude bigger corpora, the size enables using advanced statistical techniques like word embeddings

  • Owner IČO

    00216224

  • Owner name

    Masarykova univerzita