Corpus Generation to Develop Amharic Morphological Segmenter

The result's identifiers

Result code in IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216208%3A11320%2F23%3AXIX2EJI7" target="_blank" >RIV/00216208:11320/23:XIX2EJI7 - isvavai.cz</a>
Result on the web
<a href="https://www.proquest.com/docview/2883174197/abstract/B90879C438B4510PQ/1" target="_blank" >https://www.proquest.com/docview/2883174197/abstract/B90879C438B4510PQ/1</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.14569/IJACSA.2023.01409116" target="_blank" >10.14569/IJACSA.2023.01409116</a>

Alternative languages

Result language
angličtina
Original language name
Corpus Generation to Develop Amharic Morphological Segmenter
Original language description
"Morphological segmenter is an important component in Amharic natural language processing systems. Despite this fact, Amharic lacks large amount of morphologically segmented corpus. Large amount of corpus is often a requirement to develop neural network-based language technologies. This paper presents an alternative method to generate large amount of morph-segmented corpus for Amharic language. First, a relatively small (138,400 words) morphologically annotated Amharic seed-corpus is manually prepared. The annotation enables to identify preﬁxes, stem, and sufﬁxes of a given word. Second, a supervised approach is used to create a conditional random ﬁeld-based seed-model (on the seed-corpus). Applying the seed-model (an unsupervised technique on a large unsegmented raw Amharic words) for prediction, a large corpus size (3,777,283) of segmented words are automatically generated. Third, the newly generated corpus is used to train an Amharic morphological segmenter (based on a supervised neural sequence-to-sequence (seq2seq) approach using character embeddings). Using the seq2seq method, an F-score of 98.65% was measured. Results show an agreement with previous efforts for Arabic language. The work presented here has profound implications for future studies of Ethiopian language technologies and may one day help solve the problem of the digital-divide between resource-rich and under-resourced languages."
Czech name
—
Czech description
—

Classification

Type
J<sub>ost</sub> - Miscellaneous article in a specialist periodical
CEP classification
—
OECD FORD branch
10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)

Result continuities

Project
—
Continuities
—

Others

Publication year
2023
Confidentiality
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů

Data specific for result type

Name of the periodical
"International Journal of Advanced Computer Science and Applications"
ISSN
2158107X
e-ISSN
—
Volume of the periodical
14
Issue of the periodical within the volume
9
Country of publishing house
US - UNITED STATES
Number of pages
9
Pages from-to
1114 - 1122
UT code for WoS article
001084849700001
EID of the result in the Scopus database
2-s2.0-85173165158

Similar results(10)

Annotated Amharic Corpora Neural Disambiguation of Lemma and Part of Speech in Morphologically Rich Languages Supervised Morphological Segmentation Using Rich Annotated Lexicon

What are you looking for?

Quick search

Smart search

Corpus Generation to Develop Amharic Morphological Segmenter

The result's identifiers

Alternative languages

Classification

Result continuities

Others

Data specific for result type

Similar results(10)

What are you looking for?

Quick search

Smart search

Result description

The result's identifiers

The result's identifiers

Alternative languages

Alternative languages

Classification

Classification

Result continuities

Result continuities

Others

Others

Data specific for result type

Data specific for result type

Similar results(10)