Part-of-Speech Tagging of Odia Language Using Statistical and Deep Learning Based Approaches
The result's identifiers
Result code in IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216208%3A11320%2F23%3A6ZFC7XBA" target="_blank" >RIV/00216208:11320/23:6ZFC7XBA - isvavai.cz</a>
Result on the web
<a href="https://www.scopus.com/inward/record.uri?eid=2-s2.0-85164237824&doi=10.1145%2f3588900&partnerID=40&md5=320153a08893f9a497203024ab7f1904" target="_blank" >https://www.scopus.com/inward/record.uri?eid=2-s2.0-85164237824&doi=10.1145%2f3588900&partnerID=40&md5=320153a08893f9a497203024ab7f1904</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.1145/3588900" target="_blank" >10.1145/3588900</a>
Alternative languages
Result language
angličtina
Original language name
Part-of-Speech Tagging of Odia Language Using Statistical and Deep Learning Based Approaches
Original language description
"Automatic part-of-speech (POS) tagging is a preprocessing step of many natural language processing tasks, such as named entity recognition, speech processing, information extraction, word sense disambiguation, and machine translation. It has already gained promising results in English and European languages. However, in Indian languages, particularly in the Odia language, it is not yet well explored because of the lack of supporting tools, resources, and morphological richness of the language. Unfortunately, we were unable to locate an open source POS tagger for the Odia language, and only a handful of attempts have been made to develop POS taggers for the Odia language. The main contribution of this research work is to present statistical approaches such as the maximum entropy Markov model and conditional random field (CRF), as well as deep learning based approaches, including the convolutional neural network (CNN) and bidirectional long short-term memory (Bi-LSTM) to develop the Odia POS tagger. A publicly accessible corpus annotated with the Bureau of Indian Standards (BIS) tagset is used in our work. However, most of the languages around the globe have used the dataset annotated with the Universal Dependencies (UD) tagset. Hence, to maintain uniformity, the Odia dataset should use the same tagset. Thus, following the BIS and UD guidelines, we constructed a mapping from the BIS tagset to the UD tagset. The maximum entropy Markov model, CRF, Bi-LSTM, and CNN models are trained using the Indian Languages Corpora Initiative corpus with the BIS and UD tagsets. We have experimented with various feature sets as input to the statistical models to prepare a baseline system and observed the impact of constructed feature sets. The deep learning based model includes the Bi-LSTM network, the CNN network, the CRF layer, character sequence information, and a pre-trained word vector. Seven different combinations of neural sequence labeling models are implemented, and their performance measures are investigated. It has been observed that the Bi-LSTM model with the character sequence feature and pre-trained word vector achieved a result with 94.58% accuracy. © 2023 Copyright held by the owner/author(s). Publication rights licensed to ACM."
Czech name
—
Czech description
—
Classification
Type
J<sub>SC</sub> - Article in a specialist periodical, which is included in the SCOPUS database
CEP classification
—
OECD FORD branch
10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)
Result continuities
Project
—
Continuities
—
Others
Publication year
2023
Confidentiality
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Data specific for result type
Name of the periodical
"ACM Transactions on Asian and Low-Resource Language Information Processing"
ISSN
2375-4699
e-ISSN
—
Volume of the periodical
22
Issue of the periodical within the volume
6
Country of publishing house
US - UNITED STATES
Number of pages
24
Pages from-to
1-24
UT code for WoS article
001018562700013
EID of the result in the Scopus database
2-s2.0-85164237824