Transliteration Characteristics in Romanized Assamese Language Social Media Text and Machine Transliteration

The result's identifiers

Result code in IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216208%3A11320%2F25%3AKJY9TPZP" target="_blank" >RIV/00216208:11320/25:KJY9TPZP - isvavai.cz</a>
Result on the web
<a href="https://www.scopus.com/inward/record.uri?eid=2-s2.0-85186192401&doi=10.1145%2f3639565&partnerID=40&md5=05c10fd5f07edc9da61c4a1148eb97c1" target="_blank" >https://www.scopus.com/inward/record.uri?eid=2-s2.0-85186192401&doi=10.1145%2f3639565&partnerID=40&md5=05c10fd5f07edc9da61c4a1148eb97c1</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.1145/3639565" target="_blank" >10.1145/3639565</a>

Alternative languages

Result language
angličtina
Original language name
Transliteration Characteristics in Romanized Assamese Language Social Media Text and Machine Transliteration
Original language description
This article aims to understand different transliteration behaviors of Romanized Assamese text on social media. Assamese, a language that belongs to the Indo-Aryan language family, is also among the 22 scheduled languages in India. With the increasing popularity of social media in India and also the common use of the English Qwerty keyboard, Indian users on social media express themselves in their native languages, but using the Roman/Latin script. Unlike some other popular South Asian languages (say Pinyin for Chinese), Indian languages do not have a common standard romanization convention for writing on social media platforms. Assamese and English are two very different orthographical languages. Thus, considering both orthographic and phonemic characteristics of the language, this study tries to explain how Assamese vowels, vowel diacritics, and consonants are represented in Roman transliterated form. From a dataset of romanized Assamese social media texts collected from three popular social media sites: (Facebook, YouTube, and X (formerly known as Twitter)),1 we have manually labeled them with their native Assamese script. A comparison analysis is also carried out between the transliterated Assamese social media texts with six different Assamese romanization schemes that reflect how Assamese users on social media do not adhere to any fixed romanization scheme. We have built three separate character-level transliteration models from our dataset. One using a traditional phrase-based statistical machine transliteration model, (1) PBSMT model and two separate neural transliteration models, (2) BiLSTM neural seq2seq model with attention, and (3) Neural transformer model. A thorough error analysis has been performed on the transliteration result obtained from the three state-of-the-art models mentioned above. This may help to build a more robust machine transliteration system for the Assamese social media domain in the future. Finally, an attention analysis experiment is also carried out with the help of attention weight scores taken from the character-level BiLSTM neural seq2seq transliteration model built from our dataset. © 2024 Association for Computing Machinery. All rights reserved.
Czech name
—
Czech description
—

Classification

Type
J<sub>SC</sub> - Article in a specialist periodical, which is included in the SCOPUS database
CEP classification
—
OECD FORD branch
10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)

Result continuities

Project
—
Continuities
—

Others

Publication year
2024
Confidentiality
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů

Data specific for result type

Name of the periodical
ACM Transactions on Asian and Low-Resource Language Information Processing
ISSN
2375-4699
e-ISSN
—
Volume of the periodical
23
Issue of the periodical within the volume
2
Country of publishing house
US - UNITED STATES
Number of pages
36
Pages from-to
1-36
UT code for WoS article
—
EID of the result in the Scopus database
2-s2.0-85186192401

Similar results(10)

Transliteration of Urdu to Latin Script Cross-Lingual Transfer from Related Languages: Treating Low-Resource Maltese as Multilingual Code-Switching Multiclass Event Classification from Text

What are you looking for?

Quick search

Smart search

Transliteration Characteristics in Romanized Assamese Language Social Media Text and Machine Transliteration

The result's identifiers

Alternative languages

Classification

Result continuities

Others

Data specific for result type

Similar results(10)

What are you looking for?

Quick search

Smart search

Result description

The result's identifiers

The result's identifiers

Alternative languages

Alternative languages

Classification

Classification

Result continuities

Result continuities

Others

Others

Data specific for result type

Data specific for result type

Similar results(10)