A Comparative Study of Lemmatization Approaches for Rojak Language

Identifikátory výsledku

Kód výsledku v IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216208%3A11320%2F25%3A7CW7EANQ" target="_blank" >RIV/00216208:11320/25:7CW7EANQ - isvavai.cz</a>
Výsledek na webu
<a href="https://www.scopus.com/inward/record.uri?eid=2-s2.0-85192744395&doi=10.1007%2f978-981-97-0293-0_1&partnerID=40&md5=f10fe36e39c931361b2a00e2326c3670" target="_blank" >https://www.scopus.com/inward/record.uri?eid=2-s2.0-85192744395&doi=10.1007%2f978-981-97-0293-0_1&partnerID=40&md5=f10fe36e39c931361b2a00e2326c3670</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.1007/978-981-97-0293-0_1" target="_blank" >10.1007/978-981-97-0293-0_1</a>

Alternativní jazyky

Jazyk výsledku
angličtina
Název v původním jazyce
A Comparative Study of Lemmatization Approaches for Rojak Language
Popis výsledku v původním jazyce
Lemmatization is an important preprocessing step in most natural language processing (NLP) applications where it extracts a valid and linguistically meaningful lemma from an inflectional word. This allows different inflected forms of a word to be grouped into a common root which is the base-form or dictionary-form of a word, known as lemma. Due to the rapid spread of code-mixing languages like the Rojak language that mixes English with Malay, a lemmatizer capable of lemmatizing the language is needed for NLP applications involving this language. Thus, this work proposes a Rojak language lemmatization approach that is able to handle both languages without requiring users to input texts in different language separately. Various methods including rule-based, corpus-based, machine learning, and deep learning-based were experimented and compared using the English Web Treebank (EWT) and Indonesian GSD corpora from the Universal Dependencies (UD) framework. Besides, the effect of POS tags on the performance of lemmatizers was also evaluated based on the accuracy of the train and test sets. From the experiments conducted, the corpus-based approach produced the best results with 99.90% and 92.27% test set accuracy for Malay and English, respectively, whereas the deep learning-based with POS tag approach produced the worst results of 79.78 and 91.15%. © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024.
Název v anglickém jazyce
A Comparative Study of Lemmatization Approaches for Rojak Language
Popis výsledku anglicky
Lemmatization is an important preprocessing step in most natural language processing (NLP) applications where it extracts a valid and linguistically meaningful lemma from an inflectional word. This allows different inflected forms of a word to be grouped into a common root which is the base-form or dictionary-form of a word, known as lemma. Due to the rapid spread of code-mixing languages like the Rojak language that mixes English with Malay, a lemmatizer capable of lemmatizing the language is needed for NLP applications involving this language. Thus, this work proposes a Rojak language lemmatization approach that is able to handle both languages without requiring users to input texts in different language separately. Various methods including rule-based, corpus-based, machine learning, and deep learning-based were experimented and compared using the English Web Treebank (EWT) and Indonesian GSD corpora from the Universal Dependencies (UD) framework. Besides, the effect of POS tags on the performance of lemmatizers was also evaluated based on the accuracy of the train and test sets. From the experiments conducted, the corpus-based approach produced the best results with 99.90% and 92.27% test set accuracy for Malay and English, respectively, whereas the deep learning-based with POS tag approach produced the worst results of 79.78 and 91.15%. © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024.

Klasifikace

Druh
C - Kapitola v odborné knize
CEP obor
—
OECD FORD obor
10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)

Návaznosti výsledku

Projekt
—
Návaznosti
—

Ostatní

Rok uplatnění
2024
Kód důvěrnosti údajů
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů

Údaje specifické pro druh výsledku

Název knihy nebo sborníku
Lecture. Notes. Data Eng. Commun. Tech.
ISBN
978-981-9702-93-0
Počet stran výsledku
14
Strana od-do
3-16
Počet stran knihy
250
Název nakladatele
Springer Science and Business Media Deutschland GmbH
Místo vydání
—
Kód UT WoS kapitoly
—

Podobné výsledky(10)

On the Role of Morphological Information for Contextual Lemmatization How low is too low? A monolingual take on lemmatisation in Indian languages Nefnir: A high accuracy lemmatizer for Icelandic

Co hledáte?

Rychlé hledání

Chytré vyhledávání

A Comparative Study of Lemmatization Approaches for Rojak Language

Identifikátory výsledku

Alternativní jazyky

Klasifikace

Návaznosti výsledku

Ostatní

Údaje specifické pro druh výsledku

Podobné výsledky(10)

Co hledáte?

Rychlé hledání

Chytré vyhledávání

Popis výsledku

Identifikátory výsledku

Identifikátory výsledku

Alternativní jazyky

Alternativní jazyky

Klasifikace

Klasifikace

Návaznosti výsledku

Návaznosti výsledku

Ostatní

Ostatní

Údaje specifické pro druh výsledku

Údaje specifické pro druh výsledku

Podobné výsledky(10)