All

What are you looking for?

All
Projects
Results
Organizations

Quick search

  • Projects supported by TA ČR
  • Excellent projects
  • Projects with the highest public support
  • Current projects

Smart search

  • That is how I find a specific +word
  • That is how I leave the -word out of the results
  • “That is how I can find the whole phrase”

A Comparative Study of Lemmatization Approaches for Rojak Language

The result's identifiers

  • Result code in IS VaVaI

    <a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216208%3A11320%2F25%3A7CW7EANQ" target="_blank" >RIV/00216208:11320/25:7CW7EANQ - isvavai.cz</a>

  • Result on the web

    <a href="https://www.scopus.com/inward/record.uri?eid=2-s2.0-85192744395&doi=10.1007%2f978-981-97-0293-0_1&partnerID=40&md5=f10fe36e39c931361b2a00e2326c3670" target="_blank" >https://www.scopus.com/inward/record.uri?eid=2-s2.0-85192744395&doi=10.1007%2f978-981-97-0293-0_1&partnerID=40&md5=f10fe36e39c931361b2a00e2326c3670</a>

  • DOI - Digital Object Identifier

    <a href="http://dx.doi.org/10.1007/978-981-97-0293-0_1" target="_blank" >10.1007/978-981-97-0293-0_1</a>

Alternative languages

  • Result language

    angličtina

  • Original language name

    A Comparative Study of Lemmatization Approaches for Rojak Language

  • Original language description

    Lemmatization is an important preprocessing step in most natural language processing (NLP) applications where it extracts a valid and linguistically meaningful lemma from an inflectional word. This allows different inflected forms of a word to be grouped into a common root which is the base-form or dictionary-form of a word, known as lemma. Due to the rapid spread of code-mixing languages like the Rojak language that mixes English with Malay, a lemmatizer capable of lemmatizing the language is needed for NLP applications involving this language. Thus, this work proposes a Rojak language lemmatization approach that is able to handle both languages without requiring users to input texts in different language separately. Various methods including rule-based, corpus-based, machine learning, and deep learning-based were experimented and compared using the English Web Treebank (EWT) and Indonesian GSD corpora from the Universal Dependencies (UD) framework. Besides, the effect of POS tags on the performance of lemmatizers was also evaluated based on the accuracy of the train and test sets. From the experiments conducted, the corpus-based approach produced the best results with 99.90% and 92.27% test set accuracy for Malay and English, respectively, whereas the deep learning-based with POS tag approach produced the worst results of 79.78 and 91.15%. © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024.

  • Czech name

  • Czech description

Classification

  • Type

    C - Chapter in a specialist book

  • CEP classification

  • OECD FORD branch

    10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)

Result continuities

  • Project

  • Continuities

Others

  • Publication year

    2024

  • Confidentiality

    S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů

Data specific for result type

  • Book/collection name

    Lecture. Notes. Data Eng. Commun. Tech.

  • ISBN

    978-981-9702-93-0

  • Number of pages of the result

    14

  • Pages from-to

    3-16

  • Number of pages of the book

    250

  • Publisher name

    Springer Science and Business Media Deutschland GmbH

  • Place of publication

  • UT code for WoS chapter