Vše

Co hledáte?

Vše
Projekty
Výsledky výzkumu
Subjekty

Rychlé hledání

  • Projekty podpořené TA ČR
  • Významné projekty
  • Projekty s nejvyšší státní podporou
  • Aktuálně běžící projekty

Chytré vyhledávání

  • Takto najdu konkrétní +slovo
  • Takto z výsledků -slovo zcela vynechám
  • “Takto můžu najít celou frázi”

A Comparative Study of Lemmatization Approaches for Rojak Language

Identifikátory výsledku

  • Kód výsledku v IS VaVaI

    <a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216208%3A11320%2F25%3A7CW7EANQ" target="_blank" >RIV/00216208:11320/25:7CW7EANQ - isvavai.cz</a>

  • Výsledek na webu

    <a href="https://www.scopus.com/inward/record.uri?eid=2-s2.0-85192744395&doi=10.1007%2f978-981-97-0293-0_1&partnerID=40&md5=f10fe36e39c931361b2a00e2326c3670" target="_blank" >https://www.scopus.com/inward/record.uri?eid=2-s2.0-85192744395&doi=10.1007%2f978-981-97-0293-0_1&partnerID=40&md5=f10fe36e39c931361b2a00e2326c3670</a>

  • DOI - Digital Object Identifier

    <a href="http://dx.doi.org/10.1007/978-981-97-0293-0_1" target="_blank" >10.1007/978-981-97-0293-0_1</a>

Alternativní jazyky

  • Jazyk výsledku

    angličtina

  • Název v původním jazyce

    A Comparative Study of Lemmatization Approaches for Rojak Language

  • Popis výsledku v původním jazyce

    Lemmatization is an important preprocessing step in most natural language processing (NLP) applications where it extracts a valid and linguistically meaningful lemma from an inflectional word. This allows different inflected forms of a word to be grouped into a common root which is the base-form or dictionary-form of a word, known as lemma. Due to the rapid spread of code-mixing languages like the Rojak language that mixes English with Malay, a lemmatizer capable of lemmatizing the language is needed for NLP applications involving this language. Thus, this work proposes a Rojak language lemmatization approach that is able to handle both languages without requiring users to input texts in different language separately. Various methods including rule-based, corpus-based, machine learning, and deep learning-based were experimented and compared using the English Web Treebank (EWT) and Indonesian GSD corpora from the Universal Dependencies (UD) framework. Besides, the effect of POS tags on the performance of lemmatizers was also evaluated based on the accuracy of the train and test sets. From the experiments conducted, the corpus-based approach produced the best results with 99.90% and 92.27% test set accuracy for Malay and English, respectively, whereas the deep learning-based with POS tag approach produced the worst results of 79.78 and 91.15%. © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024.

  • Název v anglickém jazyce

    A Comparative Study of Lemmatization Approaches for Rojak Language

  • Popis výsledku anglicky

    Lemmatization is an important preprocessing step in most natural language processing (NLP) applications where it extracts a valid and linguistically meaningful lemma from an inflectional word. This allows different inflected forms of a word to be grouped into a common root which is the base-form or dictionary-form of a word, known as lemma. Due to the rapid spread of code-mixing languages like the Rojak language that mixes English with Malay, a lemmatizer capable of lemmatizing the language is needed for NLP applications involving this language. Thus, this work proposes a Rojak language lemmatization approach that is able to handle both languages without requiring users to input texts in different language separately. Various methods including rule-based, corpus-based, machine learning, and deep learning-based were experimented and compared using the English Web Treebank (EWT) and Indonesian GSD corpora from the Universal Dependencies (UD) framework. Besides, the effect of POS tags on the performance of lemmatizers was also evaluated based on the accuracy of the train and test sets. From the experiments conducted, the corpus-based approach produced the best results with 99.90% and 92.27% test set accuracy for Malay and English, respectively, whereas the deep learning-based with POS tag approach produced the worst results of 79.78 and 91.15%. © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024.

Klasifikace

  • Druh

    C - Kapitola v odborné knize

  • CEP obor

  • OECD FORD obor

    10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)

Návaznosti výsledku

  • Projekt

  • Návaznosti

Ostatní

  • Rok uplatnění

    2024

  • Kód důvěrnosti údajů

    S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů

Údaje specifické pro druh výsledku

  • Název knihy nebo sborníku

    Lecture. Notes. Data Eng. Commun. Tech.

  • ISBN

    978-981-9702-93-0

  • Počet stran výsledku

    14

  • Strana od-do

    3-16

  • Počet stran knihy

    250

  • Název nakladatele

    Springer Science and Business Media Deutschland GmbH

  • Místo vydání

  • Kód UT WoS kapitoly