Vše

Co hledáte?

Vše
Projekty
Výsledky výzkumu
Subjekty

Rychlé hledání

  • Projekty podpořené TA ČR
  • Významné projekty
  • Projekty s nejvyšší státní podporou
  • Aktuálně běžící projekty

Chytré vyhledávání

  • Takto najdu konkrétní +slovo
  • Takto z výsledků -slovo zcela vynechám
  • “Takto můžu najít celou frázi”

Hybrid embeddings for transition-based dependency parsing of free word order languages

Identifikátory výsledku

  • Kód výsledku v IS VaVaI

    <a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216208%3A11320%2F23%3A5KLEQ72G" target="_blank" >RIV/00216208:11320/23:5KLEQ72G - isvavai.cz</a>

  • Výsledek na webu

    <a href="https://www.scopus.com/inward/record.uri?eid=2-s2.0-85150066344&doi=10.1016%2fj.ipm.2023.103334&partnerID=40&md5=bf97bf992dc6554eb855a6e6dbddd1ae" target="_blank" >https://www.scopus.com/inward/record.uri?eid=2-s2.0-85150066344&doi=10.1016%2fj.ipm.2023.103334&partnerID=40&md5=bf97bf992dc6554eb855a6e6dbddd1ae</a>

  • DOI - Digital Object Identifier

    <a href="http://dx.doi.org/10.1016/j.ipm.2023.103334" target="_blank" >10.1016/j.ipm.2023.103334</a>

Alternativní jazyky

  • Jazyk výsledku

    angličtina

  • Název v původním jazyce

    Hybrid embeddings for transition-based dependency parsing of free word order languages

  • Popis výsledku v původním jazyce

    "Neural Dependency parsing relies on embeddings such as word embeddings and part of speech (POS) embeddings. We propose embeddings which convey more meanings in case of Arabic scripted, morphologically rich, free word order languages. In such languages, part of speech (POS) and morphological features (feats) of a particular word in a sentence govern the suffixes of another word in the same sentence. Keeping this in view, we augment the famous quote “a word is known by the company it keeps” and propose that “a POS is known by the company of suffixes it keeps” and “a morphological feat is known by the company of suffixes it keeps”. We propose two novel embeddings which are XPOSngram and FEATSngram embeddings. These embeddings are trained on heterogeneous items i.e. the pair of language specific POS (XPOS) and n-grams, referred to as ‘XPOSngram’; and morphological feats and n-grams, called ‘FEATSngram’. We call these new type of embeddings hybrid embeddings. We perform experiments on five treebanks, taken from universal dependencies (UD), which belong to four Arabic-scripted, morphologically rich, free word order, and low-resource languages (i.e. Urdu, Arabic, Persian and Uyghur). These treebanks consist of 42985 sentences in total. The experimental results show that on the average, the proposed approach has ≈1.24%, ≈0.84% and ≈3.31% gain in unlabelled attachment score (UAS) over the state of the art language specific POS embeddings, universal POS embeddings and n-gram embeddings based approaches respectively. We have compared the results of hybrid embeddings for Arabic language with the state of the art ArWordVec embeddings. The proposed solution achieves UAS which is ≈10.27% higher than the UAS achieved by ArWordVec. We have further compared the results of hybrid embeddings of Urdu with two state of the art Urdu word embeddings. The results show that the best hybrid embedding has a UAS ≈3.32% and ≈5.015% higher than the two embeddings. We have also tested the proposed methodology for five treebanks of non-Arabic scripted languages from the UD, which are Belarusian, Dutch, German, Greek, and Hungarian languages. The experimental results demonstrate that the proposed approach not only outperform for Arabic scripted languages, but generalizes well for non-Arabic scripted, free word order languages with an average gain of ≈2.5%, ≈2.8% and ≈7.5% in UAS over the state of the art XPOS, UPOS and n-gram based approaches. © 2023 Elsevier Ltd"

  • Název v anglickém jazyce

    Hybrid embeddings for transition-based dependency parsing of free word order languages

  • Popis výsledku anglicky

    "Neural Dependency parsing relies on embeddings such as word embeddings and part of speech (POS) embeddings. We propose embeddings which convey more meanings in case of Arabic scripted, morphologically rich, free word order languages. In such languages, part of speech (POS) and morphological features (feats) of a particular word in a sentence govern the suffixes of another word in the same sentence. Keeping this in view, we augment the famous quote “a word is known by the company it keeps” and propose that “a POS is known by the company of suffixes it keeps” and “a morphological feat is known by the company of suffixes it keeps”. We propose two novel embeddings which are XPOSngram and FEATSngram embeddings. These embeddings are trained on heterogeneous items i.e. the pair of language specific POS (XPOS) and n-grams, referred to as ‘XPOSngram’; and morphological feats and n-grams, called ‘FEATSngram’. We call these new type of embeddings hybrid embeddings. We perform experiments on five treebanks, taken from universal dependencies (UD), which belong to four Arabic-scripted, morphologically rich, free word order, and low-resource languages (i.e. Urdu, Arabic, Persian and Uyghur). These treebanks consist of 42985 sentences in total. The experimental results show that on the average, the proposed approach has ≈1.24%, ≈0.84% and ≈3.31% gain in unlabelled attachment score (UAS) over the state of the art language specific POS embeddings, universal POS embeddings and n-gram embeddings based approaches respectively. We have compared the results of hybrid embeddings for Arabic language with the state of the art ArWordVec embeddings. The proposed solution achieves UAS which is ≈10.27% higher than the UAS achieved by ArWordVec. We have further compared the results of hybrid embeddings of Urdu with two state of the art Urdu word embeddings. The results show that the best hybrid embedding has a UAS ≈3.32% and ≈5.015% higher than the two embeddings. We have also tested the proposed methodology for five treebanks of non-Arabic scripted languages from the UD, which are Belarusian, Dutch, German, Greek, and Hungarian languages. The experimental results demonstrate that the proposed approach not only outperform for Arabic scripted languages, but generalizes well for non-Arabic scripted, free word order languages with an average gain of ≈2.5%, ≈2.8% and ≈7.5% in UAS over the state of the art XPOS, UPOS and n-gram based approaches. © 2023 Elsevier Ltd"

Klasifikace

  • Druh

    J<sub>SC</sub> - Článek v periodiku v databázi SCOPUS

  • CEP obor

  • OECD FORD obor

    10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)

Návaznosti výsledku

  • Projekt

  • Návaznosti

Ostatní

  • Rok uplatnění

    2023

  • Kód důvěrnosti údajů

    S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů

Údaje specifické pro druh výsledku

  • Název periodika

    "Information Processing and Management"

  • ISSN

    0306-4573

  • e-ISSN

  • Svazek periodika

    60

  • Číslo periodika v rámci svazku

    3

  • Stát vydavatele periodika

    US - Spojené státy americké

  • Počet stran výsledku

    21

  • Strana od-do

    1-21

  • Kód UT WoS článku

    000956224800001

  • EID výsledku v databázi Scopus

    2-s2.0-85150066344