Hybrid embeddings for transition-based dependency parsing of free word order languages
Identifikátory výsledku
Kód výsledku v IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216208%3A11320%2F23%3A5KLEQ72G" target="_blank" >RIV/00216208:11320/23:5KLEQ72G - isvavai.cz</a>
Výsledek na webu
<a href="https://www.scopus.com/inward/record.uri?eid=2-s2.0-85150066344&doi=10.1016%2fj.ipm.2023.103334&partnerID=40&md5=bf97bf992dc6554eb855a6e6dbddd1ae" target="_blank" >https://www.scopus.com/inward/record.uri?eid=2-s2.0-85150066344&doi=10.1016%2fj.ipm.2023.103334&partnerID=40&md5=bf97bf992dc6554eb855a6e6dbddd1ae</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.1016/j.ipm.2023.103334" target="_blank" >10.1016/j.ipm.2023.103334</a>
Alternativní jazyky
Jazyk výsledku
angličtina
Název v původním jazyce
Hybrid embeddings for transition-based dependency parsing of free word order languages
Popis výsledku v původním jazyce
"Neural Dependency parsing relies on embeddings such as word embeddings and part of speech (POS) embeddings. We propose embeddings which convey more meanings in case of Arabic scripted, morphologically rich, free word order languages. In such languages, part of speech (POS) and morphological features (feats) of a particular word in a sentence govern the suffixes of another word in the same sentence. Keeping this in view, we augment the famous quote “a word is known by the company it keeps” and propose that “a POS is known by the company of suffixes it keeps” and “a morphological feat is known by the company of suffixes it keeps”. We propose two novel embeddings which are XPOSngram and FEATSngram embeddings. These embeddings are trained on heterogeneous items i.e. the pair of language specific POS (XPOS) and n-grams, referred to as ‘XPOSngram’; and morphological feats and n-grams, called ‘FEATSngram’. We call these new type of embeddings hybrid embeddings. We perform experiments on five treebanks, taken from universal dependencies (UD), which belong to four Arabic-scripted, morphologically rich, free word order, and low-resource languages (i.e. Urdu, Arabic, Persian and Uyghur). These treebanks consist of 42985 sentences in total. The experimental results show that on the average, the proposed approach has ≈1.24%, ≈0.84% and ≈3.31% gain in unlabelled attachment score (UAS) over the state of the art language specific POS embeddings, universal POS embeddings and n-gram embeddings based approaches respectively. We have compared the results of hybrid embeddings for Arabic language with the state of the art ArWordVec embeddings. The proposed solution achieves UAS which is ≈10.27% higher than the UAS achieved by ArWordVec. We have further compared the results of hybrid embeddings of Urdu with two state of the art Urdu word embeddings. The results show that the best hybrid embedding has a UAS ≈3.32% and ≈5.015% higher than the two embeddings. We have also tested the proposed methodology for five treebanks of non-Arabic scripted languages from the UD, which are Belarusian, Dutch, German, Greek, and Hungarian languages. The experimental results demonstrate that the proposed approach not only outperform for Arabic scripted languages, but generalizes well for non-Arabic scripted, free word order languages with an average gain of ≈2.5%, ≈2.8% and ≈7.5% in UAS over the state of the art XPOS, UPOS and n-gram based approaches. © 2023 Elsevier Ltd"
Název v anglickém jazyce
Hybrid embeddings for transition-based dependency parsing of free word order languages
Popis výsledku anglicky
"Neural Dependency parsing relies on embeddings such as word embeddings and part of speech (POS) embeddings. We propose embeddings which convey more meanings in case of Arabic scripted, morphologically rich, free word order languages. In such languages, part of speech (POS) and morphological features (feats) of a particular word in a sentence govern the suffixes of another word in the same sentence. Keeping this in view, we augment the famous quote “a word is known by the company it keeps” and propose that “a POS is known by the company of suffixes it keeps” and “a morphological feat is known by the company of suffixes it keeps”. We propose two novel embeddings which are XPOSngram and FEATSngram embeddings. These embeddings are trained on heterogeneous items i.e. the pair of language specific POS (XPOS) and n-grams, referred to as ‘XPOSngram’; and morphological feats and n-grams, called ‘FEATSngram’. We call these new type of embeddings hybrid embeddings. We perform experiments on five treebanks, taken from universal dependencies (UD), which belong to four Arabic-scripted, morphologically rich, free word order, and low-resource languages (i.e. Urdu, Arabic, Persian and Uyghur). These treebanks consist of 42985 sentences in total. The experimental results show that on the average, the proposed approach has ≈1.24%, ≈0.84% and ≈3.31% gain in unlabelled attachment score (UAS) over the state of the art language specific POS embeddings, universal POS embeddings and n-gram embeddings based approaches respectively. We have compared the results of hybrid embeddings for Arabic language with the state of the art ArWordVec embeddings. The proposed solution achieves UAS which is ≈10.27% higher than the UAS achieved by ArWordVec. We have further compared the results of hybrid embeddings of Urdu with two state of the art Urdu word embeddings. The results show that the best hybrid embedding has a UAS ≈3.32% and ≈5.015% higher than the two embeddings. We have also tested the proposed methodology for five treebanks of non-Arabic scripted languages from the UD, which are Belarusian, Dutch, German, Greek, and Hungarian languages. The experimental results demonstrate that the proposed approach not only outperform for Arabic scripted languages, but generalizes well for non-Arabic scripted, free word order languages with an average gain of ≈2.5%, ≈2.8% and ≈7.5% in UAS over the state of the art XPOS, UPOS and n-gram based approaches. © 2023 Elsevier Ltd"
Klasifikace
Druh
J<sub>SC</sub> - Článek v periodiku v databázi SCOPUS
CEP obor
—
OECD FORD obor
10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)
Návaznosti výsledku
Projekt
—
Návaznosti
—
Ostatní
Rok uplatnění
2023
Kód důvěrnosti údajů
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Údaje specifické pro druh výsledku
Název periodika
"Information Processing and Management"
ISSN
0306-4573
e-ISSN
—
Svazek periodika
60
Číslo periodika v rámci svazku
3
Stát vydavatele periodika
US - Spojené státy americké
Počet stran výsledku
21
Strana od-do
1-21
Kód UT WoS článku
000956224800001
EID výsledku v databázi Scopus
2-s2.0-85150066344