Improving linear orthogonal mapping based cross-lingual representation using ridge regression and graph centrality
Identifikátory výsledku
Kód výsledku v IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216208%3A11320%2F25%3AVED7NJD9" target="_blank" >RIV/00216208:11320/25:VED7NJD9 - isvavai.cz</a>
Výsledek na webu
<a href="https://www.scopus.com/inward/record.uri?eid=2-s2.0-85188573401&doi=10.1016%2fj.csl.2024.101640&partnerID=40&md5=6151af2a84f3f7facd35357c17f82d02" target="_blank" >https://www.scopus.com/inward/record.uri?eid=2-s2.0-85188573401&doi=10.1016%2fj.csl.2024.101640&partnerID=40&md5=6151af2a84f3f7facd35357c17f82d02</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.1016/j.csl.2024.101640" target="_blank" >10.1016/j.csl.2024.101640</a>
Alternativní jazyky
Jazyk výsledku
angličtina
Název v původním jazyce
Improving linear orthogonal mapping based cross-lingual representation using ridge regression and graph centrality
Popis výsledku v původním jazyce
Orthogonal linear mapping is a commonly used approach for generating cross-lingual embedding between two monolingual corpora that uses a word frequency-based seed dictionary alignment approach. While this approach is found to be effective for isomorphic language pairs, they do not perform well for distant language pairs with different sentence structures and morphological properties. For a distance language pair, the existing frequency-aligned orthogonal mapping methods suffer from two problems - (i)the frequency of source and target word are not comparable, and (ii)different word pairs in the seed dictionary may have different contribution. Motivated by the above two concerns, this paper proposes a novel centrality-aligned ridge regression-based orthogonal mapping. The proposed method uses centrality-based alignment for seed dictionary selection and ridge regression framework for incorporating influential weights of different word pairs in the seed dictionary. From various experimental observations over five language pairs (both isomorphic and distant languages), it is evident that the proposed method outperforms baseline methods in the Bilingual Dictionary Induction(BDI) task, Sentence Retrieval Task(SRT), and Machine Translation. Further, several analyses are also included to support the proposed method. © 2024 Elsevier Ltd
Název v anglickém jazyce
Improving linear orthogonal mapping based cross-lingual representation using ridge regression and graph centrality
Popis výsledku anglicky
Orthogonal linear mapping is a commonly used approach for generating cross-lingual embedding between two monolingual corpora that uses a word frequency-based seed dictionary alignment approach. While this approach is found to be effective for isomorphic language pairs, they do not perform well for distant language pairs with different sentence structures and morphological properties. For a distance language pair, the existing frequency-aligned orthogonal mapping methods suffer from two problems - (i)the frequency of source and target word are not comparable, and (ii)different word pairs in the seed dictionary may have different contribution. Motivated by the above two concerns, this paper proposes a novel centrality-aligned ridge regression-based orthogonal mapping. The proposed method uses centrality-based alignment for seed dictionary selection and ridge regression framework for incorporating influential weights of different word pairs in the seed dictionary. From various experimental observations over five language pairs (both isomorphic and distant languages), it is evident that the proposed method outperforms baseline methods in the Bilingual Dictionary Induction(BDI) task, Sentence Retrieval Task(SRT), and Machine Translation. Further, several analyses are also included to support the proposed method. © 2024 Elsevier Ltd
Klasifikace
Druh
J<sub>SC</sub> - Článek v periodiku v databázi SCOPUS
CEP obor
—
OECD FORD obor
10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)
Návaznosti výsledku
Projekt
—
Návaznosti
—
Ostatní
Rok uplatnění
2024
Kód důvěrnosti údajů
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Údaje specifické pro druh výsledku
Název periodika
Computer Speech and Language
ISSN
0885-2308
e-ISSN
—
Svazek periodika
87
Číslo periodika v rámci svazku
2024
Stát vydavatele periodika
US - Spojené státy americké
Počet stran výsledku
25
Strana od-do
1-25
Kód UT WoS článku
—
EID výsledku v databázi Scopus
2-s2.0-85188573401