Context-based Translation for the Out of Vocabulary Words Applied to Hindi-English Cross-Lingual Information Retrieval
Identifikátory výsledku
Kód výsledku v IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216208%3A11320%2F22%3AGAEAEM9K" target="_blank" >RIV/00216208:11320/22:GAEAEM9K - isvavai.cz</a>
Nalezeny alternativní kódy
RIV/00216208:11320/23:Q9Y7NBVE
Výsledek na webu
<a href="https://doi.org/10.1080/02564602.2020.1843553" target="_blank" >https://doi.org/10.1080/02564602.2020.1843553</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.1080/02564602.2020.1843553" target="_blank" >10.1080/02564602.2020.1843553</a>
Alternativní jazyky
Jazyk výsledku
angličtina
Název v původním jazyce
Context-based Translation for the Out of Vocabulary Words Applied to Hindi-English Cross-Lingual Information Retrieval
Popis výsledku v původním jazyce
Cross-Lingual Information Retrieval (CLIR) provides flexibility to users to query in their regional (source) languages regardless the target documents languages. CLIR uses trending translation techniques Statistical Machine Translation (SMT) and Neural Machine Translation (NMT). SMT and NMT achieve good results for foreign languages but not for Indian languages due to non-absoluteness of the parallel corpus. Source language user queries may contain the Out Of Vocabulary (OOV) words which are not present in the parallel corpus such words may be skipped without performing translation by SMT. In this paper, a context-based translation algorithm is proposed to translate the OOV words by utilizing two unlabeled & unrelated large raw corpora (in source and target language) and a small bi-lingual parallel corpus. Since SMT performs better than NMT for Hindi to English translation as per the literature, therefore, experimental results are evaluated for FIRE datasets against baseline SMT. The proposed algorithm improves evaluation measures, Recall up to 6.04% (0.8785) for FIRE 2010 and up to 3.96% (0.7365) for FIRE 2011, & Mean Average Precision (MAP) up to 14.37% (0.3239) for FIRE 2010 and up to 5.46% (0.1988) for FIRE 2011, in comparison to the baseline SMT which achieves 0.8284 and 0.7084 Recall for FIRE 2010 and 2011, & 0.2832 and 0.1885 MAP for FIRE 2010 and 2011. An analysis for the number of OOV words shows that the proposed algorithm reduces the number of OOV more effectively, up to 0.81% for FIRE 2010 and 1.73% for FIRE 2011.
Název v anglickém jazyce
Context-based Translation for the Out of Vocabulary Words Applied to Hindi-English Cross-Lingual Information Retrieval
Popis výsledku anglicky
Cross-Lingual Information Retrieval (CLIR) provides flexibility to users to query in their regional (source) languages regardless the target documents languages. CLIR uses trending translation techniques Statistical Machine Translation (SMT) and Neural Machine Translation (NMT). SMT and NMT achieve good results for foreign languages but not for Indian languages due to non-absoluteness of the parallel corpus. Source language user queries may contain the Out Of Vocabulary (OOV) words which are not present in the parallel corpus such words may be skipped without performing translation by SMT. In this paper, a context-based translation algorithm is proposed to translate the OOV words by utilizing two unlabeled & unrelated large raw corpora (in source and target language) and a small bi-lingual parallel corpus. Since SMT performs better than NMT for Hindi to English translation as per the literature, therefore, experimental results are evaluated for FIRE datasets against baseline SMT. The proposed algorithm improves evaluation measures, Recall up to 6.04% (0.8785) for FIRE 2010 and up to 3.96% (0.7365) for FIRE 2011, & Mean Average Precision (MAP) up to 14.37% (0.3239) for FIRE 2010 and up to 5.46% (0.1988) for FIRE 2011, in comparison to the baseline SMT which achieves 0.8284 and 0.7084 Recall for FIRE 2010 and 2011, & 0.2832 and 0.1885 MAP for FIRE 2010 and 2011. An analysis for the number of OOV words shows that the proposed algorithm reduces the number of OOV more effectively, up to 0.81% for FIRE 2010 and 1.73% for FIRE 2011.
Klasifikace
Druh
J<sub>imp</sub> - Článek v periodiku v databázi Web of Science
CEP obor
—
OECD FORD obor
10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)
Návaznosti výsledku
Projekt
—
Návaznosti
—
Ostatní
Rok uplatnění
2022
Kód důvěrnosti údajů
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Údaje specifické pro druh výsledku
Název periodika
IETE Technical Review (Institution of Electronics and Telecommunication Engineers, India)
ISSN
0256-4602
e-ISSN
0974-5971
Svazek periodika
39
Číslo periodika v rámci svazku
2
Stát vydavatele periodika
IN - Indická republika
Počet stran výsledku
10
Strana od-do
276-285
Kód UT WoS článku
000592603100001
EID výsledku v databázi Scopus
2-s2.0-85096773748