Context-based Translation for the Out of Vocabulary Words Applied to Hindi-English Cross-Lingual Information Retrieval
The result's identifiers
Result code in IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216208%3A11320%2F22%3AGAEAEM9K" target="_blank" >RIV/00216208:11320/22:GAEAEM9K - isvavai.cz</a>
Alternative codes found
RIV/00216208:11320/23:Q9Y7NBVE
Result on the web
<a href="https://doi.org/10.1080/02564602.2020.1843553" target="_blank" >https://doi.org/10.1080/02564602.2020.1843553</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.1080/02564602.2020.1843553" target="_blank" >10.1080/02564602.2020.1843553</a>
Alternative languages
Result language
angličtina
Original language name
Context-based Translation for the Out of Vocabulary Words Applied to Hindi-English Cross-Lingual Information Retrieval
Original language description
Cross-Lingual Information Retrieval (CLIR) provides flexibility to users to query in their regional (source) languages regardless the target documents languages. CLIR uses trending translation techniques Statistical Machine Translation (SMT) and Neural Machine Translation (NMT). SMT and NMT achieve good results for foreign languages but not for Indian languages due to non-absoluteness of the parallel corpus. Source language user queries may contain the Out Of Vocabulary (OOV) words which are not present in the parallel corpus such words may be skipped without performing translation by SMT. In this paper, a context-based translation algorithm is proposed to translate the OOV words by utilizing two unlabeled & unrelated large raw corpora (in source and target language) and a small bi-lingual parallel corpus. Since SMT performs better than NMT for Hindi to English translation as per the literature, therefore, experimental results are evaluated for FIRE datasets against baseline SMT. The proposed algorithm improves evaluation measures, Recall up to 6.04% (0.8785) for FIRE 2010 and up to 3.96% (0.7365) for FIRE 2011, & Mean Average Precision (MAP) up to 14.37% (0.3239) for FIRE 2010 and up to 5.46% (0.1988) for FIRE 2011, in comparison to the baseline SMT which achieves 0.8284 and 0.7084 Recall for FIRE 2010 and 2011, & 0.2832 and 0.1885 MAP for FIRE 2010 and 2011. An analysis for the number of OOV words shows that the proposed algorithm reduces the number of OOV more effectively, up to 0.81% for FIRE 2010 and 1.73% for FIRE 2011.
Czech name
—
Czech description
—
Classification
Type
J<sub>imp</sub> - Article in a specialist periodical, which is included in the Web of Science database
CEP classification
—
OECD FORD branch
10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)
Result continuities
Project
—
Continuities
—
Others
Publication year
2022
Confidentiality
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Data specific for result type
Name of the periodical
IETE Technical Review (Institution of Electronics and Telecommunication Engineers, India)
ISSN
0256-4602
e-ISSN
0974-5971
Volume of the periodical
39
Issue of the periodical within the volume
2
Country of publishing house
IN - INDIA
Number of pages
10
Pages from-to
276-285
UT code for WoS article
000592603100001
EID of the result in the Scopus database
2-s2.0-85096773748