OdiEnCorp: Odia-English and Odia-Only Corpus for Machine Translation

Result code in IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216208%3A11320%2F19%3A10405593" target="_blank" >RIV/00216208:11320/19:10405593 - isvavai.cz</a>
Result on the web
<a href="https://link.springer.com/chapter/10.1007/978-981-13-9282-5_47" target="_blank" >https://link.springer.com/chapter/10.1007/978-981-13-9282-5_47</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.1007/978-981-13-9282-5_47" target="_blank" >10.1007/978-981-13-9282-5_47</a>

Result language
angličtina
Original language name
OdiEnCorp: Odia-English and Odia-Only Corpus for Machine Translation
Original language description
A multi-lingual country like India needs language corpora for low resource languages not only to provide its citizens with technologies of natural language processing (NLP) readily available in other countries, but also to support its people in their education and cultural needs. In this work, we focus on one of the low resource languages, Odia, and build an Odia-English parallel (OdiEnCorp) and an Odia monolingual (OdiMonoCorp) corpus. The parallel corpus is based on Odia-English parallel texts extracted from online resources and formally corrected by volunteers. We also preprocess the parallel corpus for machine translation research or training. The monolingual corpus comes from a diverse set of online resources and we organize it into a collection of segments and paragraphs, easy to handle by NLP tools. OdiEnCorp parallel corpus contains 29346 sentence pairs and 756K English and 648K Odia tokens. OdiMonoCorp contains 2.6 million tokens in 221K sentences in 71K paragraphs. Despite their small size,
Czech name
—
Czech description
—

Type
D - Article in proceedings
CEP classification
—
OECD FORD branch
10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)

Project
Result was created during the realization of more than one project. More information in the Projects tab.
Continuities
P - Projekt vyzkumu a vyvoje financovany z verejnych zdroju (s odkazem do CEP)

Publication year
2019
Confidentiality
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů

Article name in the collection
Proceedings of the Third International Conference on Smart Computing and Informatics, Volume 1
ISBN
978-981-13-9282-5
ISSN
—
e-ISSN
—
Number of pages
10
Pages from-to
495-504
Publisher name
Springer
Place of publication
Singapore
Event location
Bhubaneshwar, Odisha, India
Event date
Jan 1, 2018
Type of event by nationality
WRD - Celosvětová akce
UT code for WoS article
—

Similar results(10)