Parallel texts dataset for Uzbek-Kazakh machine translation
The result's identifiers
Result code in IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216208%3A11320%2F25%3AVMWYREKG" target="_blank" >RIV/00216208:11320/25:VMWYREKG - isvavai.cz</a>
Result on the web
<a href="https://www.webofscience.com/wos/woscc/summary/121e71ce-d59a-4953-8092-7d6304231303-fed0d941/relevance/1" target="_blank" >https://www.webofscience.com/wos/woscc/summary/121e71ce-d59a-4953-8092-7d6304231303-fed0d941/relevance/1</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.1016/j.dib.2024.110194" target="_blank" >10.1016/j.dib.2024.110194</a>
Alternative languages
Result language
angličtina
Original language name
Parallel texts dataset for Uzbek-Kazakh machine translation
Original language description
This paper presents a parallel corpus of raw texts between the Uzbek and Kazakh languages as a dataset for machine translation applications, focusing on the data collection process, dataset description, and its potential for reuse. The dataset-building process includes three separate stages, starting with a tiny portion of already available parallel data, then some more compiled from openly available resources like literature books, and web news texts, which were aligned using the sentence alignment method, encompassing a wide range of topics and genres. Finally, the majority of the dataset was taken from a raw text corpus in Uzbek and manually translated into Kazakh by a group of experts who are fluent in both languages. The resulting parallel corpus serves as a valuable resource for researchers and practitioners interested in Kazakh and Uzbek language processing tasks, particularly in the context of neural machine translation, where the presented data can be used for testing the rule-based machine translation models, or it can be used for both training statistical and neural machine translation models as well. The dataset has been made accessible through the widely recognized Hugging Face platform, a repository known for facilitating collaborative efforts and advancing Natural Language
Czech name
—
Czech description
—
Classification
Type
J<sub>imp</sub> - Article in a specialist periodical, which is included in the Web of Science database
CEP classification
—
OECD FORD branch
10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)
Result continuities
Project
—
Continuities
—
Others
Publication year
2024
Confidentiality
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Data specific for result type
Name of the periodical
DATA IN BRIEF
ISSN
2352-3409
e-ISSN
—
Volume of the periodical
53
Issue of the periodical within the volume
2024-04
Country of publishing house
US - UNITED STATES
Number of pages
110194
Pages from-to
1-110194
UT code for WoS article
001199307700001
EID of the result in the Scopus database
—