Vše

Co hledáte?

Vše
Projekty
Výsledky výzkumu
Subjekty

Rychlé hledání

  • Projekty podpořené TA ČR
  • Významné projekty
  • Projekty s nejvyšší státní podporou
  • Aktuálně běžící projekty

Chytré vyhledávání

  • Takto najdu konkrétní +slovo
  • Takto z výsledků -slovo zcela vynechám
  • “Takto můžu najít celou frázi”

Parallel texts dataset for Uzbek-Kazakh machine translation

Identifikátory výsledku

  • Kód výsledku v IS VaVaI

    <a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216208%3A11320%2F25%3AVMWYREKG" target="_blank" >RIV/00216208:11320/25:VMWYREKG - isvavai.cz</a>

  • Výsledek na webu

    <a href="https://www.webofscience.com/wos/woscc/summary/121e71ce-d59a-4953-8092-7d6304231303-fed0d941/relevance/1" target="_blank" >https://www.webofscience.com/wos/woscc/summary/121e71ce-d59a-4953-8092-7d6304231303-fed0d941/relevance/1</a>

  • DOI - Digital Object Identifier

    <a href="http://dx.doi.org/10.1016/j.dib.2024.110194" target="_blank" >10.1016/j.dib.2024.110194</a>

Alternativní jazyky

  • Jazyk výsledku

    angličtina

  • Název v původním jazyce

    Parallel texts dataset for Uzbek-Kazakh machine translation

  • Popis výsledku v původním jazyce

    This paper presents a parallel corpus of raw texts between the Uzbek and Kazakh languages as a dataset for machine translation applications, focusing on the data collection process, dataset description, and its potential for reuse. The dataset-building process includes three separate stages, starting with a tiny portion of already available parallel data, then some more compiled from openly available resources like literature books, and web news texts, which were aligned using the sentence alignment method, encompassing a wide range of topics and genres. Finally, the majority of the dataset was taken from a raw text corpus in Uzbek and manually translated into Kazakh by a group of experts who are fluent in both languages. The resulting parallel corpus serves as a valuable resource for researchers and practitioners interested in Kazakh and Uzbek language processing tasks, particularly in the context of neural machine translation, where the presented data can be used for testing the rule-based machine translation models, or it can be used for both training statistical and neural machine translation models as well. The dataset has been made accessible through the widely recognized Hugging Face platform, a repository known for facilitating collaborative efforts and advancing Natural Language

  • Název v anglickém jazyce

    Parallel texts dataset for Uzbek-Kazakh machine translation

  • Popis výsledku anglicky

    This paper presents a parallel corpus of raw texts between the Uzbek and Kazakh languages as a dataset for machine translation applications, focusing on the data collection process, dataset description, and its potential for reuse. The dataset-building process includes three separate stages, starting with a tiny portion of already available parallel data, then some more compiled from openly available resources like literature books, and web news texts, which were aligned using the sentence alignment method, encompassing a wide range of topics and genres. Finally, the majority of the dataset was taken from a raw text corpus in Uzbek and manually translated into Kazakh by a group of experts who are fluent in both languages. The resulting parallel corpus serves as a valuable resource for researchers and practitioners interested in Kazakh and Uzbek language processing tasks, particularly in the context of neural machine translation, where the presented data can be used for testing the rule-based machine translation models, or it can be used for both training statistical and neural machine translation models as well. The dataset has been made accessible through the widely recognized Hugging Face platform, a repository known for facilitating collaborative efforts and advancing Natural Language

Klasifikace

  • Druh

    J<sub>imp</sub> - Článek v periodiku v databázi Web of Science

  • CEP obor

  • OECD FORD obor

    10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)

Návaznosti výsledku

  • Projekt

  • Návaznosti

Ostatní

  • Rok uplatnění

    2024

  • Kód důvěrnosti údajů

    S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů

Údaje specifické pro druh výsledku

  • Název periodika

    DATA IN BRIEF

  • ISSN

    2352-3409

  • e-ISSN

  • Svazek periodika

    53

  • Číslo periodika v rámci svazku

    2024-04

  • Stát vydavatele periodika

    US - Spojené státy americké

  • Počet stran výsledku

    110194

  • Strana od-do

    1-110194

  • Kód UT WoS článku

    001199307700001

  • EID výsledku v databázi Scopus