Intersecting Parallel Corpora

Identifikátory výsledku

Kód výsledku v IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216208%3A11320%2F12%3A10134102" target="_blank" >RIV/00216208:11320/12:10134102 - isvavai.cz</a>
Nalezeny alternativní kódy
RIV/00216208:11320/12:10194834
Výsledek na webu
<a href="https://wiki.ufal.ms.mff.cuni.cz/user:zeman:intersecting-parallel-corpora" target="_blank" >https://wiki.ufal.ms.mff.cuni.cz/user:zeman:intersecting-parallel-corpora</a>
DOI - Digital Object Identifier
—

Alternativní jazyky

Jazyk výsledku
angličtina
Název v původním jazyce
Intersecting Parallel Corpora
Popis výsledku v původním jazyce
The organizers of the annual Workshop on Machine Translation (WMT) prepare and distribute parallel corpora that can be used to train systems for the shared tasks. Two core types of corpora are the News Commentary corpus and the Europarl corpus. Both areavailable in several language pairs, always between English and another European language: cs-en, de-en, es-en and fr-en. The corpora are not multi-parallel. They come from the same source and there is significant overlap but still some sentences are translated to only a subset of the languages. The bi-parallel subsets do not all have the same number of sentence pairs. Such corpora cannot be directly used to train a system for e.g. de-cs (German-Czech). However, we can use English as a pivot language. If we identify the intersection of the English parts of cs-en and de-en, we can take the non-English counterparts of the overlapping English sentences to create a de-cs parallel corpus. That is what this software does.
Název v anglickém jazyce
Intersecting Parallel Corpora
Popis výsledku anglicky
The organizers of the annual Workshop on Machine Translation (WMT) prepare and distribute parallel corpora that can be used to train systems for the shared tasks. Two core types of corpora are the News Commentary corpus and the Europarl corpus. Both areavailable in several language pairs, always between English and another European language: cs-en, de-en, es-en and fr-en. The corpora are not multi-parallel. They come from the same source and there is significant overlap but still some sentences are translated to only a subset of the languages. The bi-parallel subsets do not all have the same number of sentence pairs. Such corpora cannot be directly used to train a system for e.g. de-cs (German-Czech). However, we can use English as a pivot language. If we identify the intersection of the English parts of cs-en and de-en, we can take the non-English counterparts of the overlapping English sentences to create a de-cs parallel corpus. That is what this software does.

Klasifikace

Druh
R - Software
CEP obor
AI - Jazykověda
OECD FORD obor
—

Návaznosti výsledku

Projekt
<a href="/cs/project/7E11051" target="_blank" >7E11051: EuroMatrixPlus - Enlarged European Union Bringing Machine Translation for European Languages to the User</a><br>
Návaznosti
P - Projekt vyzkumu a vyvoje financovany z verejnych zdroju (s odkazem do CEP)

Ostatní

Rok uplatnění
2012
Kód důvěrnosti údajů
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů

Údaje specifické pro druh výsledku

Interní identifikační kód produktu
IPC
Technické parametry
Uzavření smlouvy o užívání není vyžadováno. Software je k dispozici ke stažení na adrese https://wiki.ufal.ms.mff.cuni.cz/user:zeman:intersecting-parallel-corpora.
Ekonomické parametry
The tool saves costs for obtaining, translating and annotating new parallel data in cases where texts exist for other language pairs.
IČO vlastníka výsledku
00216208
Název vlastníka
Univerzita Karlova v Praze

Podobné výsledky(10)

Automatic Resource Augmentation for Machine Translation in Low Resource Language: EnIndic Corpus Boosting Unsupervised Machine Translation with Pseudo-Parallel Data ParaMed: a parallel corpus for English-Chinese translation in the biomedical domain

Co hledáte?

Rychlé hledání

Chytré vyhledávání

Intersecting Parallel Corpora

Identifikátory výsledku

Alternativní jazyky

Klasifikace

Návaznosti výsledku

Ostatní

Údaje specifické pro druh výsledku

Podobné výsledky(10)

Co hledáte?

Rychlé hledání

Chytré vyhledávání

Popis výsledku

Identifikátory výsledku

Identifikátory výsledku

Alternativní jazyky

Alternativní jazyky

Klasifikace

Klasifikace

Návaznosti výsledku

Návaznosti výsledku

Ostatní

Ostatní

Údaje specifické pro druh výsledku

Údaje specifické pro druh výsledku

Podobné výsledky(10)