Intersecting Parallel Corpora
The result's identifiers
Result code in IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216208%3A11320%2F12%3A10134102" target="_blank" >RIV/00216208:11320/12:10134102 - isvavai.cz</a>
Alternative codes found
RIV/00216208:11320/12:10194834
Result on the web
<a href="https://wiki.ufal.ms.mff.cuni.cz/user:zeman:intersecting-parallel-corpora" target="_blank" >https://wiki.ufal.ms.mff.cuni.cz/user:zeman:intersecting-parallel-corpora</a>
DOI - Digital Object Identifier
—
Alternative languages
Result language
angličtina
Original language name
Intersecting Parallel Corpora
Original language description
The organizers of the annual Workshop on Machine Translation (WMT) prepare and distribute parallel corpora that can be used to train systems for the shared tasks. Two core types of corpora are the News Commentary corpus and the Europarl corpus. Both areavailable in several language pairs, always between English and another European language: cs-en, de-en, es-en and fr-en. The corpora are not multi-parallel. They come from the same source and there is significant overlap but still some sentences are translated to only a subset of the languages. The bi-parallel subsets do not all have the same number of sentence pairs. Such corpora cannot be directly used to train a system for e.g. de-cs (German-Czech). However, we can use English as a pivot language. If we identify the intersection of the English parts of cs-en and de-en, we can take the non-English counterparts of the overlapping English sentences to create a de-cs parallel corpus. That is what this software does.
Czech name
—
Czech description
—
Classification
Type
R - Software
CEP classification
AI - Linguistics
OECD FORD branch
—
Result continuities
Project
<a href="/en/project/7E11051" target="_blank" >7E11051: EuroMatrixPlus - Enlarged European Union Bringing Machine Translation for European Languages to the User</a><br>
Continuities
P - Projekt vyzkumu a vyvoje financovany z verejnych zdroju (s odkazem do CEP)
Others
Publication year
2012
Confidentiality
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Data specific for result type
Internal product ID
IPC
Technical parameters
Uzavření smlouvy o užívání není vyžadováno. Software je k dispozici ke stažení na adrese https://wiki.ufal.ms.mff.cuni.cz/user:zeman:intersecting-parallel-corpora.
Economical parameters
The tool saves costs for obtaining, translating and annotating new parallel data in cases where texts exist for other language pairs.
Owner IČO
00216208
Owner name
Univerzita Karlova v Praze