All

What are you looking for?

All
Projects
Results
Organizations

Quick search

  • Projects supported by TA ČR
  • Excellent projects
  • Projects with the highest public support
  • Current projects

Smart search

  • That is how I find a specific +word
  • That is how I leave the -word out of the results
  • “That is how I can find the whole phrase”

Intersecting Parallel Corpora

The result's identifiers

  • Result code in IS VaVaI

    <a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216208%3A11320%2F12%3A10134102" target="_blank" >RIV/00216208:11320/12:10134102 - isvavai.cz</a>

  • Alternative codes found

    RIV/00216208:11320/12:10194834

  • Result on the web

    <a href="https://wiki.ufal.ms.mff.cuni.cz/user:zeman:intersecting-parallel-corpora" target="_blank" >https://wiki.ufal.ms.mff.cuni.cz/user:zeman:intersecting-parallel-corpora</a>

  • DOI - Digital Object Identifier

Alternative languages

  • Result language

    angličtina

  • Original language name

    Intersecting Parallel Corpora

  • Original language description

    The organizers of the annual Workshop on Machine Translation (WMT) prepare and distribute parallel corpora that can be used to train systems for the shared tasks. Two core types of corpora are the News Commentary corpus and the Europarl corpus. Both areavailable in several language pairs, always between English and another European language: cs-en, de-en, es-en and fr-en. The corpora are not multi-parallel. They come from the same source and there is significant overlap but still some sentences are translated to only a subset of the languages. The bi-parallel subsets do not all have the same number of sentence pairs. Such corpora cannot be directly used to train a system for e.g. de-cs (German-Czech). However, we can use English as a pivot language. If we identify the intersection of the English parts of cs-en and de-en, we can take the non-English counterparts of the overlapping English sentences to create a de-cs parallel corpus. That is what this software does.

  • Czech name

  • Czech description

Classification

  • Type

    R - Software

  • CEP classification

    AI - Linguistics

  • OECD FORD branch

Result continuities

  • Project

    <a href="/en/project/7E11051" target="_blank" >7E11051: EuroMatrixPlus - Enlarged European Union Bringing Machine Translation for European Languages to the User</a><br>

  • Continuities

    P - Projekt vyzkumu a vyvoje financovany z verejnych zdroju (s odkazem do CEP)

Others

  • Publication year

    2012

  • Confidentiality

    S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů

Data specific for result type

  • Internal product ID

    IPC

  • Technical parameters

    Uzavření smlouvy o užívání není vyžadováno. Software je k dispozici ke stažení na adrese https://wiki.ufal.ms.mff.cuni.cz/user:zeman:intersecting-parallel-corpora.

  • Economical parameters

    The tool saves costs for obtaining, translating and annotating new parallel data in cases where texts exist for other language pairs.

  • Owner IČO

    00216208

  • Owner name

    Univerzita Karlova v Praze