mini-CIEP+: A Shareable Parallel Corpus of Prose
Identifikátory výsledku
Kód výsledku v IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216208%3A11320%2F25%3A6NZ368IL" target="_blank" >RIV/00216208:11320/25:6NZ368IL - isvavai.cz</a>
Výsledek na webu
<a href="https://www.scopus.com/inward/record.uri?eid=2-s2.0-85198659157&partnerID=40&md5=fa45ec502891c1b89565ca0f7d1dce75" target="_blank" >https://www.scopus.com/inward/record.uri?eid=2-s2.0-85198659157&partnerID=40&md5=fa45ec502891c1b89565ca0f7d1dce75</a>
DOI - Digital Object Identifier
—
Alternativní jazyky
Jazyk výsledku
angličtina
Název v původním jazyce
mini-CIEP+: A Shareable Parallel Corpus of Prose
Popis výsledku v původním jazyce
In this paper we present mini-CIEP+, a sharable parallel corpus of prose. mini-CIEP+ consists of the first part of ten different works of prose across many different languages, allowing for the cross-linguistic investigation of larger discourse units. Subcorpora typically contain 5750 sentences and almost 125K tokens. Subcorpora have dependency grammar annotation based on the Universal Dependencies standard (de Marneffe et al., 2021). mini-CIEP+ version 1.0 is available in 35 languages, with the aim of increasing the sample to 50 languages. It is shareable due to recent developments in German law, which allow researchers to share up to 15% of copy-righted material with a select group of people for their own research. Hence, mini-CIEP+ is not publically available, but is rather shareable in a modular fashion with select researchers. We additionally describe future plans for further annotation of mini-CIEP+ as well as its limitations. © 2024 ELRA Language Resource Association: CC BY-NC 4.0.
Název v anglickém jazyce
mini-CIEP+: A Shareable Parallel Corpus of Prose
Popis výsledku anglicky
In this paper we present mini-CIEP+, a sharable parallel corpus of prose. mini-CIEP+ consists of the first part of ten different works of prose across many different languages, allowing for the cross-linguistic investigation of larger discourse units. Subcorpora typically contain 5750 sentences and almost 125K tokens. Subcorpora have dependency grammar annotation based on the Universal Dependencies standard (de Marneffe et al., 2021). mini-CIEP+ version 1.0 is available in 35 languages, with the aim of increasing the sample to 50 languages. It is shareable due to recent developments in German law, which allow researchers to share up to 15% of copy-righted material with a select group of people for their own research. Hence, mini-CIEP+ is not publically available, but is rather shareable in a modular fashion with select researchers. We additionally describe future plans for further annotation of mini-CIEP+ as well as its limitations. © 2024 ELRA Language Resource Association: CC BY-NC 4.0.
Klasifikace
Druh
D - Stať ve sborníku
CEP obor
—
OECD FORD obor
10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)
Návaznosti výsledku
Projekt
—
Návaznosti
—
Ostatní
Rok uplatnění
2024
Kód důvěrnosti údajů
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Údaje specifické pro druh výsledku
Název statě ve sborníku
Workshop Build. Using Comp. Corpora, BUCC LREC-COLING - Proc.
ISBN
978-249381431-9
ISSN
—
e-ISSN
—
Počet stran výsledku
9
Strana od-do
135-143
Název nakladatele
European Language Resources Association (ELRA)
Místo vydání
—
Místo konání akce
Torino, Italia
Datum konání akce
1. 1. 2025
Typ akce podle státní příslušnosti
WRD - Celosvětová akce
Kód UT WoS článku
—