DiscoGeM 2.0: A Parallel Corpus of English, German, French and Czech Implicit Discourse Relations
The result's identifiers
Result code in IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216208%3A11320%2F24%3A10492918" target="_blank" >RIV/00216208:11320/24:10492918 - isvavai.cz</a>
Alternative codes found
RIV/00216208:11320/25:INH5XRQI
Result on the web
<a href="https://aclanthology.org/2024.lrec-main.443/" target="_blank" >https://aclanthology.org/2024.lrec-main.443/</a>
DOI - Digital Object Identifier
—
Alternative languages
Result language
angličtina
Original language name
DiscoGeM 2.0: A Parallel Corpus of English, German, French and Czech Implicit Discourse Relations
Original language description
We present DiscoGeM 2.0, a crowdsourced, parallel corpus of 12,834 implicit discourse relations, with English, German, French and Czech data. We propose and validate a new single-step crowdsourcing annotation method and apply it to collect new annotations in German, French and Czech. The corpus was constructed by having crowdsourced annotators choose a suitable discourse connective for each relation from a set of unambiguous candidates. Every instance was annotated by 10 workers. Our corpus hence represents the first multi-lingual resource that contains distributions of discourse interpretations for implicit relations. The results show that the connective insertion method of discourse annotation can be reliably extended to other languages. The resulting multi-lingual annotations also reveal that implicit relations inferred in one language may differ from those inferred in the translation, meaning the annotations are not always directly transferable. DiscoGem 2.0 promotes the investigation of cross-lin
Czech name
—
Czech description
—
Classification
Type
D - Article in proceedings
CEP classification
—
OECD FORD branch
10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)
Result continuities
Project
<a href="/en/project/GA24-11132S" target="_blank" >GA24-11132S: Disagreement in corpus annotation and variation in human understanding of text</a><br>
Continuities
P - Projekt vyzkumu a vyvoje financovany z verejnych zdroju (s odkazem do CEP)
Others
Publication year
2024
Confidentiality
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Data specific for result type
Article name in the collection
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
ISBN
978-2-493-81410-4
ISSN
2522-2686
e-ISSN
—
Number of pages
17
Pages from-to
4940-4956
Publisher name
European Language Resources Association
Place of publication
Torino, Italy
Event location
Torino, Italy
Event date
May 22, 2024
Type of event by nationality
WRD - Celosvětová akce
UT code for WoS article
—