CsFEVER and CTKFacts: acquiring Czech data for fact verification
Identifikátory výsledku
Kód výsledku v IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216208%3A11230%2F23%3A10470715" target="_blank" >RIV/00216208:11230/23:10470715 - isvavai.cz</a>
Nalezeny alternativní kódy
RIV/68407700:21230/23:00372837
Výsledek na webu
<a href="https://verso.is.cuni.cz/pub/verso.fpl?fname=obd_publikace_handle&handle=90debLHSb3" target="_blank" >https://verso.is.cuni.cz/pub/verso.fpl?fname=obd_publikace_handle&handle=90debLHSb3</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.1007/s10579-023-09654-3" target="_blank" >10.1007/s10579-023-09654-3</a>
Alternativní jazyky
Jazyk výsledku
angličtina
Název v původním jazyce
CsFEVER and CTKFacts: acquiring Czech data for fact verification
Popis výsledku v původním jazyce
In this paper, we examine several methods of acquiring Czech data for automated fact-checking, which is a task commonly modeled as a classification of textual claim veracity w.r.t. a corpus of trusted ground truths. We attempt to collect sets of data in form of a factual claim, evidence within the ground truth corpus, and its veracity label (supported, refuted or not enough info). As a first attempt, we generate a Czech version of the large-scale FEVER dataset built on top of WIKIPEDIA corpus. We take a hybrid approach of machine translation and document alignment; the approach and the tools we provide can be easily applied to other languages. We discuss its weaknesses, propose a future strategy for their mitigation and publish the 127k resulting translations, as well as a version of such dataset reliably applicable for the Natural Language Inference task-the CSFEVER-NLI. Furthermore, we collect a novel dataset of 3,097 claims, which is annotated using the corpus of 2.2 M articles of Czech News Agency. We present an extended dataset annotation methodology based on the FEVER approach, and, as the underlying corpus is proprietary, we also publish a standalone version of the dataset for the task of Natural Language Inference we call CTKFACTSNLI. We analyze both acquired datasets for spurious cues-annotation patterns leading to model overfitting. CTKFACTS is further examined for inter-annotator agreement, thoroughly cleaned, and a typology of common annotator errors is extracted. Finally, we provide baseline models for all stages of the fact-checking pipeline and publish the NLI datasets, as well as our annotation platform and other experimental data.
Název v anglickém jazyce
CsFEVER and CTKFacts: acquiring Czech data for fact verification
Popis výsledku anglicky
In this paper, we examine several methods of acquiring Czech data for automated fact-checking, which is a task commonly modeled as a classification of textual claim veracity w.r.t. a corpus of trusted ground truths. We attempt to collect sets of data in form of a factual claim, evidence within the ground truth corpus, and its veracity label (supported, refuted or not enough info). As a first attempt, we generate a Czech version of the large-scale FEVER dataset built on top of WIKIPEDIA corpus. We take a hybrid approach of machine translation and document alignment; the approach and the tools we provide can be easily applied to other languages. We discuss its weaknesses, propose a future strategy for their mitigation and publish the 127k resulting translations, as well as a version of such dataset reliably applicable for the Natural Language Inference task-the CSFEVER-NLI. Furthermore, we collect a novel dataset of 3,097 claims, which is annotated using the corpus of 2.2 M articles of Czech News Agency. We present an extended dataset annotation methodology based on the FEVER approach, and, as the underlying corpus is proprietary, we also publish a standalone version of the dataset for the task of Natural Language Inference we call CTKFACTSNLI. We analyze both acquired datasets for spurious cues-annotation patterns leading to model overfitting. CTKFACTS is further examined for inter-annotator agreement, thoroughly cleaned, and a typology of common annotator errors is extracted. Finally, we provide baseline models for all stages of the fact-checking pipeline and publish the NLI datasets, as well as our annotation platform and other experimental data.
Klasifikace
Druh
J<sub>imp</sub> - Článek v periodiku v databázi Web of Science
CEP obor
—
OECD FORD obor
10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)
Návaznosti výsledku
Projekt
<a href="/cs/project/TL02000288" target="_blank" >TL02000288: Proměna etických aspektů s nástupem žurnalistiky umělé inteligence</a><br>
Návaznosti
P - Projekt vyzkumu a vyvoje financovany z verejnych zdroju (s odkazem do CEP)
Ostatní
Rok uplatnění
2023
Kód důvěrnosti údajů
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Údaje specifické pro druh výsledku
Název periodika
Language Resources and Evaluation
ISSN
1574-020X
e-ISSN
1574-0218
Svazek periodika
57
Číslo periodika v rámci svazku
4
Stát vydavatele periodika
NL - Nizozemsko
Počet stran výsledku
35
Strana od-do
1571-1605
Kód UT WoS článku
000980799100007
EID výsledku v databázi Scopus
2-s2.0-85158137690