CsFEVER and CTKFacts: acquiring Czech data for fact verification
The result's identifiers
Result code in IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216208%3A11230%2F23%3A10470715" target="_blank" >RIV/00216208:11230/23:10470715 - isvavai.cz</a>
Alternative codes found
RIV/68407700:21230/23:00372837
Result on the web
<a href="https://verso.is.cuni.cz/pub/verso.fpl?fname=obd_publikace_handle&handle=90debLHSb3" target="_blank" >https://verso.is.cuni.cz/pub/verso.fpl?fname=obd_publikace_handle&handle=90debLHSb3</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.1007/s10579-023-09654-3" target="_blank" >10.1007/s10579-023-09654-3</a>
Alternative languages
Result language
angličtina
Original language name
CsFEVER and CTKFacts: acquiring Czech data for fact verification
Original language description
In this paper, we examine several methods of acquiring Czech data for automated fact-checking, which is a task commonly modeled as a classification of textual claim veracity w.r.t. a corpus of trusted ground truths. We attempt to collect sets of data in form of a factual claim, evidence within the ground truth corpus, and its veracity label (supported, refuted or not enough info). As a first attempt, we generate a Czech version of the large-scale FEVER dataset built on top of WIKIPEDIA corpus. We take a hybrid approach of machine translation and document alignment; the approach and the tools we provide can be easily applied to other languages. We discuss its weaknesses, propose a future strategy for their mitigation and publish the 127k resulting translations, as well as a version of such dataset reliably applicable for the Natural Language Inference task-the CSFEVER-NLI. Furthermore, we collect a novel dataset of 3,097 claims, which is annotated using the corpus of 2.2 M articles of Czech News Agency. We present an extended dataset annotation methodology based on the FEVER approach, and, as the underlying corpus is proprietary, we also publish a standalone version of the dataset for the task of Natural Language Inference we call CTKFACTSNLI. We analyze both acquired datasets for spurious cues-annotation patterns leading to model overfitting. CTKFACTS is further examined for inter-annotator agreement, thoroughly cleaned, and a typology of common annotator errors is extracted. Finally, we provide baseline models for all stages of the fact-checking pipeline and publish the NLI datasets, as well as our annotation platform and other experimental data.
Czech name
—
Czech description
—
Classification
Type
J<sub>imp</sub> - Article in a specialist periodical, which is included in the Web of Science database
CEP classification
—
OECD FORD branch
10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)
Result continuities
Project
<a href="/en/project/TL02000288" target="_blank" >TL02000288: Transformation of Journalisms Ethics in the Advent of Artificial Intelligence</a><br>
Continuities
P - Projekt vyzkumu a vyvoje financovany z verejnych zdroju (s odkazem do CEP)
Others
Publication year
2023
Confidentiality
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Data specific for result type
Name of the periodical
Language Resources and Evaluation
ISSN
1574-020X
e-ISSN
1574-0218
Volume of the periodical
57
Issue of the periodical within the volume
4
Country of publishing house
NL - THE KINGDOM OF THE NETHERLANDS
Number of pages
35
Pages from-to
1571-1605
UT code for WoS article
000980799100007
EID of the result in the Scopus database
2-s2.0-85158137690